Re: [HACKERS] logical decoding of two-phase transactions

Started by Nikhil Sontakkeover 7 years ago424 messages
#1Nikhil Sontakke
nikhils@2ndquadrant.com
4 attachment(s)

Hi,

PFA, latest patchset which incorporates the additional feedback.

There's an additional test case in
0005-Additional-test-case-to-demonstrate-decoding-rollbac.patch which
uses a sleep in the "change" plugin API to allow a concurrent rollback
on the 2PC being currently decoded. Andres generally doesn't like this
approach :-), but there are no timing/interlocking issues now, and the
sleep just helps us do a concurrent rollback, so it might be ok now,
all things considered. Anyways, it's an additional patch for now.

Yea, I still don't think it's ok. The tests won't be reliable. There's
ways to make this reliable, e.g. by forcing a lock to be acquired that's
externally held or such. Might even be doable just with a weird custom
datatype.

Ok, I will look at ways to do away with the sleep.

The attached patchset implements a non-sleep based approached by
sending the 2PC XID to the pg_logical_slot_get_changes() function as
an option for the test_decoding plugin. So, an example invocation
will now look like:

SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
NULL, 'skip-empty-xacts', '1', 'check-xid', '$xid2pc');

The test_decoding pg_decode_change() API if it sees a valid xid
argument will wait for it to be aborted. Another backend can then come
in and merrily abort this ongoing 2PC in the background. Once it's
aborted, the pg_decode_change API will go ahead and will hit an ERROR
in the systable scan APIs. That should take care of Andres' concern
about using sleep in the tests. The relevant tap test has been added
to this patchset.

@@ -423,6 +423,16 @@ systable_getnext(SysScanDesc sysscan)
else
htup = heap_getnext(sysscan->scan, ForwardScanDirection);

+     /*
+      * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+      * error out
+      */
+     if (TransactionIdIsValid(CheckXidAlive) &&
+                     TransactionIdDidAbort(CheckXidAlive))
+                     ereport(ERROR,
+                             (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+                              errmsg("transaction aborted during system catalog scan")));
+
return htup;
}

Don't we have to check TransactionIdIsInProgress() first? C.f. header
comments in tqual.c. Note this is also not guaranteed to be correct
after a crash (where no clog entry will exist for an aborted xact), but
we probably shouldn't get here in that case - but better be safe.

I suspect it'd be better reformulated as
TransactionIdIsValid(CheckXidAlive) &&
!TransactionIdIsInProgress(CheckXidAlive) &&
!TransactionIdDidCommit(CheckXidAlive)

What do you think?

Modified the checks are per the above suggestion.

I was wondering if anything else would be needed for user-defined
catalog tables..

We don't need to do anything else for user-defined catalog tables
since they will also get accessed via the systable_* scan APIs.

Hmm, lemme see if we can do it outside of the plugin. But note that a
plugin might want to decode some 2PC at prepare time and another are
"commit prepared" time.

The test_decoding pg_decode_filter_prepare() API implements a simple
filter strategy now. If the GID contains a substring "nodecode", then
it filters out decoding of such a 2PC at prepare time. Have added
steps to test this in the relevant test case in this patch.

I believe this patchset handles all pending issues along with relevant
test cases. Comments, further feedback appreciated.

Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchapplication/octet-stream; name=0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchDownload
From 8e32c723490df640ffd071716930715ba087814c Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:15:24 +0530
Subject: [PATCH 1/4] Cleaning up of flags in ReorderBufferTXN structure

---
 src/backend/replication/logical/reorderbuffer.c | 34 ++++++++++++-------------
 src/include/replication/reorderbuffer.h         | 33 ++++++++++++++----------
 2 files changed, 37 insertions(+), 30 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 9b55b94227..fb71631434 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -643,7 +643,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_first_lsn < cur_txn->first_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_first_lsn = cur_txn->first_lsn;
 	}
@@ -663,7 +663,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
 	}
@@ -686,7 +686,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
 
 	txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
 
-	Assert(!txn->is_known_as_subxact);
+	Assert(!rbtxn_is_known_subxact(txn));
 	Assert(txn->first_lsn != InvalidXLogRecPtr);
 	return txn;
 }
@@ -746,7 +746,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 
 	if (!new_sub)
 	{
-		if (subtxn->is_known_as_subxact)
+		if (rbtxn_is_known_subxact(subtxn))
 		{
 			/* already associated, nothing to do */
 			return;
@@ -762,7 +762,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 		}
 	}
 
-	subtxn->is_known_as_subxact = true;
+	subtxn->txn_flags |= RBTXN_IS_SUBXACT;
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
@@ -972,7 +972,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	{
 		ReorderBufferChange *cur_change;
 
-		if (txn->serialized)
+		if (rbtxn_is_serialized(txn))
 		{
 			/* serialize remaining changes */
 			ReorderBufferSerializeTXN(rb, txn);
@@ -1001,7 +1001,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		{
 			ReorderBufferChange *cur_change;
 
-			if (cur_txn->serialized)
+			if (rbtxn_is_serialized(cur_txn))
 			{
 				/* serialize remaining changes */
 				ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1167,7 +1167,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * they originally were happening inside another subtxn, so we won't
 		 * ever recurse more than one level deep here.
 		 */
-		Assert(subtxn->is_known_as_subxact);
+		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
 		ReorderBufferCleanupTXN(rb, subtxn);
@@ -1208,7 +1208,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/*
 	 * Remove TXN from its containing list.
 	 *
-	 * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
 	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
 	 * from the LSN-ordered list of toplevel TXNs.
@@ -1223,7 +1223,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	Assert(found);
 
 	/* remove entries spilled to disk */
-	if (txn->serialized)
+	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
 	/* deallocate */
@@ -1240,7 +1240,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
 		return;
 
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1854,7 +1854,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 			 * final_lsn to that of their last change; this causes
 			 * ReorderBufferRestoreCleanup to do the right thing.
 			 */
-			if (txn->serialized && txn->final_lsn == 0)
+			if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
 			{
 				ReorderBufferChange *last =
 				dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -2002,7 +2002,7 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
 	 * operate on its top-level transaction instead.
 	 */
 	txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 	Assert(txn->base_snapshot == NULL);
@@ -2109,7 +2109,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->has_catalog_changes = true;
+	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2126,7 +2126,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
 	if (txn == NULL)
 		return false;
 
-	return txn->has_catalog_changes;
+	return rbtxn_has_catalog_changes(txn);
 }
 
 /*
@@ -2146,7 +2146,7 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
 		return false;
 
 	/* a known subtxn? operate on top-level txn instead */
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 
@@ -2267,7 +2267,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
-	txn->serialized = true;
+	txn->txn_flags |= RBTXN_IS_SERIALIZED;
 
 	if (fd != -1)
 		CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1f52f6bde7..ec9515d156 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -150,18 +150,34 @@ typedef struct ReorderBufferChange
 	dlist_node	node;
 } ReorderBufferChange;
 
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT          0x0002
+#define RBTXN_IS_SERIALIZED       0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn)    (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk?  It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
 typedef struct ReorderBufferTXN
 {
+	int     txn_flags;
+
 	/*
 	 * The transactions transaction id, can be a toplevel or sub xid.
 	 */
 	TransactionId xid;
 
-	/* did the TX have catalog changes */
-	bool		has_catalog_changes;
-
 	/* Do we know this is a subxact?  Xid of top-level txn if so */
-	bool		is_known_as_subxact;
 	TransactionId toplevel_xid;
 
 	/*
@@ -229,15 +245,6 @@ typedef struct ReorderBufferTXN
 	 */
 	uint64		nentries_mem;
 
-	/*
-	 * Has this transaction been spilled to disk?  It's not always possible to
-	 * deduce that fact by comparing nentries with nentries_mem, because e.g.
-	 * subtransactions of a large transaction might get serialized together
-	 * with the parent - if they're restored to memory they'd have
-	 * nentries_mem == nentries.
-	 */
-	bool		serialized;
-
 	/*
 	 * List of ReorderBufferChange structs, including new Snapshots and new
 	 * CommandIds
-- 
2.15.2 (Apple Git-101.1)

0002-Support-decoding-of-two-phase-transactions-at-PREPAR.patchapplication/octet-stream; name=0002-Support-decoding-of-two-phase-transactions-at-PREPAR.patchDownload
From 83b83aed7769a4971c83823f50a3799cf74bd8d5 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:30:30 +0530
Subject: [PATCH 2/4] Support decoding of two-phase transactions at PREPARE

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.

Includes documentation changes.
---
 doc/src/sgml/logicaldecoding.sgml               | 127 ++++++++++++++-
 src/backend/replication/logical/decode.c        | 147 +++++++++++++++--
 src/backend/replication/logical/logical.c       | 203 ++++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c | 185 +++++++++++++++++++--
 src/include/replication/logical.h               |   2 +-
 src/include/replication/output_plugin.h         |  46 ++++++
 src/include/replication/reorderbuffer.h         |  68 ++++++++
 7 files changed, 746 insertions(+), 32 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db968641e..a89e4d5184 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -385,7 +385,12 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeChangeCB change_cb;
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
+    LogicalDecodeAbortCB abort_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeAbortPreparedCB abort_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
 } OutputPluginCallbacks;
@@ -457,7 +462,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -558,6 +569,71 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The optional <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Commit Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>commit_prepared_cb</function> callback is called whenever
+      a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+     <title>Rollback Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>abort_prepared_cb</function> callback is called whenever
+      a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-abort">
+     <title>Transaction Abort Callback</title>
+
+     <para>
+      The required <function>abort_cb</function> callback is called whenever
+      a transaction abort has to be initiated. This can happen if we are
+      decoding a transaction that has been prepared for two-phase commit and
+      a concurrent rollback happens while we are decoding it.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -567,7 +643,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -644,6 +726,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -665,7 +780,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message 
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 59c003de9c..99d801eb94 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
 #include "access/xlogutils.h"
 #include "access/xlogreader.h"
 #include "access/xlogrecord.h"
+#include "access/twophase.h"
 
 #include "catalog/pg_control.h"
 
@@ -73,6 +74,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 			 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 			xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed);
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -281,16 +284,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+				/* check that output plugin is capable of twophase decoding */
+				if (!ctx->options.enable_twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   XLogRecGetData(buf->record), &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 			break;
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -633,9 +653,90 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
+	/*
+	 * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+	 * Regular commit simply triggers a replay of transaction changes from the
+	 * reorder buffer. For COMMIT PREPARED that however already happened at
+	 * PREPARE time, and so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 *
+	 * For output plugins that do not support PREPARE-time decoding of
+	 * two-phase transactions, we never even see the PREPARE and all two-phase
+	 * transactions simply fall through to the second branch.
+	 */
+	if (TransactionIdIsValid(parsed->twophase_xid) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder,
+								   parsed->twophase_xid, parsed->twophase_gid))
+	{
+		Assert(xid == parsed->twophase_xid);
+		/* we are processing COMMIT PREPARED */
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* replay actions of all transaction + subtransactions in order */
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	/*
+	 * Process invalidation messages, even if we're not interested in the
+	 * transaction's contents, since the various caches need to always be
+	 * consistent.
+	 */
+	if (parsed->nmsgs > 0)
+	{
+		if (!ctx->fast_forward)
+			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+										  parsed->nmsgs, parsed->msgs);
+		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+	}
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+		}
+		ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+		return;
+	}
+
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 }
 
 /*
@@ -647,6 +748,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 			xl_xact_parsed_abort *parsed, TransactionId xid)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it's ROLLBACK PREPARED then handle it via callbacks.
+	 */
+	if (TransactionIdIsValid(xid) &&
+		!SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+		parsed->dbId == ctx->slot->data.database &&
+		!FilterByOrigin(ctx, origin_id) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
 
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3cd4eefb9b..f3b28c3dab 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				 XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -187,6 +197,11 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->abort = abort_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	ctx->out = makeStringInfo();
@@ -611,6 +626,33 @@ startup_cb_wrapper(LogicalDecodingContext *ctx, OutputPluginOptions *opt, bool i
 	/* do the actual work: call callback */
 	ctx->callbacks.startup_cb(ctx, opt, is_init);
 
+	/*
+	 * If the plugin claims to support two-phase transactions, then
+	 * check that the plugin implements all callbacks necessary to decode
+	 * two-phase transactions - we either have to have all of them or none.
+	 * The filter_prepare callback is optional, but can only be defined when
+	 * two-phase decoding is enabled (i.e. the three other callbacks are
+	 * defined).
+	 */
+	if (opt->enable_twophase)
+	{
+		int twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+			(ctx->callbacks.commit_prepared_cb != NULL) +
+			(ctx->callbacks.abort_prepared_cb != NULL);
+
+		/* Plugins with incorrect number of two-phase callbacks are broken. */
+		if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+			ereport(ERROR,
+					(errmsg("Output plugin registered only %d twophase callbacks. ",
+							twophase_callbacks)));
+	}
+
+	/* filter_prepare is optional, but requires two-phase decoding */
+	if ((ctx->callbacks.filter_prepare_cb != NULL) && (!opt->enable_twophase))
+		ereport(ERROR,
+				(errmsg("Output plugin does not support two-phase decoding, but "
+						"registered filter_prepared callback.")));
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 }
@@ -708,6 +750,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "abort";
+	state.report_location = txn->final_lsn; /* beginning of abort record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* do the actual work: call callback */
+	ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "abort_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* do the actual work: call callback */
+	ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
@@ -785,6 +943,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of twophase at PREPARE time is not enabled. In that
+	 * case all twophase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->options.enable_twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fb71631434..2fffc90606 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,6 +337,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/* free data that's contained */
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
 
 	if (txn->tuplecid_hash != NULL)
 	{
@@ -1389,25 +1394,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
  * and subtransactions (using a k-way merge) and replay the changes in lsn
  * order.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	volatile Snapshot snapshot_now;
 	volatile CommandId command_id = FirstCommandId;
 	bool		using_subtxn;
 	ReorderBufferIterTXNState *volatile iterstate = NULL;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -1711,7 +1709,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					break;
 			}
 		}
-
 		/*
 		 * There's a speculative insertion remaining, just clean in up, it
 		 * can't have been successful, otherwise we'd gotten a confirmation
@@ -1727,8 +1724,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Call abort/commit/prepare callback, depending on the transaction
+		 * state.
+		 *
+		 * If the transaction aborted during apply (which currently can happen
+		 * only for prepared transactions), simply call the abort callback.
+		 *
+		 * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+		 * (for regular ones).
+		 */
+		if (rbtxn_rollback(txn))
+			rb->abort(rb, txn, commit_lsn);
+		else if (rbtxn_prepared(txn))
+			rb->prepare(rb, txn, commit_lsn);
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1755,7 +1766,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (snapshot_now->copied)
 			ReorderBufferFreeSnap(rb, snapshot_now);
 
-		/* remove potential on-disk data, and deallocate */
+		/*
+		 * remove potential on-disk data, and deallocate.
+		 *
+		 * We remove it even for prepared transactions (GID is enough to
+		 * commit/abort those later).
+		 */
 		ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
@@ -1789,6 +1805,141 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_END_TRY();
 }
 
+
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/*
+	 * Always call the prepare filter. It's the job of the prepare filter to
+	 * give us the *same* response for a given xid across multiple calls
+	 * (including ones on restart)
+	 */
+	return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	 * The transaction may or may not exist (during restarts for example).
+	 * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+	 * it to be created below.
+	 */
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+	{
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+		rb->commit_prepared(rb, txn, commit_lsn);
+	}
+	else
+	{
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+		rb->abort_prepared(rb, txn, commit_lsn);
+	}
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(rb, txn);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c25ac1fa85..5fdda65031 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -47,7 +47,7 @@ typedef struct LogicalDecodingContext
 
 	/*
 	 * Marks the logical decoding context as fast forward decoding one. Such a
-	 * context does not have plugin loaded so most of the the following
+	 * context does not have plugin loaded so most of the following
 	 * properties are unused.
 	 */
 	bool		fast_forward;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 1ee0a56f03..c9140e7001 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -27,6 +27,7 @@ typedef struct OutputPluginOptions
 {
 	OutputPluginOutputType output_type;
 	bool		receive_rewrites;
+	bool		enable_twophase;
 } OutputPluginOptions;
 
 /*
@@ -77,6 +78,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
+
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -109,7 +150,12 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeChangeCB change_cb;
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
+	LogicalDecodeAbortCB abort_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeAbortPreparedCB abort_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 } OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ec9515d156..285c9b53da 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -154,6 +155,11 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_PREPARE             0x0008
+#define RBTXN_COMMIT_PREPARED     0x0010
+#define RBTXN_ROLLBACK_PREPARED   0x0020
+#define RBTXN_COMMIT              0x0040
+#define RBTXN_ROLLBACK            0x0080
 
 /* does the txn have catalog changes */
 #define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -167,6 +173,16 @@ typedef struct ReorderBufferChange
  * nentries_mem == nentries.
  */
 #define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn)            (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn)     (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn)   (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn)              (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn)            (txn->txn_flags & RBTXN_ROLLBACK)
 
 typedef struct ReorderBufferTXN
 {
@@ -179,6 +195,8 @@ typedef struct ReorderBufferTXN
 
 	/* Do we know this is a subxact?  Xid of top-level txn if so */
 	TransactionId toplevel_xid;
+	/* In case of 2PC we need to pass GID to output plugin */
+	char		 *gid;
 
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
@@ -324,6 +342,37 @@ typedef void (*ReorderBufferCommitCB) (
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+									  ReorderBuffer *rb,
+									  ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+										ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (
 										ReorderBuffer *rb,
@@ -369,6 +418,11 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferAbortCB abort;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferAbortPreparedCB abort_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -416,6 +470,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
 void ReorderBufferCommit(ReorderBuffer *, TransactionId,
 					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 					TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 						 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -439,6 +498,15 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+							 const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
2.15.2 (Apple Git-101.1)

0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patchapplication/octet-stream; name=0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patchDownload
From e08e16fb46d5be7927c01de05daa1fd1ef9b19d7 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 26 Jul 2018 18:45:26 +0530
Subject: [PATCH 3/4] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 src/backend/access/index/genam.c                | 35 +++++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c | 32 ++++++++++++++++++----
 src/backend/utils/time/snapmgr.c                | 25 ++++++++++++++++--
 src/include/utils/snapmgr.h                     |  4 ++-
 4 files changed, 88 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 9d08775687..9220dcce83 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -25,6 +25,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -423,6 +424,17 @@ systable_getnext(SysScanDesc sysscan)
 	else
 		htup = heap_getnext(sysscan->scan, ForwardScanDirection);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -476,6 +488,18 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 		result = HeapTupleSatisfiesVisibility(tup, freshsnap, scan->rs_cbuf);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
 	}
+
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -593,6 +617,17 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2fffc90606..96d52d32c1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -599,7 +599,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1405,6 +1405,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
 	volatile CommandId command_id = FirstCommandId;
 	bool		using_subtxn;
 	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	MemoryContext ccxt = CurrentMemoryContext;
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
@@ -1431,7 +1432,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1672,7 +1673,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1692,7 +1693,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 						/*
 						 * Every time the CommandId is incremented, we could
@@ -1777,6 +1778,20 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
 	PG_CATCH();
 	{
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
+		/*
+		 * if the catalog scan access returned an error of
+		 * rollback, then abort on the other side as well
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			elog(LOG, "stopping decoding of %s (%u)",
+				 txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+			rb->abort(rb, txn, commit_lsn);
+		}
+
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
 
@@ -1800,7 +1815,14 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
 		/* remove potential on-disk data, and deallocate */
 		ReorderBufferCleanupTXN(rb, txn);
 
-		PG_RE_THROW();
+		/* re-throw only if it's not an abort */
+		if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+		else
+			FlushErrorState();
 	}
 	PG_END_TRY();
 }
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index edf59efc29..0354fc9da9 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -151,6 +151,13 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
@@ -1995,10 +2002,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2007,8 +2018,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2018,6 +2038,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 83806f3040..bad2053477 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -100,8 +100,10 @@ extern char *ExportSnapshot(Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
2.15.2 (Apple Git-101.1)

0004-Teach-test_decoding-plugin-to-work-with-2PC.patchapplication/octet-stream; name=0004-Teach-test_decoding-plugin-to-work-with-2PC.patchDownload
From 16cddbd9ca4b7a1615014257df4cb02b0a005416 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:31:15 +0530
Subject: [PATCH 4/4] Teach test_decoding plugin to work with 2PC

Implement all callbacks required for decoding 2PC in this test_decoding
plugin. Includes relevant test cases as well.

Additionally, includes a new option "check-xid". If this option points
to a valid xid, then the pg_decode_change() API will wait for it to
be aborted externally. This allows us to test concurrent rollback of
a prepared transaction while it's being actually decoded simultaneously.
---
 contrib/test_decoding/Makefile              |   5 +-
 contrib/test_decoding/expected/prepared.out | 185 ++++++++++++++++++++++++----
 contrib/test_decoding/sql/prepared.sql      |  77 ++++++++++--
 contrib/test_decoding/t/001_twophase.pl     | 119 ++++++++++++++++++
 contrib/test_decoding/test_decoding.c       | 179 +++++++++++++++++++++++++++
 5 files changed, 532 insertions(+), 33 deletions(-)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index afcab930f7..3f0b1c6ebd 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
 # installation, allow to do so, but only if requested explicitly.
 installcheck-force: regresscheck-install-force isolationcheck-install-force
 
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
 
 submake-regress:
 	$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -67,3 +67,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
 	isolationcheck isolationcheck-install-force
 
 temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+	$(prove_check)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..934c8f1509 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,50 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
  init
 (1 row)
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (2);
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (4);
 -- test prepared xact containing ddl
 BEGIN;
@@ -26,45 +57,149 @@ INSERT INTO test_prepared1 VALUES (5);
 ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
                                   data                                   
 -------------------------------------------------------------------------
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
- COMMIT
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
  BEGIN
  table public.test_prepared1: INSERT: id[integer]:4
  COMMIT
  BEGIN
- table public.test_prepared2: INSERT: id[integer]:7
- COMMIT
- BEGIN
  table public.test_prepared1: INSERT: id[integer]:5
  table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
  COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
  BEGIN
  table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
  COMMIT
  BEGIN
  table public.test_prepared2: INSERT: id[integer]:9
  COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+ relation | locktype | mode 
+----------+----------+------
+(0 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                    data                                    
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints 
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
 
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..60725419fe 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -2,21 +2,25 @@
 SET synchronous_commit = on;
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 INSERT INTO test_prepared1 VALUES (2);
 
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 INSERT INTO test_prepared1 VALUES (4);
 
@@ -27,24 +31,83 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
 
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
 INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- make sure stuff still works
 INSERT INTO test_prepared1 VALUES (8);
 INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test savepoints 
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- cleanup
 DROP TABLE test_prepared1;
 DROP TABLE test_prepared2;
 
--- show results
+-- show results. There should be nothing to show
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000000..99a9249689
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,119 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction 
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 1c439b57b0..b981d1693c 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +54,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
 				bool last_write);
 static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
 					 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+					ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
 static void pg_decode_change(LogicalDecodingContext *ctx,
 				 ReorderBufferTXN *txn, Relation rel,
 				 ReorderBufferChange *change);
@@ -62,6 +69,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 				  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 				  bool transactional, const char *prefix,
 				  Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+							  ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+							 ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -80,9 +99,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pg_decode_change;
 	cb->truncate_cb = pg_decode_truncate;
 	cb->commit_cb = pg_decode_commit_txn;
+	cb->abort_cb = pg_decode_abort_txn;
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
 }
 
 
@@ -102,11 +126,14 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
 	opt->output_type = OUTPUT_PLUGIN_TEXTUAL_OUTPUT;
 	opt->receive_rewrites = false;
+	/* this plugin supports decoding of 2pc */
+	opt->enable_twophase = true;
 
 	foreach(option, ctx->output_plugin_options)
 	{
@@ -183,6 +210,32 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid") == 0)
+		{
+			if (elem->arg)
+			{
+				errno = 0;
+				data->check_xid = (TransactionId)
+					strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno == EINVAL || errno == ERANGE)
+					ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid is not a valid number: \"%s\"",
+								strVal(elem->arg))));
+			}
+			else
+				ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid needs an input value")));
+
+			if (data->check_xid <= 0)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("Specify positive value for parameter \"%s\","
+								" you specified \"%s\"",
+								elem->defname, strVal(elem->arg))));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -251,6 +304,116 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+	else
+		appendStringInfoString(ctx->out, "ABORT");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -416,6 +579,22 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid is specified */
+	if (TransactionIdIsValid(data->check_xid))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid);
+		while (TransactionIdIsInProgress(data->check_xid))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid) &&
+			   !TransactionIdDidCommit(data->check_xid))
+			elog(LOG, "%u aborted", data->check_xid);
+
+		Assert(TransactionIdDidAbort(data->check_xid));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
-- 
2.15.2 (Apple Git-101.1)

#2Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Nikhil Sontakke (#1)

On 01/08/18 16:00, Nikhil Sontakke wrote:

I was wondering if anything else would be needed for user-defined
catalog tables..

We don't need to do anything else for user-defined catalog tables
since they will also get accessed via the systable_* scan APIs.

They can be, but currently they might not be. So this requires at least
big fat warning in docs and description on how to access user catalogs
from plugins correctly (ie to always use systable_* API on them). It
would be nice if we could check for it in Assert builds at least.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#3Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#2)

On 2018-08-01 21:55:18 +0200, Petr Jelinek wrote:

On 01/08/18 16:00, Nikhil Sontakke wrote:

I was wondering if anything else would be needed for user-defined
catalog tables..

We don't need to do anything else for user-defined catalog tables
since they will also get accessed via the systable_* scan APIs.

They can be, but currently they might not be. So this requires at least
big fat warning in docs and description on how to access user catalogs
from plugins correctly (ie to always use systable_* API on them). It
would be nice if we could check for it in Assert builds at least.

Yea, I agree. I think we should just consider putting similar checks in
the general scan APIs. With an unlikely() and the easy predictability of
these checks, I think we should be fine, overhead-wise.

Greetings,

Andres Freund

#4Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Andres Freund (#3)
4 attachment(s)

They can be, but currently they might not be. So this requires at least
big fat warning in docs and description on how to access user catalogs
from plugins correctly (ie to always use systable_* API on them). It
would be nice if we could check for it in Assert builds at least.

Ok, modified the sgml documentation for the above.

Yea, I agree. I think we should just consider putting similar checks in
the general scan APIs. With an unlikely() and the easy predictability of
these checks, I think we should be fine, overhead-wise.

Ok, added unlikely() checks in the heap_* scan APIs.

Revised patchset attached.

Regards,
Nikhils

Greetings,

Andres Freund

--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patchapplication/octet-stream; name=0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patchDownload
From 7640800a2a1342a4ddeed4ffa049bf80aa99d4e1 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Thu, 26 Jul 2018 18:45:26 +0530
Subject: [PATCH 3/4] Gracefully handle concurrent aborts of uncommitted
 transactions that are being decoded alongside.

When a transaction aborts, it's changes are considered unnecessary for
other transactions. That means the changes may be either cleaned up by
vacuum or removed from HOT chains (thus made inaccessible through
indexes), and there may be other such consequences.

When decoding committed transactions this is not an issue, and we
never decode transactions that abort before the decoding starts.

But for in-progress transactions - for example when decoding prepared
transactions on PREPARE (and not COMMIT PREPARED as before), this
may cause failures when the output plugin consults catalogs (both
system and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend decoding a
specific uncommitted transaction. The decoding logic on the receipt
of such an sqlerrcode aborts the ongoing decoding and returns
gracefully.
---
 doc/src/sgml/logicaldecoding.sgml               |  5 ++-
 src/backend/access/heap/heapam.c                | 51 +++++++++++++++++++++++++
 src/backend/access/index/genam.c                | 35 +++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c | 32 +++++++++++++---
 src/backend/utils/time/snapmgr.c                | 25 +++++++++++-
 src/include/utils/snapmgr.h                     |  4 +-
 6 files changed, 143 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index a89e4d5184..d76afbda05 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -421,7 +421,10 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
 ALTER TABLE user_catalog_table SET (user_catalog_table = true);
 CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
 </programlisting>
-     Any actions leading to transaction ID assignment are prohibited. That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the <literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment are prohibited. That, among others,
      includes writing to tables, performing DDL changes, and
      calling <literal>txid_current()</literal>.
     </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 72395a50b8..ae9d24c164 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1834,6 +1834,17 @@ heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot)
 HeapTuple
 heap_getnext(HeapScanDesc scan, ScanDirection direction)
 {
+	/*
+	 * We don't expect direct calls to heap_getnext with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(scan->rs_rd) ||
+		  RelationIsUsedAsCatalogTable(scan->rs_rd))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_getnext call")));
+
 	/* Note: no locking manipulations needed */
 
 	HEAPDEBUG_1;				/* heap_getnext( info ) */
@@ -1914,6 +1925,16 @@ heap_fetch(Relation relation,
 	OffsetNumber offnum;
 	bool		valid;
 
+	/*
+	 * We don't expect direct calls to heap_fetch with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_fetch call")));
+
 	/*
 	 * Fetch and pin the appropriate page of the relation.
 	 */
@@ -2046,6 +2067,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		valid;
 	bool		skip;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search_buffer with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_hot_search_buffer call")));
+
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
@@ -2187,6 +2218,16 @@ heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
 	Buffer		buffer;
 	HeapTupleData heapTuple;
 
+	/*
+	 * We don't expect direct calls to heap_hot_search with
+	 * valid CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_hot_search call")));
+
 	buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(buffer, BUFFER_LOCK_SHARE);
 	result = heap_hot_search_buffer(tid, relation, buffer, snapshot,
@@ -2216,6 +2257,16 @@ heap_get_latest_tid(Relation relation,
 	ItemPointerData ctid;
 	TransactionId priorXmax;
 
+	/*
+	 * We don't expect direct calls to heap_get_latest_tid with valid
+	 * CheckXidAlive for regular tables. Track that below.
+	 */
+	if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+		!(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
+			 errmsg("improper heap_get_latest_tid call")));
+
 	/* this is to avoid Assert failures on bad input */
 	if (!ItemPointerIsValid(tid))
 		return;
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 9d08775687..9220dcce83 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -25,6 +25,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -423,6 +424,17 @@ systable_getnext(SysScanDesc sysscan)
 	else
 		htup = heap_getnext(sysscan->scan, ForwardScanDirection);
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
@@ -476,6 +488,18 @@ systable_recheck_tuple(SysScanDesc sysscan, HeapTuple tup)
 		result = HeapTupleSatisfiesVisibility(tup, freshsnap, scan->rs_cbuf);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
 	}
+
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return result;
 }
 
@@ -593,6 +617,17 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	if (htup && sysscan->iscan->xs_recheck)
 		elog(ERROR, "system catalog scans with lossy index conditions are not implemented");
 
+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));
+
 	return htup;
 }
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2fffc90606..96d52d32c1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -599,7 +599,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
 			txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
 		/* setup snapshot to allow catalog access */
-		SetupHistoricSnapshot(snapshot_now, NULL);
+		SetupHistoricSnapshot(snapshot_now, NULL, xid);
 		PG_TRY();
 		{
 			rb->message(rb, txn, lsn, false, prefix, message_size, message);
@@ -1405,6 +1405,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
 	volatile CommandId command_id = FirstCommandId;
 	bool		using_subtxn;
 	ReorderBufferIterTXNState *volatile iterstate = NULL;
+	MemoryContext ccxt = CurrentMemoryContext;
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
@@ -1431,7 +1432,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
 	ReorderBufferBuildTupleCidHash(rb, txn);
 
 	/* setup the initial snapshot */
-	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+	SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 	/*
 	 * Decoding needs access to syscaches et al., which in turn use
@@ -1672,7 +1673,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
 
 
 					/* and continue with the new one */
-					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+					SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 					break;
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
@@ -1692,7 +1693,7 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
 						snapshot_now->curcid = command_id;
 
 						TeardownHistoricSnapshot(false);
-						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
+						SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash, xid);
 
 						/*
 						 * Every time the CommandId is incremented, we could
@@ -1777,6 +1778,20 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
 	PG_CATCH();
 	{
 		/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
+		MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+		ErrorData  *errdata = CopyErrorData();
+
+		/*
+		 * if the catalog scan access returned an error of
+		 * rollback, then abort on the other side as well
+		 */
+		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			elog(LOG, "stopping decoding of %s (%u)",
+				 txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+			rb->abort(rb, txn, commit_lsn);
+		}
+
 		if (iterstate)
 			ReorderBufferIterTXNFinish(rb, iterstate);
 
@@ -1800,7 +1815,14 @@ ReorderBufferCommitInternal(ReorderBufferTXN *txn,
 		/* remove potential on-disk data, and deallocate */
 		ReorderBufferCleanupTXN(rb, txn);
 
-		PG_RE_THROW();
+		/* re-throw only if it's not an abort */
+		if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+		{
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+		else
+			FlushErrorState();
 	}
 	PG_END_TRY();
 }
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index edf59efc29..0354fc9da9 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -151,6 +151,13 @@ static Snapshot SecondarySnapshot = NULL;
 static Snapshot CatalogSnapshot = NULL;
 static Snapshot HistoricSnapshot = NULL;
 
+/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+
 /*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
@@ -1995,10 +2002,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
  *
+ * If a valid xid is passed in, we check if it is uncommitted and track it in
+ * CheckXidAlive.  This is to re-check XID status while accessing catalog.
+ *
  * Needed for logical decoding.
  */
 void
-SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
+SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids,
+					  TransactionId snapshot_xid)
 {
 	Assert(historic_snapshot != NULL);
 
@@ -2007,8 +2018,17 @@ SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
 
 	/* setup (cmin, cmax) lookup hash */
 	tuplecid_data = tuplecids;
-}
 
+	/*
+	 * setup CheckXidAlive if it's not committed yet. We don't check
+	 * if the xid aborted. That will happen during catalog access.
+	 */
+	if (TransactionIdIsValid(snapshot_xid) &&
+		!TransactionIdDidCommit(snapshot_xid))
+		CheckXidAlive = snapshot_xid;
+	else
+		CheckXidAlive = InvalidTransactionId;
+}
 
 /*
  * Make catalog snapshots behave normally again.
@@ -2018,6 +2038,7 @@ TeardownHistoricSnapshot(bool is_error)
 {
 	HistoricSnapshot = NULL;
 	tuplecid_data = NULL;
+	CheckXidAlive = InvalidTransactionId;
 }
 
 bool
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 83806f3040..bad2053477 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -100,8 +100,10 @@ extern char *ExportSnapshot(Snapshot snapshot);
 
 /* Support for catalog timetravel for logical decoding */
 struct HTAB;
+extern TransactionId CheckXidAlive;
 extern struct HTAB *HistoricSnapshotGetTupleCids(void);
-extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids);
+extern void SetupHistoricSnapshot(Snapshot snapshot_now, struct HTAB *tuplecids,
+								  TransactionId snapshot_xid);
 extern void TeardownHistoricSnapshot(bool is_error);
 extern bool HistoricSnapshotActive(void);
 
-- 
2.15.2 (Apple Git-101.1)

0004-Teach-test_decoding-plugin-to-work-with-2PC.patchapplication/octet-stream; name=0004-Teach-test_decoding-plugin-to-work-with-2PC.patchDownload
From 6940c92d8c1b0f517c0e785fb62253fc0390044d Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:31:15 +0530
Subject: [PATCH 4/4] Teach test_decoding plugin to work with 2PC

Implement all callbacks required for decoding 2PC in this test_decoding
plugin. Includes relevant test cases as well.

Additionally, includes a new option "check-xid". If this option points
to a valid xid, then the pg_decode_change() API will wait for it to
be aborted externally. This allows us to test concurrent rollback of
a prepared transaction while it's being actually decoded simultaneously.
---
 contrib/test_decoding/Makefile              |   5 +-
 contrib/test_decoding/expected/prepared.out | 185 ++++++++++++++++++++++++----
 contrib/test_decoding/sql/prepared.sql      |  77 ++++++++++--
 contrib/test_decoding/t/001_twophase.pl     | 119 ++++++++++++++++++
 contrib/test_decoding/test_decoding.c       | 179 +++++++++++++++++++++++++++
 5 files changed, 532 insertions(+), 33 deletions(-)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index afcab930f7..3f0b1c6ebd 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -26,7 +26,7 @@ installcheck:;
 # installation, allow to do so, but only if requested explicitly.
 installcheck-force: regresscheck-install-force isolationcheck-install-force
 
-check: regresscheck isolationcheck
+check: regresscheck isolationcheck 2pc-check
 
 submake-regress:
 	$(MAKE) -C $(top_builddir)/src/test/regress all
@@ -67,3 +67,6 @@ isolationcheck-install-force: all | submake-isolation submake-test_decoding temp
 	isolationcheck isolationcheck-install-force
 
 temp-install: EXTRA_INSTALL=contrib/test_decoding
+
+2pc-check: temp-install
+	$(prove_check)
diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d4ff..934c8f1509 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,50 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
  init
 (1 row)
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (2);
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (4);
 -- test prepared xact containing ddl
 BEGIN;
@@ -26,45 +57,149 @@ INSERT INTO test_prepared1 VALUES (5);
 ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
                                   data                                   
 -------------------------------------------------------------------------
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
- COMMIT
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
  BEGIN
  table public.test_prepared1: INSERT: id[integer]:4
  COMMIT
  BEGIN
- table public.test_prepared2: INSERT: id[integer]:7
- COMMIT
- BEGIN
  table public.test_prepared1: INSERT: id[integer]:5
  table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
  COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
  BEGIN
  table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
  COMMIT
  BEGIN
  table public.test_prepared2: INSERT: id[integer]:9
  COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+ relation | locktype | mode 
+----------+----------+------
+(0 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                    data                                    
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints 
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
 
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e72639767e..60725419fe 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -2,21 +2,25 @@
 SET synchronous_commit = on;
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 INSERT INTO test_prepared1 VALUES (2);
 
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 INSERT INTO test_prepared1 VALUES (4);
 
@@ -27,24 +31,83 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
 
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
 INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- make sure stuff still works
 INSERT INTO test_prepared1 VALUES (8);
 INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test savepoints 
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- cleanup
 DROP TABLE test_prepared1;
 DROP TABLE test_prepared2;
 
--- show results
+-- show results. There should be nothing to show
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000000..99a9249689
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,119 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction 
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 1c439b57b0..b981d1693c 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +54,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
 				bool last_write);
 static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
 					 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+					ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
 static void pg_decode_change(LogicalDecodingContext *ctx,
 				 ReorderBufferTXN *txn, Relation rel,
 				 ReorderBufferChange *change);
@@ -62,6 +69,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 				  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 				  bool transactional, const char *prefix,
 				  Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+					  ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+							  ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+							 ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -80,9 +99,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pg_decode_change;
 	cb->truncate_cb = pg_decode_truncate;
 	cb->commit_cb = pg_decode_commit_txn;
+	cb->abort_cb = pg_decode_abort_txn;
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
 }
 
 
@@ -102,11 +126,14 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
 	opt->output_type = OUTPUT_PLUGIN_TEXTUAL_OUTPUT;
 	opt->receive_rewrites = false;
+	/* this plugin supports decoding of 2pc */
+	opt->enable_twophase = true;
 
 	foreach(option, ctx->output_plugin_options)
 	{
@@ -183,6 +210,32 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid") == 0)
+		{
+			if (elem->arg)
+			{
+				errno = 0;
+				data->check_xid = (TransactionId)
+					strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno == EINVAL || errno == ERANGE)
+					ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid is not a valid number: \"%s\"",
+								strVal(elem->arg))));
+			}
+			else
+				ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid needs an input value")));
+
+			if (data->check_xid <= 0)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("Specify positive value for parameter \"%s\","
+								" you specified \"%s\"",
+								elem->defname, strVal(elem->arg))));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -251,6 +304,116 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+	else
+		appendStringInfoString(ctx->out, "ABORT");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -416,6 +579,22 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid is specified */
+	if (TransactionIdIsValid(data->check_xid))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid);
+		while (TransactionIdIsInProgress(data->check_xid))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid) &&
+			   !TransactionIdDidCommit(data->check_xid))
+			elog(LOG, "%u aborted", data->check_xid);
+
+		Assert(TransactionIdDidAbort(data->check_xid));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
-- 
2.15.2 (Apple Git-101.1)

0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchapplication/octet-stream; name=0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patchDownload
From 8e32c723490df640ffd071716930715ba087814c Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:15:24 +0530
Subject: [PATCH 1/4] Cleaning up of flags in ReorderBufferTXN structure

---
 src/backend/replication/logical/reorderbuffer.c | 34 ++++++++++++-------------
 src/include/replication/reorderbuffer.h         | 33 ++++++++++++++----------
 2 files changed, 37 insertions(+), 30 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 9b55b94227..fb71631434 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -643,7 +643,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_first_lsn < cur_txn->first_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_first_lsn = cur_txn->first_lsn;
 	}
@@ -663,7 +663,7 @@ AssertTXNLsnOrder(ReorderBuffer *rb)
 			Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
 
 		/* known-as-subtxn txns must not be listed */
-		Assert(!cur_txn->is_known_as_subxact);
+		Assert(!rbtxn_is_known_subxact(cur_txn));
 
 		prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
 	}
@@ -686,7 +686,7 @@ ReorderBufferGetOldestTXN(ReorderBuffer *rb)
 
 	txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
 
-	Assert(!txn->is_known_as_subxact);
+	Assert(!rbtxn_is_known_subxact(txn));
 	Assert(txn->first_lsn != InvalidXLogRecPtr);
 	return txn;
 }
@@ -746,7 +746,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 
 	if (!new_sub)
 	{
-		if (subtxn->is_known_as_subxact)
+		if (rbtxn_is_known_subxact(subtxn))
 		{
 			/* already associated, nothing to do */
 			return;
@@ -762,7 +762,7 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 		}
 	}
 
-	subtxn->is_known_as_subxact = true;
+	subtxn->txn_flags |= RBTXN_IS_SUBXACT;
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
@@ -972,7 +972,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	{
 		ReorderBufferChange *cur_change;
 
-		if (txn->serialized)
+		if (rbtxn_is_serialized(txn))
 		{
 			/* serialize remaining changes */
 			ReorderBufferSerializeTXN(rb, txn);
@@ -1001,7 +1001,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		{
 			ReorderBufferChange *cur_change;
 
-			if (cur_txn->serialized)
+			if (rbtxn_is_serialized(cur_txn))
 			{
 				/* serialize remaining changes */
 				ReorderBufferSerializeTXN(rb, cur_txn);
@@ -1167,7 +1167,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		 * they originally were happening inside another subtxn, so we won't
 		 * ever recurse more than one level deep here.
 		 */
-		Assert(subtxn->is_known_as_subxact);
+		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
 		ReorderBufferCleanupTXN(rb, subtxn);
@@ -1208,7 +1208,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/*
 	 * Remove TXN from its containing list.
 	 *
-	 * Note: if txn->is_known_as_subxact, we are deleting the TXN from its
+	 * Note: if txn is known as subxact, we are deleting the TXN from its
 	 * parent's list of known subxacts; this leaves the parent's nsubxacts
 	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
 	 * from the LSN-ordered list of toplevel TXNs.
@@ -1223,7 +1223,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	Assert(found);
 
 	/* remove entries spilled to disk */
-	if (txn->serialized)
+	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
 	/* deallocate */
@@ -1240,7 +1240,7 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_iter	iter;
 	HASHCTL		hash_ctl;
 
-	if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
+	if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
 		return;
 
 	memset(&hash_ctl, 0, sizeof(hash_ctl));
@@ -1854,7 +1854,7 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
 			 * final_lsn to that of their last change; this causes
 			 * ReorderBufferRestoreCleanup to do the right thing.
 			 */
-			if (txn->serialized && txn->final_lsn == 0)
+			if (rbtxn_is_serialized(txn) && txn->final_lsn == 0)
 			{
 				ReorderBufferChange *last =
 				dlist_tail_element(ReorderBufferChange, node, &txn->changes);
@@ -2002,7 +2002,7 @@ ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
 	 * operate on its top-level transaction instead.
 	 */
 	txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 	Assert(txn->base_snapshot == NULL);
@@ -2109,7 +2109,7 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
 
 	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
 
-	txn->has_catalog_changes = true;
+	txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }
 
 /*
@@ -2126,7 +2126,7 @@ ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
 	if (txn == NULL)
 		return false;
 
-	return txn->has_catalog_changes;
+	return rbtxn_has_catalog_changes(txn);
 }
 
 /*
@@ -2146,7 +2146,7 @@ ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
 		return false;
 
 	/* a known subtxn? operate on top-level txn instead */
-	if (txn->is_known_as_subxact)
+	if (rbtxn_is_known_subxact(txn))
 		txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
 									NULL, InvalidXLogRecPtr, false);
 
@@ -2267,7 +2267,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
-	txn->serialized = true;
+	txn->txn_flags |= RBTXN_IS_SERIALIZED;
 
 	if (fd != -1)
 		CloseTransientFile(fd);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1f52f6bde7..ec9515d156 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -150,18 +150,34 @@ typedef struct ReorderBufferChange
 	dlist_node	node;
 } ReorderBufferChange;
 
+/* ReorderBufferTXN flags */
+#define RBTXN_HAS_CATALOG_CHANGES 0x0001
+#define RBTXN_IS_SUBXACT          0x0002
+#define RBTXN_IS_SERIALIZED       0x0004
+
+/* does the txn have catalog changes */
+#define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
+/* is the txn known as a subxact? */
+#define rbtxn_is_known_subxact(txn)    (txn->txn_flags & RBTXN_IS_SUBXACT)
+/*
+ * Has this transaction been spilled to disk?  It's not always possible to
+ * deduce that fact by comparing nentries with nentries_mem, because e.g.
+ * subtransactions of a large transaction might get serialized together
+ * with the parent - if they're restored to memory they'd have
+ * nentries_mem == nentries.
+ */
+#define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
+
 typedef struct ReorderBufferTXN
 {
+	int     txn_flags;
+
 	/*
 	 * The transactions transaction id, can be a toplevel or sub xid.
 	 */
 	TransactionId xid;
 
-	/* did the TX have catalog changes */
-	bool		has_catalog_changes;
-
 	/* Do we know this is a subxact?  Xid of top-level txn if so */
-	bool		is_known_as_subxact;
 	TransactionId toplevel_xid;
 
 	/*
@@ -229,15 +245,6 @@ typedef struct ReorderBufferTXN
 	 */
 	uint64		nentries_mem;
 
-	/*
-	 * Has this transaction been spilled to disk?  It's not always possible to
-	 * deduce that fact by comparing nentries with nentries_mem, because e.g.
-	 * subtransactions of a large transaction might get serialized together
-	 * with the parent - if they're restored to memory they'd have
-	 * nentries_mem == nentries.
-	 */
-	bool		serialized;
-
 	/*
 	 * List of ReorderBufferChange structs, including new Snapshots and new
 	 * CommandIds
-- 
2.15.2 (Apple Git-101.1)

0002-Support-decoding-of-two-phase-transactions-at-PREPAR.patchapplication/octet-stream; name=0002-Support-decoding-of-two-phase-transactions-at-PREPAR.patchDownload
From 83b83aed7769a4971c83823f50a3799cf74bd8d5 Mon Sep 17 00:00:00 2001
From: Nikhil Sontakke <nikhils@2ndQuadrant.com>
Date: Wed, 13 Jun 2018 16:30:30 +0530
Subject: [PATCH 2/4] Support decoding of two-phase transactions at PREPARE

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.

Includes documentation changes.
---
 doc/src/sgml/logicaldecoding.sgml               | 127 ++++++++++++++-
 src/backend/replication/logical/decode.c        | 147 +++++++++++++++--
 src/backend/replication/logical/logical.c       | 203 ++++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c | 185 +++++++++++++++++++--
 src/include/replication/logical.h               |   2 +-
 src/include/replication/output_plugin.h         |  46 ++++++
 src/include/replication/reorderbuffer.h         |  68 ++++++++
 7 files changed, 746 insertions(+), 32 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8db968641e..a89e4d5184 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -385,7 +385,12 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeChangeCB change_cb;
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
+    LogicalDecodeAbortCB abort_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeAbortPreparedCB abort_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
 } OutputPluginCallbacks;
@@ -457,7 +462,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -558,6 +569,71 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The optional <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Commit Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>commit_prepared_cb</function> callback is called whenever
+      a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+     <title>Rollback Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>abort_prepared_cb</function> callback is called whenever
+      a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-abort">
+     <title>Transaction Abort Callback</title>
+
+     <para>
+      The required <function>abort_cb</function> callback is called whenever
+      a transaction abort has to be initiated. This can happen if we are
+      decoding a transaction that has been prepared for two-phase commit and
+      a concurrent rollback happens while we are decoding it.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -567,7 +643,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -644,6 +726,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -665,7 +780,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message 
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 59c003de9c..99d801eb94 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -34,6 +34,7 @@
 #include "access/xlogutils.h"
 #include "access/xlogreader.h"
 #include "access/xlogrecord.h"
+#include "access/twophase.h"
 
 #include "catalog/pg_control.h"
 
@@ -73,6 +74,8 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 			 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 			xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed);
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -281,16 +284,33 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				break;
 			}
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+				/* check that output plugin is capable of twophase decoding */
+				if (!ctx->options.enable_twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   XLogRecGetData(buf->record), &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 			break;
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
@@ -633,9 +653,90 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
+	/*
+	 * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+	 * Regular commit simply triggers a replay of transaction changes from the
+	 * reorder buffer. For COMMIT PREPARED that however already happened at
+	 * PREPARE time, and so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 *
+	 * For output plugins that do not support PREPARE-time decoding of
+	 * two-phase transactions, we never even see the PREPARE and all two-phase
+	 * transactions simply fall through to the second branch.
+	 */
+	if (TransactionIdIsValid(parsed->twophase_xid) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder,
+								   parsed->twophase_xid, parsed->twophase_gid))
+	{
+		Assert(xid == parsed->twophase_xid);
+		/* we are processing COMMIT PREPARED */
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* replay actions of all transaction + subtransactions in order */
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	/*
+	 * Process invalidation messages, even if we're not interested in the
+	 * transaction's contents, since the various caches need to always be
+	 * consistent.
+	 */
+	if (parsed->nmsgs > 0)
+	{
+		if (!ctx->fast_forward)
+			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+										  parsed->nmsgs, parsed->msgs);
+		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+	}
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+		}
+		ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+		return;
+	}
+
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 }
 
 /*
@@ -647,6 +748,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 			xl_xact_parsed_abort *parsed, TransactionId xid)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it's ROLLBACK PREPARED then handle it via callbacks.
+	 */
+	if (TransactionIdIsValid(xid) &&
+		!SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+		parsed->dbId == ctx->slot->data.database &&
+		!FilterByOrigin(ctx, origin_id) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
 
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 3cd4eefb9b..f3b28c3dab 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -60,6 +60,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				 XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -187,6 +197,11 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->abort = abort_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	ctx->out = makeStringInfo();
@@ -611,6 +626,33 @@ startup_cb_wrapper(LogicalDecodingContext *ctx, OutputPluginOptions *opt, bool i
 	/* do the actual work: call callback */
 	ctx->callbacks.startup_cb(ctx, opt, is_init);
 
+	/*
+	 * If the plugin claims to support two-phase transactions, then
+	 * check that the plugin implements all callbacks necessary to decode
+	 * two-phase transactions - we either have to have all of them or none.
+	 * The filter_prepare callback is optional, but can only be defined when
+	 * two-phase decoding is enabled (i.e. the three other callbacks are
+	 * defined).
+	 */
+	if (opt->enable_twophase)
+	{
+		int twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+			(ctx->callbacks.commit_prepared_cb != NULL) +
+			(ctx->callbacks.abort_prepared_cb != NULL);
+
+		/* Plugins with incorrect number of two-phase callbacks are broken. */
+		if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+			ereport(ERROR,
+					(errmsg("Output plugin registered only %d twophase callbacks. ",
+							twophase_callbacks)));
+	}
+
+	/* filter_prepare is optional, but requires two-phase decoding */
+	if ((ctx->callbacks.filter_prepare_cb != NULL) && (!opt->enable_twophase))
+		ereport(ERROR,
+				(errmsg("Output plugin does not support two-phase decoding, but "
+						"registered filter_prepared callback.")));
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 }
@@ -708,6 +750,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "abort";
+	state.report_location = txn->final_lsn; /* beginning of abort record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* do the actual work: call callback */
+	ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "abort_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* do the actual work: call callback */
+	ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
@@ -785,6 +943,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of twophase at PREPARE time is not enabled. In that
+	 * case all twophase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->options.enable_twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index fb71631434..2fffc90606 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -337,6 +337,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/* free data that's contained */
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
 
 	if (txn->tuplecid_hash != NULL)
 	{
@@ -1389,25 +1394,18 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
  * and subtransactions (using a k-way merge) and replay the changes in lsn
  * order.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	volatile Snapshot snapshot_now;
 	volatile CommandId command_id = FirstCommandId;
 	bool		using_subtxn;
 	ReorderBufferIterTXNState *volatile iterstate = NULL;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -1711,7 +1709,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					break;
 			}
 		}
-
 		/*
 		 * There's a speculative insertion remaining, just clean in up, it
 		 * can't have been successful, otherwise we'd gotten a confirmation
@@ -1727,8 +1724,22 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		ReorderBufferIterTXNFinish(rb, iterstate);
 		iterstate = NULL;
 
-		/* call commit callback */
-		rb->commit(rb, txn, commit_lsn);
+		/*
+		 * Call abort/commit/prepare callback, depending on the transaction
+		 * state.
+		 *
+		 * If the transaction aborted during apply (which currently can happen
+		 * only for prepared transactions), simply call the abort callback.
+		 *
+		 * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+		 * (for regular ones).
+		 */
+		if (rbtxn_rollback(txn))
+			rb->abort(rb, txn, commit_lsn);
+		else if (rbtxn_prepared(txn))
+			rb->prepare(rb, txn, commit_lsn);
+		else
+			rb->commit(rb, txn, commit_lsn);
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -1755,7 +1766,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		if (snapshot_now->copied)
 			ReorderBufferFreeSnap(rb, snapshot_now);
 
-		/* remove potential on-disk data, and deallocate */
+		/*
+		 * remove potential on-disk data, and deallocate.
+		 *
+		 * We remove it even for prepared transactions (GID is enough to
+		 * commit/abort those later).
+		 */
 		ReorderBufferCleanupTXN(rb, txn);
 	}
 	PG_CATCH();
@@ -1789,6 +1805,141 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_END_TRY();
 }
 
+
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/*
+	 * Always call the prepare filter. It's the job of the prepare filter to
+	 * give us the *same* response for a given xid across multiple calls
+	 * (including ones on restart)
+	 */
+	return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	 * The transaction may or may not exist (during restarts for example).
+	 * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+	 * it to be created below.
+	 */
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+	{
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+		rb->commit_prepared(rb, txn, commit_lsn);
+	}
+	else
+	{
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+		rb->abort_prepared(rb, txn, commit_lsn);
+	}
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(rb, txn);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c25ac1fa85..5fdda65031 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -47,7 +47,7 @@ typedef struct LogicalDecodingContext
 
 	/*
 	 * Marks the logical decoding context as fast forward decoding one. Such a
-	 * context does not have plugin loaded so most of the the following
+	 * context does not have plugin loaded so most of the following
 	 * properties are unused.
 	 */
 	bool		fast_forward;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 1ee0a56f03..c9140e7001 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -27,6 +27,7 @@ typedef struct OutputPluginOptions
 {
 	OutputPluginOutputType output_type;
 	bool		receive_rewrites;
+	bool		enable_twophase;
 } OutputPluginOptions;
 
 /*
@@ -77,6 +78,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
+
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -109,7 +150,12 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeChangeCB change_cb;
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
+	LogicalDecodeAbortCB abort_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeAbortPreparedCB abort_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 } OutputPluginCallbacks;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ec9515d156..285c9b53da 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -154,6 +155,11 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
+#define RBTXN_PREPARE             0x0008
+#define RBTXN_COMMIT_PREPARED     0x0010
+#define RBTXN_ROLLBACK_PREPARED   0x0020
+#define RBTXN_COMMIT              0x0040
+#define RBTXN_ROLLBACK            0x0080
 
 /* does the txn have catalog changes */
 #define rbtxn_has_catalog_changes(txn) (txn->txn_flags & RBTXN_HAS_CATALOG_CHANGES)
@@ -167,6 +173,16 @@ typedef struct ReorderBufferChange
  * nentries_mem == nentries.
  */
 #define rbtxn_is_serialized(txn)       (txn->txn_flags & RBTXN_IS_SERIALIZED)
+/* is this txn prepared? */
+#define rbtxn_prepared(txn)            (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn)     (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn)   (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn)              (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn)            (txn->txn_flags & RBTXN_ROLLBACK)
 
 typedef struct ReorderBufferTXN
 {
@@ -179,6 +195,8 @@ typedef struct ReorderBufferTXN
 
 	/* Do we know this is a subxact?  Xid of top-level txn if so */
 	TransactionId toplevel_xid;
+	/* In case of 2PC we need to pass GID to output plugin */
+	char		 *gid;
 
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
@@ -324,6 +342,37 @@ typedef void (*ReorderBufferCommitCB) (
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+									  ReorderBuffer *rb,
+									  ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+										ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+											   ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (
 										ReorderBuffer *rb,
@@ -369,6 +418,11 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferAbortCB abort;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferAbortPreparedCB abort_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -416,6 +470,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapshot
 void ReorderBufferCommit(ReorderBuffer *, TransactionId,
 					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 					TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 						 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -439,6 +498,15 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+							 const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
2.15.2 (Apple Git-101.1)

#5Arseny Sher
a.sher@postgrespro.ru
In reply to: Nikhil Sontakke (#4)

Hello,

I have looked through the patches. I will first describe relativaly
serious issues I see and then proceed with small nitpicking.

- On decoding of aborted xacts. The idea to throw an error once we
detect the abort is appealing, however I think you will have problems
with subxacts in the current implementation. What if subxact issues
DDL and then aborted, but main transaction successfully committed?

- Decoding transactions at PREPARE record changes rules of the "we ship
all commits after lsn 'x'" game. Namely, it will break initial
tablesync: what if consistent snapshot was formed *after* PREPARE, but
before COMMIT PREPARED, and the plugin decides to employ 2pc? Instead
of getting inital contents + continious stream of changes the receiver
will miss the prepared xact contents and raise 'prepared xact doesn't
exist' error. I think the starting point to address this is to forbid
two-phase decoding of xacts with lsn of PREPARE less than
snapbuilder's start_decoding_at.

- Currently we will call abort_prepared cb even if we failed to actually
prepare xact due to concurrent abort. I think it is confusing for
users. We should either handle this by remembering not to invoke
abort_prepared in these cases or at least document this behaviour,
leaving this problem to the receiver side.

- I find it suspicious that DecodePrepare completely ignores actions of
SnapBuildCommitTxn. For example, to execute invalidations, the latter
sets base snapshot if our xact (or subxacts) did DDL and the snapshot
not set yet. My fantasy doesn't hint me the concrete example
where this would burn at the moment, but it should be considered.

Now, the bikeshedding.

First patch:
- I am one of those people upthread who don't think that converting
flags to bitmask is beneficial -- especially given that many of them
are mutually exclusive, e.g. xact can't be committed and aborted at
the same time. Apparently you have left this to the committer though.

Second patch:
- Applying gives me
Applying: Support decoding of two-phase transactions at PREPARE
.git/rebase-apply/patch:871: trailing whitespace.

+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback is ensured sane access to catalog tables regardless of
+      simultaneous rollback by another backend of this very same transaction.

I don't think we should explain this, at least in such words. As
mentioned upthread, we should warn about allowed systable_* accesses
instead. Same for message_cb.

+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */

While we do certainly need to associate subxacts here, the explanation
looks weird to me. I would leave just the 'Tell the reorderbuffer about
the surviving subtransactions' as in DecodeCommit.

}
-
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation

Spurious newline deletion.

- I would rename ReorderBufferCommitInternal to ReorderBufferReplay:
we replay the xact there, not commit.

- If xact is empty, we will not prepare it (and call cb),
even if the output plugin asked us. However, we will call
commit_prepared cb.

- ReorderBufferTxnIsPrepared and ReorderBufferPrepareNeedSkip do the
same and should be merged with comments explaining that the answer
must be stable.

- filter_prepare_cb callback existence is checked in both decode.c and
in filter_prepare_cb_wrapper.

+	/*
+	 * The transaction may or may not exist (during restarts for example).
+	 * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+	 * it to be created below.
+	 */

Code around looks sane, but I think that ReorderBufferTXN for our xact
must *not* exist at this moment: if we are going to COMMIT/ABORT
PREPARED it, it must have been replayed and RBTXN purged immediately
after. Also, instead of misty '2PC transactions do not contain any
reorderbuffers' I would say something like 'create dummy
ReorderBufferTXN to pass it to the callback'.

- In DecodeAbort:
+	/*
+	 * If it's ROLLBACK PREPARED then handle it via callbacks.
+	 */
+	if (TransactionIdIsValid(xid) &&
+		!SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+

How xid can be invalid here?

- It might be worthwile to put the check
+		!SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+		parsed->dbId == ctx->slot->data.database &&
+		!FilterByOrigin(ctx, origin_id) &&

which appears 3 times now into separate function.

+	 * two-phase transactions - we either have to have all of them or none.
+	 * The filter_prepare callback is optional, but can only be defined when

Kind of controversial (all of them or none, but optional), might be
formulated more accurately.

+	/*
+	 * Capabilities of the output plugin.
+	 */
+	bool        enable_twophase;

I would rename this to 'supports_twophase' since this is not an option
but a description of the plugin capabilities.

+	/* filter_prepare is optional, but requires two-phase decoding */
+	if ((ctx->callbacks.filter_prepare_cb != NULL) && (!ctx->enable_twophase))
+		ereport(ERROR,
+				(errmsg("Output plugin does not support two-phase decoding, but "
+						"registered
filter_prepared callback.")));

Don't think we need to check that...

+		 * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+		 * (for regular ones).
+		 */
+		if (rbtxn_rollback(txn))
+			rb->abort(rb, txn, commit_lsn);

This is the dead code since we don't have decoding of in-progress xacts
yet.

Third patch:
+/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */

I would explain here that this xid is checked for abort after each
catalog scan, and sent for the details to SetupHistoricSnapshot.

+	/*
+	 * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+	 * error out
+	 */
+	if (TransactionIdIsValid(CheckXidAlive) &&
+			!TransactionIdIsInProgress(CheckXidAlive) &&
+			!TransactionIdDidCommit(CheckXidAlive))
+			ereport(ERROR,
+				(errcode(ERRCODE_TRANSACTION_ROLLBACK),
+				 errmsg("transaction aborted during system catalog scan")));

Probably centralize checks in one function? As well as 'We don't expect
direct calls to heap_fetch...' ones.

P.S. Looks like you have torn the thread chain: In-Reply-To header of
mail [1]/messages/by-id/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com is missing. Please don't do that.

[1]: /messages/by-id/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#6Andres Freund
andres@anarazel.de
In reply to: Arseny Sher (#5)

On 2018-08-06 21:06:13 +0300, Arseny Sher wrote:

Hello,

I have looked through the patches. I will first describe relativaly
serious issues I see and then proceed with small nitpicking.

- On decoding of aborted xacts. The idea to throw an error once we
detect the abort is appealing, however I think you will have problems
with subxacts in the current implementation. What if subxact issues
DDL and then aborted, but main transaction successfully committed?

I don't see a fundamental issue here. I've not reviewed the current
patchset meaningfully, however. Do you see a fundamental issue here?

- Decoding transactions at PREPARE record changes rules of the "we ship
all commits after lsn 'x'" game. Namely, it will break initial
tablesync: what if consistent snapshot was formed *after* PREPARE, but
before COMMIT PREPARED, and the plugin decides to employ 2pc? Instead
of getting inital contents + continious stream of changes the receiver
will miss the prepared xact contents and raise 'prepared xact doesn't
exist' error. I think the starting point to address this is to forbid
two-phase decoding of xacts with lsn of PREPARE less than
snapbuilder's start_decoding_at.

Yea, that sounds like it need to be addressed.

- Currently we will call abort_prepared cb even if we failed to actually
prepare xact due to concurrent abort. I think it is confusing for
users. We should either handle this by remembering not to invoke
abort_prepared in these cases or at least document this behaviour,
leaving this problem to the receiver side.

What precisely do you mean by "concurrent abort"?

- I find it suspicious that DecodePrepare completely ignores actions of
SnapBuildCommitTxn. For example, to execute invalidations, the latter
sets base snapshot if our xact (or subxacts) did DDL and the snapshot
not set yet. My fantasy doesn't hint me the concrete example
where this would burn at the moment, but it should be considered.

Yea, I think this need to mirror the actions (and thus generalize the
code to avoid duplication)

Now, the bikeshedding.

First patch:
- I am one of those people upthread who don't think that converting
flags to bitmask is beneficial -- especially given that many of them
are mutually exclusive, e.g. xact can't be committed and aborted at
the same time. Apparently you have left this to the committer though.

Similar.

- Andres

#7Arseny Sher
a.sher@postgrespro.ru
In reply to: Andres Freund (#6)

Andres Freund <andres@anarazel.de> writes:

- On decoding of aborted xacts. The idea to throw an error once we
detect the abort is appealing, however I think you will have problems
with subxacts in the current implementation. What if subxact issues
DDL and then aborted, but main transaction successfully committed?

I don't see a fundamental issue here. I've not reviewed the current
patchset meaningfully, however. Do you see a fundamental issue here?

Hmm, yes, this is not an issue for this patch because after reading
PREPARE record we know all aborted subxacts and won't try to decode
their changes. However, this will be raised once we decide to decode
in-progress transactions. Checking for all subxids is expensive;
moreover, WAL doesn't provide all of them until commit... it might be
easier to prevent vacuuming of aborted stuff while decoding needs it.
Matter for another patch, anyway.

- Currently we will call abort_prepared cb even if we failed to actually
prepare xact due to concurrent abort. I think it is confusing for
users. We should either handle this by remembering not to invoke
abort_prepared in these cases or at least document this behaviour,
leaving this problem to the receiver side.

What precisely do you mean by "concurrent abort"?

With current patch, the following is possible:
* We start decoding of some prepared xact;
* Xact aborts (ABORT PREPARED) for any reason;
* Decoding processs notices this on catalog scan and calls abort()
callback;
* Later decoding process reads abort record and calls abort_prepared
callback.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#8Nikhil Sontakke
nikhils@2ndquadrant.com
In reply to: Arseny Sher (#5)

Hi Arseny,

- Decoding transactions at PREPARE record changes rules of the "we ship
all commits after lsn 'x'" game. Namely, it will break initial
tablesync: what if consistent snapshot was formed *after* PREPARE, but
before COMMIT PREPARED, and the plugin decides to employ 2pc? Instead
of getting inital contents + continious stream of changes the receiver
will miss the prepared xact contents and raise 'prepared xact doesn't
exist' error. I think the starting point to address this is to forbid
two-phase decoding of xacts with lsn of PREPARE less than
snapbuilder's start_decoding_at.

It will be the job of the plugin to return a consistent answer for
every GID that is encountered. In this case, the plugin will decode
the transaction at COMMIT PREPARED time and not at PREPARE time.

- Currently we will call abort_prepared cb even if we failed to actually
prepare xact due to concurrent abort. I think it is confusing for
users. We should either handle this by remembering not to invoke
abort_prepared in these cases or at least document this behaviour,
leaving this problem to the receiver side.

The point is, when we reach the "ROLLBACK PREPARED", we have no idea
if the "PREPARE" was aborted by this rollback happening concurrently.
So it's possible that the 2PC has been successfully decoded and we
would have to send the rollback to the other side which would need to
check if it needs to rollback locally.

- I find it suspicious that DecodePrepare completely ignores actions of
SnapBuildCommitTxn. For example, to execute invalidations, the latter
sets base snapshot if our xact (or subxacts) did DDL and the snapshot
not set yet. My fantasy doesn't hint me the concrete example
where this would burn at the moment, but it should be considered.

I had discussed this area with Petr and we didn't see any issues as well, then.

Now, the bikeshedding.

First patch:
- I am one of those people upthread who don't think that converting
flags to bitmask is beneficial -- especially given that many of them
are mutually exclusive, e.g. xact can't be committed and aborted at
the same time. Apparently you have left this to the committer though.

Hmm, there seems to be divided opinion on this. I am willing to go
back to using the booleans if there's opposition and if the committer
so wishes. Note that this patch will end up adding 4/5 more booleans
in that case (we add new ones for prepare, commit prepare, abort,
rollback prepare etc).

Second patch:
- Applying gives me
Applying: Support decoding of two-phase transactions at PREPARE
.git/rebase-apply/patch:871: trailing whitespace.

+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback is ensured sane access to catalog tables regardless of
+      simultaneous rollback by another backend of this very same transaction.

I don't think we should explain this, at least in such words. As
mentioned upthread, we should warn about allowed systable_* accesses
instead. Same for message_cb.

Looks like you are looking at an earlier patchset. The latest patchset
has removed the above.

+       /*
+        * Tell the reorderbuffer about the surviving subtransactions. We need to
+        * do this because the main transaction itself has not committed since we
+        * are in the prepare phase right now. So we need to be sure the snapshot
+        * is setup correctly for the main transaction in case all changes
+        * happened in subtransanctions
+        */

While we do certainly need to associate subxacts here, the explanation
looks weird to me. I would leave just the 'Tell the reorderbuffer about
the surviving subtransactions' as in DecodeCommit.

}
-
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation

Spurious newline deletion.

- I would rename ReorderBufferCommitInternal to ReorderBufferReplay:
we replay the xact there, not commit.

- If xact is empty, we will not prepare it (and call cb),
even if the output plugin asked us. However, we will call
commit_prepared cb.

- ReorderBufferTxnIsPrepared and ReorderBufferPrepareNeedSkip do the
same and should be merged with comments explaining that the answer
must be stable.

- filter_prepare_cb callback existence is checked in both decode.c and
in filter_prepare_cb_wrapper.

+       /*
+        * The transaction may or may not exist (during restarts for example).
+        * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+        * it to be created below.
+        */

Code around looks sane, but I think that ReorderBufferTXN for our xact
must *not* exist at this moment: if we are going to COMMIT/ABORT
PREPARED it, it must have been replayed and RBTXN purged immediately
after. Also, instead of misty '2PC transactions do not contain any
reorderbuffers' I would say something like 'create dummy
ReorderBufferTXN to pass it to the callback'.

- In DecodeAbort:
+       /*
+        * If it's ROLLBACK PREPARED then handle it via callbacks.
+        */
+       if (TransactionIdIsValid(xid) &&
+               !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+

How xid can be invalid here?

- It might be worthwile to put the check
+               !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+               parsed->dbId == ctx->slot->data.database &&
+               !FilterByOrigin(ctx, origin_id) &&

which appears 3 times now into separate function.

+        * two-phase transactions - we either have to have all of them or none.
+        * The filter_prepare callback is optional, but can only be defined when

Kind of controversial (all of them or none, but optional), might be
formulated more accurately.

+       /*
+        * Capabilities of the output plugin.
+        */
+       bool        enable_twophase;

I would rename this to 'supports_twophase' since this is not an option
but a description of the plugin capabilities.

+       /* filter_prepare is optional, but requires two-phase decoding */
+       if ((ctx->callbacks.filter_prepare_cb != NULL) && (!ctx->enable_twophase))
+               ereport(ERROR,
+                               (errmsg("Output plugin does not support two-phase decoding, but "
+                                               "registered
filter_prepared callback.")));

Don't think we need to check that...

+                * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+                * (for regular ones).
+                */
+               if (rbtxn_rollback(txn))
+                       rb->abort(rb, txn, commit_lsn);

This is the dead code since we don't have decoding of in-progress xacts
yet.

Yes, the above check can be done away with it.

Third patch:
+/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */

I would explain here that this xid is checked for abort after each
catalog scan, and sent for the details to SetupHistoricSnapshot.

+       /*
+        * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+        * error out
+        */
+       if (TransactionIdIsValid(CheckXidAlive) &&
+                       !TransactionIdIsInProgress(CheckXidAlive) &&
+                       !TransactionIdDidCommit(CheckXidAlive))
+                       ereport(ERROR,
+                               (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+                                errmsg("transaction aborted during system catalog scan")));

Probably centralize checks in one function? As well as 'We don't expect
direct calls to heap_fetch...' ones.

P.S. Looks like you have torn the thread chain: In-Reply-To header of
mail [1] is missing. Please don't do that.

That wasn't me. I was also annoyed and surprised to see a new email
thread separate from the earlier containing 100 or so messages.

Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#9Arseny Sher
a.sher@postgrespro.ru
In reply to: Nikhil Sontakke (#8)

Nikhil Sontakke <nikhils@2ndquadrant.com> writes:

- Decoding transactions at PREPARE record changes rules of the "we ship
all commits after lsn 'x'" game. Namely, it will break initial
tablesync: what if consistent snapshot was formed *after* PREPARE, but
before COMMIT PREPARED, and the plugin decides to employ 2pc? Instead
of getting inital contents + continious stream of changes the receiver
will miss the prepared xact contents and raise 'prepared xact doesn't
exist' error. I think the starting point to address this is to forbid
two-phase decoding of xacts with lsn of PREPARE less than
snapbuilder's start_decoding_at.

It will be the job of the plugin to return a consistent answer for
every GID that is encountered. In this case, the plugin will decode
the transaction at COMMIT PREPARED time and not at PREPARE time.

I can't imagine a scenario in which plugin would want to send COMMIT
PREPARED instead of replaying xact fully on CP record given it had never
seen PREPARE record. On the other hand, tracking such situations on
plugins side would make plugins life unneccessary complicated: either it
has to dig into snapbuilder/slot internals to learn when the snapshot
became consistent (which currently is impossible as this lsn is not
saved anywhere btw), or it must fsync each its decision to do or not to
do 2PC.

Technically, my concern covers not only tablesync, but just plain
decoding start: we don't want to ship COMMIT PREPARED if the downstream
had never had chance to see PREPARE.

As for tablesync, looking at current implementation I contemplate that
we would need to do something along the following lines:
- Tablesync worker performs COPY.
- It then speaks with main apply worker to arrange (origin)
lsn of sync point, as it does now.
- Tablesync worker applies changes up to arranged lsn; it never uses
two-phase decoding, all xacts are replayed on COMMIT PREPARED.
Moreover, instead of going into SYNCDONE state immediately after
reaching needed lsn, it stops replaying usual commits but continues
to receive changes to finish all transactions which were prepared
before sync point (we would need some additional support from
reorderbuffer to learn when this happens). Only then it goes into
SYNCDONE.
- Behaviour of the main apply worker doesn't change: it
ignores changes of the table in question before sync point and
applies them after sync point. It also can use 2PC decoding of any
transaction or not, as it desires.
I believe this approach would implement tablesync correctly (all changes
are applied, but only once) with minimal fuss.

- Currently we will call abort_prepared cb even if we failed to actually
prepare xact due to concurrent abort. I think it is confusing for
users. We should either handle this by remembering not to invoke
abort_prepared in these cases or at least document this behaviour,
leaving this problem to the receiver side.

The point is, when we reach the "ROLLBACK PREPARED", we have no idea
if the "PREPARE" was aborted by this rollback happening concurrently.
So it's possible that the 2PC has been successfully decoded and we
would have to send the rollback to the other side which would need to
check if it needs to rollback locally.

I understand this. But I find this confusing for the users, so I propose
to
- Either document that "you might get abort_prepared cb called even
after abort cb was invoked for the same transaction";
- Or consider adding some infrastructure to reorderbuffer to
remember not to call abort_prepared in these cases. Due to possible
reboots, I think this means that we need not to
ReorderBufferCleanupTXN immediately after failed attempt to replay
xact on PREPARE, but mark it as 'aborted' and keep it until we see
ABORT PREPARED record. If we see that xact is marked as aborted, we
don't call abort_prepared_cb. That way even if the decoder restarts
in between, we will see PREPARE in WAL, inquire xact status (even
if we skip it as already replayed) and mark it as aborted again.

- I find it suspicious that DecodePrepare completely ignores actions of
SnapBuildCommitTxn. For example, to execute invalidations, the latter
sets base snapshot if our xact (or subxacts) did DDL and the snapshot
not set yet. My fantasy doesn't hint me the concrete example
where this would burn at the moment, but it should be considered.

I had discussed this area with Petr and we didn't see any issues as well, then.

Ok, simplifying, what SnapBuildCommitTxn practically does is
* Decide whether we are interested in tracking this xact effects, and
if we are, mark it as committed.
* Build and distribute snapshot to all RBTXNs, if it is important.
* Set base snap of our xact if it did DDL, to execute invalidations
during replay.

I see that we don't need to do first two bullets during DecodePrepare:
xact effects are still invisible for everyone but itself after
PREPARE. As for seeing xacts own own changes, it is implemented via
logging cmin/cmax and we don't need to mark xact as committed for that
(c.f ReorderBufferCopySnap).

Regarding the third point... I think in 2PC decoding we might need to
execute invalidations twice:
1) After replaying xact on PREPARE to forget about catalog changes
xact did -- it is not yet committed and must be invisible to
other xacts until CP. In the latest patchset invalidations are
executed only if there is at least one change in xact (it has base
snap). It looks fine: we can't spoil catalogs if there were nothing
to decode. Better to explain that somewhere.
2) After decoding COMMIT PREPARED to make changes visible. In current
patchset it is always done. Actually, *this* is the reason
RBTXN might already exist when we enter ReorderBufferFinishPrepared,
not "(during restarts for example)" as comment says there:
if there were inval messages, RBTXN will be created
in DecodeCommit during their addition.

BTW, "that we might need to execute invalidations, add snapshot" in
SnapBuildCommitTxn looks like a cludge to me: I suppose it is better to
do that at ReorderBufferXidSetCatalogChanges.

Now, another issue is registering xact as committed in
SnapBuildCommitTxn during COMMIT PREPARED processing. Since RBTXN is
always purged after xact replay on PREPARE, the only medium we have for
noticing catalog changes during COMMIT PREPARED is invalidation messages
attached to the CP record. This raises the following question.
* If there is a guarantee that whenever xact makes catalog changes it
generates invalidation messages, then this code is fine. However,
currently ReorderBufferXidSetCatalogChanges is also called on
XLOG_HEAP_INPLACE processing and in SnapBuildProcessNewCid, which
is useless if such guarantee exists.
* If, on the other hand, there is no such guarantee, this code is
broken.

- I am one of those people upthread who don't think that converting
flags to bitmask is beneficial -- especially given that many of them
are mutually exclusive, e.g. xact can't be committed and aborted at
the same time. Apparently you have left this to the committer though.

Hmm, there seems to be divided opinion on this. I am willing to go
back to using the booleans if there's opposition and if the committer
so wishes. Note that this patch will end up adding 4/5 more booleans
in that case (we add new ones for prepare, commit prepare, abort,
rollback prepare etc).

Well, you can unite mutually exclusive fields into one enum or char with
macros defining possible values. Transaction can't be committed and
aborted at the same time, etc.

+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback is ensured sane access to catalog tables regardless of
+      simultaneous rollback by another backend of this very same transaction.

I don't think we should explain this, at least in such words. As
mentioned upthread, we should warn about allowed systable_* accesses
instead. Same for message_cb.

Looks like you are looking at an earlier patchset. The latest patchset
has removed the above.

I see, sorry.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#10Ajin Cherian
itsajin@gmail.com
In reply to: Arseny Sher (#9)
1 attachment(s)

Hello,

Trying to revive this patch which attempts to support logical decoding of
two phase transactions. I've rebased and polished Nikhil's patch on the
current HEAD. Some of the logic in the previous patchset has already been
committed as part of large-in-progress transaction commits, like the
handling of concurrent aborts, so all that logic has been left out. I think
some of the earlier comments have already been addressed or are no longer
relevant. Do have a look at the patch and let me know what you think.I will
try and address any pending issues going forward.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

0001-Support-decoding-of-two-phase-transactions-at-PREPAR.patchapplication/octet-stream; name=0001-Support-decoding-of-two-phase-transactions-at-PREPAR.patchDownload
From 07dafb931715df67914cd1b2e50e7d71d873fa9b Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 7 Sep 2020 00:41:31 -0400
Subject: [PATCH] Support decoding of two-phase transactions at PREPARE

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.

Includes documentation changes.
---
 contrib/test_decoding/expected/prepared.out     | 185 ++++++++++++++++++---
 contrib/test_decoding/sql/prepared.sql          |  77 ++++++++-
 contrib/test_decoding/test_decoding.c           | 181 +++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml               | 127 ++++++++++++++-
 src/backend/replication/logical/decode.c        | 141 ++++++++++++++--
 src/backend/replication/logical/logical.c       | 203 ++++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c | 188 ++++++++++++++++++++--
 src/include/replication/output_plugin.h         |  46 ++++++
 src/include/replication/reorderbuffer.h         |  78 ++++++++-
 9 files changed, 1159 insertions(+), 67 deletions(-)

diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d..94fb0c9 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,50 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
  init
 (1 row)
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (2);
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (4);
 -- test prepared xact containing ddl
 BEGIN;
@@ -26,45 +57,149 @@ INSERT INTO test_prepared1 VALUES (5);
 ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
                                   data                                   
 -------------------------------------------------------------------------
  BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
- COMMIT
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
- BEGIN
  table public.test_prepared1: INSERT: id[integer]:4
  COMMIT
  BEGIN
- table public.test_prepared2: INSERT: id[integer]:7
- COMMIT
- BEGIN
  table public.test_prepared1: INSERT: id[integer]:5
  table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
  COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
  BEGIN
  table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
  COMMIT
  BEGIN
  table public.test_prepared2: INSERT: id[integer]:9
  COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+ relation | locktype | mode 
+----------+----------+------
+(0 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                    data                                    
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
 
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e726397..ca801e4 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -2,21 +2,25 @@
 SET synchronous_commit = on;
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 INSERT INTO test_prepared1 VALUES (2);
 
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 INSERT INTO test_prepared1 VALUES (4);
 
@@ -27,24 +31,83 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
 
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
 INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- make sure stuff still works
 INSERT INTO test_prepared1 VALUES (8);
 INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- cleanup
 DROP TABLE test_prepared1;
 DROP TABLE test_prepared2;
 
--- show results
+-- show results. There should be nothing to show
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 3474515..d5438c5 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +54,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
 							bool last_write);
 static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+								ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
 static void pg_decode_change(LogicalDecodingContext *ctx,
 							 ReorderBufferTXN *txn, Relation rel,
 							 ReorderBufferChange *change);
@@ -84,6 +91,19 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
+
 
 void
 _PG_init(void)
@@ -102,6 +122,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pg_decode_change;
 	cb->truncate_cb = pg_decode_truncate;
 	cb->commit_cb = pg_decode_commit_txn;
+	cb->abort_cb = pg_decode_abort_txn;
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
@@ -112,6 +133,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
+
 }
 
 
@@ -132,11 +158,14 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
 	opt->output_type = OUTPUT_PLUGIN_TEXTUAL_OUTPUT;
 	opt->receive_rewrites = false;
+	/* this plugin supports decoding of 2pc */
+	opt->enable_twophase = true;
 
 	foreach(option, ctx->output_plugin_options)
 	{
@@ -223,6 +252,32 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid") == 0)
+		{
+			if (elem->arg)
+			{
+				errno = 0;
+				data->check_xid = (TransactionId)
+					strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno == EINVAL || errno == ERANGE)
+					ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid is not a valid number: \"%s\"",
+								strVal(elem->arg))));
+			}
+			else
+				ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid needs an input value")));
+
+			if (data->check_xid <= 0)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("Specify positive value for parameter \"%s\","
+								" you specified \"%s\"",
+								elem->defname, strVal(elem->arg))));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -293,6 +348,116 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+	else
+		appendStringInfoString(ctx->out, "ABORT");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -451,6 +616,22 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid is specified */
+	if (TransactionIdIsValid(data->check_xid))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid);
+		while (TransactionIdIsInProgress(data->check_xid))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid) &&
+			   !TransactionIdDidCommit(data->check_xid))
+			elog(LOG, "%u aborted", data->check_xid);
+
+		Assert(TransactionIdDidAbort(data->check_xid));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8d4fdf6..22ee70f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -386,7 +386,12 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeChangeCB change_cb;
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
+    LogicalDecodeAbortCB abort_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeAbortPreparedCB abort_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
@@ -477,7 +482,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +589,71 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The optional <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Commit Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>commit_prepared_cb</function> callback is called whenever
+      a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+     <title>Rollback Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>abort_prepared_cb</function> callback is called whenever
+      a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-abort">
+     <title>Transaction Abort Callback</title>
+
+     <para>
+      The required <function>abort_cb</function> callback is called whenever
+      a transaction abort has to be initiated. This can happen if we are
+      decoding a transaction that has been prepared for two-phase commit and
+      a concurrent rollback happens while we are decoding it.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +663,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +746,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +800,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message 
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f21f61d..63d5acf 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -70,6 +70,9 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -312,17 +315,34 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
+				/* check that output plugin is capable of twophase decoding */
+				if (!ctx->options.enable_twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -647,9 +667,82 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
+	/*
+	 * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+	 * Regular commit simply triggers a replay of transaction changes from the
+	 * reorder buffer. For COMMIT PREPARED that however already happened at
+	 * PREPARE time, and so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 *
+	 * For output plugins that do not support PREPARE-time decoding of
+	 * two-phase transactions, we never even see the PREPARE and all two-phase
+	 * transactions simply fall through to the second branch.
+	 */
+	if (TransactionIdIsValid(parsed->twophase_xid) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder,
+								   parsed->twophase_xid, parsed->twophase_gid))
+	{
+		Assert(xid == parsed->twophase_xid);
+		/* we are processing COMMIT PREPARED */
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* replay actions of all transaction + subtransactions in order */
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	/*
+	 * Process invalidation messages, even if we're not interested in the
+	 * transaction's contents, since the various caches need to always be
+	 * consistent.
+	 */
+	if (parsed->nmsgs > 0)
+	{
+		if (!ctx->fast_forward)
+			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+										  parsed->nmsgs, parsed->msgs);
+		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+	}
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+		return;
+
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 }
 
 /*
@@ -661,6 +754,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 			xl_xact_parsed_abort *parsed, TransactionId xid)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it's ROLLBACK PREPARED then handle it via callbacks.
+	 */
+	if (TransactionIdIsValid(xid) &&
+		!SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+		parsed->dbId == ctx->slot->data.database &&
+		!FilterByOrigin(ctx, origin_id) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
 
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 0f6af95..02b726c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -58,6 +58,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+		                XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+		                         TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+		                  XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+		                          XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+		                         XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -206,6 +216,11 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->abort = abort_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -684,6 +699,33 @@ startup_cb_wrapper(LogicalDecodingContext *ctx, OutputPluginOptions *opt, bool i
 	/* do the actual work: call callback */
 	ctx->callbacks.startup_cb(ctx, opt, is_init);
 
+	/*
+	 * If the plugin claims to support two-phase transactions, then
+	 * check that the plugin implements all callbacks necessary to decode
+	 * two-phase transactions - we either have to have all of them or none.
+	 * The filter_prepare callback is optional, but can only be defined when
+	 * two-phase decoding is enabled (i.e. the three other callbacks are
+	 * defined).
+	 */
+	if (opt->enable_twophase)
+	{
+		int twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+			(ctx->callbacks.commit_prepared_cb != NULL) +
+			(ctx->callbacks.abort_prepared_cb != NULL);
+
+		/* Plugins with incorrect number of two-phase callbacks are broken. */
+		if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+			ereport(ERROR,
+					(errmsg("Output plugin registered only %d twophase callbacks. ",
+							twophase_callbacks)));
+	}
+
+	/* filter_prepare is optional, but requires two-phase decoding */
+	if ((ctx->callbacks.filter_prepare_cb != NULL) && (!opt->enable_twophase))
+		ereport(ERROR,
+				(errmsg("Output plugin does not support two-phase decoding, but "
+						"registered filter_prepared callback.")));
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 }
@@ -782,6 +824,122 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "abort";
+	state.report_location = txn->final_lsn; /* beginning of abort record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* do the actual work: call callback */
+	ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "abort_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* do the actual work: call callback */
+	ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -858,6 +1016,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of twophase at PREPARE time is not enabled. In that
+	 * case all twophase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->options.enable_twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1975d62..6d9cfbd 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -413,6 +413,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/* free data that's contained */
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
 
 	if (txn->tuplecid_hash != NULL)
 	{
@@ -1987,7 +1992,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2249,7 +2254,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					break;
 			}
 		}
-
 		/*
 		 * There's a speculative insertion remaining, just clean in up, it
 		 * can't have been successful, otherwise we'd gotten a confirmation
@@ -2278,7 +2282,24 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call abort/commit/prepare callback, depending on the transaction
+ 			 * state.
+ 			 *
+ 			 * If the transaction aborted during apply (which currently can happen
+ 			 * only for prepared transactions), simply call the abort callback.
+ 			 *
+ 			 * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ 			 * (for regular ones).
+ 			 */
+			if (rbtxn_rollback(txn))
+				rb->abort(rb, txn, commit_lsn);
+			else if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2395,23 +2416,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+                            ReorderBuffer *rb, TransactionId xid,
+					        XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					        TimestampTz commit_time,
+					        RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2453,6 +2467,140 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+                   XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                   TimestampTz commit_time,
+                   RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+   /* unknown transaction, nothing to replay */
+   if (txn == NULL)
+       return;
+
+   ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+                               commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+                    XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                    TimestampTz commit_time,
+                    RepOriginId origin_id, XLogRecPtr origin_lsn,
+                    char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+   /* unknown transaction, nothing to replay */
+   if (txn == NULL)
+       return;
+
+   txn->txn_flags |= RBTXN_PREPARE;
+   txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+   strcpy(txn->gid, gid);
+
+   ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+                               commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+                          const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+   /*
+    * Always call the prepare filter. It's the job of the prepare filter to
+    * give us the *same* response for a given xid across multiple calls
+    * (including ones on restart)
+    */
+   return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+                           XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                           TimestampTz commit_time,
+                           RepOriginId origin_id, XLogRecPtr origin_lsn,
+                           char *gid, bool is_commit)
+{
+   ReorderBufferTXN *txn;
+
+   /*
+    * The transaction may or may not exist (during restarts for example).
+    * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+    * it to be created below.
+    */
+   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+                               true);
+
+   txn->final_lsn = commit_lsn;
+   txn->end_lsn = end_lsn;
+   txn->commit_time = commit_time;
+   txn->origin_id = origin_id;
+   txn->origin_lsn = origin_lsn;
+   /* this txn is obviously prepared */
+   txn->txn_flags |= RBTXN_PREPARE;
+   txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+   strcpy(txn->gid, gid);
+
+   if (is_commit)
+   {
+       txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+       rb->commit_prepared(rb, txn, commit_lsn);
+   }
+   else
+   {
+       txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+       rb->abort_prepared(rb, txn, commit_lsn);
+   }
+
+   /* cleanup: make sure there's no cache pollution */
+   ReorderBufferExecuteInvalidations(rb, txn);
+   ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2495,7 +2643,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+    /*
+     * remove potential on-disk data, and deallocate.
+     *
+     * We remove it even for prepared transactions (GID is enough to
+     * commit/abort those later).
+     */
+
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..f6ca87f 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -27,6 +27,7 @@ typedef struct OutputPluginOptions
 {
 	OutputPluginOutputType output_type;
 	bool		receive_rewrites;
+	bool		enable_twophase;
 } OutputPluginOptions;
 
 /*
@@ -78,6 +79,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   XLogRecPtr commit_lsn);
 
 /*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
+
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
+/*
  * Called for the generic logical decoding messages.
  */
 typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
@@ -170,7 +211,12 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeChangeCB change_cb;
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
+	LogicalDecodeAbortCB abort_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeAbortPreparedCB abort_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1ae17d5..820840a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -162,9 +163,14 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
-#define RBTXN_IS_STREAMED         0x0008
-#define RBTXN_HAS_TOAST_INSERT    0x0010
-#define RBTXN_HAS_SPEC_INSERT     0x0020
+#define RBTXN_PREPARE             0x0008
+#define RBTXN_COMMIT_PREPARED     0x0010
+#define RBTXN_ROLLBACK_PREPARED   0x0020
+#define RBTXN_COMMIT              0x0040
+#define RBTXN_ROLLBACK            0x0080
+#define RBTXN_IS_STREAMED         0x0100
+#define RBTXN_HAS_TOAST_INSERT    0x0200
+#define RBTXN_HAS_SPEC_INSERT     0x0400
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -218,6 +224,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* is this txn prepared? */
+#define rbtxn_prepared(txn)            (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn)     (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn)   (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn)              (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn)            (txn->txn_flags & RBTXN_ROLLBACK)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -229,6 +246,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of 2PC we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -390,6 +410,39 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+                                     ReorderBuffer *rb,
+                                     ReorderBufferTXN *txn,
+                                     XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             TransactionId xid,
+                                             const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+                                       ReorderBuffer *rb,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+                                              ReorderBuffer *rb,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             XLogRecPtr abort_lsn);
+
+
+
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -482,6 +535,11 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferAbortCB abort;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferAbortPreparedCB abort_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -548,6 +606,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+                           XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                           TimestampTz commit_time,
+                           RepOriginId origin_id, XLogRecPtr origin_lsn,
+                           char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -571,6 +634,15 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+							 const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#10)

On Mon, Sep 7, 2020 at 10:54 AM Ajin Cherian <itsajin@gmail.com> wrote:

Hello,

Trying to revive this patch which attempts to support logical decoding of two phase transactions. I've rebased and polished Nikhil's patch on the current HEAD. Some of the logic in the previous patchset has already been committed as part of large-in-progress transaction commits, like the handling of concurrent aborts, so all that logic has been left out.

I am not sure about your point related to concurrent aborts. I think
we need some changes related to this patch. Have you tried to test
this behavior? Basically, we have the below code in
ReorderBufferProcessTXN() which will be hit for concurrent aborts, and
currently, the Asserts shown below will fail.

if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
{
/*
* This error can only occur when we are sending the data in
* streaming mode and the streaming is not finished yet.
*/
Assert(streaming);
Assert(stream_started);

Nikhil has a test for the same
(0004-Teach-test_decoding-plugin-to-work-with-2PC.Jan4) in his last
email [1]. You might want to use it to test this behavior. I think you
can also keep the tests as a separate patch as Nikhil had.

One other comment:
===================
@@ -27,6 +27,7 @@ typedef struct OutputPluginOptions
{
OutputPluginOutputType output_type;
bool receive_rewrites;
+ bool enable_twophase;
} OutputPluginOptions;
..
..
@@ -684,6 +699,33 @@ startup_cb_wrapper(LogicalDecodingContext *ctx,
OutputPluginOptions *opt, bool i
/* do the actual work: call callback */
ctx->callbacks.startup_cb(ctx, opt, is_init);

+ /*
+ * If the plugin claims to support two-phase transactions, then
+ * check that the plugin implements all callbacks necessary to decode
+ * two-phase transactions - we either have to have all of them or none.
+ * The filter_prepare callback is optional, but can only be defined when
+ * two-phase decoding is enabled (i.e. the three other callbacks are
+ * defined).
+ */
+ if (opt->enable_twophase)
+ {
+ int twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ /* Plugins with incorrect number of two-phase callbacks are broken. */
+ if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+ ereport(ERROR,
+ (errmsg("Output plugin registered only %d twophase callbacks. ",
+ twophase_callbacks)));
+ }

I don't know why the patch has used this way to implement an option to
enable two-phase. Can't we use how we implement 'stream-changes'
option in commit 7259736a6e? Just refer how we set ctx->streaming and
you can use a similar way to set this parameter.

--
With Regards,
Amit Kapila.

#12Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#11)
2 attachment(s)

On Mon, Sep 7, 2020 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Nikhil has a test for the same
(0004-Teach-test_decoding-plugin-to-work-with-2PC.Jan4) in his last
email [1]. You might want to use it to test this behavior. I think you
can also keep the tests as a separate patch as Nikhil had.

Done. I've added the tests and also tweaked code to make sure that the

aborts during 2 phase commits are also handled.

I don't know why the patch has used this way to implement an option to
enable two-phase. Can't we use how we implement 'stream-changes'
option in commit 7259736a6e? Just refer how we set ctx->streaming and
you can use a similar way to set this parameter.

Done, I've moved the checks for callbacks to inside the corresponding
wrappers.

Regards,
Ajin Cherian
Fujitsu Australia

Attachments:

0001-Support-decoding-of-two-phase-transactions-at-PREPAR.patchapplication/octet-stream; name=0001-Support-decoding-of-two-phase-transactions-at-PREPAR.patchDownload
From 3716f5eee7130f9d40b42a59440fefe13849bb8a Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 9 Sep 2020 05:33:35 -0400
Subject: [PATCH 1/2] Support decoding of two-phase transactions at PREPARE

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

On the subscriber, the transactions will be executed as two-phase
transactions, with the same GID. This is important for various
external transaction managers, that often encode information into
the GID itself.

Includes documentation changes.
---
 contrib/test_decoding/expected/prepared.out     | 185 ++++++++++++++++++---
 contrib/test_decoding/sql/prepared.sql          |  77 ++++++++-
 contrib/test_decoding/test_decoding.c           | 181 ++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml               | 127 +++++++++++++-
 src/backend/replication/logical/decode.c        | 141 ++++++++++++++--
 src/backend/replication/logical/logical.c       | 194 ++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c | 209 +++++++++++++++++++++---
 src/include/replication/output_plugin.h         |  46 ++++++
 src/include/replication/reorderbuffer.h         |  78 ++++++++-
 9 files changed, 1165 insertions(+), 73 deletions(-)

diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d..94fb0c9 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,50 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
  init
 (1 row)
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (2);
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (4);
 -- test prepared xact containing ddl
 BEGIN;
@@ -26,45 +57,149 @@ INSERT INTO test_prepared1 VALUES (5);
 ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
                                   data                                   
 -------------------------------------------------------------------------
  BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
- COMMIT
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
- BEGIN
  table public.test_prepared1: INSERT: id[integer]:4
  COMMIT
  BEGIN
- table public.test_prepared2: INSERT: id[integer]:7
- COMMIT
- BEGIN
  table public.test_prepared1: INSERT: id[integer]:5
  table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
  COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
  BEGIN
  table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
  COMMIT
  BEGIN
  table public.test_prepared2: INSERT: id[integer]:9
  COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+ relation | locktype | mode 
+----------+----------+------
+(0 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                    data                                    
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
 
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e726397..ca801e4 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -2,21 +2,25 @@
 SET synchronous_commit = on;
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 INSERT INTO test_prepared1 VALUES (2);
 
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 INSERT INTO test_prepared1 VALUES (4);
 
@@ -27,24 +31,83 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
 
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
 INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- make sure stuff still works
 INSERT INTO test_prepared1 VALUES (8);
 INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- cleanup
 DROP TABLE test_prepared1;
 DROP TABLE test_prepared2;
 
--- show results
+-- show results. There should be nothing to show
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 3474515..d5438c5 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +54,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
 							bool last_write);
 static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+								ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
 static void pg_decode_change(LogicalDecodingContext *ctx,
 							 ReorderBufferTXN *txn, Relation rel,
 							 ReorderBufferChange *change);
@@ -84,6 +91,19 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
+
 
 void
 _PG_init(void)
@@ -102,6 +122,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pg_decode_change;
 	cb->truncate_cb = pg_decode_truncate;
 	cb->commit_cb = pg_decode_commit_txn;
+	cb->abort_cb = pg_decode_abort_txn;
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
@@ -112,6 +133,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
+
 }
 
 
@@ -132,11 +158,14 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
 	opt->output_type = OUTPUT_PLUGIN_TEXTUAL_OUTPUT;
 	opt->receive_rewrites = false;
+	/* this plugin supports decoding of 2pc */
+	opt->enable_twophase = true;
 
 	foreach(option, ctx->output_plugin_options)
 	{
@@ -223,6 +252,32 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid") == 0)
+		{
+			if (elem->arg)
+			{
+				errno = 0;
+				data->check_xid = (TransactionId)
+					strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno == EINVAL || errno == ERANGE)
+					ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid is not a valid number: \"%s\"",
+								strVal(elem->arg))));
+			}
+			else
+				ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid needs an input value")));
+
+			if (data->check_xid <= 0)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("Specify positive value for parameter \"%s\","
+								" you specified \"%s\"",
+								elem->defname, strVal(elem->arg))));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -293,6 +348,116 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+	else
+		appendStringInfoString(ctx->out, "ABORT");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -451,6 +616,22 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid is specified */
+	if (TransactionIdIsValid(data->check_xid))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid);
+		while (TransactionIdIsInProgress(data->check_xid))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid) &&
+			   !TransactionIdDidCommit(data->check_xid))
+			elog(LOG, "%u aborted", data->check_xid);
+
+		Assert(TransactionIdDidAbort(data->check_xid));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 8d4fdf6..22ee70f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -386,7 +386,12 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeChangeCB change_cb;
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
+    LogicalDecodeAbortCB abort_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeAbortPreparedCB abort_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
@@ -477,7 +482,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +589,71 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The optional <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Commit Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>commit_prepared_cb</function> callback is called whenever
+      a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+     <title>Rollback Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>abort_prepared_cb</function> callback is called whenever
+      a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-abort">
+     <title>Transaction Abort Callback</title>
+
+     <para>
+      The required <function>abort_cb</function> callback is called whenever
+      a transaction abort has to be initiated. This can happen if we are
+      decoding a transaction that has been prepared for two-phase commit and
+      a concurrent rollback happens while we are decoding it.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +663,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +746,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +800,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message 
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f21f61d..63d5acf 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -70,6 +70,9 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -312,17 +315,34 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
+				/* check that output plugin is capable of twophase decoding */
+				if (!ctx->options.enable_twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -647,9 +667,82 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
+	/*
+	 * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+	 * Regular commit simply triggers a replay of transaction changes from the
+	 * reorder buffer. For COMMIT PREPARED that however already happened at
+	 * PREPARE time, and so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 *
+	 * For output plugins that do not support PREPARE-time decoding of
+	 * two-phase transactions, we never even see the PREPARE and all two-phase
+	 * transactions simply fall through to the second branch.
+	 */
+	if (TransactionIdIsValid(parsed->twophase_xid) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder,
+								   parsed->twophase_xid, parsed->twophase_gid))
+	{
+		Assert(xid == parsed->twophase_xid);
+		/* we are processing COMMIT PREPARED */
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* replay actions of all transaction + subtransactions in order */
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	/*
+	 * Process invalidation messages, even if we're not interested in the
+	 * transaction's contents, since the various caches need to always be
+	 * consistent.
+	 */
+	if (parsed->nmsgs > 0)
+	{
+		if (!ctx->fast_forward)
+			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+										  parsed->nmsgs, parsed->msgs);
+		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+	}
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+		return;
+
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 }
 
 /*
@@ -661,6 +754,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 			xl_xact_parsed_abort *parsed, TransactionId xid)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it's ROLLBACK PREPARED then handle it via callbacks.
+	 */
+	if (TransactionIdIsValid(xid) &&
+		!SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+		parsed->dbId == ctx->slot->data.database &&
+		!FilterByOrigin(ctx, origin_id) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
 
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 0f6af95..a569ab8 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -58,6 +58,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -206,6 +216,11 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->abort = abort_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -782,6 +797,140 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "abort";
+	state.report_location = txn->final_lsn; /* beginning of abort record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* do the actual work: call callback */
+	ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+    /* If the plugin support 2 phase commits then prepare callback is mandatory */
+    if (ctx->options.enable_twophase && ctx->callbacks.prepare_cb == NULL)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+    /* If the plugin support 2 phase commits then commit prepared callback is mandatory */
+    if (ctx->options.enable_twophase && ctx->callbacks.commit_prepared_cb == NULL)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "abort_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+    /* If the plugin support 2 phase commits then abort prepared callback is mandatory */
+    if (ctx->options.enable_twophase && ctx->callbacks.abort_prepared_cb == NULL)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("Output plugin did not register abort_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -858,6 +1007,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of twophase at PREPARE time is not enabled. In that
+	 * case all twophase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->options.enable_twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1975d62..d6556be 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -413,6 +413,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/* free data that's contained */
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
 
 	if (txn->tuplecid_hash != NULL)
 	{
@@ -1987,7 +1992,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2249,7 +2254,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					break;
 			}
 		}
-
 		/*
 		 * There's a speculative insertion remaining, just clean in up, it
 		 * can't have been successful, otherwise we'd gotten a confirmation
@@ -2278,7 +2282,24 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call abort/commit/prepare callback, depending on the transaction
+ 			 * state.
+ 			 *
+ 			 * If the transaction aborted during apply (which currently can happen
+ 			 * only for prepared transactions), simply call the abort callback.
+ 			 *
+ 			 * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ 			 * (for regular ones).
+ 			 */
+			if (rbtxn_rollback(txn))
+				rb->abort(rb, txn, commit_lsn);
+			else if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2361,8 +2382,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * This error can only occur when we are sending the data in
 			 * streaming mode and the streaming is not finished yet.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started  || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2370,10 +2391,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/* If streaming, reset the TXN so that it is allowed to stream remaining data. */
+			if (streaming && stream_started)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+						txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+				rb->abort(rb, txn, commit_lsn);
+			}
 		}
 		else
 		{
@@ -2395,23 +2425,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+                            ReorderBuffer *rb, TransactionId xid,
+					        XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					        TimestampTz commit_time,
+					        RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2453,6 +2476,140 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+                   XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                   TimestampTz commit_time,
+                   RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+   /* unknown transaction, nothing to replay */
+   if (txn == NULL)
+       return;
+
+   ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+                               commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+                    XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                    TimestampTz commit_time,
+                    RepOriginId origin_id, XLogRecPtr origin_lsn,
+                    char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+   /* unknown transaction, nothing to replay */
+   if (txn == NULL)
+       return;
+
+   txn->txn_flags |= RBTXN_PREPARE;
+   txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+   strcpy(txn->gid, gid);
+
+   ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+                               commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+                          const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+   /*
+    * Always call the prepare filter. It's the job of the prepare filter to
+    * give us the *same* response for a given xid across multiple calls
+    * (including ones on restart)
+    */
+   return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+                           XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                           TimestampTz commit_time,
+                           RepOriginId origin_id, XLogRecPtr origin_lsn,
+                           char *gid, bool is_commit)
+{
+   ReorderBufferTXN *txn;
+
+   /*
+    * The transaction may or may not exist (during restarts for example).
+    * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+    * it to be created below.
+    */
+   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+                               true);
+
+   txn->final_lsn = commit_lsn;
+   txn->end_lsn = end_lsn;
+   txn->commit_time = commit_time;
+   txn->origin_id = origin_id;
+   txn->origin_lsn = origin_lsn;
+   /* this txn is obviously prepared */
+   txn->txn_flags |= RBTXN_PREPARE;
+   txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+   strcpy(txn->gid, gid);
+
+   if (is_commit)
+   {
+       txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+       rb->commit_prepared(rb, txn, commit_lsn);
+   }
+   else
+   {
+       txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+       rb->abort_prepared(rb, txn, commit_lsn);
+   }
+
+   /* cleanup: make sure there's no cache pollution */
+   ReorderBufferExecuteInvalidations(rb, txn);
+   ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2495,7 +2652,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+    /*
+     * remove potential on-disk data, and deallocate.
+     *
+     * We remove it even for prepared transactions (GID is enough to
+     * commit/abort those later).
+     */
+
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..f6ca87f 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -27,6 +27,7 @@ typedef struct OutputPluginOptions
 {
 	OutputPluginOutputType output_type;
 	bool		receive_rewrites;
+	bool		enable_twophase;
 } OutputPluginOptions;
 
 /*
@@ -78,6 +79,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   XLogRecPtr commit_lsn);
 
 /*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
+
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
+/*
  * Called for the generic logical decoding messages.
  */
 typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
@@ -170,7 +211,12 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeChangeCB change_cb;
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
+	LogicalDecodeAbortCB abort_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeAbortPreparedCB abort_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1ae17d5..820840a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -162,9 +163,14 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
-#define RBTXN_IS_STREAMED         0x0008
-#define RBTXN_HAS_TOAST_INSERT    0x0010
-#define RBTXN_HAS_SPEC_INSERT     0x0020
+#define RBTXN_PREPARE             0x0008
+#define RBTXN_COMMIT_PREPARED     0x0010
+#define RBTXN_ROLLBACK_PREPARED   0x0020
+#define RBTXN_COMMIT              0x0040
+#define RBTXN_ROLLBACK            0x0080
+#define RBTXN_IS_STREAMED         0x0100
+#define RBTXN_HAS_TOAST_INSERT    0x0200
+#define RBTXN_HAS_SPEC_INSERT     0x0400
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -218,6 +224,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* is this txn prepared? */
+#define rbtxn_prepared(txn)            (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn)     (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn)   (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn)              (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn)            (txn->txn_flags & RBTXN_ROLLBACK)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -229,6 +246,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of 2PC we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -390,6 +410,39 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+                                     ReorderBuffer *rb,
+                                     ReorderBufferTXN *txn,
+                                     XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             TransactionId xid,
+                                             const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+                                       ReorderBuffer *rb,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+                                              ReorderBuffer *rb,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             XLogRecPtr abort_lsn);
+
+
+
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -482,6 +535,11 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferAbortCB abort;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferAbortPreparedCB abort_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -548,6 +606,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+                           XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                           TimestampTz commit_time,
+                           RepOriginId origin_id, XLogRecPtr origin_lsn,
+                           char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -571,6 +634,15 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+							 const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

0002-Tap-test-to-test-concurrent-aborts-during-2-phase-co.patchapplication/octet-stream; name=0002-Tap-test-to-test-concurrent-aborts-during-2-phase-co.patchDownload
From 5ded273d7e9005f102474bc773b249dfb544a75b Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 9 Sep 2020 05:50:11 -0400
Subject: [PATCH 2/2] Tap test to test concurrent aborts during 2 phase commits

This test is specifically for testing concurrent abort while logical decode
is ongoing. Pass in the xid of the 2PC to the plugin as an option.
On the receipt of a valid "check-xid", the change API in the test decoding
plugin will wait for it to be aborted.
---
 contrib/test_decoding/Makefile          |   1 +
 contrib/test_decoding/t/001_twophase.pl | 119 ++++++++++++++++++++++++++++++++
 2 files changed, 120 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index ed9a3d6..175b971 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,6 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top
+TAP_TESTS = 1
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..99a9249
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,119 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction 
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#12)

On Wed, Sep 9, 2020 at 3:33 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Sep 7, 2020 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Nikhil has a test for the same
(0004-Teach-test_decoding-plugin-to-work-with-2PC.Jan4) in his last
email [1]. You might want to use it to test this behavior. I think you
can also keep the tests as a separate patch as Nikhil had.

Done. I've added the tests and also tweaked code to make sure that the aborts during 2 phase commits are also handled.

Okay, I'll look into your changes but before that today, I have gone
through this entire thread to check if there are any design problems
and found that there were two major issues in the original proposal,
(a) one was to handle concurrent aborts which I think we should be
able to deal in a way similar to what we have done for decoding of
in-progress transactions and (b) what if someone specifically locks
pg_class or pg_attribute in exclusive mode (say be Lock pg_attribute
...), it seems the deadlock can happen in that case [0]/messages/by-id/20170328012546.473psm6546bgsi2c@alap3.anarazel.de. AFAIU, people
seem to think if there is no realistic scenario where deadlock can
happen apart from user explicitly locking the system catalog then we
might be able to get away by just ignoring such xacts to be decoded at
prepare time or would block it in some other way as any way that will
block the entire system. I am not sure what is the right thing but
something has to be done to avoid any sort of deadlock for this.

Another thing, I noticed is that originally we have subscriber-side
support as well, see [1]/messages/by-id/CAMGcDxchx=0PeQBVLzrgYG2AQ49QSRxHj5DCp7yy0QrJR0S0nA@mail.gmail.com (see *pgoutput* patch) but later dropped it
due to some reasons [2]/messages/by-id/CAMGcDxc-kuO9uq0zRCRwbHWBj_rePY9=raR7M9pZGWoj9EOGdg@mail.gmail.com. I think we should have pgoutput support as
well, so see what is required to get that incorporated.

I would also like to summarize my thinking on the usefulness of this
feature. One of the authors of this patch Stats wants this for a
conflict-free logical replication, see more details [3]/messages/by-id/CAMsr+YHQzGxnR-peT4SbX2-xiG2uApJMTgZ4a3TiRBM6COyfqg@mail.gmail.com. Craig seems
to suggest [3]/messages/by-id/CAMsr+YHQzGxnR-peT4SbX2-xiG2uApJMTgZ4a3TiRBM6COyfqg@mail.gmail.com that this will allow us to avoid conflicting schema
changes at different nodes though it is not clear to me if that is
possible without some external code support because we don't send
schema changes in logical replication, maybe Craig can shed some light
on this. Another use-case, I am thinking is if this can be used for
scaling-out reads as well. Because of 2PC, we can ensure that on
subscribers we have all the data committed on the master. Now, we can
design a system where different nodes are owners of some set of tables
and we can always get the data of those tables reliably from those
nodes, and then one can have some external process that will route the
reads accordingly. I know that the last idea is a bit of a hand-waving
but it seems to be possible after this feature.

[0]: /messages/by-id/20170328012546.473psm6546bgsi2c@alap3.anarazel.de
[1]: /messages/by-id/CAMGcDxchx=0PeQBVLzrgYG2AQ49QSRxHj5DCp7yy0QrJR0S0nA@mail.gmail.com
[2]: /messages/by-id/CAMGcDxc-kuO9uq0zRCRwbHWBj_rePY9=raR7M9pZGWoj9EOGdg@mail.gmail.com
[3]: /messages/by-id/CAMsr+YHQzGxnR-peT4SbX2-xiG2uApJMTgZ4a3TiRBM6COyfqg@mail.gmail.com

--
With Regards,
Amit Kapila.

#14Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#13)
3 attachment(s)

On Sat, Sep 12, 2020 at 9:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Another thing, I noticed is that originally we have subscriber-side
support as well, see [1] (see *pgoutput* patch) but later dropped it
due to some reasons [2]. I think we should have pgoutput support as
well, so see what is required to get that incorporated.

I have added the rebased patch-set for pgoutput and subscriber side

changes as well. This also includes a test case in subscriber.

regards,
Ajin Cherian

Attachments:

0001-Support-decoding-of-two-phase-transactions.patchapplication/octet-stream; name=0001-Support-decoding-of-two-phase-transactions.patchDownload
From 68d5222520c245427d6e3a9ecc95657a36212f09 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 15 Sep 2020 07:17:26 -0400
Subject: [PATCH 1/3] Support decoding of two-phase transactions

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes documentation changes.
---
 contrib/test_decoding/expected/prepared.out     | 185 ++++++++++++++++++---
 contrib/test_decoding/sql/prepared.sql          |  77 ++++++++-
 contrib/test_decoding/test_decoding.c           | 181 ++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml               | 127 +++++++++++++-
 src/backend/replication/logical/decode.c        | 141 ++++++++++++++--
 src/backend/replication/logical/logical.c       | 194 ++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c | 209 +++++++++++++++++++++---
 src/include/replication/output_plugin.h         |  46 ++++++
 src/include/replication/reorderbuffer.h         |  78 ++++++++-
 9 files changed, 1165 insertions(+), 73 deletions(-)

diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d..94fb0c9 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,50 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
  init
 (1 row)
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (2);
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (4);
 -- test prepared xact containing ddl
 BEGIN;
@@ -26,45 +57,149 @@ INSERT INTO test_prepared1 VALUES (5);
 ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
                                   data                                   
 -------------------------------------------------------------------------
  BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
- COMMIT
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
- BEGIN
  table public.test_prepared1: INSERT: id[integer]:4
  COMMIT
  BEGIN
- table public.test_prepared2: INSERT: id[integer]:7
- COMMIT
- BEGIN
  table public.test_prepared1: INSERT: id[integer]:5
  table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
  COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
  BEGIN
  table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
  COMMIT
  BEGIN
  table public.test_prepared2: INSERT: id[integer]:9
  COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+ relation | locktype | mode 
+----------+----------+------
+(0 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                    data                                    
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
 
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e726397..ca801e4 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -2,21 +2,25 @@
 SET synchronous_commit = on;
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 INSERT INTO test_prepared1 VALUES (2);
 
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 INSERT INTO test_prepared1 VALUES (4);
 
@@ -27,24 +31,83 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
 
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
 INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- make sure stuff still works
 INSERT INTO test_prepared1 VALUES (8);
 INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- cleanup
 DROP TABLE test_prepared1;
 DROP TABLE test_prepared2;
 
--- show results
+-- show results. There should be nothing to show
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e60ab34..149e7f6 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -49,6 +54,8 @@ static void pg_output_begin(LogicalDecodingContext *ctx,
 							bool last_write);
 static void pg_decode_commit_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pg_decode_abort_txn(LogicalDecodingContext *ctx,
+								ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
 static void pg_decode_change(LogicalDecodingContext *ctx,
 							 ReorderBufferTXN *txn, Relation rel,
 							 ReorderBufferChange *change);
@@ -88,6 +95,19 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
+
 
 void
 _PG_init(void)
@@ -106,6 +126,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pg_decode_change;
 	cb->truncate_cb = pg_decode_truncate;
 	cb->commit_cb = pg_decode_commit_txn;
+	cb->abort_cb = pg_decode_abort_txn;
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
@@ -116,6 +137,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
+
 }
 
 
@@ -136,11 +162,14 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
 	opt->output_type = OUTPUT_PLUGIN_TEXTUAL_OUTPUT;
 	opt->receive_rewrites = false;
+	/* this plugin supports decoding of 2pc */
+	opt->enable_twophase = true;
 
 	foreach(option, ctx->output_plugin_options)
 	{
@@ -227,6 +256,32 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid") == 0)
+		{
+			if (elem->arg)
+			{
+				errno = 0;
+				data->check_xid = (TransactionId)
+					strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno == EINVAL || errno == ERANGE)
+					ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid is not a valid number: \"%s\"",
+								strVal(elem->arg))));
+			}
+			else
+				ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid needs an input value")));
+
+			if (data->check_xid <= 0)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("Specify positive value for parameter \"%s\","
+								" you specified \"%s\"",
+								elem->defname, strVal(elem->arg))));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -297,6 +352,116 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/* ABORT callback */
+static void
+pg_decode_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "ABORT %u", txn->xid);
+	else
+		appendStringInfoString(ctx->out, "ABORT");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +620,22 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid is specified */
+	if (TransactionIdIsValid(data->check_xid))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid);
+		while (TransactionIdIsInProgress(data->check_xid))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid) &&
+			   !TransactionIdDidCommit(data->check_xid))
+			elog(LOG, "%u aborted", data->check_xid);
+
+		Assert(TransactionIdDidAbort(data->check_xid));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..1cddfeb 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -386,7 +386,12 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeChangeCB change_cb;
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
+    LogicalDecodeAbortCB abort_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeAbortPreparedCB abort_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
@@ -477,7 +482,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +589,71 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The optional <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Commit Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>commit_prepared_cb</function> callback is called whenever
+      a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+     <title>Rollback Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>abort_prepared_cb</function> callback is called whenever
+      a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-abort">
+     <title>Transaction Abort Callback</title>
+
+     <para>
+      The required <function>abort_cb</function> callback is called whenever
+      a transaction abort has to be initiated. This can happen if we are
+      decoding a transaction that has been prepared for two-phase commit and
+      a concurrent rollback happens while we are decoding it.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +663,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +746,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +800,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message 
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f21f61d..63d5acf 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -70,6 +70,9 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -312,17 +315,34 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
+				/* check that output plugin is capable of twophase decoding */
+				if (!ctx->options.enable_twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -647,9 +667,82 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
+	/*
+	 * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+	 * Regular commit simply triggers a replay of transaction changes from the
+	 * reorder buffer. For COMMIT PREPARED that however already happened at
+	 * PREPARE time, and so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 *
+	 * For output plugins that do not support PREPARE-time decoding of
+	 * two-phase transactions, we never even see the PREPARE and all two-phase
+	 * transactions simply fall through to the second branch.
+	 */
+	if (TransactionIdIsValid(parsed->twophase_xid) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder,
+								   parsed->twophase_xid, parsed->twophase_gid))
+	{
+		Assert(xid == parsed->twophase_xid);
+		/* we are processing COMMIT PREPARED */
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* replay actions of all transaction + subtransactions in order */
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	/*
+	 * Process invalidation messages, even if we're not interested in the
+	 * transaction's contents, since the various caches need to always be
+	 * consistent.
+	 */
+	if (parsed->nmsgs > 0)
+	{
+		if (!ctx->fast_forward)
+			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+										  parsed->nmsgs, parsed->msgs);
+		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+	}
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+		return;
+
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 }
 
 /*
@@ -661,6 +754,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 			xl_xact_parsed_abort *parsed, TransactionId xid)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it's ROLLBACK PREPARED then handle it via callbacks.
+	 */
+	if (TransactionIdIsValid(xid) &&
+		!SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+		parsed->dbId == ctx->slot->data.database &&
+		!FilterByOrigin(ctx, origin_id) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
 
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 0f6af95..0ae5da7 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -58,6 +58,16 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static void abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -206,6 +216,11 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->abort = abort_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -782,6 +797,140 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "abort";
+	state.report_location = txn->final_lsn; /* beginning of abort record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* do the actual work: call callback */
+	ctx->callbacks.abort_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then prepare callback is mandatory */
+	if (ctx->options.enable_twophase && ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then commit prepared callback is mandatory */
+	if (ctx->options.enable_twophase && ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "abort_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then abort prepared callback is mandatory */
+	if (ctx->options.enable_twophase && ctx->callbacks.abort_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register abort_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -858,6 +1007,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of twophase at PREPARE time is not enabled. In that
+	 * case all twophase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->options.enable_twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1975d62..d6556be 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -413,6 +413,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/* free data that's contained */
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
 
 	if (txn->tuplecid_hash != NULL)
 	{
@@ -1987,7 +1992,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2249,7 +2254,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					break;
 			}
 		}
-
 		/*
 		 * There's a speculative insertion remaining, just clean in up, it
 		 * can't have been successful, otherwise we'd gotten a confirmation
@@ -2278,7 +2282,24 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call abort/commit/prepare callback, depending on the transaction
+ 			 * state.
+ 			 *
+ 			 * If the transaction aborted during apply (which currently can happen
+ 			 * only for prepared transactions), simply call the abort callback.
+ 			 *
+ 			 * Otherwise call either PREPARE (for twophase transactions) or COMMIT
+ 			 * (for regular ones).
+ 			 */
+			if (rbtxn_rollback(txn))
+				rb->abort(rb, txn, commit_lsn);
+			else if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2361,8 +2382,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			 * This error can only occur when we are sending the data in
 			 * streaming mode and the streaming is not finished yet.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started  || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2370,10 +2391,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/* If streaming, reset the TXN so that it is allowed to stream remaining data. */
+			if (streaming && stream_started)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+						txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+				rb->abort(rb, txn, commit_lsn);
+			}
 		}
 		else
 		{
@@ -2395,23 +2425,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+                            ReorderBuffer *rb, TransactionId xid,
+					        XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					        TimestampTz commit_time,
+					        RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2453,6 +2476,140 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+                   XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                   TimestampTz commit_time,
+                   RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+   /* unknown transaction, nothing to replay */
+   if (txn == NULL)
+       return;
+
+   ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+                               commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+                    XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                    TimestampTz commit_time,
+                    RepOriginId origin_id, XLogRecPtr origin_lsn,
+                    char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+   /* unknown transaction, nothing to replay */
+   if (txn == NULL)
+       return;
+
+   txn->txn_flags |= RBTXN_PREPARE;
+   txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+   strcpy(txn->gid, gid);
+
+   ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+                               commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+                          const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+   /*
+    * Always call the prepare filter. It's the job of the prepare filter to
+    * give us the *same* response for a given xid across multiple calls
+    * (including ones on restart)
+    */
+   return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+                           XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                           TimestampTz commit_time,
+                           RepOriginId origin_id, XLogRecPtr origin_lsn,
+                           char *gid, bool is_commit)
+{
+   ReorderBufferTXN *txn;
+
+   /*
+    * The transaction may or may not exist (during restarts for example).
+    * Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+    * it to be created below.
+    */
+   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+                               true);
+
+   txn->final_lsn = commit_lsn;
+   txn->end_lsn = end_lsn;
+   txn->commit_time = commit_time;
+   txn->origin_id = origin_id;
+   txn->origin_lsn = origin_lsn;
+   /* this txn is obviously prepared */
+   txn->txn_flags |= RBTXN_PREPARE;
+   txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+   strcpy(txn->gid, gid);
+
+   if (is_commit)
+   {
+       txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+       rb->commit_prepared(rb, txn, commit_lsn);
+   }
+   else
+   {
+       txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+       rb->abort_prepared(rb, txn, commit_lsn);
+   }
+
+   /* cleanup: make sure there's no cache pollution */
+   ReorderBufferExecuteInvalidations(rb, txn);
+   ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2495,7 +2652,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+    /*
+     * remove potential on-disk data, and deallocate.
+     *
+     * We remove it even for prepared transactions (GID is enough to
+     * commit/abort those later).
+     */
+
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..f6ca87f 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -27,6 +27,7 @@ typedef struct OutputPluginOptions
 {
 	OutputPluginOutputType output_type;
 	bool		receive_rewrites;
+	bool		enable_twophase;
 } OutputPluginOptions;
 
 /*
@@ -78,6 +79,46 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   XLogRecPtr commit_lsn);
 
 /*
+ * Called for an implicit ABORT of a transaction.
+ */
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
+
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
+/*
  * Called for the generic logical decoding messages.
  */
 typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
@@ -170,7 +211,12 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeChangeCB change_cb;
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
+	LogicalDecodeAbortCB abort_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeAbortPreparedCB abort_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1ae17d5..820840a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -162,9 +163,14 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
-#define RBTXN_IS_STREAMED         0x0008
-#define RBTXN_HAS_TOAST_INSERT    0x0010
-#define RBTXN_HAS_SPEC_INSERT     0x0020
+#define RBTXN_PREPARE             0x0008
+#define RBTXN_COMMIT_PREPARED     0x0010
+#define RBTXN_ROLLBACK_PREPARED   0x0020
+#define RBTXN_COMMIT              0x0040
+#define RBTXN_ROLLBACK            0x0080
+#define RBTXN_IS_STREAMED         0x0100
+#define RBTXN_HAS_TOAST_INSERT    0x0200
+#define RBTXN_HAS_SPEC_INSERT     0x0400
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -218,6 +224,17 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* is this txn prepared? */
+#define rbtxn_prepared(txn)            (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn)     (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn)   (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn)              (txn->txn_flags & RBTXN_COMMIT)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback(txn)            (txn->txn_flags & RBTXN_ROLLBACK)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -229,6 +246,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of 2PC we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -390,6 +410,39 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+                                     ReorderBuffer *rb,
+                                     ReorderBufferTXN *txn,
+                                     XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             TransactionId xid,
+                                             const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+                                       ReorderBuffer *rb,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+                                              ReorderBuffer *rb,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             XLogRecPtr abort_lsn);
+
+
+
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -482,6 +535,11 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferAbortCB abort;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferAbortPreparedCB abort_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -548,6 +606,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+                           XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                           TimestampTz commit_time,
+                           RepOriginId origin_id, XLogRecPtr origin_lsn,
+                           char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -571,6 +634,15 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+							 const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

0002-Tap-test-to-test-concurrent-aborts-during-2-phase-co.patchapplication/octet-stream; name=0002-Tap-test-to-test-concurrent-aborts-during-2-phase-co.patchDownload
From 96b856e08857a6eda57bebdd6f656e3aff7aa933 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 15 Sep 2020 07:41:38 -0400
Subject: [PATCH 2/3] Tap test to test concurrent aborts during 2 phase commits

This test is specifically for testing concurrent abort while logical decode
is ongoing. Pass in the xid of the 2PC to the plugin as an option.
On the receipt of a valid "check-xid", the change API in the test decoding
plugin will wait for it to be aborted.
---
 contrib/test_decoding/Makefile          |   2 +
 contrib/test_decoding/t/001_twophase.pl | 119 ++++++++++++++++++++++++++++++++
 2 files changed, 121 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f23f15b..4905a0a 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..99a9249
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,119 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction 
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

0003-pgoutput-output-plugin-support-for-logical-decoding-.patchapplication/octet-stream; name=0003-pgoutput-output-plugin-support-for-logical-decoding-.patchDownload
From decf87e1595f9c7bc9580f7959ee40c38cd0fdfa Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 15 Sep 2020 07:43:07 -0400
Subject: [PATCH 3/3] pgoutput output plugin support for logical decoding of
 2pc

---
 src/backend/access/transam/twophase.c       |  31 ++++++
 src/backend/replication/logical/proto.c     |  90 ++++++++++++++-
 src/backend/replication/logical/worker.c    | 147 ++++++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c |  72 +++++++++++-
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  37 ++++++-
 src/test/subscription/t/020_twophase.pl     | 163 ++++++++++++++++++++++++++++
 7 files changed, 532 insertions(+), 9 deletions(-)
 create mode 100644 src/test/subscription/t/020_twophase.pl

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ef4f998..bed87d5 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,37 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (!gxact->valid)
+			continue;
+		if (strcmp(gxact->gid, gid) != 0)
+			continue;
+
+		LWLockRelease(TwoPhaseStateLock);
+
+		return true;
+	}
+
+	LWLockRelease(TwoPhaseStateLock);
+
+	return false;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index eb19142..291ed10 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -72,12 +72,17 @@ logicalrep_read_begin(StringInfo in, LogicalRepBeginData *begin_data)
  */
 void
 logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
-						XLogRecPtr commit_lsn)
+						XLogRecPtr commit_lsn, bool is_commit)
 {
 	uint8		flags = 0;
 
 	pq_sendbyte(out, 'C');		/* sending COMMIT */
 
+	if (is_commit)
+		flags |= LOGICALREP_IS_COMMIT;
+	else
+		flags |= LOGICALREP_IS_ABORT;
+
 	/* send the flags field (unused for now) */
 	pq_sendbyte(out, flags);
 
@@ -88,16 +93,20 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 }
 
 /*
- * Read transaction COMMIT from the stream.
+ * Read transaction COMMIT|ABORT from the stream.
  */
 void
 logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 {
-	/* read flags (unused for now) */
+	/* read flags */
 	uint8		flags = pq_getmsgbyte(in);
 
-	if (flags != 0)
-		elog(ERROR, "unrecognized flags %u in commit message", flags);
+	if (!CommitFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in commit|abort message",
+			 flags);
+
+	/* the flag is either commit or abort */
+	commit_data->is_commit = (flags == LOGICALREP_IS_COMMIT);
 
 	/* read fields */
 	commit_data->commit_lsn = pq_getmsgint64(in);
@@ -106,6 +115,77 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'P');		/* sending PREPARE protocol */
+
+	/*
+	 * This should only ever happen for 2PC transactions. In which case we
+	 * expect to have a non-empty GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(strlen(txn->gid) > 0);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags |= LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags |= LOGICALREP_IS_PREPARE;
+
+	/* Make sure exactly one of the expected flags is set. */
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c37aafe..d8944a8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -729,7 +729,11 @@ apply_handle_commit(StringInfo s)
 		replorigin_session_origin_lsn = commit_data.end_lsn;
 		replorigin_session_origin_timestamp = commit_data.committime;
 
-		CommitTransactionCommand();
+		if (commit_data.is_commit)
+			CommitTransactionCommand();
+		else
+			AbortCurrentTransaction();
+
 		pgstat_report_stat(false);
 
 		store_flush_position(commit_data.end_lsn);
@@ -749,6 +753,141 @@ apply_handle_commit(StringInfo s)
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
 
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/* End the earlier transaction and start a new one */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
 /*
  * Handle ORIGIN message.
  *
@@ -1909,10 +2048,14 @@ apply_dispatch(StringInfo s)
 		case 'B':
 			apply_handle_begin(s);
 			break;
-			/* COMMIT */
+			/* COMMIT/ABORT */
 		case 'C':
 			apply_handle_commit(s);
 			break;
+			/* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+		case 'P':
+			apply_handle_prepare(s);
+			break;
 			/* INSERT */
 		case 'I':
 			apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 343f031..3dc5264 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -39,6 +39,8 @@ static void pgoutput_begin_txn(LogicalDecodingContext *ctx,
 							   ReorderBufferTXN *txn);
 static void pgoutput_commit_txn(LogicalDecodingContext *ctx,
 								ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_abort_txn(LogicalDecodingContext *ctx,
+				   ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
 static void pgoutput_change(LogicalDecodingContext *ctx,
 							ReorderBufferTXN *txn, Relation rel,
 							ReorderBufferChange *change);
@@ -47,6 +49,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+							 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -143,6 +151,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+	cb->abort_cb = pgoutput_abort_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -256,6 +269,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 
 	/* This plugin uses binary protocol. */
 	opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+	opt->enable_twophase = true;
 
 	/*
 	 * This is replication start and not slot initialization.
@@ -373,7 +387,63 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginUpdateProgress(ctx);
 
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit(ctx->out, txn, commit_lsn, true);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ABORT callback
+ */
+static void
+pgoutput_abort_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+				   XLogRecPtr abort_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit(ctx->out, txn, abort_lsn, false);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
 }
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 607a728..fb07580 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -85,20 +85,55 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
+	bool        is_commit;
 	XLogRecPtr	commit_lsn;
 	XLogRecPtr	end_lsn;
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/* types of the commit protocol message */
+#define LOGICALREP_IS_COMMIT			0x01
+#define LOGICALREP_IS_ABORT				0x02
+
+/* commit message is COMMIT or ABORT, and there is nothing else */
+#define CommitFlagsAreValid(flags) \
+	((flags == LOGICALREP_IS_COMMIT) || (flags == LOGICALREP_IS_ABORT))
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+}			LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ABORT] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	((flags == LOGICALREP_IS_PREPARE) || \
+	 (flags == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 (flags == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
 extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
-									XLogRecPtr commit_lsn);
+									XLogRecPtr commit_lsn, bool is_commit);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+						LogicalRepPrepareData * prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..c7f373d
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+        ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+        'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+   is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+   is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#12)

On Wed, Sep 9, 2020 at 3:33 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Sep 7, 2020 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Nikhil has a test for the same
(0004-Teach-test_decoding-plugin-to-work-with-2PC.Jan4) in his last
email [1]. You might want to use it to test this behavior. I think you
can also keep the tests as a separate patch as Nikhil had.

Done. I've added the tests and also tweaked code to make sure that the aborts during 2 phase commits are also handled.

I don't think it is complete yet.
*
* This error can only occur when we are sending the data in
  * streaming mode and the streaming is not finished yet.
  */
- Assert(streaming);
- Assert(stream_started);
+ Assert(streaming || rbtxn_prepared(txn));
+ Assert(stream_started  || rbtxn_prepared(txn));

Here, you have updated the code but comments are still not updated.

*
@@ -2370,10 +2391,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
errdata = NULL;
curtxn->concurrent_abort = true;

- /* Reset the TXN so that it is allowed to stream remaining data. */
- ReorderBufferResetTXN(rb, txn, snapshot_now,
-   command_id, prev_lsn,
-   specinsert);
+ /* If streaming, reset the TXN so that it is allowed to stream
remaining data. */
+ if (streaming && stream_started)
+ {
+ ReorderBufferResetTXN(rb, txn, snapshot_now,
+   command_id, prev_lsn,
+   specinsert);
+ }
+ else
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+ rb->abort(rb, txn, commit_lsn);
+ }

I don't think we need to perform abort here. Later we will anyway
encounter the WAL for Rollback Prepared for which we will call
abort_prepared_cb. As we have set the 'concurrent_abort' flag, it will
allow us to skip all the intermediate records. Here, we need only
enough state in ReorderBufferTxn that it can be later used for
ReorderBufferFinishPrepared(). Basically, you need functionality
similar to ReorderBufferTruncateTXN where except for invalidations you
can free memory for everything else. You can either write a new
function ReorderBufferTruncatePreparedTxn or pass another bool
parameter in ReorderBufferTruncateTXN to indicate it is prepared_xact
and then clean up additional things that are not required for prepared
xact.

*
Similarly, I don't understand why we need below code:
ReorderBufferProcessTXN()
{
..
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
..
}

There is nowhere we are setting the RBTXN_ROLLBACK flag, so how will
this check be true? If we decide to remove this code then don't forget
to update the comments.

*
If my previous two comments are correct then I don't think we need the
below interface.
+    <sect3 id="logicaldecoding-output-plugin-abort">
+     <title>Transaction Abort Callback</title>
+
+     <para>
+      The required <function>abort_cb</function> callback is called whenever
+      a transaction abort has to be initiated. This can happen if we are
+      decoding a transaction that has been prepared for two-phase commit and
+      a concurrent rollback happens while we are decoding it.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr abort_lsn);

I don't know why the patch has used this way to implement an option to
enable two-phase. Can't we use how we implement 'stream-changes'
option in commit 7259736a6e? Just refer how we set ctx->streaming and
you can use a similar way to set this parameter.

Done, I've moved the checks for callbacks to inside the corresponding wrappers.

This is not what I suggested. Please study the commit 7259736a6e and
see how streaming option is implemented. I want later subscribers can
specify whether they want transactions to be decoded at prepare time
similar to what we have done for streaming. Also, search for
ctx->streaming in the code and see how it is set to get the idea.

Note: Please use version number while sending patches, you can use
something like git format-patch -N -v n to do that. It makes easier
for the reviewer to compare it with the previous version.

Few other comments:
===================
1.
ReorderBufferProcessTXN()
{
..
if (streaming)
{
ReorderBufferTruncateTXN(rb, txn);

/* Reset the CheckXidAlive */
CheckXidAlive = InvalidTransactionId;
}
else
ReorderBufferCleanupTXN(rb, txn);
..
}

I don't think we can perform ReorderBufferCleanupTXN for the prepared
transactions because if we have removed the ReorderBufferTxn before
commit, the later code might not consider such a transaction in the
system and compute the wrong value of restart_lsn for a slot.
Basically, in SnapBuildProcessRunningXacts() when we call
ReorderBufferGetOldestTXN(), it should show the ReorderBufferTxn of
the prepared transaction which is not yet committed but because we
have removed it after prepare, it won't get that TXN and then that
leads to wrong computation of restart_lsn. Once we start from a wrong
point in WAL, the snapshot built was incorrect which will lead to the
wrong result. This is the same reason why the patch is not doing
ReorderBufferForget in DecodePrepare when we decide to skip the
transaction. Also, here, we need to set CheckXidAlive =
InvalidTransactionId; for prepared xact as well.

2. Have you thought about the interaction of streaming with prepared
transactions? You can try writing some tests using pg_logical* APIs
and see the behaviour. For ex. there is no handling in
ReorderBufferStreamCommit for the same. I think you need to introduce
stream_prepare API similar to stream_commit and then use the same.

3.
- if (streaming)
+ if (streaming || rbtxn_prepared(change->txn))
  {
  curtxn = change->txn;
  SetupCheckXidLive(curtxn->xid);
@@ -2249,7 +2254,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
  break;
  }
  }
-
  /*

Spurious line removal.

--
With Regards,
Amit Kapila.

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#14)

On Tue, Sep 15, 2020 at 5:27 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Sat, Sep 12, 2020 at 9:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Another thing, I noticed is that originally we have subscriber-side
support as well, see [1] (see *pgoutput* patch) but later dropped it
due to some reasons [2]. I think we should have pgoutput support as
well, so see what is required to get that incorporated.

I have added the rebased patch-set for pgoutput and subscriber side changes as well. This also includes a test case in subscriber.

As mentioned in my email there were some reasons due to which the
support has been left for later, have you checked those and if so, can
you please explain how you have addressed those or why they are not
relevant now if that is the case?

--
With Regards,
Amit Kapila.

#17Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#15)

On Tue, Sep 15, 2020 at 10:43 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

Few other comments:
===================
1.
ReorderBufferProcessTXN()
{
..
if (streaming)
{
ReorderBufferTruncateTXN(rb, txn);

/* Reset the CheckXidAlive */
CheckXidAlive = InvalidTransactionId;
}
else
ReorderBufferCleanupTXN(rb, txn);
..
}

I don't think we can perform ReorderBufferCleanupTXN for the prepared
transactions because if we have removed the ReorderBufferTxn before
commit, the later code might not consider such a transaction in the
system and compute the wrong value of restart_lsn for a slot.
Basically, in SnapBuildProcessRunningXacts() when we call
ReorderBufferGetOldestTXN(), it should show the ReorderBufferTxn of
the prepared transaction which is not yet committed but because we
have removed it after prepare, it won't get that TXN and then that
leads to wrong computation of restart_lsn. Once we start from a wrong
point in WAL, the snapshot built was incorrect which will lead to the
wrong result. This is the same reason why the patch is not doing
ReorderBufferForget in DecodePrepare when we decide to skip the
transaction. Also, here, we need to set CheckXidAlive =
InvalidTransactionId; for prepared xact as well.

Just to confirm what you are expecting here. so after we send out the
prepare transaction to the plugin, you are suggesting to NOT do a
ReorderBufferCleanupTXN, but what to do instead?. Are you suggesting to do
what you suggested
as part of concurrent abort handling? Something equivalent
to ReorderBufferTruncateTXN()? remove all changes of the transaction but
keep the invalidations and tuplecids etc? Do you think we should have a new
flag in txn to indicate that this transaction has already been decoded?
(prepare_decoded?) Any other special handling you think is required?

regards,
Ajin Cherian
Fujitsu Australia

#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#17)

On Thu, Sep 17, 2020 at 2:02 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Sep 15, 2020 at 10:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few other comments:
===================
1.
ReorderBufferProcessTXN()
{
..
if (streaming)
{
ReorderBufferTruncateTXN(rb, txn);

/* Reset the CheckXidAlive */
CheckXidAlive = InvalidTransactionId;
}
else
ReorderBufferCleanupTXN(rb, txn);
..
}

I don't think we can perform ReorderBufferCleanupTXN for the prepared
transactions because if we have removed the ReorderBufferTxn before
commit, the later code might not consider such a transaction in the
system and compute the wrong value of restart_lsn for a slot.
Basically, in SnapBuildProcessRunningXacts() when we call
ReorderBufferGetOldestTXN(), it should show the ReorderBufferTxn of
the prepared transaction which is not yet committed but because we
have removed it after prepare, it won't get that TXN and then that
leads to wrong computation of restart_lsn. Once we start from a wrong
point in WAL, the snapshot built was incorrect which will lead to the
wrong result. This is the same reason why the patch is not doing
ReorderBufferForget in DecodePrepare when we decide to skip the
transaction. Also, here, we need to set CheckXidAlive =
InvalidTransactionId; for prepared xact as well.

Just to confirm what you are expecting here. so after we send out the prepare transaction to the plugin, you are suggesting to NOT do a ReorderBufferCleanupTXN, but what to do instead?. Are you suggesting to do what you suggested
as part of concurrent abort handling?

Yes.

Something equivalent to ReorderBufferTruncateTXN()? remove all changes of the transaction but keep the invalidations and tuplecids etc?

I don't think you don't need tuplecids. I have checked
ReorderBufferFinishPrepared() and that seems to require only
invalidations, check if anything else is required.

Do you think we should have a new flag in txn to indicate that this transaction has already been decoded? (prepare_decoded?)

Yeah, I think that would be better. How about if name the new variable
as cleanup_prepared?

--
With Regards,
Amit Kapila.

#19Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#15)
3 attachment(s)

On Tue, Sep 15, 2020 at 10:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I don't think it is complete yet.
*
* This error can only occur when we are sending the data in
* streaming mode and the streaming is not finished yet.
*/
- Assert(streaming);
- Assert(stream_started);
+ Assert(streaming || rbtxn_prepared(txn));
+ Assert(stream_started  || rbtxn_prepared(txn));

Here, you have updated the code but comments are still not updated.

Updated the comments.

I don't think we need to perform abort here. Later we will anyway
encounter the WAL for Rollback Prepared for which we will call
abort_prepared_cb. As we have set the 'concurrent_abort' flag, it will
allow us to skip all the intermediate records. Here, we need only
enough state in ReorderBufferTxn that it can be later used for
ReorderBufferFinishPrepared(). Basically, you need functionality
similar to ReorderBufferTruncateTXN where except for invalidations you
can free memory for everything else. You can either write a new
function ReorderBufferTruncatePreparedTxn or pass another bool
parameter in ReorderBufferTruncateTXN to indicate it is prepared_xact
and then clean up additional things that are not required for prepared
xact.

Added a new parameter to ReorderBufferTruncatePreparedTxn for
prepared transactions and did cleanup of tupulecids as well, I have
left snapshots and transactions.
As a result of this, I also had to create a new function
ReorderBufferCleanupPreparedTXN which will clean up the rest as part
of FinishPrepared handling as we can't call
ReorderBufferCleanupTXN again after this.

*
Similarly, I don't understand why we need below code:
ReorderBufferProcessTXN()
{
..
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
..
}

There is nowhere we are setting the RBTXN_ROLLBACK flag, so how will
this check be true? If we decide to remove this code then don't forget
to update the comments.

Removed.

*
If my previous two comments are correct then I don't think we need the
below interface.
+    <sect3 id="logicaldecoding-output-plugin-abort">
+     <title>Transaction Abort Callback</title>
+
+     <para>
+      The required <function>abort_cb</function> callback is called whenever
+      a transaction abort has to be initiated. This can happen if we are
+      decoding a transaction that has been prepared for two-phase commit and
+      a concurrent rollback happens while we are decoding it.
+<programlisting>
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr abort_lsn);

Removed.

I don't know why the patch has used this way to implement an option to
enable two-phase. Can't we use how we implement 'stream-changes'
option in commit 7259736a6e? Just refer how we set ctx->streaming and
you can use a similar way to set this parameter.

Done, I've moved the checks for callbacks to inside the corresponding wrappers.

This is not what I suggested. Please study the commit 7259736a6e and
see how streaming option is implemented. I want later subscribers can
specify whether they want transactions to be decoded at prepare time
similar to what we have done for streaming. Also, search for
ctx->streaming in the code and see how it is set to get the idea.

Changed it similar to ctx->streaming logic.

Note: Please use version number while sending patches, you can use
something like git format-patch -N -v n to do that. It makes easier
for the reviewer to compare it with the previous version.

Done.

Few other comments:
===================
1.
ReorderBufferProcessTXN()
{
..
if (streaming)
{
ReorderBufferTruncateTXN(rb, txn);

/* Reset the CheckXidAlive */
CheckXidAlive = InvalidTransactionId;
}
else
ReorderBufferCleanupTXN(rb, txn);
..
}

I don't think we can perform ReorderBufferCleanupTXN for the prepared
transactions because if we have removed the ReorderBufferTxn before
commit, the later code might not consider such a transaction in the
system and compute the wrong value of restart_lsn for a slot.
Basically, in SnapBuildProcessRunningXacts() when we call
ReorderBufferGetOldestTXN(), it should show the ReorderBufferTxn of
the prepared transaction which is not yet committed but because we
have removed it after prepare, it won't get that TXN and then that
leads to wrong computation of restart_lsn. Once we start from a wrong
point in WAL, the snapshot built was incorrect which will lead to the
wrong result. This is the same reason why the patch is not doing
ReorderBufferForget in DecodePrepare when we decide to skip the
transaction. Also, here, we need to set CheckXidAlive =
InvalidTransactionId; for prepared xact as well.

Updated as suggested above.

2. Have you thought about the interaction of streaming with prepared
transactions? You can try writing some tests using pg_logical* APIs
and see the behaviour. For ex. there is no handling in
ReorderBufferStreamCommit for the same. I think you need to introduce
stream_prepare API similar to stream_commit and then use the same.

This is pending. I will look at it in the next iteration. Also pending
is the investigation as to why the pgoutput changes were not added
initially.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v4-0002-Tap-test-to-test-concurrent-aborts-during-2-phase.patchapplication/octet-stream; name=v4-0002-Tap-test-to-test-concurrent-aborts-during-2-phase.patchDownload
From 2e778c81095acd808079379f7d51c77e5fba9c7a Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 18 Sep 2020 07:56:07 -0400
Subject: [PATCH v4] Tap test to test concurrent aborts during 2 phase commits

This test is specifically for testing concurrent abort while logical decode
is ongoing. Pass in the xid of the 2PC to the plugin as an option.
On the receipt of a valid "check-xid", the change API in the test decoding
plugin will wait for it to be aborted.
---
 contrib/test_decoding/Makefile          |   2 +
 contrib/test_decoding/t/001_twophase.pl | 119 ++++++++++++++++++++++++++++++++
 2 files changed, 121 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f23f15b..4905a0a 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..a7eb65e
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,119 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

v4-0001-Support-decoding-of-two-phase-transactions.patchapplication/octet-stream; name=v4-0001-Support-decoding-of-two-phase-transactions.patchDownload
From f175d1360bec3646315ccbf5699748448704a38b Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 18 Sep 2020 07:51:49 -0400
Subject: [PATCH v4] Support decoding of two-phase transactions

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes documentation changes.
---
 contrib/test_decoding/expected/prepared.out     | 185 ++++++++++++--
 contrib/test_decoding/sql/prepared.sql          |  77 +++++-
 contrib/test_decoding/test_decoding.c           | 154 ++++++++++++
 doc/src/sgml/logicaldecoding.sgml               | 110 ++++++++-
 src/backend/replication/logical/decode.c        | 141 ++++++++++-
 src/backend/replication/logical/logical.c       | 175 ++++++++++++++
 src/backend/replication/logical/reorderbuffer.c | 309 +++++++++++++++++++++---
 src/include/replication/logical.h               |   5 +
 src/include/replication/output_plugin.h         |  37 +++
 src/include/replication/reorderbuffer.h         |  75 +++++-
 10 files changed, 1183 insertions(+), 85 deletions(-)

diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d..94fb0c9 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,50 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
  init
 (1 row)
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (2);
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (4);
 -- test prepared xact containing ddl
 BEGIN;
@@ -26,45 +57,149 @@ INSERT INTO test_prepared1 VALUES (5);
 ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
                                   data                                   
 -------------------------------------------------------------------------
  BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
- COMMIT
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
- BEGIN
  table public.test_prepared1: INSERT: id[integer]:4
  COMMIT
  BEGIN
- table public.test_prepared2: INSERT: id[integer]:7
- COMMIT
- BEGIN
  table public.test_prepared1: INSERT: id[integer]:5
  table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
  COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
  BEGIN
  table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
  COMMIT
  BEGIN
  table public.test_prepared2: INSERT: id[integer]:9
  COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+ relation | locktype | mode 
+----------+----------+------
+(0 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                    data                                    
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
 
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e726397..ca801e4 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -2,21 +2,25 @@
 SET synchronous_commit = on;
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 INSERT INTO test_prepared1 VALUES (2);
 
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 INSERT INTO test_prepared1 VALUES (4);
 
@@ -27,24 +31,83 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
 
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
 INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- make sure stuff still works
 INSERT INTO test_prepared1 VALUES (8);
 INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- cleanup
 DROP TABLE test_prepared1;
 DROP TABLE test_prepared2;
 
--- show results
+-- show results. There should be nothing to show
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
 
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e60ab34..185a70e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -88,6 +93,19 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
+
 
 void
 _PG_init(void)
@@ -116,6 +134,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
+
 }
 
 
@@ -136,6 +159,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +251,32 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid") == 0)
+		{
+			if (elem->arg)
+			{
+				errno = 0;
+				data->check_xid = (TransactionId)
+					strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno == EINVAL || errno == ERANGE)
+					ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid is not a valid number: \"%s\"",
+								strVal(elem->arg))));
+			}
+			else
+				ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid needs an input value")));
+
+			if (data->check_xid <= 0)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("Specify positive value for parameter \"%s\","
+								" you specified \"%s\"",
+								elem->defname, strVal(elem->arg))));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -297,6 +347,94 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +593,22 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid is specified */
+	if (TransactionIdIsValid(data->check_xid))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid);
+		while (TransactionIdIsInProgress(data->check_xid))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid) &&
+			   !TransactionIdDidCommit(data->check_xid))
+			elog(LOG, "%u aborted", data->check_xid);
+
+		Assert(TransactionIdDidAbort(data->check_xid));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..bd4542e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,6 +387,10 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeAbortPreparedCB abort_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
@@ -477,7 +481,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +588,55 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The optional <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Commit Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>commit_prepared_cb</function> callback is called whenever
+      a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+     <title>Rollback Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>abort_prepared_cb</function> callback is called whenever
+      a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +646,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +729,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +783,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message 
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f21f61d..c0b0bce 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -70,6 +70,9 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -312,17 +315,34 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
+				/* check that output plugin is capable of twophase decoding */
+				if (!ctx->enable_twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -647,9 +667,82 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
+	/*
+	 * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+	 * Regular commit simply triggers a replay of transaction changes from the
+	 * reorder buffer. For COMMIT PREPARED that however already happened at
+	 * PREPARE time, and so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 *
+	 * For output plugins that do not support PREPARE-time decoding of
+	 * two-phase transactions, we never even see the PREPARE and all two-phase
+	 * transactions simply fall through to the second branch.
+	 */
+	if (TransactionIdIsValid(parsed->twophase_xid) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder,
+								   parsed->twophase_xid, parsed->twophase_gid))
+	{
+		Assert(xid == parsed->twophase_xid);
+		/* we are processing COMMIT PREPARED */
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* replay actions of all transaction + subtransactions in order */
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	/*
+	 * Process invalidation messages, even if we're not interested in the
+	 * transaction's contents, since the various caches need to always be
+	 * consistent.
+	 */
+	if (parsed->nmsgs > 0)
+	{
+		if (!ctx->fast_forward)
+			ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+										  parsed->nmsgs, parsed->msgs);
+		ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+	}
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+		return;
+
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 }
 
 /*
@@ -661,6 +754,30 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 			xl_xact_parsed_abort *parsed, TransactionId xid)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it's ROLLBACK PREPARED then handle it via callbacks.
+	 */
+	if (TransactionIdIsValid(xid) &&
+		!SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+		parsed->dbId == ctx->slot->data.database &&
+		!FilterByOrigin(ctx, origin_id) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
 
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 0f6af95..4e95337 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -58,6 +58,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -206,6 +214,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -225,6 +237,19 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_message_cb != NULL) ||
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
+ 	/*
+	 * To support two phase logical decoding, we require prepare/commit-prepare/abort-prepare
+	 * callbacks. The filter-prepare callback is optional. We however enable two phase logical
+	 * decoding when at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->enable_twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.abort_prepared_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
 	/*
 	 * streaming callbacks
 	 *
@@ -782,6 +807,111 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then prepare callback is mandatory */
+	if (ctx->enable_twophase && ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then commit prepared callback is mandatory */
+	if (ctx->enable_twophase && ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "abort_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then abort prepared callback is mandatory */
+	if (ctx->enable_twophase && ctx->callbacks.abort_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register abort_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -858,6 +988,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of twophase at PREPARE time is not enabled. In that
+	 * case all twophase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->enable_twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1975d62..d96be77 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -413,6 +414,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/* free data that's contained */
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
 
 	if (txn->tuplecid_hash != NULL)
 	{
@@ -1401,6 +1407,59 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 }
 
 /*
+ * Cleanup the leftover contents of a transaction, usually after the transaction
+ * has been COMMIT PREPARED or ROLLBACK PREPARED. This does the rest of the cleanup
+ * that was not done when the transaction was PREPARED
+ */
+static void
+ReorderBufferCleanupPreparedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	bool	found;
+
+	/*
+	 * Cleanup the base snapshot, if set.
+	 */
+	if (txn->base_snapshot != NULL)
+	{
+		SnapBuildSnapDecRefcount(txn->base_snapshot);
+		dlist_delete(&txn->base_snapshot_node);
+	}
+
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Remove TXN from its containing list.
+	 *
+	 * Note: if txn is known as subxact, we are deleting the TXN from its
+	 * parent's list of known subxacts; this leaves the parent's nsubxacts
+	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
+	 * from the LSN-ordered list of toplevel TXNs.
+	 */
+	dlist_delete(&txn->node);
+
+	/* now remove reference from buffer */
+	hash_search(rb->by_txn,
+				(void *) &txn->xid,
+				HASH_REMOVE,
+				&found);
+	Assert(found);
+
+	/* remove entries spilled to disk */
+	if (rbtxn_is_serialized(txn))
+		ReorderBufferRestoreCleanup(rb, txn);
+
+	/* deallocate */
+	ReorderBufferReturnTXN(rb, txn);
+}
+
+/*
  * Cleanup the contents of a transaction, usually after the transaction
  * committed or aborted.
  */
@@ -1502,12 +1561,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or 
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1526,7 +1587,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the toplevel txn */
@@ -1560,9 +1621,30 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for decoding
+		 * catalog snapshot access.
+	 	 * They are always stored in the toplevel transaction.
+	 	 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1880,7 +1962,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1987,7 +2069,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2249,7 +2331,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					break;
 			}
 		}
-
 		/*
 		 * There's a speculative insertion remaining, just clean in up, it
 		 * can't have been successful, otherwise we'd gotten a confirmation
@@ -2278,7 +2359,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for twophase transactions) or COMMIT
+			 * (for regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2319,11 +2409,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, false);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (rbtxn_prepared(txn))
+		{
+			ReorderBufferTruncateTXN(rb, txn, true);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2352,17 +2448,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * abort of the (sub)transaction we are streaming or preparing. We need to do the
 		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can only occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we are
+			 * sending the data out on a PREPARE during a twoi phase commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started  || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2370,10 +2467,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/* If streaming, reset the TXN so that it is allowed to stream remaining data. */
+			if (streaming && stream_started)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+						txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2395,23 +2501,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+                            ReorderBuffer *rb, TransactionId xid,
+					        XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					        TimestampTz commit_time,
+					        RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2453,6 +2552,140 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+	/*
+	* Always call the prepare filter. It's the job of the prepare filter to
+	* give us the *same* response for a given xid across multiple calls
+	* (including ones on restart)
+	*/
+	return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	* The transaction may or may not exist (during restarts for example).
+	* Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+	* it to be created below.
+	*/
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+	{
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+		rb->commit_prepared(rb, txn, commit_lsn);
+	}
+	else
+	{
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+		rb->abort_prepared(rb, txn, commit_lsn);
+	}
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(rb, txn);
+	ReorderBufferCleanupPreparedTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2495,7 +2728,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+    /*
+     * remove potential on-disk data, and deallocate.
+     *
+     * We remove it even for prepared transactions (GID is enough to
+     * commit/abort those later).
+     */
+
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 45abc44..ee63e7b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -84,6 +84,11 @@ typedef struct LogicalDecodingContext
 	 */
 	bool		streaming;
 
+ 	/*
+	 * Does the output plugin support two phase decoding, and is it enabled?
+	 */
+	bool		enable_twophase;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..96e269b 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -171,6 +204,10 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeAbortPreparedCB abort_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1ae17d5..4d4e35d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -162,9 +163,13 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
-#define RBTXN_IS_STREAMED         0x0008
-#define RBTXN_HAS_TOAST_INSERT    0x0010
-#define RBTXN_HAS_SPEC_INSERT     0x0020
+#define RBTXN_PREPARE             0x0008
+#define RBTXN_COMMIT_PREPARED     0x0010
+#define RBTXN_ROLLBACK_PREPARED   0x0020
+#define RBTXN_COMMIT              0x0040
+#define RBTXN_IS_STREAMED         0x0080
+#define RBTXN_HAS_TOAST_INSERT    0x0100
+#define RBTXN_HAS_SPEC_INSERT     0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -218,6 +223,15 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* is this txn prepared? */
+#define rbtxn_prepared(txn)            (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn)     (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn)   (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn)              (txn->txn_flags & RBTXN_COMMIT)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -229,6 +243,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of 2PC we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -390,6 +407,39 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+                                     ReorderBuffer *rb,
+                                     ReorderBufferTXN *txn,
+                                     XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             TransactionId xid,
+                                             const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+                                       ReorderBuffer *rb,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+                                              ReorderBuffer *rb,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             XLogRecPtr abort_lsn);
+
+
+
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -482,6 +532,11 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferAbortCB abort;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferAbortPreparedCB abort_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -548,6 +603,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+                           XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                           TimestampTz commit_time,
+                           RepOriginId origin_id, XLogRecPtr origin_lsn,
+                           char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -571,6 +631,15 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+							 const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v4-0003-pgoutput-output-plugin-support-for-logical-decodi.patchapplication/octet-stream; name=v4-0003-pgoutput-output-plugin-support-for-logical-decodi.patchDownload
From f0cf243ef74b28248d068f43adeed0404fa39fec Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 18 Sep 2020 08:01:14 -0400
Subject: [PATCH v4] pgoutput output plugin support for logical decoding of 2pc

---
 src/backend/access/transam/twophase.c       |  31 ++++++
 src/backend/replication/logical/proto.c     |  90 ++++++++++++++-
 src/backend/replication/logical/worker.c    | 147 ++++++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c |  54 ++++++++-
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  37 ++++++-
 src/test/subscription/t/020_twophase.pl     | 163 ++++++++++++++++++++++++++++
 7 files changed, 514 insertions(+), 9 deletions(-)
 create mode 100644 src/test/subscription/t/020_twophase.pl

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ef4f998..bed87d5 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,37 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (!gxact->valid)
+			continue;
+		if (strcmp(gxact->gid, gid) != 0)
+			continue;
+
+		LWLockRelease(TwoPhaseStateLock);
+
+		return true;
+	}
+
+	LWLockRelease(TwoPhaseStateLock);
+
+	return false;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index eb19142..291ed10 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -72,12 +72,17 @@ logicalrep_read_begin(StringInfo in, LogicalRepBeginData *begin_data)
  */
 void
 logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
-						XLogRecPtr commit_lsn)
+						XLogRecPtr commit_lsn, bool is_commit)
 {
 	uint8		flags = 0;
 
 	pq_sendbyte(out, 'C');		/* sending COMMIT */
 
+	if (is_commit)
+		flags |= LOGICALREP_IS_COMMIT;
+	else
+		flags |= LOGICALREP_IS_ABORT;
+
 	/* send the flags field (unused for now) */
 	pq_sendbyte(out, flags);
 
@@ -88,16 +93,20 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 }
 
 /*
- * Read transaction COMMIT from the stream.
+ * Read transaction COMMIT|ABORT from the stream.
  */
 void
 logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 {
-	/* read flags (unused for now) */
+	/* read flags */
 	uint8		flags = pq_getmsgbyte(in);
 
-	if (flags != 0)
-		elog(ERROR, "unrecognized flags %u in commit message", flags);
+	if (!CommitFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in commit|abort message",
+			 flags);
+
+	/* the flag is either commit or abort */
+	commit_data->is_commit = (flags == LOGICALREP_IS_COMMIT);
 
 	/* read fields */
 	commit_data->commit_lsn = pq_getmsgint64(in);
@@ -106,6 +115,77 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'P');		/* sending PREPARE protocol */
+
+	/*
+	 * This should only ever happen for 2PC transactions. In which case we
+	 * expect to have a non-empty GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(strlen(txn->gid) > 0);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags |= LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags |= LOGICALREP_IS_PREPARE;
+
+	/* Make sure exactly one of the expected flags is set. */
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index d239d28..62c571e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -729,7 +729,11 @@ apply_handle_commit(StringInfo s)
 		replorigin_session_origin_lsn = commit_data.end_lsn;
 		replorigin_session_origin_timestamp = commit_data.committime;
 
-		CommitTransactionCommand();
+		if (commit_data.is_commit)
+			CommitTransactionCommand();
+		else
+			AbortCurrentTransaction();
+
 		pgstat_report_stat(false);
 
 		store_flush_position(commit_data.end_lsn);
@@ -749,6 +753,141 @@ apply_handle_commit(StringInfo s)
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
 
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/* End the earlier transaction and start a new one */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
 /*
  * Handle ORIGIN message.
  *
@@ -1909,10 +2048,14 @@ apply_dispatch(StringInfo s)
 		case 'B':
 			apply_handle_begin(s);
 			break;
-			/* COMMIT */
+			/* COMMIT/ABORT */
 		case 'C':
 			apply_handle_commit(s);
 			break;
+			/* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+		case 'P':
+			apply_handle_prepare(s);
+			break;
 			/* INSERT */
 		case 'I':
 			apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index eb1f230..729b655 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+							 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -143,6 +149,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -373,7 +383,49 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginUpdateProgress(ctx);
 
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit(ctx->out, txn, commit_lsn, true);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
 }
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 607a728..fb07580 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -85,20 +85,55 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
+	bool        is_commit;
 	XLogRecPtr	commit_lsn;
 	XLogRecPtr	end_lsn;
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/* types of the commit protocol message */
+#define LOGICALREP_IS_COMMIT			0x01
+#define LOGICALREP_IS_ABORT				0x02
+
+/* commit message is COMMIT or ABORT, and there is nothing else */
+#define CommitFlagsAreValid(flags) \
+	((flags == LOGICALREP_IS_COMMIT) || (flags == LOGICALREP_IS_ABORT))
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+}			LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ABORT] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	((flags == LOGICALREP_IS_PREPARE) || \
+	 (flags == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 (flags == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
 extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
-									XLogRecPtr commit_lsn);
+									XLogRecPtr commit_lsn, bool is_commit);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+						LogicalRepPrepareData * prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..c7f373d
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+        ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+        'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+   is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+   is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#20Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#18)

On Thu, Sep 17, 2020 at 10:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, I think that would be better. How about if name the new variable
as cleanup_prepared?

I haven't added a new flag to indicate that the prepare was cleaned
up, as that wasn' really necessary. Instead I used a new function to
do partial cleanup to do whatever was not done in the truncate. If you
think, using a flag and doing special handling in
ReorderBufferCleanupTXN was a better idea, let me know.

regards,
Ajin Cherian
Fujitsu Australia

#21Dilip Kumar
dilipbalaut@gmail.com
In reply to: Ajin Cherian (#19)

On Fri, Sep 18, 2020 at 6:02 PM Ajin Cherian <itsajin@gmail.com> wrote:

I have reviewed v4-0001 patch and I have a few comments. I haven't
yet completely reviewed the patch.

1.
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ if (!ctx->fast_forward)
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+   parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+

I think we don't need to add prepare time invalidation messages as we now we
are already logging the invalidations at the command level and adding them to
reorder buffer.

2.

+ /*
+ * Tell the reorderbuffer about the surviving subtransactions. We need to
+ * do this because the main transaction itself has not committed since we
+ * are in the prepare phase right now. So we need to be sure the snapshot
+ * is setup correctly for the main transaction in case all changes
+ * happened in subtransanctions
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ return;

Do we need to call ReorderBufferCommitChild if we are skiping this transaction?
I think the below check should be before calling ReorderBufferCommitChild.

3.

+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }

I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId
== ctx->slot->data.database and !FilterByOrigin in DecodePrepare
so if those are not true then we wouldn't have prepared this
transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we
need
to recheck this conditions.

4.

+ /* If streaming, reset the TXN so that it is allowed to stream
remaining data. */
+ if (streaming && stream_started)
+ {
+ ReorderBufferResetTXN(rb, txn, snapshot_now,
+   command_id, prev_lsn,
+   specinsert);
+ }
+ else
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+ ReorderBufferTruncateTXN(rb, txn, true);
+ }

Why only if (streaming) is not enough? I agree if we are coming here
and it is streaming mode then streaming started must be true
but we already have an assert for that.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#19)

On Fri, Sep 18, 2020 at 6:02 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Sep 15, 2020 at 10:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I don't think it is complete yet.
*
* This error can only occur when we are sending the data in
* streaming mode and the streaming is not finished yet.
*/
- Assert(streaming);
- Assert(stream_started);
+ Assert(streaming || rbtxn_prepared(txn));
+ Assert(stream_started  || rbtxn_prepared(txn));

Here, you have updated the code but comments are still not updated.

Updated the comments.

I don't think we need to perform abort here. Later we will anyway
encounter the WAL for Rollback Prepared for which we will call
abort_prepared_cb. As we have set the 'concurrent_abort' flag, it will
allow us to skip all the intermediate records. Here, we need only
enough state in ReorderBufferTxn that it can be later used for
ReorderBufferFinishPrepared(). Basically, you need functionality
similar to ReorderBufferTruncateTXN where except for invalidations you
can free memory for everything else. You can either write a new
function ReorderBufferTruncatePreparedTxn or pass another bool
parameter in ReorderBufferTruncateTXN to indicate it is prepared_xact
and then clean up additional things that are not required for prepared
xact.

Added a new parameter to ReorderBufferTruncatePreparedTxn for
prepared transactions and did cleanup of tupulecids as well, I have
left snapshots and transactions.
As a result of this, I also had to create a new function
ReorderBufferCleanupPreparedTXN which will clean up the rest as part
of FinishPrepared handling as we can't call
ReorderBufferCleanupTXN again after this.

Why can't we call ReorderBufferCleanupTXN() from
ReorderBufferFinishPrepared after your changes?

+ * If streaming, keep the remaining info - transactions, tuplecids,
invalidations and
+ * snapshots.If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
bool txn_prepared)

Why do we need even snapshot for Prepared transactions? Also, note
that in the comment there is no space before you start a new line.

I don't know why the patch has used this way to implement an option to
enable two-phase. Can't we use how we implement 'stream-changes'
option in commit 7259736a6e? Just refer how we set ctx->streaming and
you can use a similar way to set this parameter.

Done, I've moved the checks for callbacks to inside the corresponding wrappers.

This is not what I suggested. Please study the commit 7259736a6e and
see how streaming option is implemented. I want later subscribers can
specify whether they want transactions to be decoded at prepare time
similar to what we have done for streaming. Also, search for
ctx->streaming in the code and see how it is set to get the idea.

Changed it similar to ctx->streaming logic.

Hmm, I still don't see changes relevant changes in pg_decode_startup().

--
With Regards,
Amit Kapila.

#23Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#21)

On Sun, Sep 20, 2020 at 11:01 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Sep 18, 2020 at 6:02 PM Ajin Cherian <itsajin@gmail.com> wrote:

3.

+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }

I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId
== ctx->slot->data.database and !FilterByOrigin in DecodePrepare
so if those are not true then we wouldn't have prepared this
transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we
need
to recheck this conditions.

Yeah, probably we should have Assert for below three conditions:
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&

Your other comments make sense to me.

--
With Regards,
Amit Kapila.

#24Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#22)

Why can't we call ReorderBufferCleanupTXN() from
ReorderBufferFinishPrepared after your changes?

Since the truncate already removed the changes, it would fail on the
below Assert in ReorderBufferCleanupTXN()
/* Check we're not mixing changes from different transactions. */
Assert(change->txn == txn);

regards.
Ajin Cherian
Fujitsu Australia

#25Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#24)

On Mon, Sep 21, 2020 at 12:36 PM Ajin Cherian <itsajin@gmail.com> wrote:

Why can't we call ReorderBufferCleanupTXN() from
ReorderBufferFinishPrepared after your changes?

Since the truncate already removed the changes, it would fail on the
below Assert in ReorderBufferCleanupTXN()
/* Check we're not mixing changes from different transactions. */
Assert(change->txn == txn);

The changes list should be empty by that time because we removing each
change from the list:, see code "dlist_delete(&change->node);" in
ReorderBufferTruncateTXN. If you are hitting the Assert as you
mentioned then I think the problem is something else.

--
With Regards,
Amit Kapila.

#26Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#23)

On Mon, Sep 21, 2020 at 10:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Sep 20, 2020 at 11:01 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Sep 18, 2020 at 6:02 PM Ajin Cherian <itsajin@gmail.com> wrote:

3.

+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }

I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId
== ctx->slot->data.database and !FilterByOrigin in DecodePrepare
so if those are not true then we wouldn't have prepared this
transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we
need
to recheck this conditions.

Yeah, probably we should have Assert for below three conditions:
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&

+1

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#27Ajin Cherian
itsajin@gmail.com
In reply to: Dilip Kumar (#21)

On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }

I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId
== ctx->slot->data.database and !FilterByOrigin in DecodePrepare
so if those are not true then we wouldn't have prepared this
transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we
need
to recheck this conditions.

We could enter DecodeAbort even without a prepare, as the code is
common for both XLOG_XACT_ABORT and XLOG_XACT_ABORT_PREPARED. So, the
conditions !SnapBuildXactNeedsSkip, parsed->dbId

== ctx->slot->data.database and !FilterByOrigin could be true but the transaction is not prepared, then we dont need to do a ReorderBufferFinishPrepared (with commit flag false) but called ReorderBufferAbort. But I think there is a problem, if those conditions are in fact false, then we should return without trying to Abort using ReorderBufferAbort, what do you think?

I agree with all your other comments.

regards,
Ajin
Fujitsu Australia

#28Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#27)

On Mon, Sep 21, 2020 at 3:45 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }

I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId
== ctx->slot->data.database and !FilterByOrigin in DecodePrepare
so if those are not true then we wouldn't have prepared this
transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we
need
to recheck this conditions.

We could enter DecodeAbort even without a prepare, as the code is
common for both XLOG_XACT_ABORT and XLOG_XACT_ABORT_PREPARED. So, the
conditions !SnapBuildXactNeedsSkip, parsed->dbId

== ctx->slot->data.database and !FilterByOrigin could be true but the transaction is not prepared, then we dont need to do a ReorderBufferFinishPrepared (with commit flag false) but called ReorderBufferAbort. But I think there is a problem, if those conditions are in fact false, then we should return without trying to Abort using ReorderBufferAbort, what do you think?

I think we need to call ReorderBufferAbort at least to clean up the
TXN. Also, if what you are saying is correct then that should be true
without this patch as well, no? If so, we don't need to worry about it
as far as this patch is concerned.

--
With Regards,
Amit Kapila.

#29Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#28)

On Mon, Sep 21, 2020 at 9:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think we need to call ReorderBufferAbort at least to clean up the
TXN. Also, if what you are saying is correct then that should be true
without this patch as well, no? If so, we don't need to worry about it
as far as this patch is concerned.

Yes, that is true. So will change this check to:

if (TransactionIdIsValid(xid) &&
ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)

regards,
Ajin Cherian
Fujitsu Australia

#30Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#29)

On Mon, Sep 21, 2020 at 5:23 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Sep 21, 2020 at 9:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think we need to call ReorderBufferAbort at least to clean up the
TXN. Also, if what you are saying is correct then that should be true
without this patch as well, no? If so, we don't need to worry about it
as far as this patch is concerned.

Yes, that is true. So will change this check to:

if (TransactionIdIsValid(xid) &&
ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)

Yeah and add the Assert for skip conditions as asked above.

--
With Regards,
Amit Kapila.

#31Ajin Cherian
itsajin@gmail.com
In reply to: Dilip Kumar (#21)
3 attachment(s)

On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

1.
+ /*
+ * Process invalidation messages, even if we're not interested in the
+ * transaction's contents, since the various caches need to always be
+ * consistent.
+ */
+ if (parsed->nmsgs > 0)
+ {
+ if (!ctx->fast_forward)
+ ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
+   parsed->nmsgs, parsed->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
+

I think we don't need to add prepare time invalidation messages as we now we
are already logging the invalidations at the command level and adding them to
reorder buffer.

Removed.

2.

+ /*
+ * Tell the reorderbuffer about the surviving subtransactions. We need to
+ * do this because the main transaction itself has not committed since we
+ * are in the prepare phase right now. So we need to be sure the snapshot
+ * is setup correctly for the main transaction in case all changes
+ * happened in subtransanctions
+ */
+ for (i = 0; i < parsed->nsubxacts; i++)
+ {
+ ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+ buf->origptr, buf->endptr);
+ }
+
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ return;

Do we need to call ReorderBufferCommitChild if we are skiping this transaction?
I think the below check should be before calling ReorderBufferCommitChild.

Done.

3.

+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }

I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId
== ctx->slot->data.database and !FilterByOrigin in DecodePrepare
so if those are not true then we wouldn't have prepared this
transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we
need
to recheck this conditions.

I didnt change this, as I am seeing cases where the Abort is getting
called for transactions that needs to be skipped. I also see that the
same check is there both in DecodePrepare and DecodeCommit.
So, while the same transactions were not getting prepared or
committed, it tries to get ROLLBACK PREPARED (as part of finish
prepared handling). The check in if ReorderBufferTxnIsPrepared() is
also not proper. I will need to relook
this logic again in a future patch.

4.

+ /* If streaming, reset the TXN so that it is allowed to stream
remaining data. */
+ if (streaming && stream_started)
+ {
+ ReorderBufferResetTXN(rb, txn, snapshot_now,
+   command_id, prev_lsn,
+   specinsert);
+ }
+ else
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+ ReorderBufferTruncateTXN(rb, txn, true);
+ }

Why only if (streaming) is not enough? I agree if we are coming here
and it is streaming mode then streaming started must be true
but we already have an assert for that.

Changed.

Amit,

I have also changed test_decode startup to support two_phase commits
only if specified similar to how it was done for streaming. I have
also changed the test cases accordingly. However, I have not added it
to the pgoutput startup as that would require create subscription
changes. I will do that in a future patch. Some other pending changes
are:

1. Remove snapshots on prepare truncate.
2. Look at why ReorderBufferCleanupTXN is failing after a
ReorderBufferTruncateTXN
3. Add prepare support to streaming

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v5-0001-Support-decoding-of-two-phase-transactions.patchapplication/octet-stream; name=v5-0001-Support-decoding-of-two-phase-transactions.patchDownload
From 996e9dbc78f61729cfb1295363620cbc95e974cc Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 22 Sep 2020 06:17:43 -0400
Subject: [PATCH v5] Support decoding of two-phase transactions

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes documentation changes.
---
 contrib/test_decoding/expected/prepared.out     | 187 ++++++++++++--
 contrib/test_decoding/sql/prepared.sql          |  79 +++++-
 contrib/test_decoding/test_decoding.c           | 166 +++++++++++++
 doc/src/sgml/logicaldecoding.sgml               | 110 ++++++++-
 src/backend/replication/logical/decode.c        | 129 +++++++++-
 src/backend/replication/logical/logical.c       | 175 ++++++++++++++
 src/backend/replication/logical/reorderbuffer.c | 309 +++++++++++++++++++++---
 src/include/replication/logical.h               |   5 +
 src/include/replication/output_plugin.h         |  37 +++
 src/include/replication/reorderbuffer.h         |  75 +++++-
 10 files changed, 1185 insertions(+), 87 deletions(-)

diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d..fd0e8a4 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,50 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
  init
 (1 row)
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (2);
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (4);
 -- test prepared xact containing ddl
 BEGIN;
@@ -26,45 +57,149 @@ INSERT INTO test_prepared1 VALUES (5);
 ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
                                   data                                   
 -------------------------------------------------------------------------
  BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
- COMMIT
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
- BEGIN
  table public.test_prepared1: INSERT: id[integer]:4
  COMMIT
  BEGIN
- table public.test_prepared2: INSERT: id[integer]:7
- COMMIT
- BEGIN
  table public.test_prepared1: INSERT: id[integer]:5
  table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
  COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
  BEGIN
  table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
  COMMIT
  BEGIN
  table public.test_prepared2: INSERT: id[integer]:9
  COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+ relation | locktype | mode 
+----------+----------+------
+(0 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                    data                                    
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
 
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e726397..162fe43 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -2,21 +2,25 @@
 SET synchronous_commit = on;
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 INSERT INTO test_prepared1 VALUES (2);
 
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 
 INSERT INTO test_prepared1 VALUES (4);
 
@@ -27,24 +31,83 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
 
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
 INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 
 COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- make sure stuff still works
 INSERT INTO test_prepared1 VALUES (8);
 INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- cleanup
 DROP TABLE test_prepared1;
 DROP TABLE test_prepared2;
 
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e60ab34..1bb17a6 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -88,6 +93,19 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
+
 
 void
 _PG_init(void)
@@ -116,6 +134,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
+
 }
 
 
@@ -127,6 +150,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool 		enable_2pc = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +160,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +252,42 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_2pc))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid") == 0)
+		{
+			if (elem->arg)
+			{
+				errno = 0;
+				data->check_xid = (TransactionId)
+					strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno == EINVAL || errno == ERANGE)
+					ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid is not a valid number: \"%s\"",
+								strVal(elem->arg))));
+			}
+			else
+				ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid needs an input value")));
+
+			if (data->check_xid <= 0)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("Specify positive value for parameter \"%s\","
+								" you specified \"%s\"",
+								elem->defname, strVal(elem->arg))));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +299,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->enable_twophase &= enable_2pc;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +359,94 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +605,22 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid is specified */
+	if (TransactionIdIsValid(data->check_xid))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid);
+		while (TransactionIdIsInProgress(data->check_xid))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid) &&
+			   !TransactionIdDidCommit(data->check_xid))
+			elog(LOG, "%u aborted", data->check_xid);
+
+		Assert(TransactionIdDidAbort(data->check_xid));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..bd4542e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,6 +387,10 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeAbortPreparedCB abort_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
@@ -477,7 +481,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +588,55 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The optional <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Commit Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>commit_prepared_cb</function> callback is called whenever
+      a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+     <title>Rollback Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>abort_prepared_cb</function> callback is called whenever
+      a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +646,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +729,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +783,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message 
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f21f61d..b37b62d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -70,6 +70,9 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -312,17 +315,34 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
+				/* check that output plugin is capable of twophase decoding */
+				if (!ctx->enable_twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -647,9 +667,69 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
+	/*
+	 * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+	 * Regular commit simply triggers a replay of transaction changes from the
+	 * reorder buffer. For COMMIT PREPARED that however already happened at
+	 * PREPARE time, and so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 *
+	 * For output plugins that do not support PREPARE-time decoding of
+	 * two-phase transactions, we never even see the PREPARE and all two-phase
+	 * transactions simply fall through to the second branch.
+	 */
+	if (TransactionIdIsValid(parsed->twophase_xid) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder,
+								   parsed->twophase_xid, parsed->twophase_gid))
+	{
+		Assert(xid == parsed->twophase_xid);
+		/* we are processing COMMIT PREPARED */
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* replay actions of all transaction + subtransactions in order */
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+		return;
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 }
 
 /*
@@ -661,6 +741,31 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 			xl_xact_parsed_abort *parsed, TransactionId xid)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it's ROLLBACK PREPARED then handle it via callbacks.
+	 */
+	if (TransactionIdIsValid(xid) &&
+		!SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+		parsed->dbId == ctx->slot->data.database &&
+		!FilterByOrigin(ctx, origin_id) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+	{
+
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
 
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 0f6af95..4e95337 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -58,6 +58,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -206,6 +214,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -225,6 +237,19 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_message_cb != NULL) ||
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
+ 	/*
+	 * To support two phase logical decoding, we require prepare/commit-prepare/abort-prepare
+	 * callbacks. The filter-prepare callback is optional. We however enable two phase logical
+	 * decoding when at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->enable_twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.abort_prepared_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
 	/*
 	 * streaming callbacks
 	 *
@@ -782,6 +807,111 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then prepare callback is mandatory */
+	if (ctx->enable_twophase && ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then commit prepared callback is mandatory */
+	if (ctx->enable_twophase && ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "abort_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then abort prepared callback is mandatory */
+	if (ctx->enable_twophase && ctx->callbacks.abort_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register abort_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -858,6 +988,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of twophase at PREPARE time is not enabled. In that
+	 * case all twophase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->enable_twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1975d62..d96be77 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -413,6 +414,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/* free data that's contained */
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
 
 	if (txn->tuplecid_hash != NULL)
 	{
@@ -1401,6 +1407,59 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 }
 
 /*
+ * Cleanup the leftover contents of a transaction, usually after the transaction
+ * has been COMMIT PREPARED or ROLLBACK PREPARED. This does the rest of the cleanup
+ * that was not done when the transaction was PREPARED
+ */
+static void
+ReorderBufferCleanupPreparedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	bool	found;
+
+	/*
+	 * Cleanup the base snapshot, if set.
+	 */
+	if (txn->base_snapshot != NULL)
+	{
+		SnapBuildSnapDecRefcount(txn->base_snapshot);
+		dlist_delete(&txn->base_snapshot_node);
+	}
+
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Remove TXN from its containing list.
+	 *
+	 * Note: if txn is known as subxact, we are deleting the TXN from its
+	 * parent's list of known subxacts; this leaves the parent's nsubxacts
+	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
+	 * from the LSN-ordered list of toplevel TXNs.
+	 */
+	dlist_delete(&txn->node);
+
+	/* now remove reference from buffer */
+	hash_search(rb->by_txn,
+				(void *) &txn->xid,
+				HASH_REMOVE,
+				&found);
+	Assert(found);
+
+	/* remove entries spilled to disk */
+	if (rbtxn_is_serialized(txn))
+		ReorderBufferRestoreCleanup(rb, txn);
+
+	/* deallocate */
+	ReorderBufferReturnTXN(rb, txn);
+}
+
+/*
  * Cleanup the contents of a transaction, usually after the transaction
  * committed or aborted.
  */
@@ -1502,12 +1561,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or 
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1526,7 +1587,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the toplevel txn */
@@ -1560,9 +1621,30 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for decoding
+		 * catalog snapshot access.
+	 	 * They are always stored in the toplevel transaction.
+	 	 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1880,7 +1962,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1987,7 +2069,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2249,7 +2331,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					break;
 			}
 		}
-
 		/*
 		 * There's a speculative insertion remaining, just clean in up, it
 		 * can't have been successful, otherwise we'd gotten a confirmation
@@ -2278,7 +2359,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for twophase transactions) or COMMIT
+			 * (for regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2319,11 +2409,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, false);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (rbtxn_prepared(txn))
+		{
+			ReorderBufferTruncateTXN(rb, txn, true);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2352,17 +2448,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * abort of the (sub)transaction we are streaming or preparing. We need to do the
 		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can only occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we are
+			 * sending the data out on a PREPARE during a twoi phase commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started  || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2370,10 +2467,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/* If streaming, reset the TXN so that it is allowed to stream remaining data. */
+			if (streaming)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+						txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2395,23 +2501,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+                            ReorderBuffer *rb, TransactionId xid,
+					        XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					        TimestampTz commit_time,
+					        RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2453,6 +2552,140 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+	/*
+	* Always call the prepare filter. It's the job of the prepare filter to
+	* give us the *same* response for a given xid across multiple calls
+	* (including ones on restart)
+	*/
+	return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	* The transaction may or may not exist (during restarts for example).
+	* Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+	* it to be created below.
+	*/
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+	{
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+		rb->commit_prepared(rb, txn, commit_lsn);
+	}
+	else
+	{
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+		rb->abort_prepared(rb, txn, commit_lsn);
+	}
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(rb, txn);
+	ReorderBufferCleanupPreparedTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2495,7 +2728,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+    /*
+     * remove potential on-disk data, and deallocate.
+     *
+     * We remove it even for prepared transactions (GID is enough to
+     * commit/abort those later).
+     */
+
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 45abc44..ee63e7b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -84,6 +84,11 @@ typedef struct LogicalDecodingContext
 	 */
 	bool		streaming;
 
+ 	/*
+	 * Does the output plugin support two phase decoding, and is it enabled?
+	 */
+	bool		enable_twophase;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..96e269b 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -171,6 +204,10 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeAbortPreparedCB abort_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1ae17d5..4d4e35d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -162,9 +163,13 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
-#define RBTXN_IS_STREAMED         0x0008
-#define RBTXN_HAS_TOAST_INSERT    0x0010
-#define RBTXN_HAS_SPEC_INSERT     0x0020
+#define RBTXN_PREPARE             0x0008
+#define RBTXN_COMMIT_PREPARED     0x0010
+#define RBTXN_ROLLBACK_PREPARED   0x0020
+#define RBTXN_COMMIT              0x0040
+#define RBTXN_IS_STREAMED         0x0080
+#define RBTXN_HAS_TOAST_INSERT    0x0100
+#define RBTXN_HAS_SPEC_INSERT     0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -218,6 +223,15 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* is this txn prepared? */
+#define rbtxn_prepared(txn)            (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn)     (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn)   (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn)              (txn->txn_flags & RBTXN_COMMIT)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -229,6 +243,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of 2PC we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -390,6 +407,39 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+                                     ReorderBuffer *rb,
+                                     ReorderBufferTXN *txn,
+                                     XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             TransactionId xid,
+                                             const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+                                       ReorderBuffer *rb,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+                                              ReorderBuffer *rb,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             XLogRecPtr abort_lsn);
+
+
+
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -482,6 +532,11 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferAbortCB abort;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferAbortPreparedCB abort_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -548,6 +603,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+                           XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                           TimestampTz commit_time,
+                           RepOriginId origin_id, XLogRecPtr origin_lsn,
+                           char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -571,6 +631,15 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+							 const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v5-0002-Tap-test-to-test-concurrent-aborts-during-2-phase.patchapplication/octet-stream; name=v5-0002-Tap-test-to-test-concurrent-aborts-during-2-phase.patchDownload
From e8fc641a1a98c611efd469e867bdd04a1947d909 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 22 Sep 2020 06:35:11 -0400
Subject: [PATCH v5] Tap test to test concurrent aborts during 2 phase commits

This test is specifically for testing concurrent abort while logical decode
is ongoing. Pass in the xid of the 2PC to the plugin as an option.
On the receipt of a valid "check-xid", the change API in the test decoding
plugin will wait for it to be aborted.
---
 contrib/test_decoding/Makefile          |   2 +
 contrib/test_decoding/t/001_twophase.pl | 119 ++++++++++++++++++++++++++++++++
 2 files changed, 121 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f23f15b..4905a0a 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..d8c2c29
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,119 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

v5-0003-pgoutput-output-plugin-support-for-logical-decodi.patchapplication/octet-stream; name=v5-0003-pgoutput-output-plugin-support-for-logical-decodi.patchDownload
From fbcc9bbc36d002a9e20edd76a5eac0eca0434ed0 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 22 Sep 2020 07:12:45 -0400
Subject: [PATCH v5] pgoutput output plugin support for logical decoding of 2pc

---
 src/backend/access/transam/twophase.c       |  31 ++++++
 src/backend/replication/logical/proto.c     |  90 ++++++++++++++-
 src/backend/replication/logical/worker.c    | 147 ++++++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c |  54 ++++++++-
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  37 ++++++-
 src/test/subscription/t/020_twophase.pl     | 163 ++++++++++++++++++++++++++++
 7 files changed, 514 insertions(+), 9 deletions(-)
 create mode 100644 src/test/subscription/t/020_twophase.pl

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ef4f998..bed87d5 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,37 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (!gxact->valid)
+			continue;
+		if (strcmp(gxact->gid, gid) != 0)
+			continue;
+
+		LWLockRelease(TwoPhaseStateLock);
+
+		return true;
+	}
+
+	LWLockRelease(TwoPhaseStateLock);
+
+	return false;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index eb19142..291ed10 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -72,12 +72,17 @@ logicalrep_read_begin(StringInfo in, LogicalRepBeginData *begin_data)
  */
 void
 logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
-						XLogRecPtr commit_lsn)
+						XLogRecPtr commit_lsn, bool is_commit)
 {
 	uint8		flags = 0;
 
 	pq_sendbyte(out, 'C');		/* sending COMMIT */
 
+	if (is_commit)
+		flags |= LOGICALREP_IS_COMMIT;
+	else
+		flags |= LOGICALREP_IS_ABORT;
+
 	/* send the flags field (unused for now) */
 	pq_sendbyte(out, flags);
 
@@ -88,16 +93,20 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 }
 
 /*
- * Read transaction COMMIT from the stream.
+ * Read transaction COMMIT|ABORT from the stream.
  */
 void
 logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 {
-	/* read flags (unused for now) */
+	/* read flags */
 	uint8		flags = pq_getmsgbyte(in);
 
-	if (flags != 0)
-		elog(ERROR, "unrecognized flags %u in commit message", flags);
+	if (!CommitFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in commit|abort message",
+			 flags);
+
+	/* the flag is either commit or abort */
+	commit_data->is_commit = (flags == LOGICALREP_IS_COMMIT);
 
 	/* read fields */
 	commit_data->commit_lsn = pq_getmsgint64(in);
@@ -106,6 +115,77 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'P');		/* sending PREPARE protocol */
+
+	/*
+	 * This should only ever happen for 2PC transactions. In which case we
+	 * expect to have a non-empty GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(strlen(txn->gid) > 0);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags |= LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags |= LOGICALREP_IS_PREPARE;
+
+	/* Make sure exactly one of the expected flags is set. */
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index d239d28..62c571e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -729,7 +729,11 @@ apply_handle_commit(StringInfo s)
 		replorigin_session_origin_lsn = commit_data.end_lsn;
 		replorigin_session_origin_timestamp = commit_data.committime;
 
-		CommitTransactionCommand();
+		if (commit_data.is_commit)
+			CommitTransactionCommand();
+		else
+			AbortCurrentTransaction();
+
 		pgstat_report_stat(false);
 
 		store_flush_position(commit_data.end_lsn);
@@ -749,6 +753,141 @@ apply_handle_commit(StringInfo s)
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
 
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/* End the earlier transaction and start a new one */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
 /*
  * Handle ORIGIN message.
  *
@@ -1909,10 +2048,14 @@ apply_dispatch(StringInfo s)
 		case 'B':
 			apply_handle_begin(s);
 			break;
-			/* COMMIT */
+			/* COMMIT/ABORT */
 		case 'C':
 			apply_handle_commit(s);
 			break;
+			/* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+		case 'P':
+			apply_handle_prepare(s);
+			break;
 			/* INSERT */
 		case 'I':
 			apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index eb1f230..729b655 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+							 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -143,6 +149,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -373,7 +383,49 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginUpdateProgress(ctx);
 
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit(ctx->out, txn, commit_lsn, true);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
 }
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 607a728..fb07580 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -85,20 +85,55 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
+	bool        is_commit;
 	XLogRecPtr	commit_lsn;
 	XLogRecPtr	end_lsn;
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/* types of the commit protocol message */
+#define LOGICALREP_IS_COMMIT			0x01
+#define LOGICALREP_IS_ABORT				0x02
+
+/* commit message is COMMIT or ABORT, and there is nothing else */
+#define CommitFlagsAreValid(flags) \
+	((flags == LOGICALREP_IS_COMMIT) || (flags == LOGICALREP_IS_ABORT))
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+}			LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ABORT] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	((flags == LOGICALREP_IS_PREPARE) || \
+	 (flags == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 (flags == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
 extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
-									XLogRecPtr commit_lsn);
+									XLogRecPtr commit_lsn, bool is_commit);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+						LogicalRepPrepareData * prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..c7f373d
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+        ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+        'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+   is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+   is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#32Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#31)

On Tue, Sep 22, 2020 at 5:18 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

3.

+ /*
+ * If it's ROLLBACK PREPARED then handle it via callbacks.
+ */
+ if (TransactionIdIsValid(xid) &&
+ !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+ parsed->dbId == ctx->slot->data.database &&
+ !FilterByOrigin(ctx, origin_id) &&
+ ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, false);
+ return;
+ }

I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId
== ctx->slot->data.database and !FilterByOrigin in DecodePrepare
so if those are not true then we wouldn't have prepared this
transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we
need
to recheck this conditions.

I didnt change this, as I am seeing cases where the Abort is getting
called for transactions that needs to be skipped. I also see that the
same check is there both in DecodePrepare and DecodeCommit.
So, while the same transactions were not getting prepared or
committed, it tries to get ROLLBACK PREPARED (as part of finish
prepared handling). The check in if ReorderBufferTxnIsPrepared() is
also not proper.

If the transaction is prepared which you can ensure via
ReorderBufferTxnIsPrepared() (considering you have a proper check in
that function), it should not require skipping the transaction in
Abort. One way it could happen is if you clean up the ReorderBufferTxn
in Prepare which you were doing in earlier version of patch which I
pointed out was wrong, if you have changed that then I don't know why
it could fail, may be someplace else during prepare the patch is
freeing it. Just check that.

I will need to relook
this logic again in a future patch.

No problem. I think you can handle the other comments and then we can
come back to this and you might want to share the exact details of the
test (may be a narrow down version of the original test) and I or
someone else might be able to help you with that.

--
With Regards,
Amit Kapila.

#33Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#32)
4 attachment(s)

On Wed, Sep 23, 2020 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

No problem. I think you can handle the other comments and then we can
come back to this and you might want to share the exact details of the
test (may be a narrow down version of the original test) and I or
someone else might be able to help you with that.

--
With Regards,
Amit Kapila.

I have added a new patch for supporting 2 phase commit semantics in
the streaming APIs for the logical decoding plugins. I have added 3
APIs
1. stream_prepare
2. stream_commit_prepared
3. stream_abort_prepared

I have also added the support for the new APIs in test_decoding
plugin. I have not yet added it to pgoutpout.

I have also added a fix for the error I saw while calling
ReorderBufferCleanupTXN as part of FinishPrepared handling. As a
result I have removed the function I added earlier,
ReorderBufferCleanupPreparedTXN.
Please have a look at the new changes and let me know what you think.

I will continue to look at:

1. Remove snapshots on prepare truncate.
2. Bug seen while abort of prepared transaction, the prepared flag is
lost, and not able to make out that it was a previously prepared
transaction.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v6-0001-Support-decoding-of-two-phase-transactions.patchapplication/octet-stream; name=v6-0001-Support-decoding-of-two-phase-transactions.patchDownload
From 13352a817819b01e9089bfcb1eee1d62545c4c4c Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 28 Sep 2020 02:38:44 -0400
Subject: [PATCH v6] Support decoding of two-phase transactions

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes documentation changes.
---
 contrib/test_decoding/expected/prepared.out     | 187 ++++++++++++--
 contrib/test_decoding/sql/prepared.sql          |  79 +++++-
 contrib/test_decoding/test_decoding.c           | 166 +++++++++++++
 doc/src/sgml/logicaldecoding.sgml               | 110 ++++++++-
 src/backend/replication/logical/decode.c        | 129 +++++++++-
 src/backend/replication/logical/logical.c       | 175 +++++++++++++
 src/backend/replication/logical/reorderbuffer.c | 312 +++++++++++++++++++++---
 src/include/replication/logical.h               |   5 +
 src/include/replication/output_plugin.h         |  37 +++
 src/include/replication/reorderbuffer.h         |  75 +++++-
 10 files changed, 1188 insertions(+), 87 deletions(-)

diff --git a/contrib/test_decoding/expected/prepared.out b/contrib/test_decoding/expected/prepared.out
index 46e915d..fd0e8a4 100644
--- a/contrib/test_decoding/expected/prepared.out
+++ b/contrib/test_decoding/expected/prepared.out
@@ -6,19 +6,50 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_d
  init
 (1 row)
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ PREPARE TRANSACTION 'test_prepared#1'
+(3 rows)
+
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (2);
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:2
+ COMMIT
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(6 rows)
+
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
 INSERT INTO test_prepared1 VALUES (4);
 -- test prepared xact containing ddl
 BEGIN;
@@ -26,45 +57,149 @@ INSERT INTO test_prepared1 VALUES (5);
 ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
-INSERT INTO test_prepared2 VALUES (7);
-COMMIT PREPARED 'test_prepared#3';
--- make sure stuff still works
-INSERT INTO test_prepared1 VALUES (8);
-INSERT INTO test_prepared2 VALUES (9);
--- cleanup
-DROP TABLE test_prepared1;
-DROP TABLE test_prepared2;
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
                                   data                                   
 -------------------------------------------------------------------------
  BEGIN
- table public.test_prepared1: INSERT: id[integer]:1
- COMMIT
- BEGIN
- table public.test_prepared1: INSERT: id[integer]:2
- COMMIT
- BEGIN
  table public.test_prepared1: INSERT: id[integer]:4
  COMMIT
  BEGIN
- table public.test_prepared2: INSERT: id[integer]:7
- COMMIT
- BEGIN
  table public.test_prepared1: INSERT: id[integer]:5
  table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(7 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
  COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (8);
+INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
  BEGIN
  table public.test_prepared1: INSERT: id[integer]:8 data[text]:null
  COMMIT
  BEGIN
  table public.test_prepared2: INSERT: id[integer]:9
  COMMIT
-(22 rows)
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+ relation | locktype | mode 
+----------+----------+------
+(0 rows)
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                    data                                    
+----------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:12
+ PREPARE TRANSACTION 'test_prepared_lock2'
+ COMMIT PREPARED 'test_prepared_lock2'
+(8 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
 
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/prepared.sql b/contrib/test_decoding/sql/prepared.sql
index e726397..162fe43 100644
--- a/contrib/test_decoding/sql/prepared.sql
+++ b/contrib/test_decoding/sql/prepared.sql
@@ -2,21 +2,25 @@
 SET synchronous_commit = on;
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 
-CREATE TABLE test_prepared1(id int);
-CREATE TABLE test_prepared2(id int);
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
 
 -- test simple successful use of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (1);
 PREPARE TRANSACTION 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 INSERT INTO test_prepared1 VALUES (2);
 
 -- test abort of a prepared xact
 BEGIN;
 INSERT INTO test_prepared1 VALUES (3);
 PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 
 INSERT INTO test_prepared1 VALUES (4);
 
@@ -27,24 +31,83 @@ ALTER TABLE test_prepared1 ADD COLUMN data text;
 INSERT INTO test_prepared1 VALUES (6, 'frakbar');
 PREPARE TRANSACTION 'test_prepared#3';
 
--- test that we decode correctly while an uncommitted prepared xact
--- with ddl exists.
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 
--- separate table because of the lock from the ALTER
--- this will come before the '5' row above, as this commits before it.
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+-- We should see '7' before '5' in our results since it commits first.
+--
 INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 
 COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- make sure stuff still works
 INSERT INTO test_prepared1 VALUES (8);
 INSERT INTO test_prepared2 VALUES (9);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (10, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (11, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+BEGIN;
+insert into test_prepared2 values (12);
+PREPARE TRANSACTION 'test_prepared_lock2';
+COMMIT PREPARED 'test_prepared_lock2';
+
+SELECT 'pg_class' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'pg_class'::regclass;
+
+-- Shouldn't timeout on 2pc decoding.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- will work normally after we commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test savepoints
+BEGIN;
+SAVEPOINT test_savepoint;
+CREATE TABLE test_prepared_savepoint (a int);
+PREPARE TRANSACTION 'test_prepared_savepoint';
+COMMIT PREPARED 'test_prepared_savepoint';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test that a GID containing "nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 
 -- cleanup
 DROP TABLE test_prepared1;
 DROP TABLE test_prepared2;
 
--- show results
-SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
 
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e60ab34..1bb17a6 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -88,6 +93,19 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
+
 
 void
 _PG_init(void)
@@ -116,6 +134,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
+
 }
 
 
@@ -127,6 +150,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool 		enable_2pc = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +160,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +252,42 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_2pc))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid") == 0)
+		{
+			if (elem->arg)
+			{
+				errno = 0;
+				data->check_xid = (TransactionId)
+					strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno == EINVAL || errno == ERANGE)
+					ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid is not a valid number: \"%s\"",
+								strVal(elem->arg))));
+			}
+			else
+				ereport(FATAL,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid needs an input value")));
+
+			if (data->check_xid <= 0)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("Specify positive value for parameter \"%s\","
+								" you specified \"%s\"",
+								elem->defname, strVal(elem->arg))));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +299,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->enable_twophase &= enable_2pc;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +359,94 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +605,22 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid is specified */
+	if (TransactionIdIsValid(data->check_xid))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid);
+		while (TransactionIdIsInProgress(data->check_xid))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid) &&
+			   !TransactionIdDidCommit(data->check_xid))
+			elog(LOG, "%u aborted", data->check_xid);
+
+		Assert(TransactionIdDidAbort(data->check_xid));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..bd4542e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,6 +387,10 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeAbortPreparedCB abort_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
@@ -477,7 +481,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +588,55 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The optional <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Commit Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>commit_prepared_cb</function> callback is called whenever
+      a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+     <title>Rollback Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>abort_prepared_cb</function> callback is called whenever
+      a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +646,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +729,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +783,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message 
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f21f61d..b37b62d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -70,6 +70,9 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -312,17 +315,34 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
+				/* check that output plugin is capable of twophase decoding */
+				if (!ctx->enable_twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -647,9 +667,69 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
+	/*
+	 * Decide if we're processing COMMIT PREPARED, or a regular COMMIT.
+	 * Regular commit simply triggers a replay of transaction changes from the
+	 * reorder buffer. For COMMIT PREPARED that however already happened at
+	 * PREPARE time, and so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 *
+	 * For output plugins that do not support PREPARE-time decoding of
+	 * two-phase transactions, we never even see the PREPARE and all two-phase
+	 * transactions simply fall through to the second branch.
+	 */
+	if (TransactionIdIsValid(parsed->twophase_xid) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder,
+								   parsed->twophase_xid, parsed->twophase_gid))
+	{
+		Assert(xid == parsed->twophase_xid);
+		/* we are processing COMMIT PREPARED */
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* replay actions of all transaction + subtransactions in order */
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+		return;
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 }
 
 /*
@@ -661,6 +741,31 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 			xl_xact_parsed_abort *parsed, TransactionId xid)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it's ROLLBACK PREPARED then handle it via callbacks.
+	 */
+	if (TransactionIdIsValid(xid) &&
+		!SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+		parsed->dbId == ctx->slot->data.database &&
+		!FilterByOrigin(ctx, origin_id) &&
+		ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
+	{
+
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
 
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 0f6af95..4e95337 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -58,6 +58,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -206,6 +214,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->abort_prepared = abort_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -225,6 +237,19 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_message_cb != NULL) ||
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
+ 	/*
+	 * To support two phase logical decoding, we require prepare/commit-prepare/abort-prepare
+	 * callbacks. The filter-prepare callback is optional. We however enable two phase logical
+	 * decoding when at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->enable_twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.abort_prepared_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
 	/*
 	 * streaming callbacks
 	 *
@@ -782,6 +807,111 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then prepare callback is mandatory */
+	if (ctx->enable_twophase && ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then commit prepared callback is mandatory */
+	if (ctx->enable_twophase && ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "abort_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then abort prepared callback is mandatory */
+	if (ctx->enable_twophase && ctx->callbacks.abort_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register abort_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.abort_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -858,6 +988,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of twophase at PREPARE time is not enabled. In that
+	 * case all twophase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->enable_twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1975d62..5ff920b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -413,6 +414,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/* free data that's contained */
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
 
 	if (txn->tuplecid_hash != NULL)
 	{
@@ -1401,6 +1407,59 @@ ReorderBufferIterTXNFinish(ReorderBuffer *rb,
 }
 
 /*
+ * Cleanup the leftover contents of a transaction, usually after the transaction
+ * has been COMMIT PREPARED or ROLLBACK PREPARED. This does the rest of the cleanup
+ * that was not done when the transaction was PREPARED
+ */
+static void
+ReorderBufferCleanupPreparedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+	bool	found;
+
+	/*
+	 * Cleanup the base snapshot, if set.
+	 */
+	if (txn->base_snapshot != NULL)
+	{
+		SnapBuildSnapDecRefcount(txn->base_snapshot);
+		dlist_delete(&txn->base_snapshot_node);
+	}
+
+	/*
+	 * Cleanup the snapshot for the last streamed run.
+	 */
+	if (txn->snapshot_now != NULL)
+	{
+		Assert(rbtxn_is_streamed(txn));
+		ReorderBufferFreeSnap(rb, txn->snapshot_now);
+	}
+
+	/*
+	 * Remove TXN from its containing list.
+	 *
+	 * Note: if txn is known as subxact, we are deleting the TXN from its
+	 * parent's list of known subxacts; this leaves the parent's nsubxacts
+	 * count too high, but we don't care.  Otherwise, we are deleting the TXN
+	 * from the LSN-ordered list of toplevel TXNs.
+	 */
+	dlist_delete(&txn->node);
+
+	/* now remove reference from buffer */
+	hash_search(rb->by_txn,
+				(void *) &txn->xid,
+				HASH_REMOVE,
+				&found);
+	Assert(found);
+
+	/* remove entries spilled to disk */
+	if (rbtxn_is_serialized(txn))
+		ReorderBufferRestoreCleanup(rb, txn);
+
+	/* deallocate */
+	ReorderBufferReturnTXN(rb, txn);
+}
+
+/*
  * Cleanup the contents of a transaction, usually after the transaction
  * committed or aborted.
  */
@@ -1502,12 +1561,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or 
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1526,7 +1587,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the toplevel txn */
@@ -1560,9 +1621,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for decoding
+		 * catalog snapshot access.
+	 	 * They are always stored in the toplevel transaction.
+	 	 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* remove the change from it's containing list */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1880,7 +1965,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1987,7 +2072,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					break;
 			}
 		}
-
 		/*
 		 * There's a speculative insertion remaining, just clean in up, it
 		 * can't have been successful, otherwise we'd gotten a confirmation
@@ -2278,7 +2362,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for twophase transactions) or COMMIT
+			 * (for regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2319,11 +2412,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, false);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (rbtxn_prepared(txn))
+		{
+			ReorderBufferTruncateTXN(rb, txn, true);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2352,17 +2451,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * abort of the (sub)transaction we are streaming or preparing. We need to do the
 		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can only occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we are
+			 * sending the data out on a PREPARE during a twoi phase commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started  || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2370,10 +2470,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/* If streaming, reset the TXN so that it is allowed to stream remaining data. */
+			if (streaming)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+						txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2395,23 +2504,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+                            ReorderBuffer *rb, TransactionId xid,
+					        XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					        TimestampTz commit_time,
+					        RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2453,6 +2555,140 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+	/*
+	* Always call the prepare filter. It's the job of the prepare filter to
+	* give us the *same* response for a given xid across multiple calls
+	* (including ones on restart)
+	*/
+	return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	* The transaction may or may not exist (during restarts for example).
+	* Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+	* it to be created below.
+	*/
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+	{
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+		rb->commit_prepared(rb, txn, commit_lsn);
+	}
+	else
+	{
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+		rb->abort_prepared(rb, txn, commit_lsn);
+	}
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(rb, txn);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2495,7 +2731,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+    /*
+     * remove potential on-disk data, and deallocate.
+     *
+     * We remove it even for prepared transactions (GID is enough to
+     * commit/abort those later).
+     */
+
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 45abc44..ee63e7b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -84,6 +84,11 @@ typedef struct LogicalDecodingContext
 	 */
 	bool		streaming;
 
+ 	/*
+	 * Does the output plugin support two phase decoding, and is it enabled?
+	 */
+	bool		enable_twophase;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..96e269b 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/abort_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -171,6 +204,10 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeAbortPreparedCB abort_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1ae17d5..4d4e35d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -162,9 +163,13 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT          0x0002
 #define RBTXN_IS_SERIALIZED       0x0004
-#define RBTXN_IS_STREAMED         0x0008
-#define RBTXN_HAS_TOAST_INSERT    0x0010
-#define RBTXN_HAS_SPEC_INSERT     0x0020
+#define RBTXN_PREPARE             0x0008
+#define RBTXN_COMMIT_PREPARED     0x0010
+#define RBTXN_ROLLBACK_PREPARED   0x0020
+#define RBTXN_COMMIT              0x0040
+#define RBTXN_IS_STREAMED         0x0080
+#define RBTXN_HAS_TOAST_INSERT    0x0100
+#define RBTXN_HAS_SPEC_INSERT     0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -218,6 +223,15 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* is this txn prepared? */
+#define rbtxn_prepared(txn)            (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn)     (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn)   (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn)              (txn->txn_flags & RBTXN_COMMIT)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -229,6 +243,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of 2PC we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -390,6 +407,39 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+                                     ReorderBuffer *rb,
+                                     ReorderBufferTXN *txn,
+                                     XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             TransactionId xid,
+                                             const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+                                       ReorderBuffer *rb,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+                                              ReorderBuffer *rb,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferAbortPreparedCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             XLogRecPtr abort_lsn);
+
+
+
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -482,6 +532,11 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferAbortCB abort;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferAbortPreparedCB abort_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -548,6 +603,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+                           XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                           TimestampTz commit_time,
+                           RepOriginId origin_id, XLogRecPtr origin_lsn,
+                           char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -571,6 +631,15 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+							 const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v6-0002-Tap-test-to-test-concurrent-aborts-during-2-phase.patchapplication/octet-stream; name=v6-0002-Tap-test-to-test-concurrent-aborts-during-2-phase.patchDownload
From 5fe74742178c021384591b4898174698df80fd06 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 28 Sep 2020 02:41:25 -0400
Subject: [PATCH v6] Tap test to test concurrent aborts during 2 phase commits

This test is specifically for testing concurrent abort while logical decode
is ongoing. Pass in the xid of the 2PC to the plugin as an option.
On the receipt of a valid "check-xid", the change API in the test decoding
plugin will wait for it to be aborted.
---
 contrib/test_decoding/Makefile          |   2 +
 contrib/test_decoding/t/001_twophase.pl | 119 ++++++++++++++++++++++++++++++++
 2 files changed, 121 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f23f15b..4905a0a 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..d8c2c29
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,119 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

v6-0004-Support-two-phase-commits-in-streaming-mode-in-lo.patchapplication/octet-stream; name=v6-0004-Support-two-phase-commits-in-streaming-mode-in-lo.patchDownload
From 3fb6ae7c4aeae53fae8bf9304878f144e39cb6cf Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 28 Sep 2020 03:08:30 -0400
Subject: [PATCH v6] Support two phase commits in streaming mode in logical
 decoding

Add APIs to the streaming APIS for PREPARE, COMMIT PREPARED and ROLLBACK PREPARED
---
 contrib/test_decoding/test_decoding.c           |  84 ++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml               |  56 +++++++++++-
 src/backend/replication/logical/logical.c       | 111 ++++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  41 +++++++--
 src/include/replication/output_plugin.h         |  30 +++++++
 src/include/replication/reorderbuffer.h         |  21 +++++
 6 files changed, 331 insertions(+), 12 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 1bb17a6..bb9f787 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -78,6 +78,15 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_commit_prepared(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_abort_prepared(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -130,6 +139,9 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
+	cb->stream_commit_prepared_cb = pg_decode_stream_commit_prepared;
+	cb->stream_abort_prepared_cb = pg_decode_stream_abort_prepared;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
@@ -812,6 +824,78 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit_prepared(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "commit prepared streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "commit prepared streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_abort_prepared(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "abort prepared streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "abort prepared streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index bd4542e..a8be9bf 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -396,6 +396,9 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
+    LogicalDecodeStreamCommitPreparedCB stream_commit_prepared_cb;
+    LogicalDecodeStreamAbortPreparedCB stream_abort_prepared_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -418,7 +421,9 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
      <function>stream_commit_cb</function> and <function>stream_change_cb</function>
-     are required, while <function>stream_message_cb</function> and
+     are required, while <function>stream_message_cb</function>,
+     <function>stream_prepare_cb</function>, <function>stream_commit_prepared_cb</function>,
+     <function>stream_abort_prepared_cb</function>, 
      <function>stream_truncate_cb</function> are optional.
     </para>
    </sect2>
@@ -839,6 +844,45 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit-prepared">
+     <title>Stream Commit Prepared Callback</title>
+     <para>
+      The <function>stream_commit_prepared_cb</function> callback is called to commit prepared
+      a previously streamed transaction as part of a two phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort-prepared">
+     <title>Stream Abort Prepared Callback</title>
+     <para>
+      The <function>stream_abort_prepared_cb</function> callback is called to abort prepared
+      a previously streamed transaction as part of a two phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -1017,9 +1061,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>stream_commit_prepared_cb</function> callback or aborted using the
+    <function>stream_abort_prepared_cb</function>
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 4e95337..47968cb 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -81,6 +81,12 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -232,6 +238,9 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
 		(ctx->callbacks.stream_stop_cb != NULL) ||
 		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.stream_commit_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_abort_prepared_cb != NULL) ||
 		(ctx->callbacks.stream_commit_cb != NULL) ||
 		(ctx->callbacks.stream_change_cb != NULL) ||
 		(ctx->callbacks.stream_message_cb != NULL) ||
@@ -261,6 +270,9 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
+	ctx->reorder->stream_commit_prepared = stream_commit_prepared_cb_wrapper;
+	ctx->reorder->stream_abort_prepared = stream_abort_prepared_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -1231,6 +1243,105 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						 XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+                         XLogRecPtr commit_lsn)
+{
+    LogicalDecodingContext *ctx = cache->private_data;
+    LogicalErrorCallbackState state;
+    ErrorContextCallback errcallback;
+
+    Assert(!ctx->fast_forward);
+
+    /* We're only supposed to call this when streaming is supported. */
+    Assert(ctx->streaming);
+
+    /* Push callback + info on the error context stack */
+    state.ctx = ctx;
+    state.callback_name = "stream_commit_prepared";
+    state.report_location = txn->final_lsn;
+    errcallback.callback = output_plugin_error_callback;
+    errcallback.arg = (void *) &state;
+    errcallback.previous = error_context_stack;
+    error_context_stack = &errcallback;
+
+    /* set output state */
+    ctx->accept_writes = true;
+    ctx->write_xid = txn->xid;
+    ctx->write_location = txn->end_lsn;
+
+    ctx->callbacks.stream_commit_prepared_cb(ctx, txn, commit_lsn);
+
+    /* Pop the error context stack */
+    error_context_stack = errcallback.previous;
+}
+
+static void
+stream_abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+                         XLogRecPtr commit_lsn)
+{
+    LogicalDecodingContext *ctx = cache->private_data;
+    LogicalErrorCallbackState state;
+    ErrorContextCallback errcallback;
+
+    Assert(!ctx->fast_forward);
+
+    /* We're only supposed to call this when streaming is supported. */
+    Assert(ctx->streaming);
+
+    /* Push callback + info on the error context stack */
+    state.ctx = ctx;
+    state.callback_name = "stream_abort_prepared";
+    state.report_location = txn->final_lsn;
+    errcallback.callback = output_plugin_error_callback;
+    errcallback.arg = (void *) &state;
+    errcallback.previous = error_context_stack;
+    error_context_stack = &errcallback;
+
+    /* set output state */
+    ctx->accept_writes = true;
+    ctx->write_xid = txn->xid;
+    ctx->write_location = txn->end_lsn;
+
+    ctx->callbacks.stream_abort_prepared_cb(ctx, txn, commit_lsn);
+
+    /* Pop the error context stack */
+    error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5ff920b..e124c35 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1834,9 +1834,18 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
-
-	ReorderBufferCleanupTXN(rb, txn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}	
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -2672,15 +2681,31 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
 	strcpy(txn->gid, gid);
 
-	if (is_commit)
+	if (rbtxn_is_streamed(txn))
 	{
-		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
-		rb->commit_prepared(rb, txn, commit_lsn);
+		if (is_commit)
+		{
+			txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+			rb->stream_commit_prepared(rb, txn, commit_lsn);
+		}
+		else
+		{
+			txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+			rb->stream_abort_prepared(rb, txn, commit_lsn);
+		}
 	}
 	else
 	{
-		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
-		rb->abort_prepared(rb, txn, commit_lsn);
+		if (is_commit)
+		{
+			txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+			rb->commit_prepared(rb, txn, commit_lsn);
+		}
+		else
+		{
+			txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+			rb->abort_prepared(rb, txn, commit_lsn);
+		}
 	}
 
 	/* cleanup: make sure there's no cache pollution */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 96e269b..6000096 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -157,6 +157,33 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit and only when 
+ * two-phased commits are supported
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called to commit prepared changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit and only when 
+ * two-phased commits are supported
+ */
+typedef void (*LogicalDecodeStreamCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called to abort/rollback prepared changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit and only when 
+ * two-phased commits are supported
+ */
+typedef void (*LogicalDecodeStreamAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -214,6 +241,9 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
+	LogicalDecodeStreamCommitPreparedCB stream_commit_prepared_cb;
+	LogicalDecodeStreamAbortPreparedCB stream_abort_prepared_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 4d4e35d..a4dc509 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -466,6 +466,24 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitPreparedCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortPreparedCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -545,6 +563,9 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
+	ReorderBufferStreamCommitPreparedCB stream_commit_prepared;
+	ReorderBufferStreamAbortPreparedCB stream_abort_prepared;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
-- 
1.8.3.1

v6-0003-pgoutput-output-plugin-support-for-logical-decodi.patchapplication/octet-stream; name=v6-0003-pgoutput-output-plugin-support-for-logical-decodi.patchDownload
From 1f98e665c2bb04b621cbc16b386e4c1544aa180f Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 28 Sep 2020 02:48:53 -0400
Subject: [PATCH v6] pgoutput output plugin support for logical decoding of 2pc

Support decoding of two phase commit in pgoutput and on subscriber side.
---
 src/backend/access/transam/twophase.c       |  31 ++++++
 src/backend/replication/logical/proto.c     |  90 ++++++++++++++-
 src/backend/replication/logical/worker.c    | 147 ++++++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c |  54 ++++++++-
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  37 ++++++-
 src/test/subscription/t/020_twophase.pl     | 163 ++++++++++++++++++++++++++++
 7 files changed, 514 insertions(+), 9 deletions(-)
 create mode 100644 src/test/subscription/t/020_twophase.pl

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7940060..f470210 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,37 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (!gxact->valid)
+			continue;
+		if (strcmp(gxact->gid, gid) != 0)
+			continue;
+
+		LWLockRelease(TwoPhaseStateLock);
+
+		return true;
+	}
+
+	LWLockRelease(TwoPhaseStateLock);
+
+	return false;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index eb19142..291ed10 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -72,12 +72,17 @@ logicalrep_read_begin(StringInfo in, LogicalRepBeginData *begin_data)
  */
 void
 logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
-						XLogRecPtr commit_lsn)
+						XLogRecPtr commit_lsn, bool is_commit)
 {
 	uint8		flags = 0;
 
 	pq_sendbyte(out, 'C');		/* sending COMMIT */
 
+	if (is_commit)
+		flags |= LOGICALREP_IS_COMMIT;
+	else
+		flags |= LOGICALREP_IS_ABORT;
+
 	/* send the flags field (unused for now) */
 	pq_sendbyte(out, flags);
 
@@ -88,16 +93,20 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 }
 
 /*
- * Read transaction COMMIT from the stream.
+ * Read transaction COMMIT|ABORT from the stream.
  */
 void
 logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 {
-	/* read flags (unused for now) */
+	/* read flags */
 	uint8		flags = pq_getmsgbyte(in);
 
-	if (flags != 0)
-		elog(ERROR, "unrecognized flags %u in commit message", flags);
+	if (!CommitFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in commit|abort message",
+			 flags);
+
+	/* the flag is either commit or abort */
+	commit_data->is_commit = (flags == LOGICALREP_IS_COMMIT);
 
 	/* read fields */
 	commit_data->commit_lsn = pq_getmsgint64(in);
@@ -106,6 +115,77 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'P');		/* sending PREPARE protocol */
+
+	/*
+	 * This should only ever happen for 2PC transactions. In which case we
+	 * expect to have a non-empty GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(strlen(txn->gid) > 0);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags |= LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags |= LOGICALREP_IS_PREPARE;
+
+	/* Make sure exactly one of the expected flags is set. */
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9c6fdee..a08da85 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -729,7 +729,11 @@ apply_handle_commit(StringInfo s)
 		replorigin_session_origin_lsn = commit_data.end_lsn;
 		replorigin_session_origin_timestamp = commit_data.committime;
 
-		CommitTransactionCommand();
+		if (commit_data.is_commit)
+			CommitTransactionCommand();
+		else
+			AbortCurrentTransaction();
+
 		pgstat_report_stat(false);
 
 		store_flush_position(commit_data.end_lsn);
@@ -749,6 +753,141 @@ apply_handle_commit(StringInfo s)
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
 
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/* End the earlier transaction and start a new one */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
 /*
  * Handle ORIGIN message.
  *
@@ -1909,10 +2048,14 @@ apply_dispatch(StringInfo s)
 		case 'B':
 			apply_handle_begin(s);
 			break;
-			/* COMMIT */
+			/* COMMIT/ABORT */
 		case 'C':
 			apply_handle_commit(s);
 			break;
+			/* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+		case 'P':
+			apply_handle_prepare(s);
+			break;
 			/* INSERT */
 		case 'I':
 			apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..174af7f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+							 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -143,6 +149,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->abort_prepared_cb = pgoutput_abort_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -373,7 +383,49 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginUpdateProgress(ctx);
 
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit(ctx->out, txn, commit_lsn, true);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
 }
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 0c2cda2..33d719c 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -87,20 +87,55 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
+	bool        is_commit;
 	XLogRecPtr	commit_lsn;
 	XLogRecPtr	end_lsn;
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/* types of the commit protocol message */
+#define LOGICALREP_IS_COMMIT			0x01
+#define LOGICALREP_IS_ABORT				0x02
+
+/* commit message is COMMIT or ABORT, and there is nothing else */
+#define CommitFlagsAreValid(flags) \
+	((flags == LOGICALREP_IS_COMMIT) || (flags == LOGICALREP_IS_ABORT))
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+}			LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ABORT] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	((flags == LOGICALREP_IS_PREPARE) || \
+	 (flags == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 (flags == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
 extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
-									XLogRecPtr commit_lsn);
+									XLogRecPtr commit_lsn, bool is_commit);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+						LogicalRepPrepareData * prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..c7f373d
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+        ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+        'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+   is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+   is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#34Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#33)

On Mon, Sep 28, 2020 at 1:13 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Wed, Sep 23, 2020 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have added a new patch for supporting 2 phase commit semantics in
the streaming APIs for the logical decoding plugins. I have added 3
APIs
1. stream_prepare
2. stream_commit_prepared
3. stream_abort_prepared

I have also added the support for the new APIs in test_decoding
plugin. I have not yet added it to pgoutpout.

I have also added a fix for the error I saw while calling
ReorderBufferCleanupTXN as part of FinishPrepared handling. As a
result I have removed the function I added earlier,
ReorderBufferCleanupPreparedTXN.

Can you explain what was the problem and how you fixed it?

Please have a look at the new changes and let me know what you think.

I will continue to look at:

1. Remove snapshots on prepare truncate.
2. Bug seen while abort of prepared transaction, the prepared flag is
lost, and not able to make out that it was a previously prepared
transaction.

And the support of new APIs in pgoutput, right?

--
With Regards,
Amit Kapila.

#35Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#34)

On Mon, Sep 28, 2020 at 6:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Sep 28, 2020 at 1:13 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Wed, Sep 23, 2020 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have added a new patch for supporting 2 phase commit semantics in
the streaming APIs for the logical decoding plugins. I have added 3
APIs
1. stream_prepare
2. stream_commit_prepared
3. stream_abort_prepared

I have also added the support for the new APIs in test_decoding
plugin. I have not yet added it to pgoutpout.

I have also added a fix for the error I saw while calling
ReorderBufferCleanupTXN as part of FinishPrepared handling. As a
result I have removed the function I added earlier,
ReorderBufferCleanupPreparedTXN.

Can you explain what was the problem and how you fixed it?

When I added the changes for cleaning up tuplecids in
ReorderBufferTruncateTXN, I was not deleting it from the list
(dlist_delete), only calling ReorderBufferReturnChange to free
memory. This logic was copied from ReorderBufferCleanupTXN, there the
lists were all cleaned up in the end, so was not present in each list
cleanup logic.

Please have a look at the new changes and let me know what you think.

I will continue to look at:

1. Remove snapshots on prepare truncate.
2. Bug seen while abort of prepared transaction, the prepared flag is
lost, and not able to make out that it was a previously prepared
transaction.

And the support of new APIs in pgoutput, right?

Yes, that also.

regards,
Ajin Cherian
Fujitsu Australia

#36Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#32)

On Wed, Sep 23, 2020 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

If the transaction is prepared which you can ensure via
ReorderBufferTxnIsPrepared() (considering you have a proper check in
that function), it should not require skipping the transaction in
Abort. One way it could happen is if you clean up the ReorderBufferTxn
in Prepare which you were doing in earlier version of patch which I
pointed out was wrong, if you have changed that then I don't know why
it could fail, may be someplace else during prepare the patch is
freeing it. Just check that.

I had a look at this problem. The problem happens when decoding is
done after a prepare but before the corresponding rollback
prepared/commit prepared.
For eg:

Begin;
<change 1>
<change 2>
PREPARE TRANSACTION '<prepare#1>';
SELECT data FROM pg_logical_slot_get_changes(...);
:
:
ROLLBACK PREPARED '<prepare#1>';
SELECT data FROM pg_logical_slot_get_changes(...);

Since the prepare is consumed in the first call to
pg_logical_slot_get_changes, subsequently when it is encountered in
the second call, it is skipped (as already decoded) in DecodePrepare
and the txn->flags are not set to
reflect the fact that it was prepared. The same behaviour is seen when
it is commit prepared after the original prepare was consumed.
Initially I was thinking about the following approach to fix it in DecodePrepare
Approach 1:
1. Break the big Skip check in DecodePrepare into 2 parts.
Return if the following conditions are true:
If (parsed->dbId != InvalidOid && parsed->dbId !=
ctx->slot->data.database) ||
ctx->fast_forward || FilterByOrigin(ctx, origin_id))

2. Check If this condition is true:
SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr)

Then this means we are skipping because this has already
been decoded, then instead of returning, call a new function
ReorderBufferMarkPrepare() which will only update the flags in the txn
to indicate that the transaction is prepared
Then later in DecodeAbort or DecodeCommit, we can confirm
that the transaction has been Prepared by checking if the flag is set
and call ReorderBufferFinishPrepared appropriately.

But then, thinking about this some more, I thought of a second approach.
Approach 2:
If the only purpose of all this was to differentiate between
Abort vs Rollback Prepared and Commit vs Commit Prepared, then we dont
need this. We already know the exact operation
in DecodeXactOp and can differentiate there. We only
overloaded DecodeAbort and DecodeCommit for convenience, we can always
call these functions with an extra flag to denote that we are either
commit or aborting a
previously prepared transaction and call
ReorderBufferFinishPrepared accordingly.

Let me know your thoughts on these two approaches or any other
suggestions on this.

regards,
Ajin Cherian
Fujitsu Australia

#37Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#36)

On Tue, Sep 29, 2020 at 5:08 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Wed, Sep 23, 2020 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

If the transaction is prepared which you can ensure via
ReorderBufferTxnIsPrepared() (considering you have a proper check in
that function), it should not require skipping the transaction in
Abort. One way it could happen is if you clean up the ReorderBufferTxn
in Prepare which you were doing in earlier version of patch which I
pointed out was wrong, if you have changed that then I don't know why
it could fail, may be someplace else during prepare the patch is
freeing it. Just check that.

I had a look at this problem. The problem happens when decoding is
done after a prepare but before the corresponding rollback
prepared/commit prepared.
For eg:

Begin;
<change 1>
<change 2>
PREPARE TRANSACTION '<prepare#1>';
SELECT data FROM pg_logical_slot_get_changes(...);
:
:
ROLLBACK PREPARED '<prepare#1>';
SELECT data FROM pg_logical_slot_get_changes(...);

Since the prepare is consumed in the first call to
pg_logical_slot_get_changes, subsequently when it is encountered in
the second call, it is skipped (as already decoded) in DecodePrepare
and the txn->flags are not set to
reflect the fact that it was prepared. The same behaviour is seen when
it is commit prepared after the original prepare was consumed.
Initially I was thinking about the following approach to fix it in DecodePrepare
Approach 1:
1. Break the big Skip check in DecodePrepare into 2 parts.
Return if the following conditions are true:
If (parsed->dbId != InvalidOid && parsed->dbId !=
ctx->slot->data.database) ||
ctx->fast_forward || FilterByOrigin(ctx, origin_id))

2. Check If this condition is true:
SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr)

Then this means we are skipping because this has already
been decoded, then instead of returning, call a new function
ReorderBufferMarkPrepare() which will only update the flags in the txn
to indicate that the transaction is prepared
Then later in DecodeAbort or DecodeCommit, we can confirm
that the transaction has been Prepared by checking if the flag is set
and call ReorderBufferFinishPrepared appropriately.

But then, thinking about this some more, I thought of a second approach.
Approach 2:
If the only purpose of all this was to differentiate between
Abort vs Rollback Prepared and Commit vs Commit Prepared, then we dont
need this. We already know the exact operation
in DecodeXactOp and can differentiate there. We only
overloaded DecodeAbort and DecodeCommit for convenience, we can always
call these functions with an extra flag to denote that we are either
commit or aborting a
previously prepared transaction and call
ReorderBufferFinishPrepared accordingly.

The second approach sounds better but you can see if there is not much
you want to reuse from DecodeCommit/DecodeAbort then you can even
write new functions DecodeCommitPrepared/DecodeAbortPrepared. OTOH, if
there is a common code among them then passing the flag would be a
better way.

--
With Regards,
Amit Kapila.

#38Dilip Kumar
dilipbalaut@gmail.com
In reply to: Ajin Cherian (#33)

On Mon, Sep 28, 2020 at 1:13 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Wed, Sep 23, 2020 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

No problem. I think you can handle the other comments and then we can
come back to this and you might want to share the exact details of the
test (may be a narrow down version of the original test) and I or
someone else might be able to help you with that.

--
With Regards,
Amit Kapila.

I have added a new patch for supporting 2 phase commit semantics in
the streaming APIs for the logical decoding plugins. I have added 3
APIs
1. stream_prepare
2. stream_commit_prepared
3. stream_abort_prepared

I have also added the support for the new APIs in test_decoding
plugin. I have not yet added it to pgoutpout.

I have also added a fix for the error I saw while calling
ReorderBufferCleanupTXN as part of FinishPrepared handling. As a
result I have removed the function I added earlier,
ReorderBufferCleanupPreparedTXN.
Please have a look at the new changes and let me know what you think.

I will continue to look at:

1. Remove snapshots on prepare truncate.
2. Bug seen while abort of prepared transaction, the prepared flag is
lost, and not able to make out that it was a previously prepared
transaction.

I have started looking into you latest patches, as of now I have a
few comments.

v6-0001

@@ -1987,7 +2072,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
prev_lsn = change->lsn;

  /* Set the current xid to detect concurrent aborts. */
- if (streaming)
+ if (streaming || rbtxn_prepared(change->txn))
  {
  curtxn = change->txn;
  SetupCheckXidLive(curtxn->xid);
@@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
  break;
  }
  }
-

For streaming transaction we need to check the xid everytime because
there could concurrent a subtransaction abort, but
for two-phase we don't need to call SetupCheckXidLive everytime,
because we are sure that transaction is going to be
the same throughout the processing.

Apart from this I have also noticed a couple of cosmetic changes

+ {
+ xl_xact_parsed_prepare parsed;
+ xl_xact_prepare *xlrec;
+ /* check that output plugin is capable of twophase decoding */
+ if (!ctx->enable_twophase)
+ {
+ ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+ break;
+ }

One blank line after variable declations

- /* remove potential on-disk data, and deallocate */
+    /*
+     * remove potential on-disk data, and deallocate.
+     *
+     * We remove it even for prepared transactions (GID is enough to
+     * commit/abort those later).
+     */
+
  ReorderBufferCleanupTXN(rb, txn);

Comment not aligned properly

v6-0003

+LookupGXact(const char *gid)
+{
+ int i;
+
+ LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+ for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+ {
+ GlobalTransaction gxact = TwoPhaseState->prepXacts[i];

I think we should take LW_SHARED lowck here no?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#39Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#38)

On Tue, Sep 29, 2020 at 8:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have started looking into you latest patches, as of now I have a
few comments.

v6-0001

@@ -1987,7 +2072,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
prev_lsn = change->lsn;

/* Set the current xid to detect concurrent aborts. */
- if (streaming)
+ if (streaming || rbtxn_prepared(change->txn))
{
curtxn = change->txn;
SetupCheckXidLive(curtxn->xid);
@@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
break;
}
}
-

For streaming transaction we need to check the xid everytime because
there could concurrent a subtransaction abort, but
for two-phase we don't need to call SetupCheckXidLive everytime,
because we are sure that transaction is going to be
the same throughout the processing.

While decoding transactions at 'prepare' time there could be multiple
sub-transactions like in the case below. Won't that be impacted if we
follow your suggestion here?

postgres=# Begin;
BEGIN
postgres=*# insert into t1 values(1,'aaa');
INSERT 0 1
postgres=*# savepoint s1;
SAVEPOINT
postgres=*# insert into t1 values(2,'aaa');
INSERT 0 1
postgres=*# savepoint s2;
SAVEPOINT
postgres=*# insert into t1 values(3,'aaa');
INSERT 0 1
postgres=*# Prepare Transaction 'foo';
PREPARE TRANSACTION

--
With Regards,
Amit Kapila.

#40Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#39)

On Wed, Sep 30, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 29, 2020 at 8:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have started looking into you latest patches, as of now I have a
few comments.

v6-0001

@@ -1987,7 +2072,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
prev_lsn = change->lsn;

/* Set the current xid to detect concurrent aborts. */
- if (streaming)
+ if (streaming || rbtxn_prepared(change->txn))
{
curtxn = change->txn;
SetupCheckXidLive(curtxn->xid);
@@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
break;
}
}
-

For streaming transaction we need to check the xid everytime because
there could concurrent a subtransaction abort, but
for two-phase we don't need to call SetupCheckXidLive everytime,
because we are sure that transaction is going to be
the same throughout the processing.

While decoding transactions at 'prepare' time there could be multiple
sub-transactions like in the case below. Won't that be impacted if we
follow your suggestion here?

postgres=# Begin;
BEGIN
postgres=*# insert into t1 values(1,'aaa');
INSERT 0 1
postgres=*# savepoint s1;
SAVEPOINT
postgres=*# insert into t1 values(2,'aaa');
INSERT 0 1
postgres=*# savepoint s2;
SAVEPOINT
postgres=*# insert into t1 values(3,'aaa');
INSERT 0 1
postgres=*# Prepare Transaction 'foo';
PREPARE TRANSACTION

But once we prepare the transaction, we can not rollback individual
subtransaction. We can only rollback the main transaction so instead
of setting individual subxact as CheckXidLive, we can just set the
main XID so no need to check on every command. Just set it before
start processing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#41Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#40)

On Wed, Sep 30, 2020 at 2:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Sep 30, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 29, 2020 at 8:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have started looking into you latest patches, as of now I have a
few comments.

v6-0001

@@ -1987,7 +2072,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
prev_lsn = change->lsn;

/* Set the current xid to detect concurrent aborts. */
- if (streaming)
+ if (streaming || rbtxn_prepared(change->txn))
{
curtxn = change->txn;
SetupCheckXidLive(curtxn->xid);
@@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
break;
}
}
-

For streaming transaction we need to check the xid everytime because
there could concurrent a subtransaction abort, but
for two-phase we don't need to call SetupCheckXidLive everytime,
because we are sure that transaction is going to be
the same throughout the processing.

While decoding transactions at 'prepare' time there could be multiple
sub-transactions like in the case below. Won't that be impacted if we
follow your suggestion here?

postgres=# Begin;
BEGIN
postgres=*# insert into t1 values(1,'aaa');
INSERT 0 1
postgres=*# savepoint s1;
SAVEPOINT
postgres=*# insert into t1 values(2,'aaa');
INSERT 0 1
postgres=*# savepoint s2;
SAVEPOINT
postgres=*# insert into t1 values(3,'aaa');
INSERT 0 1
postgres=*# Prepare Transaction 'foo';
PREPARE TRANSACTION

But once we prepare the transaction, we can not rollback individual
subtransaction.

Sure but Rollback can come before prepare like in the case below which
will appear as concurrent abort (assume there is some DDL which
changes the table before the Rollback statement) because it has
already been done by the backend and that need to be caught by this
mechanism only.

Begin;
insert into t1 values(1,'aaa');
savepoint s1;
insert into t1 values(2,'aaa');
savepoint s2;
insert into t1 values(3,'aaa');
Rollback to savepoint s2;
insert into t1 values(4,'aaa');
Prepare Transaction 'foo';

--
With Regards,
Amit Kapila.

#42Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#41)

On Wed, Sep 30, 2020 at 3:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 30, 2020 at 2:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Sep 30, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 29, 2020 at 8:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have started looking into you latest patches, as of now I have a
few comments.

v6-0001

@@ -1987,7 +2072,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
prev_lsn = change->lsn;

/* Set the current xid to detect concurrent aborts. */
- if (streaming)
+ if (streaming || rbtxn_prepared(change->txn))
{
curtxn = change->txn;
SetupCheckXidLive(curtxn->xid);
@@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
break;
}
}
-

For streaming transaction we need to check the xid everytime because
there could concurrent a subtransaction abort, but
for two-phase we don't need to call SetupCheckXidLive everytime,
because we are sure that transaction is going to be
the same throughout the processing.

While decoding transactions at 'prepare' time there could be multiple
sub-transactions like in the case below. Won't that be impacted if we
follow your suggestion here?

postgres=# Begin;
BEGIN
postgres=*# insert into t1 values(1,'aaa');
INSERT 0 1
postgres=*# savepoint s1;
SAVEPOINT
postgres=*# insert into t1 values(2,'aaa');
INSERT 0 1
postgres=*# savepoint s2;
SAVEPOINT
postgres=*# insert into t1 values(3,'aaa');
INSERT 0 1
postgres=*# Prepare Transaction 'foo';
PREPARE TRANSACTION

But once we prepare the transaction, we can not rollback individual
subtransaction.

Sure but Rollback can come before prepare like in the case below which
will appear as concurrent abort (assume there is some DDL which
changes the table before the Rollback statement) because it has
already been done by the backend and that need to be caught by this
mechanism only.

Begin;
insert into t1 values(1,'aaa');
savepoint s1;
insert into t1 values(2,'aaa');
savepoint s2;
insert into t1 values(3,'aaa');
Rollback to savepoint s2;
insert into t1 values(4,'aaa');
Prepare Transaction 'foo';

If we are streaming on the prepare that means we must have decoded
that rollback WAL which means we should have removed the
ReorderBufferTXN for those subxact.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#43Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#42)

On Wed, Sep 30, 2020 at 3:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Sep 30, 2020 at 3:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 30, 2020 at 2:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Sep 30, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 29, 2020 at 8:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have started looking into you latest patches, as of now I have a
few comments.

v6-0001

@@ -1987,7 +2072,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
prev_lsn = change->lsn;

/* Set the current xid to detect concurrent aborts. */
- if (streaming)
+ if (streaming || rbtxn_prepared(change->txn))
{
curtxn = change->txn;
SetupCheckXidLive(curtxn->xid);
@@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
break;
}
}
-

For streaming transaction we need to check the xid everytime because
there could concurrent a subtransaction abort, but
for two-phase we don't need to call SetupCheckXidLive everytime,
because we are sure that transaction is going to be
the same throughout the processing.

While decoding transactions at 'prepare' time there could be multiple
sub-transactions like in the case below. Won't that be impacted if we
follow your suggestion here?

postgres=# Begin;
BEGIN
postgres=*# insert into t1 values(1,'aaa');
INSERT 0 1
postgres=*# savepoint s1;
SAVEPOINT
postgres=*# insert into t1 values(2,'aaa');
INSERT 0 1
postgres=*# savepoint s2;
SAVEPOINT
postgres=*# insert into t1 values(3,'aaa');
INSERT 0 1
postgres=*# Prepare Transaction 'foo';
PREPARE TRANSACTION

But once we prepare the transaction, we can not rollback individual
subtransaction.

Sure but Rollback can come before prepare like in the case below which
will appear as concurrent abort (assume there is some DDL which
changes the table before the Rollback statement) because it has
already been done by the backend and that need to be caught by this
mechanism only.

Begin;
insert into t1 values(1,'aaa');
savepoint s1;
insert into t1 values(2,'aaa');
savepoint s2;
insert into t1 values(3,'aaa');
Rollback to savepoint s2;
insert into t1 values(4,'aaa');
Prepare Transaction 'foo';

If we are streaming on the prepare that means we must have decoded
that rollback WAL which means we should have removed the
ReorderBufferTXN for those subxact.

Okay, valid point. We can avoid setting it for each sub-transaction in
that case but OTOH even if we allow to set it there shouldn't be any
bug.

--
With Regards,
Amit Kapila.

#44Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#43)

On Wed, Sep 30, 2020 at 3:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 30, 2020 at 3:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Sep 30, 2020 at 3:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 30, 2020 at 2:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Sep 30, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 29, 2020 at 8:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have started looking into you latest patches, as of now I have a
few comments.

v6-0001

@@ -1987,7 +2072,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
prev_lsn = change->lsn;

/* Set the current xid to detect concurrent aborts. */
- if (streaming)
+ if (streaming || rbtxn_prepared(change->txn))
{
curtxn = change->txn;
SetupCheckXidLive(curtxn->xid);
@@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
break;
}
}
-

For streaming transaction we need to check the xid everytime because
there could concurrent a subtransaction abort, but
for two-phase we don't need to call SetupCheckXidLive everytime,
because we are sure that transaction is going to be
the same throughout the processing.

While decoding transactions at 'prepare' time there could be multiple
sub-transactions like in the case below. Won't that be impacted if we
follow your suggestion here?

postgres=# Begin;
BEGIN
postgres=*# insert into t1 values(1,'aaa');
INSERT 0 1
postgres=*# savepoint s1;
SAVEPOINT
postgres=*# insert into t1 values(2,'aaa');
INSERT 0 1
postgres=*# savepoint s2;
SAVEPOINT
postgres=*# insert into t1 values(3,'aaa');
INSERT 0 1
postgres=*# Prepare Transaction 'foo';
PREPARE TRANSACTION

But once we prepare the transaction, we can not rollback individual
subtransaction.

Sure but Rollback can come before prepare like in the case below which
will appear as concurrent abort (assume there is some DDL which
changes the table before the Rollback statement) because it has
already been done by the backend and that need to be caught by this
mechanism only.

Begin;
insert into t1 values(1,'aaa');
savepoint s1;
insert into t1 values(2,'aaa');
savepoint s2;
insert into t1 values(3,'aaa');
Rollback to savepoint s2;
insert into t1 values(4,'aaa');
Prepare Transaction 'foo';

If we are streaming on the prepare that means we must have decoded
that rollback WAL which means we should have removed the
ReorderBufferTXN for those subxact.

Okay, valid point. We can avoid setting it for each sub-transaction in
that case but OTOH even if we allow to set it there shouldn't be any
bug.

Right, there will not be any bug, just an optimization.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#45Peter Smith
smithpb2250@gmail.com
In reply to: Ajin Cherian (#33)
1 attachment(s)

Hello Ajin.

I have done some review of the v6 patches.

I had some difficulty replying my review comments to the OSS list, so
I am putting them as an attachment here.

Kind Regards,
Peter Smith
Fujitsu Australia

Attachments:

OSS-List-v6-review-comments-20201006.txttext/plain; charset=US-ASCII; name=OSS-List-v6-review-comments-20201006.txtDownload
Hello Ajin.

I have gone through the v6 patch changes and have a list of review
comments below.

Apologies for the length of this email - I know that many of the
following comments are trivial, but I figured I should either just
ignore everything cosmetic, or list everything regardless. I chose the
latter.

There may be some duplication where the same review comment is written
for multiple files and/or where the same file is in your multiple
patches.

Kind Regards.
Peter Smith
Fujitsu Australia

[BEGIN]

==========
Patch V6-0001, File: contrib/test_decoding/expected/prepared.out (so
prepared.sql also)
==========

COMMENT
Line 30 - The INSERT INTO test_prepared1 VALUES (2); is kind of
strange because it is not really part of the prior test nor the
following test. Maybe it would be better to have a comment describing
the purpose of this isolated INSERT and to also consume the data from
the slot so it does not get jumbled with the data of the following
(abort) test.

;

COMMENT
Line 53 - Same comment for this test INSERT INTO test_prepared1 VALUES
(4); It kind of has nothing really to do with either the prior (abort)
test nor the following (ddl) test.

;

COMMENT
Line 60 - Seems to check which locks are held for the test_prepared_1
table while the transaction is in progress. Maybe it would be better
to have more comments describing what is expected here and why.

;

COMMENT
Line 88 - There is a comment in the test saying "-- We should see '7'
before '5' in our results since it commits first." but I did not see
any test code that actually verifies that happens.

;

QUESTION
Line 120 - I did not really understand the SQL checking the pg_class.
I expected this would be checking table 'test_prepared1' instead. Can
you explain it?
SELECT 'pg_class' AS relation, locktype, mode
FROM pg_locks
WHERE locktype = 'relation'
AND relation = 'pg_class'::regclass;
relation | locktype | mode
----------+----------+------
(0 rows)

;

QUESTION
Line 139 - SET statement_timeout = '1s'; is 1 seconds short enough
here for this test, or might it be that these statements would be
completed in less than one seconds anyhow?

;

QUESTION
Line 163 - How is this testing a SAVEPOINT? Or is it only to check
that the SAVEPOINT command is not part of the replicated changes?

;

COMMENT
Line 175 - Missing underscore in comment. Code requires also underscore:
"nodecode" --> "_nodecode"

==========
Patch V6-0001, File: contrib/test_decoding/test_decoding.c
==========

COMMENT
Line 43
@@ -36,6 +40,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ TransactionId check_xid; /* track abort of this txid */
} TestDecodingData;

The "check_xid" seems a meaningless name. Check what?
IIUC maybe should be something like "check_xid_aborted"

;

COMMENT
Line 105
@ -88,6 +93,19 @@ static void
pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 ReorderBufferTXN *txn,
 int nrelations, Relation relations[],
 ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,

Remove extra blank line after these functions

;

COMMENT
Line 149
@@ -116,6 +134,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 cb->stream_change_cb = pg_decode_stream_change;
 cb->stream_message_cb = pg_decode_stream_message;
 cb->stream_truncate_cb = pg_decode_stream_truncate;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
+
 }

There is a confusing mix of terminology where sometimes things are
referred as ROLLBACK/rollback and other times apparently the same
operation is referred as ABORT/abort. I do not know the root cause of
this mixture. IIUC maybe the internal functions and protocol generally
use the term "abort", whereas the SQL syntax is "ROLLBACK"... but
where those two terms collide in the middle it gets quite confusing.

At least I thought the names of the "callbacks" which get exposed to
the user (e.g. in the help) might be better if they would match the
SQL.
"abort_prepared_cb" --> "rollback_prepared_db"

There are similar review comments like this below where the
alternating terms caused me some confusion.

~

Also, Remove the extra blank line before the end of the function.

;

COMMENT
Line 267
@ -227,6 +252,42 @@ pg_decode_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 strVal(elem->arg), elem->defname)));
 }
+ else if (strcmp(elem->defname, "two-phase-commit") == 0)
+ {
+ if (elem->arg == NULL)
+ continue;

IMO the "check-xid" code might be better rearranged so the NULL check
is first instead of if/else.
e.g.
if (elem->arg == NULL)
    ereport(FATAL,
        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
        errmsg("check-xid needs an input value")));
~

Also, is it really supposed to be FATAL instead or ERROR. That is not
the same as the other surrounding code.

;

COMMENT
Line 296
if (data->check_xid <= 0)
 ereport(ERROR,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 errmsg("Specify positive value for parameter \"%s\","
 " you specified \"%s\"",
 elem->defname, strVal(elem->arg))));

The code checking for <= 0 seems over-complicated. Because conversion
was using strtoul() I fail to see how this can ever be < 0. Wouldn't
it be easier to simply test the result of the strtoul() function?

BEFORE: if (errno == EINVAL || errno == ERANGE)
AFTER: if (data->check_xid == 0)

~

Also, should this be FATAL? Everything else similar is ERROR.

;

COMMENT
(general)
I don't recall seeing any of these decoding options (e.g.
"two-phase-commit", "check-xid") documented anywhere.
So how can a user even know these options exist so they can use them?
Perhaps options should be described on this page?
https://www.postgresql.org/docs/13/functions-admin.html#FUNCTIONS-REPLICATION

;

COMMENT
(general)
"check-xid" is a meaningless option name. Maybe something like
"checked-xid-aborted" is more useful?
Suggest changing the member, the option, and the error messages to
match some better name.

;

COMMENT
Line 314
@@ -238,6 +299,7 @@ pg_decode_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
 }

 ctx->streaming &= enable_streaming;
+ ctx->enable_twophase &= enable_2pc;
 }

The "ctx->enable_twophase" is inconsistent naming with the
"ctx->streaming" member.
"enable_twophase" --> "twophase"

;

COMMENT
Line 374
@@ -297,6 +359,94 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
 OutputPluginWrite(ctx, true);
 }

+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,

Remove the extra preceding blank line.

~

I did not find anything in the help about "_nodecode". Should it be
there or is this deliberately not documented feature?

;

QUESTION
Line 440
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,

Is this a wrong comment
"ABORT PREPARED" --> "ROLLBACK PREPARED" ??

;

COMMENT
Line 620
@@ -455,6 +605,22 @@ pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
 }
 data->xact_wrote_changes = true;

+ /* if check_xid is specified */
+ if (TransactionIdIsValid(data->check_xid))
+ {
+ elog(LOG, "waiting for %u to abort", data->check_xid);
+ while (TransactionIdIsInProgress(dat

The check_xid seems a meaningless name, and the comment "/* if
check_xid is specified */" was not helpful either.
IIUC purpose of this is to check that the nominated xid always is rolled back.
So the appropriate name may be more like "check-xid-aborted".

;

==========
Patch V6-0001, File: doc/src/sgml/logicaldecoding.sgml
==========

COMMENT/QUESTION
Section 48.6.1
@ -387,6 +387,10 @@ typedef struct OutputPluginCallbacks
 LogicalDecodeTruncateCB truncate_cb;
 LogicalDecodeCommitCB commit_cb;
 LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;

Confused by the mixing of terminologies "abort" and "rollback".
Why is it LogicalDecodeAbortPreparedCB instead of
LogicalDecodeRollbackPreparedCB?
Why is it abort_prepared_cb instead of rollback_prepared_cb;?

I thought everything the user sees should be ROLLBACK/rollback (like
the SQL) regardless of what the internal functions might be called.

;

COMMENT
Section 48.6.1
The begin_cb, change_cb and commit_cb callbacks are required, while
startup_cb, filter_by_origin_cb, truncate_cb, and shutdown_cb are
optional. If truncate_cb is not set but a TRUNCATE is to be decoded,
the action will be ignored.

The 1st paragraph beneath the typedef does not mention the newly added
callbacks to say if they are required or optional.

;

COMMENT
Section 48.6.4.5
Section 48.6.4.6
Section 48.6.4.7
@@ -578,6 +588,55 @@ typedef void (*LogicalDecodeCommitCB) (struct
LogicalDecodingContext *ctx,
 </para>
 </sect3>

+ <sect3 id="logicaldecoding-output-plugin-prepare">
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+    <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+<programlisting>

The wording and titles are a bit backwards compared to the others.
e.g. previously was "Transaction Begin" (not "Begin Transaction") and
"Transaction End" (not "End Transaction").

So for consistently following the existing IMO should change these new
titles (and wording) to:
- "Commit Prepared Transaction Callback" --> "Transaction Commit
Prepared Callback"
- "Rollback Prepared Transaction Callback" --> "Transaction Rollback
Prepared Callback"
- "whenever a commit prepared transaction has been decoded" -->
"whenever a transaction commit prepared has been decoded"
- "whenever a rollback prepared transaction has been decoded." -->
"whenever a transaction rollback prepared has been decoded."

;

==========
Patch V6-0001, File: src/backend/replication/logical/decode.c
==========

COMMENT
Line 74
@@ -70,6 +70,9 @@ static void DecodeCommit(LogicalDecodingContext
*ctx, XLogRecordBuffer *buf,
 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);

The 2nd line of DecodePrepare is misaligned by one space.

;

COMMENT
Line 321
@@ -312,17 +315,34 @@ DecodeXactOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
 }
 break;
 case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
+ xl_xact_prepare *xlrec;
+ /* check that output plugin is capable of twophase decoding */

"twophase" --> "two-phase"

~

Also, add a blank line after the declarations.

;

==========
Patch V6-0001, File: src/backend/replication/logical/logical.c
==========

COMMENT
Line 249
@@ -225,6 +237,19 @@ StartupDecodingContext(List *output_plugin_options,
 (ctx->callbacks.stream_message_cb != NULL) ||
 (ctx->callbacks.stream_truncate_cb != NULL);

+ /*
+ * To support two phase logical decoding, we require
prepare/commit-prepare/abort-prepare
+ * callbacks. The filter-prepare callback is optional. We however
enable two phase logical
+ * decoding when at least one of the methods is enabled so that we
can easily identify
+ * missing methods.

The terminology is generally well known as "two-phase" (with the
hyphen) https://en.wikipedia.org/wiki/Two-phase_commit_protocol so
let's be consistent for all the patch code comments. Please search the
code and correct this in all places, even where I might have missed to
identify it.

"two phase" --> "two-phase"

;

COMMENT
Line 822
@@ -782,6 +807,111 @@ commit_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
 }

 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)

"support 2 phase" --> "supports two-phase" in the comment

;

COMMENT
Line 844
Code condition seems strange and/or broken.
if (ctx->enable_twophase && ctx->callbacks.prepare_cb == NULL)
Because if the flag is null then this condition is skipped.
But then if the callback was also NULL then attempting to call it to
"do the actual work" will give NPE.

~

Also, I wonder should this check be the first thing in this function?
Because if it fails does it even make sense that all the errcallback
code was set up?
E.g errcallback.arg potentially is left pointing to a stack variable
on a stack that no longer exists.

;

COMMENT
Line 857
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,

"support 2 phase" --> "supports two-phase" in the comment

~

Also, Same potential trouble with the condition:
if (ctx->enable_twophase && ctx->callbacks.commit_prepared_cb == NULL)
Same as previously asked. Should this check be first thing in this function?

;

COMMENT
Line 892
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,

"support 2 phase" --> "supports two-phase" in the comment

~

Same potential trouble with the condition:
if (ctx->enable_twophase && ctx->callbacks.abort_prepared_cb == NULL)
Same as previously asked. Should this check be the first thing in this function?

;

COMMENT
Line 1013
@@ -858,6 +988,51 @@ truncate_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
 error_context_stack = errcallback.previous;
 }

+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)

Fix wording in comment:
"twophase" --> "two-phase transactions"
"twophase transactions" --> "two-phase transactions"

==========
Patch V6-0001, File: src/backend/replication/logical/reorderbuffer.c
==========

COMMENT
Line 255
@@ -251,7 +251,8 @@ static Size
ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb,
ReorderBufferTXN *txn,
 char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb,
ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+ bool txn_prepared);

The alignment is inconsistent. One more space needed before "bool txn_prepared"

;

COMMENT
Line 417
@@ -413,6 +414,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
 }

 /* free data that's contained */
+ if (txn->gid != NULL)
+ {
+ pfree(txn->gid);
+ txn->gid = NULL;
+ }

Should add the blank link before this new code, as it was before.

;

COMMENT
Line 1564
@ -1502,12 +1561,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
 }

 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them. Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either
after streaming or
+ * after a PREPARE.

typo "snapshots.If" -> "snapshots. If"

;

COMMENT/QUESTION
Line 1590
@@ -1526,7 +1587,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
 Assert(rbtxn_is_known_subxact(subtxn));
 Assert(subtxn->nsubtxns == 0);

- ReorderBufferTruncateTXN(rb, subtxn);
+ ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 }

There are some code paths here I did not understand how they match the comments.
Because this function is recursive it seems that it may be called
where the 2nd parameter txn is a sub-transaction.

But then this seems at odds with some of the other code comments of
this function which are processing the txn without ever testing is it
really toplevel or not:

e.g. Line 1593 "/* cleanup changes in the toplevel txn */"
e.g. Line 1632 "They are always stored in the toplevel transaction."

;

COMMENT
Line 1644
@@ -1560,9 +1621,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
 * about the toplevel xact (we send the XID in all messages), but we never
 * stream XIDs of empty subxacts.
 */
- if ((!txn->toptxn) || (txn->nentries_mem != 0))
+ if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 txn->txn_flags |= RBTXN_IS_STREAMED;

+ if (txn_prepared)

/* remove the change from it's containing list */
typo "it's" --> "its"

;

QUESTION
Line 1977
@@ -1880,7 +1965,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
 ReorderBufferChange *specinsert)
 {
 /* Discard the changes that we just streamed */
- ReorderBufferTruncateTXN(rb, txn);
+ ReorderBufferTruncateTXN(rb, txn, false);

How do you know the 3rd parameter - i.e. txn_prepared - should be
hardwired false here?
e.g. I thought that maybe rbtxn_prepared(txn) can be true here.

;

COMMENT
Line 2345
@@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
 break;
 }
 }
-
 /*

Looks like accidental blank line deletion. This should be put back how it was

;

COMMENT/QUESTION
Line 2374
@@ -2278,7 +2362,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
 }
 }
 else
- rb->commit(rb, txn, commit_lsn);
+ {
+ /*
+ * Call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).

"twophase" --> "two-phase"

~

Also, I was confused by the apparent assumption of exclusiveness of
streaming and 2PC...
e.g. what if streaming AND 2PC then it won't do rb->prepare()

;

QUESTION
Line 2424
@@ -2319,11 +2412,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
 */
 if (streaming)
 {
- ReorderBufferTruncateTXN(rb, txn);
+ ReorderBufferTruncateTXN(rb, txn, false);

 /* Reset the CheckXidAlive */
 CheckXidAlive = InvalidTransactionId;
 }
+ else if (rbtxn_prepared(txn))

I was confused by the exclusiveness of streaming/2PC.
e.g. what if streaming AND 2PC at same time - how can you pass false
as 3rd param to ReorderBufferTruncateTXN?

;

COMMENT
Line 2463
@@ -2352,17 +2451,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,

 /*
 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
- * abort of the (sub)transaction we are streaming. We need to do the
+ * abort of the (sub)transaction we are streaming or preparing. We
need to do the
 * cleanup and return gracefully on this error, see SetupCheckXidLive.
 */

"twoi phase" --> "two-phase"

;

QUESTIONS
Line 2482
@@ -2370,10 +2470,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
 errdata = NULL;
 curtxn->concurrent_abort = true;

- /* Reset the TXN so that it is allowed to stream remaining data. */
- ReorderBufferResetTXN(rb, txn, snapshot_now,
- command_id, prev_lsn,
- specinsert);
+ /* If streaming, reset the TXN so that it is allowed to stream
remaining data. */
+ if (streaming)

Re: /* If streaming, reset the TXN so that it is allowed to stream
remaining data. */
I was confused by the exclusiveness of streaming/2PC.
Is it not possible for streaming flags and rbtxn_prepared(txn) true at
the same time?

~

elog(LOG, "stopping decoding of %s (%u)",
 txn->gid[0] != '\0'? txn->gid:"", txn->xid);

Is this a safe operation, or do you also need to test txn->gid is not NULL?

;

COMMENT
Line 2606
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,

"twophase" --> "two-phase"

;

QUESTION
Line 2655
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,

"This is used to handle COMMIT/ABORT PREPARED"
Should that say "COMMIT/ROLLBACK PREPARED"?

;

COMMENT
Line 2668

"Anyways, 2PC transactions" --> "Anyway, two-phase transactions"

;

COMMENT
Line 2765
@@ -2495,7 +2731,13 @@ ReorderBufferAbort(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
 /* cosmetic... */
 txn->final_lsn = lsn;

- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *

Remove the blank between the comment and code.

==========
Patch V6-0001, File: src/include/replication/logical.h
==========

COMMENT
Line 89

"two phase" -> "two-phase"

;

COMMENT
Line 89

For consistency with the previous member naming really the new member
should just be called "twophase" rather than "enable_twophase"

;

==========
Patch V6-0001, File: src/include/replication/output_plugin.h
==========

QUESTION
Line 106

As previously asked, why is the callback function/typedef referred as
AbortPrepared instead of RollbackPrepared?
It does not match the SQL and the function comment, and seems only to
add some unnecessary confusion.

;

==========
Patch V6-0001, File: src/include/replication/reorderbuffer.h
==========

QUESTION
Line 116
@@ -162,9 +163,13 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT 0x0002
 #define RBTXN_IS_SERIALIZED 0x0004
-#define RBTXN_IS_STREAMED 0x0008
-#define RBTXN_HAS_TOAST_INSERT 0x0010
-#define RBTXN_HAS_SPEC_INSERT 0x0020
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_IS_STREAMED 0x0080
+#define RBTXN_HAS_TOAST_INSERT 0x0100
+#define RBTXN_HAS_SPEC_INSERT 0x0200

I was wondering why when adding new flags, some of the existing flag
masks were also altered.
I am assuming this is ok because they are never persisted but are only
used in the protocol (??)

;

COMMENT
Line 226
@@ -218,6 +223,15 @@ typedef struct ReorderBufferChange
 ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )

+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+

Probably all the "txn->txn_flags" here might be more safely written
with parentheses in the macro like "(txn)->txn_flags".

~

Also, Start all comments with capital. And what is the meaning "in the
meanwhile?"

;

COMMENT
Line 410
@@ -390,6 +407,39 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 ReorderBufferTXN *txn,
 XLogRecPtr commit_lsn);

The format is inconsistent with all other callback signatures here,
where the 1st arg was on the same line as the typedef.

;

COMMENT
Line 440-442

Excessive blank lines following this change?

;

COMMENT
Line 638
@@ -571,6 +631,15 @@ void
ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid,
XLog
 bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);

+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);

Not aligned consistently with other function prototypes.

;

==========
Patch V6-0003, File: src/backend/access/transam/twophase.c
==========

COMMENT
Line 551
@@ -548,6 +548,37 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }

 /*
+ * LookupGXact
+ * Check if the prepared transaction with the given GID is around
+ */
+bool
+LookupGXact(const char *gid)

There is potential to refactor/simplify this code:
e.g.

bool
LookupGXact(const char *gid)
{
 int i;
 bool found = false;

 LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 {
  GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
  /* Ignore not-yet-valid GIDs */
  if (gxact->valid && strcmp(gxact->gid, gid) == 0)
  {
   found = true;
   break;
  }
 }
 LWLockRelease(TwoPhaseStateLock);
 return found;
}

;

==========
Patch V6-0003, File: src/backend/replication/logical/proto.c
==========

COMMENT
Line 86
@@ -72,12 +72,17 @@ logicalrep_read_begin(StringInfo in,
LogicalRepBeginData *begin_data)
 */
 void
 logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
- XLogRecPtr commit_lsn)

Since now the flags are used the code comment is wrong.
"/* send the flags field (unused for now) */"

;

COMMENT
Line 129
@ -106,6 +115,77 @@ logicalrep_read_commit(StringInfo in,
LogicalRepCommitData *commit_data)
 }

 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,

"2PC transactions" --> "two-phase commit transactions"

;

COMMENT
Line 133

Assert(strlen(txn->gid) > 0);
Shouldn't that assertion also check txn->gid is not NULL (to prevent
NPE in case gid was NULL)

;

COMMENT
Line 177
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)

prepare_data->prepare_type = flags;
This code may be OK but it does seem a bit of an abuse of the flags.

e.g. Are they flags or are the really enum values?
e.g. And if they are effectively enums (it appears they are) then
seemed inconsistent that |= was used when they were previously
assigned.

;

==========
Patch V6-0003, File: src/backend/replication/logical/worker.c
==========

COMMENT
Line 757
@@ -749,6 +753,141 @@ apply_handle_commit(StringInfo s)
 pgstat_report_activity(STATE_IDLE, NULL);
 }

+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+ Assert(prepare_data->prepare_lsn == remote_final_lsn);

Missing function comment to say this is called from apply_handle_prepare.

;

COMMENT
Line 798
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)

Missing function comment to say this is called from apply_handle_prepare.

;

COMMENT
Line 824
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)

Missing function comment to say this is called from apply_handle_prepare.

==========
Patch V6-0003, File: src/backend/replication/pgoutput/pgoutput.c
==========

COMMENT
Line 50
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);

The parameter indentation (2nd lines) does not match everything else
in this context.

;

COMMENT
Line 152
@@ -143,6 +149,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 cb->change_cb = pgoutput_change;
 cb->truncate_cb = pgoutput_truncate;
 cb->commit_cb = pgoutput_commit_txn;
+
+ cb->prepare_cb = pgoutput_prepare_txn;
+ cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+ cb->abort_prepared_cb = pgoutput_abort_prepared_txn;

Remove the unnecessary blank line.

;

QUESTION
Line 386
@@ -373,7 +383,49 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
 OutputPluginUpdateProgress(ctx);

 OutputPluginPrepareWrite(ctx, true);
- logicalrep_write_commit(ctx->out, txn, commit_lsn);
+ logicalrep_write_commit(ctx->out, txn, commit_lsn, true);

Is the is_commit parameter of logicalrep_write_commit ever passed as false?
If yes, where?
If no, the what is the point of it?

;

COMMENT
Line 408
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,

Since all this function is identical to pg_output_prepare it might be
better to either
1. just leave this as a wrapper to delegate to that function
2. remove this one entirely and assign the callback to the common
pgoutput_prepare_txn

;

COMMENT
Line 419
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,

Since all this function is identical to pg_output_prepare if might be
better to either
1. just leave this as a wrapper to delegate to that function
2. remove this one entirely and assign the callback to the common
pgoutput_prepare_tx

;

COMMENT
Line 419
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,

Shouldn't this comment say be "ROLLBACK PREPARED"?

;

==========
Patch V6-0003, File: src/include/replication/logicalproto.h
==========

QUESTION
Line 101
@@ -87,20 +87,55 @@ typedef struct LogicalRepBeginData
 TransactionId xid;
 } LogicalRepBeginData;

+/* Commit (and abort) information */

#define LOGICALREP_IS_ABORT 0x02
Is there a good reason why this is not called:
#define LOGICALREP_IS_ROLLBACK 0x02

;

COMMENT
Line 105

((flags == LOGICALREP_IS_COMMIT) || (flags == LOGICALREP_IS_ABORT))

Macros would be safer if flags are in parentheses
(((flags) == LOGICALREP_IS_COMMIT) || ((flags) == LOGICALREP_IS_ABORT))

;

COMMENT
Line 115

Unexpected whitespace for the typedef
"} LogicalRepPrepareData;"

;

COMMENT
Line 122
/* prepare can be exactly one of PREPARE, [COMMIT|ABORT] PREPARED*/
#define PrepareFlagsAreValid(flags) \
 ((flags == LOGICALREP_IS_PREPARE) || \
 (flags == LOGICALREP_IS_COMMIT_PREPARED) || \
 (flags == LOGICALREP_IS_ROLLBACK_PREPARED))

There is confusing mixture in macros and comments of ABORT and ROLLBACK terms
"[COMMIT|ABORT] PREPARED" --> "[COMMIT|ROLLBACK] PREPARED"

~

Also, it would be safer if flags are in parentheses
 (((flags) == LOGICALREP_IS_PREPARE) || \
 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))

;

==========
Patch V6-0003, File: src/test/subscription/t/020_twophase.pl
==========

COMMENT
Line 131 - # check inserts are visible

Isn't this supposed to be checking for rows 12 and 13, instead of 11 and 12?

;

==========
Patch V6-0004, File: contrib/test_decoding/test_decoding.c
==========

COMMENT
Line 81
@@ -78,6 +78,15 @@ static void
pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 ReorderBufferTXN *txn,
 XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static

All these functions have a 3rd parameter called commit_lsn. Even
though the functions are not commit related. It seems like a cut/paste
error.

;

COMMENT
Line 142
@@ -130,6 +139,9 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 cb->stream_start_cb = pg_decode_stream_start;
 cb->stream_stop_cb = pg_decode_stream_stop;
 cb->stream_abort_cb = pg_decode_stream_abort;
+ cb->stream_prepare_cb = pg_decode_stream_prepare;
+ cb->stream_commit_prepared_cb = pg_decode_stream_commit_prepared;
+ cb->stream_abort_prepared_cb = pg_decode_stream_abort_prepared;
 cb->stream_commit_cb = pg_decode_stream_commit;

Can the "cb->stream_abort_prepared_cb" be changed to
"cb->stream_rollback_prepared_cb"?

;

COMMENT
Line 827
@@ -812,6 +824,78 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }

 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_pr

The commit_lsn (3rd parameter) is unused and seems like a cut/paste name error.

;

COMMENT
Line 875
+pg_decode_stream_abort_prepared(LogicalDecodingContext *ctx,

The commit_lsn (3rd parameter) is unused and seems like a cut/paste name error.

;

==========
Patch V6-0004, File: doc/src/sgml/logicaldecoding.sgml
==========

COMMENT
48.6.1
@@ -396,6 +396,9 @@ typedef struct OutputPluginCallbacks
 LogicalDecodeStreamStartCB stream_start_cb;
 LogicalDecodeStreamStopCB stream_stop_cb;
 LogicalDecodeStreamAbortCB stream_abort_cb;
+ LogicalDecodeStreamPrepareCB stream_prepare_cb;
+ LogicalDecodeStreamCommitPreparedCB stream_commit_prepared_cb;
+ LogicalDecodeStreamAbortPreparedCB stream_abort_prepared_cb;

Same question from previous review comments - why using the
terminology "abort" instead of "rollback"

;

COMMENT
48.6.1
@@ -418,7 +421,9 @@ typedef void (*LogicalOutputPluginInit) (struct
OutputPluginCallbacks *cb);
 in-progress transactions. The <function>stream_start_cb</function>,
 <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
 <function>stream_commit_cb</function> and <function>stream_change_cb</function>
- are required, while <function>stream_message_cb</function> and
+ are required, while <function>stream_message_cb</function>,
+ <function>stream_prepare_cb</function>,
<function>stream_commit_prepared_cb</function>,
+ <function>stream_abort_prepared_cb</function>,

Missing "and".
... "stream_abort_prepared_cb, stream_truncate_cb are optional." -->
"stream_abort_prepared_cb, and stream_truncate_cb are optional."

;

COMMENT
Section 48.6.4.16
Section 48.6.4.17
Section 48.6.4.18
@@ -839,6 +844,45 @@ typedef void (*LogicalDecodeStreamAbortCB)
(struct LogicalDecodingContext *ctx,
 </para>
 </sect3>

+ <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+ <title>Stream Prepare Callback</title>
+ <para>
+ The <function>stream_prepare_cb</function> callback is called to prepare
+ a previously streamed transaction as part of a two phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct
LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-stream-commit-prepared">
+ <title>Stream Commit Prepared Callback</title>
+ <para>
+ The <function>stream_commit_prepared_cb</function> callback is
called to commit prepared
+ a previously streamed transaction as part of a two phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitPreparedCB) (struct
LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-stream-abort-prepared">
+ <title>Stream Abort Prepared Callback</title>
+ <para>
+ The <function>stream_abort_prepared_cb</function> callback is called
to abort prepared
+ a previously streamed transaction as part of a two phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortPreparedCB) (struct
LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>

1. Everywhere it says "two phase" commit should be consistently
replaced to say "two-phase" commit (with the hyphen)

2. Search for "abort_lsn" parameter. It seems to be overused
(cut/paste error) even when the API is unrelated to abort

3. 48.6.4.17 and 48.6.4.18
Is this wording ok? Is the word "prepared" even necessary here?
- "... called to commit prepared a previously streamed transaction ..."
- "... called to abort prepared a previously streamed transaction ..."

;

COMMENT
Section 48.9
@@ -1017,9 +1061,13 @@ OutputPluginWrite(ctx, true);
 When streaming an in-progress transaction, the changes (and messages) are
 streamed in blocks demarcated by <function>stream_start_cb</function>
 and <function>stream_stop_cb</function> callbacks. Once all the decoded
- changes are transmitted, the transaction is committed using the
- <function>stream_commit_cb</function> callback (or possibly aborted using
- the <function>stream_abort_cb</function> callback).
+ changes are transmitted, the transaction can be committed using the
+ the <function>stream_commit_cb</function> callback

"two phase" --> "two-phase"

~

Also, Missing period on end of sentence.
"or aborted using the stream_abort_prepared_cb" --> "or aborted using
the stream_abort_prepared_cb."

;

==========
Patch V6-0004, File: src/backend/replication/logical/logical.c
==========

COMMENT
Line 84
@@ -81,6 +81,12 @@ static void stream_stop_cb_wrapper(ReorderBuffer
*cache, ReorderBufferTXN *txn,
 XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
 XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void stream_commit_prepared_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void stream_abort_prepared_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);

The 3rd parameter is always "commit_lsn" even for API unrelated to
commit, so seems like cut/paste error.

;

COMMENT
Line 1246
@@ -1231,6 +1243,105 @@ stream_abort_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
 }

 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;

Misnamed parameter "commit_lsn" ?

~

Also, Line 1272
There seem to be some missing integrity checking to make sure the
callback is not NULL.
A null callback will give NPE when wrapper attempts to call it

;

COMMENT
Line 1305
+static void
+stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,

There seem to be some missing integrity checking to make sure the
callback is not NULL.
A null callback will give NPE when wrapper attempts to call it.

;

COMMENT
Line 1312
+static void
+stream_abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,

Misnamed parameter "commit_lsn" ?

~

Also, Line 1338
There seem to be some missing integrity checking to make sure the
callback is not NULL.
A null callback will give NPE when wrapper attempts to call it.


==========
Patch V6-0004, File: src/backend/replication/logical/reorderbuffer.c
==========

COMMENT
Line 2684
@@ -2672,15 +2681,31 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb,
TransactionId xid,
 txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
 strcpy(txn->gid, gid);

- if (is_commit)
+ if (rbtxn_is_streamed(txn))
 {
- txn->txn_flags |= RBTXN_COMMIT_PREPARED;
- rb->commit_prepared(rb, txn, commit_lsn);
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;

The setting/checking of the flags could be refactored if you wanted to
write less code:
e.g.
if (is_commit)
 txn->txn_flags |= RBTXN_COMMIT_PREPARED;
else
 txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;

if (rbtxn_is_streamed(txn) && rbtxn_commit_prepared(txn))
 rb->stream_commit_prepared(rb, txn, commit_lsn);
else if (rbtxn_is_streamed(txn) && rbtxn_rollback_prepared(txn))
 rb->stream_abort_prepared(rb, txn, commit_lsn);
else if (rbtxn_commit_prepared(txn))
 rb->commit_prepared(rb, txn, commit_lsn);
else if (rbtxn_rollback_prepared(txn))
 rb->abort_prepared(rb, txn, commit_lsn);

;

==========
Patch V6-0004, File: src/include/replication/output_plugin.h
==========

COMMENT
Line 171
@@ -157,6 +157,33 @@ typedef void (*LogicalDecodeStreamAbortCB)
(struct LogicalDecodingContext *ctx,
 XLogRecPtr abort_lsn);

 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit and only when
+ * two-phased commits are supported
+ */

1. Missing period all these comments.

2. Is the part that says "and only where two-phased commits are
supported" necessary to say? Is seems redundant since comments already
says called as part of a two-phase commit.

;

==========
Patch V6-0004, File: src/include/replication/reorderbuffer.h
==========

COMMENT
Line 467
@@ -466,6 +466,24 @@ typedef void (*ReorderBufferStreamAbortCB) (
 ReorderBufferTXN *txn,
 XLogRecPtr abort_lsn);

+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);

Cut/paste error - repeated same comment 3 times?


[END]
#46Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#45)

Hello Ajin.

I have gone through the v6 patch changes and have a list of review
comments below.

Apologies for the length of this email - I know that many of the
following comments are trivial, but I figured I should either just
ignore everything cosmetic, or list everything regardless. I chose the
latter.

There may be some duplication where the same review comment is written
for multiple files and/or where the same file is in your multiple
patches.

Kind Regards.
Peter Smith
Fujitsu Australia

[BEGIN]

==========
Patch V6-0001, File: contrib/test_decoding/expected/prepared.out (so
prepared.sql also)
==========

COMMENT
Line 30 - The INSERT INTO test_prepared1 VALUES (2); is kind of
strange because it is not really part of the prior test nor the
following test. Maybe it would be better to have a comment describing
the purpose of this isolated INSERT and to also consume the data from
the slot so it does not get jumbled with the data of the following
(abort) test.

;

COMMENT
Line 53 - Same comment for this test INSERT INTO test_prepared1 VALUES
(4); It kind of has nothing really to do with either the prior (abort)
test nor the following (ddl) test.

;

COMMENT
Line 60 - Seems to check which locks are held for the test_prepared_1
table while the transaction is in progress. Maybe it would be better
to have more comments describing what is expected here and why.

;

COMMENT
Line 88 - There is a comment in the test saying "-- We should see '7'
before '5' in our results since it commits first." but I did not see
any test code that actually verifies that happens.

;

QUESTION
Line 120 - I did not really understand the SQL checking the pg_class.
I expected this would be checking table 'test_prepared1' instead. Can
you explain it?
SELECT 'pg_class' AS relation, locktype, mode
FROM pg_locks
WHERE locktype = 'relation'
AND relation = 'pg_class'::regclass;
relation | locktype | mode
----------+----------+------
(0 rows)

;

QUESTION
Line 139 - SET statement_timeout = '1s'; is 1 seconds short enough
here for this test, or might it be that these statements would be
completed in less than one seconds anyhow?

;

QUESTION
Line 163 - How is this testing a SAVEPOINT? Or is it only to check
that the SAVEPOINT command is not part of the replicated changes?

;

COMMENT
Line 175 - Missing underscore in comment. Code requires also underscore:
"nodecode" --> "_nodecode"

==========
Patch V6-0001, File: contrib/test_decoding/test_decoding.c
==========

COMMENT
Line 43
@@ -36,6 +40,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ TransactionId check_xid; /* track abort of this txid */
} TestDecodingData;

The "check_xid" seems a meaningless name. Check what?
IIUC maybe should be something like "check_xid_aborted"

;

COMMENT
Line 105
@ -88,6 +93,19 @@ static void
pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 ReorderBufferTXN *txn,
 int nrelations, Relation relations[],
 ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,

Remove extra blank line after these functions

;

COMMENT
Line 149
@@ -116,6 +134,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 cb->stream_change_cb = pg_decode_stream_change;
 cb->stream_message_cb = pg_decode_stream_message;
 cb->stream_truncate_cb = pg_decode_stream_truncate;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
+
 }

There is a confusing mix of terminology where sometimes things are
referred as ROLLBACK/rollback and other times apparently the same
operation is referred as ABORT/abort. I do not know the root cause of
this mixture. IIUC maybe the internal functions and protocol generally
use the term "abort", whereas the SQL syntax is "ROLLBACK"... but
where those two terms collide in the middle it gets quite confusing.

At least I thought the names of the "callbacks" which get exposed to
the user (e.g. in the help) might be better if they would match the
SQL.
"abort_prepared_cb" --> "rollback_prepared_db"

There are similar review comments like this below where the
alternating terms caused me some confusion.

~

Also, Remove the extra blank line before the end of the function.

;

COMMENT
Line 267
@ -227,6 +252,42 @@ pg_decode_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 strVal(elem->arg), elem->defname)));
 }
+ else if (strcmp(elem->defname, "two-phase-commit") == 0)
+ {
+ if (elem->arg == NULL)
+ continue;

IMO the "check-xid" code might be better rearranged so the NULL check
is first instead of if/else.
e.g.
if (elem->arg == NULL)
ereport(FATAL,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("check-xid needs an input value")));
~

Also, is it really supposed to be FATAL instead or ERROR. That is not
the same as the other surrounding code.

;

COMMENT
Line 296
if (data->check_xid <= 0)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("Specify positive value for parameter \"%s\","
" you specified \"%s\"",
elem->defname, strVal(elem->arg))));

The code checking for <= 0 seems over-complicated. Because conversion
was using strtoul() I fail to see how this can ever be < 0. Wouldn't
it be easier to simply test the result of the strtoul() function?

BEFORE: if (errno == EINVAL || errno == ERANGE)
AFTER: if (data->check_xid == 0)

~

Also, should this be FATAL? Everything else similar is ERROR.

;

COMMENT
(general)
I don't recall seeing any of these decoding options (e.g.
"two-phase-commit", "check-xid") documented anywhere.
So how can a user even know these options exist so they can use them?
Perhaps options should be described on this page?
https://www.postgresql.org/docs/13/functions-admin.html#FUNCTIONS-REPLICATION

;

COMMENT
(general)
"check-xid" is a meaningless option name. Maybe something like
"checked-xid-aborted" is more useful?
Suggest changing the member, the option, and the error messages to
match some better name.

;

COMMENT
Line 314
@@ -238,6 +299,7 @@ pg_decode_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
}

ctx->streaming &= enable_streaming;
+ ctx->enable_twophase &= enable_2pc;
}

The "ctx->enable_twophase" is inconsistent naming with the
"ctx->streaming" member.
"enable_twophase" --> "twophase"

;

COMMENT
Line 374
@@ -297,6 +359,94 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}

+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,

Remove the extra preceding blank line.

~

I did not find anything in the help about "_nodecode". Should it be
there or is this deliberately not documented feature?

;

QUESTION
Line 440
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,

Is this a wrong comment
"ABORT PREPARED" --> "ROLLBACK PREPARED" ??

;

COMMENT
Line 620
@@ -455,6 +605,22 @@ pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;

+ /* if check_xid is specified */
+ if (TransactionIdIsValid(data->check_xid))
+ {
+ elog(LOG, "waiting for %u to abort", data->check_xid);
+ while (TransactionIdIsInProgress(dat

The check_xid seems a meaningless name, and the comment "/* if
check_xid is specified */" was not helpful either.
IIUC purpose of this is to check that the nominated xid always is rolled back.
So the appropriate name may be more like "check-xid-aborted".

;

==========
Patch V6-0001, File: doc/src/sgml/logicaldecoding.sgml
==========

COMMENT/QUESTION
Section 48.6.1
@ -387,6 +387,10 @@ typedef struct OutputPluginCallbacks
 LogicalDecodeTruncateCB truncate_cb;
 LogicalDecodeCommitCB commit_cb;
 LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;

Confused by the mixing of terminologies "abort" and "rollback".
Why is it LogicalDecodeAbortPreparedCB instead of
LogicalDecodeRollbackPreparedCB?
Why is it abort_prepared_cb instead of rollback_prepared_cb;?

I thought everything the user sees should be ROLLBACK/rollback (like
the SQL) regardless of what the internal functions might be called.

;

COMMENT
Section 48.6.1
The begin_cb, change_cb and commit_cb callbacks are required, while
startup_cb, filter_by_origin_cb, truncate_cb, and shutdown_cb are
optional. If truncate_cb is not set but a TRUNCATE is to be decoded,
the action will be ignored.

The 1st paragraph beneath the typedef does not mention the newly added
callbacks to say if they are required or optional.

;

COMMENT
Section 48.6.4.5
Section 48.6.4.6
Section 48.6.4.7
@@ -578,6 +588,55 @@ typedef void (*LogicalDecodeCommitCB) (struct
LogicalDecodingContext *ctx,
</para>
</sect3>

+ <sect3 id="logicaldecoding-output-plugin-prepare">
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+    <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+<programlisting>

The wording and titles are a bit backwards compared to the others.
e.g. previously was "Transaction Begin" (not "Begin Transaction") and
"Transaction End" (not "End Transaction").

So for consistently following the existing IMO should change these new
titles (and wording) to:
- "Commit Prepared Transaction Callback" --> "Transaction Commit
Prepared Callback"
- "Rollback Prepared Transaction Callback" --> "Transaction Rollback
Prepared Callback"
- "whenever a commit prepared transaction has been decoded" -->
"whenever a transaction commit prepared has been decoded"
- "whenever a rollback prepared transaction has been decoded." -->
"whenever a transaction rollback prepared has been decoded."

;

==========
Patch V6-0001, File: src/backend/replication/logical/decode.c
==========

COMMENT
Line 74
@@ -70,6 +70,9 @@ static void DecodeCommit(LogicalDecodingContext
*ctx, XLogRecordBuffer *buf,
 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);

The 2nd line of DecodePrepare is misaligned by one space.

;

COMMENT
Line 321
@@ -312,17 +315,34 @@ DecodeXactOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
 }
 break;
 case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
+ xl_xact_prepare *xlrec;
+ /* check that output plugin is capable of twophase decoding */

"twophase" --> "two-phase"

~

Also, add a blank line after the declarations.

;

==========
Patch V6-0001, File: src/backend/replication/logical/logical.c
==========

COMMENT
Line 249
@@ -225,6 +237,19 @@ StartupDecodingContext(List *output_plugin_options,
(ctx->callbacks.stream_message_cb != NULL) ||
(ctx->callbacks.stream_truncate_cb != NULL);

+ /*
+ * To support two phase logical decoding, we require
prepare/commit-prepare/abort-prepare
+ * callbacks. The filter-prepare callback is optional. We however
enable two phase logical
+ * decoding when at least one of the methods is enabled so that we
can easily identify
+ * missing methods.

The terminology is generally well known as "two-phase" (with the
hyphen) https://en.wikipedia.org/wiki/Two-phase_commit_protocol so
let's be consistent for all the patch code comments. Please search the
code and correct this in all places, even where I might have missed to
identify it.

"two phase" --> "two-phase"

;

COMMENT
Line 822
@@ -782,6 +807,111 @@ commit_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
}

 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)

"support 2 phase" --> "supports two-phase" in the comment

;

COMMENT
Line 844
Code condition seems strange and/or broken.
if (ctx->enable_twophase && ctx->callbacks.prepare_cb == NULL)
Because if the flag is null then this condition is skipped.
But then if the callback was also NULL then attempting to call it to
"do the actual work" will give NPE.

~

Also, I wonder should this check be the first thing in this function?
Because if it fails does it even make sense that all the errcallback
code was set up?
E.g errcallback.arg potentially is left pointing to a stack variable
on a stack that no longer exists.

;

COMMENT
Line 857
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,

"support 2 phase" --> "supports two-phase" in the comment

~

Also, Same potential trouble with the condition:
if (ctx->enable_twophase && ctx->callbacks.commit_prepared_cb == NULL)
Same as previously asked. Should this check be first thing in this function?

;

COMMENT
Line 892
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,

"support 2 phase" --> "supports two-phase" in the comment

~

Same potential trouble with the condition:
if (ctx->enable_twophase && ctx->callbacks.abort_prepared_cb == NULL)
Same as previously asked. Should this check be the first thing in this function?

;

COMMENT
Line 1013
@@ -858,6 +988,51 @@ truncate_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}

+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)

Fix wording in comment:
"twophase" --> "two-phase transactions"
"twophase transactions" --> "two-phase transactions"

==========
Patch V6-0001, File: src/backend/replication/logical/reorderbuffer.c
==========

COMMENT
Line 255
@@ -251,7 +251,8 @@ static Size
ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb,
ReorderBufferTXN *txn,
 char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb,
ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+ bool txn_prepared);

The alignment is inconsistent. One more space needed before "bool txn_prepared"

;

COMMENT
Line 417
@@ -413,6 +414,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
}

 /* free data that's contained */
+ if (txn->gid != NULL)
+ {
+ pfree(txn->gid);
+ txn->gid = NULL;
+ }

Should add the blank link before this new code, as it was before.

;

COMMENT
Line 1564
@ -1502,12 +1561,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
}

 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them. Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either
after streaming or
+ * after a PREPARE.

typo "snapshots.If" -> "snapshots. If"

;

COMMENT/QUESTION
Line 1590
@@ -1526,7 +1587,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);

- ReorderBufferTruncateTXN(rb, subtxn);
+ ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 }

There are some code paths here I did not understand how they match the comments.
Because this function is recursive it seems that it may be called
where the 2nd parameter txn is a sub-transaction.

But then this seems at odds with some of the other code comments of
this function which are processing the txn without ever testing is it
really toplevel or not:

e.g. Line 1593 "/* cleanup changes in the toplevel txn */"
e.g. Line 1632 "They are always stored in the toplevel transaction."

;

COMMENT
Line 1644
@@ -1560,9 +1621,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
 * about the toplevel xact (we send the XID in all messages), but we never
 * stream XIDs of empty subxacts.
 */
- if ((!txn->toptxn) || (txn->nentries_mem != 0))
+ if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 txn->txn_flags |= RBTXN_IS_STREAMED;

+ if (txn_prepared)

/* remove the change from it's containing list */
typo "it's" --> "its"

;

QUESTION
Line 1977
@@ -1880,7 +1965,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
 ReorderBufferChange *specinsert)
 {
 /* Discard the changes that we just streamed */
- ReorderBufferTruncateTXN(rb, txn);
+ ReorderBufferTruncateTXN(rb, txn, false);

How do you know the 3rd parameter - i.e. txn_prepared - should be
hardwired false here?
e.g. I thought that maybe rbtxn_prepared(txn) can be true here.

;

COMMENT
Line 2345
@@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
break;
}
}
-
/*

Looks like accidental blank line deletion. This should be put back how it was

;

COMMENT/QUESTION
Line 2374
@@ -2278,7 +2362,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
 }
 }
 else
- rb->commit(rb, txn, commit_lsn);
+ {
+ /*
+ * Call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).

"twophase" --> "two-phase"

~

Also, I was confused by the apparent assumption of exclusiveness of
streaming and 2PC...
e.g. what if streaming AND 2PC then it won't do rb->prepare()

;

QUESTION
Line 2424
@@ -2319,11 +2412,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
 */
 if (streaming)
 {
- ReorderBufferTruncateTXN(rb, txn);
+ ReorderBufferTruncateTXN(rb, txn, false);

/* Reset the CheckXidAlive */
CheckXidAlive = InvalidTransactionId;
}
+ else if (rbtxn_prepared(txn))

I was confused by the exclusiveness of streaming/2PC.
e.g. what if streaming AND 2PC at same time - how can you pass false
as 3rd param to ReorderBufferTruncateTXN?

;

COMMENT
Line 2463
@@ -2352,17 +2451,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,

 /*
 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
- * abort of the (sub)transaction we are streaming. We need to do the
+ * abort of the (sub)transaction we are streaming or preparing. We
need to do the
 * cleanup and return gracefully on this error, see SetupCheckXidLive.
 */

"twoi phase" --> "two-phase"

;

QUESTIONS
Line 2482
@@ -2370,10 +2470,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
errdata = NULL;
curtxn->concurrent_abort = true;

- /* Reset the TXN so that it is allowed to stream remaining data. */
- ReorderBufferResetTXN(rb, txn, snapshot_now,
- command_id, prev_lsn,
- specinsert);
+ /* If streaming, reset the TXN so that it is allowed to stream
remaining data. */
+ if (streaming)

Re: /* If streaming, reset the TXN so that it is allowed to stream
remaining data. */
I was confused by the exclusiveness of streaming/2PC.
Is it not possible for streaming flags and rbtxn_prepared(txn) true at
the same time?

~

elog(LOG, "stopping decoding of %s (%u)",
txn->gid[0] != '\0'? txn->gid:"", txn->xid);

Is this a safe operation, or do you also need to test txn->gid is not NULL?

;

COMMENT
Line 2606
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,

"twophase" --> "two-phase"

;

QUESTION
Line 2655
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,

"This is used to handle COMMIT/ABORT PREPARED"
Should that say "COMMIT/ROLLBACK PREPARED"?

;

COMMENT
Line 2668

"Anyways, 2PC transactions" --> "Anyway, two-phase transactions"

;

COMMENT
Line 2765
@@ -2495,7 +2731,13 @@ ReorderBufferAbort(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
/* cosmetic... */
txn->final_lsn = lsn;

- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *

Remove the blank between the comment and code.

==========
Patch V6-0001, File: src/include/replication/logical.h
==========

COMMENT
Line 89

"two phase" -> "two-phase"

;

COMMENT
Line 89

For consistency with the previous member naming really the new member
should just be called "twophase" rather than "enable_twophase"

;

==========
Patch V6-0001, File: src/include/replication/output_plugin.h
==========

QUESTION
Line 106

As previously asked, why is the callback function/typedef referred as
AbortPrepared instead of RollbackPrepared?
It does not match the SQL and the function comment, and seems only to
add some unnecessary confusion.

;

==========
Patch V6-0001, File: src/include/replication/reorderbuffer.h
==========

QUESTION
Line 116
@@ -162,9 +163,13 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_CATALOG_CHANGES 0x0001
 #define RBTXN_IS_SUBXACT 0x0002
 #define RBTXN_IS_SERIALIZED 0x0004
-#define RBTXN_IS_STREAMED 0x0008
-#define RBTXN_HAS_TOAST_INSERT 0x0010
-#define RBTXN_HAS_SPEC_INSERT 0x0020
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_IS_STREAMED 0x0080
+#define RBTXN_HAS_TOAST_INSERT 0x0100
+#define RBTXN_HAS_SPEC_INSERT 0x0200

I was wondering why when adding new flags, some of the existing flag
masks were also altered.
I am assuming this is ok because they are never persisted but are only
used in the protocol (??)

;

COMMENT
Line 226
@@ -218,6 +223,15 @@ typedef struct ReorderBufferChange
((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
)

+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+

Probably all the "txn->txn_flags" here might be more safely written
with parentheses in the macro like "(txn)->txn_flags".

~

Also, Start all comments with capital. And what is the meaning "in the
meanwhile?"

;

COMMENT
Line 410
@@ -390,6 +407,39 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);

The format is inconsistent with all other callback signatures here,
where the 1st arg was on the same line as the typedef.

;

COMMENT
Line 440-442

Excessive blank lines following this change?

;

COMMENT
Line 638
@@ -571,6 +631,15 @@ void
ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid,
XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);

+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);

Not aligned consistently with other function prototypes.

;

==========
Patch V6-0003, File: src/backend/access/transam/twophase.c
==========

COMMENT
Line 551
@@ -548,6 +548,37 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
}

 /*
+ * LookupGXact
+ * Check if the prepared transaction with the given GID is around
+ */
+bool
+LookupGXact(const char *gid)

There is potential to refactor/simplify this code:
e.g.

bool
LookupGXact(const char *gid)
{
int i;
bool found = false;

LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
{
GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
/* Ignore not-yet-valid GIDs */
if (gxact->valid && strcmp(gxact->gid, gid) == 0)
{
found = true;
break;
}
}
LWLockRelease(TwoPhaseStateLock);
return found;
}

;

==========
Patch V6-0003, File: src/backend/replication/logical/proto.c
==========

COMMENT
Line 86
@@ -72,12 +72,17 @@ logicalrep_read_begin(StringInfo in,
LogicalRepBeginData *begin_data)
*/
void
logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
- XLogRecPtr commit_lsn)

Since now the flags are used the code comment is wrong.
"/* send the flags field (unused for now) */"

;

COMMENT
Line 129
@ -106,6 +115,77 @@ logicalrep_read_commit(StringInfo in,
LogicalRepCommitData *commit_data)
}

 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,

"2PC transactions" --> "two-phase commit transactions"

;

COMMENT
Line 133

Assert(strlen(txn->gid) > 0);
Shouldn't that assertion also check txn->gid is not NULL (to prevent
NPE in case gid was NULL)

;

COMMENT
Line 177
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)

prepare_data->prepare_type = flags;
This code may be OK but it does seem a bit of an abuse of the flags.

e.g. Are they flags or are the really enum values?
e.g. And if they are effectively enums (it appears they are) then
seemed inconsistent that |= was used when they were previously
assigned.

;

==========
Patch V6-0003, File: src/backend/replication/logical/worker.c
==========

COMMENT
Line 757
@@ -749,6 +753,141 @@ apply_handle_commit(StringInfo s)
pgstat_report_activity(STATE_IDLE, NULL);
}

+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+ Assert(prepare_data->prepare_lsn == remote_final_lsn);

Missing function comment to say this is called from apply_handle_prepare.

;

COMMENT
Line 798
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)

Missing function comment to say this is called from apply_handle_prepare.

;

COMMENT
Line 824
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)

Missing function comment to say this is called from apply_handle_prepare.

==========
Patch V6-0003, File: src/backend/replication/pgoutput/pgoutput.c
==========

COMMENT
Line 50
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);

The parameter indentation (2nd lines) does not match everything else
in this context.

;

COMMENT
Line 152
@@ -143,6 +149,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 cb->change_cb = pgoutput_change;
 cb->truncate_cb = pgoutput_truncate;
 cb->commit_cb = pgoutput_commit_txn;
+
+ cb->prepare_cb = pgoutput_prepare_txn;
+ cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+ cb->abort_prepared_cb = pgoutput_abort_prepared_txn;

Remove the unnecessary blank line.

;

QUESTION
Line 386
@@ -373,7 +383,49 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
OutputPluginUpdateProgress(ctx);

 OutputPluginPrepareWrite(ctx, true);
- logicalrep_write_commit(ctx->out, txn, commit_lsn);
+ logicalrep_write_commit(ctx->out, txn, commit_lsn, true);

Is the is_commit parameter of logicalrep_write_commit ever passed as false?
If yes, where?
If no, the what is the point of it?

;

COMMENT
Line 408
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,

Since all this function is identical to pg_output_prepare it might be
better to either
1. just leave this as a wrapper to delegate to that function
2. remove this one entirely and assign the callback to the common
pgoutput_prepare_txn

;

COMMENT
Line 419
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,

Since all this function is identical to pg_output_prepare if might be
better to either
1. just leave this as a wrapper to delegate to that function
2. remove this one entirely and assign the callback to the common
pgoutput_prepare_tx

;

COMMENT
Line 419
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,

Shouldn't this comment say be "ROLLBACK PREPARED"?

;

==========
Patch V6-0003, File: src/include/replication/logicalproto.h
==========

QUESTION
Line 101
@@ -87,20 +87,55 @@ typedef struct LogicalRepBeginData
TransactionId xid;
} LogicalRepBeginData;

+/* Commit (and abort) information */

#define LOGICALREP_IS_ABORT 0x02
Is there a good reason why this is not called:
#define LOGICALREP_IS_ROLLBACK 0x02

;

COMMENT
Line 105

((flags == LOGICALREP_IS_COMMIT) || (flags == LOGICALREP_IS_ABORT))

Macros would be safer if flags are in parentheses
(((flags) == LOGICALREP_IS_COMMIT) || ((flags) == LOGICALREP_IS_ABORT))

;

COMMENT
Line 115

Unexpected whitespace for the typedef
"} LogicalRepPrepareData;"

;

COMMENT
Line 122
/* prepare can be exactly one of PREPARE, [COMMIT|ABORT] PREPARED*/
#define PrepareFlagsAreValid(flags) \
((flags == LOGICALREP_IS_PREPARE) || \
(flags == LOGICALREP_IS_COMMIT_PREPARED) || \
(flags == LOGICALREP_IS_ROLLBACK_PREPARED))

There is confusing mixture in macros and comments of ABORT and ROLLBACK terms
"[COMMIT|ABORT] PREPARED" --> "[COMMIT|ROLLBACK] PREPARED"

~

Also, it would be safer if flags are in parentheses
(((flags) == LOGICALREP_IS_PREPARE) || \
((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))

;

==========
Patch V6-0003, File: src/test/subscription/t/020_twophase.pl
==========

COMMENT
Line 131 - # check inserts are visible

Isn't this supposed to be checking for rows 12 and 13, instead of 11 and 12?

;

==========
Patch V6-0004, File: contrib/test_decoding/test_decoding.c
==========

COMMENT
Line 81
@@ -78,6 +78,15 @@ static void
pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 ReorderBufferTXN *txn,
 XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static

All these functions have a 3rd parameter called commit_lsn. Even
though the functions are not commit related. It seems like a cut/paste
error.

;

COMMENT
Line 142
@@ -130,6 +139,9 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 cb->stream_start_cb = pg_decode_stream_start;
 cb->stream_stop_cb = pg_decode_stream_stop;
 cb->stream_abort_cb = pg_decode_stream_abort;
+ cb->stream_prepare_cb = pg_decode_stream_prepare;
+ cb->stream_commit_prepared_cb = pg_decode_stream_commit_prepared;
+ cb->stream_abort_prepared_cb = pg_decode_stream_abort_prepared;
 cb->stream_commit_cb = pg_decode_stream_commit;

Can the "cb->stream_abort_prepared_cb" be changed to
"cb->stream_rollback_prepared_cb"?

;

COMMENT
Line 827
@@ -812,6 +824,78 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
}

 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_pr

The commit_lsn (3rd parameter) is unused and seems like a cut/paste name error.

;

COMMENT
Line 875
+pg_decode_stream_abort_prepared(LogicalDecodingContext *ctx,

The commit_lsn (3rd parameter) is unused and seems like a cut/paste name error.

;

==========
Patch V6-0004, File: doc/src/sgml/logicaldecoding.sgml
==========

COMMENT
48.6.1
@@ -396,6 +396,9 @@ typedef struct OutputPluginCallbacks
 LogicalDecodeStreamStartCB stream_start_cb;
 LogicalDecodeStreamStopCB stream_stop_cb;
 LogicalDecodeStreamAbortCB stream_abort_cb;
+ LogicalDecodeStreamPrepareCB stream_prepare_cb;
+ LogicalDecodeStreamCommitPreparedCB stream_commit_prepared_cb;
+ LogicalDecodeStreamAbortPreparedCB stream_abort_prepared_cb;

Same question from previous review comments - why using the
terminology "abort" instead of "rollback"

;

COMMENT
48.6.1
@@ -418,7 +421,9 @@ typedef void (*LogicalOutputPluginInit) (struct
OutputPluginCallbacks *cb);
 in-progress transactions. The <function>stream_start_cb</function>,
 <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
 <function>stream_commit_cb</function> and <function>stream_change_cb</function>
- are required, while <function>stream_message_cb</function> and
+ are required, while <function>stream_message_cb</function>,
+ <function>stream_prepare_cb</function>,
<function>stream_commit_prepared_cb</function>,
+ <function>stream_abort_prepared_cb</function>,

Missing "and".
... "stream_abort_prepared_cb, stream_truncate_cb are optional." -->
"stream_abort_prepared_cb, and stream_truncate_cb are optional."

;

COMMENT
Section 48.6.4.16
Section 48.6.4.17
Section 48.6.4.18
@@ -839,6 +844,45 @@ typedef void (*LogicalDecodeStreamAbortCB)
(struct LogicalDecodingContext *ctx,
</para>
</sect3>

+ <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+ <title>Stream Prepare Callback</title>
+ <para>
+ The <function>stream_prepare_cb</function> callback is called to prepare
+ a previously streamed transaction as part of a two phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct
LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-stream-commit-prepared">
+ <title>Stream Commit Prepared Callback</title>
+ <para>
+ The <function>stream_commit_prepared_cb</function> callback is
called to commit prepared
+ a previously streamed transaction as part of a two phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitPreparedCB) (struct
LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-stream-abort-prepared">
+ <title>Stream Abort Prepared Callback</title>
+ <para>
+ The <function>stream_abort_prepared_cb</function> callback is called
to abort prepared
+ a previously streamed transaction as part of a two phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortPreparedCB) (struct
LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>

1. Everywhere it says "two phase" commit should be consistently
replaced to say "two-phase" commit (with the hyphen)

2. Search for "abort_lsn" parameter. It seems to be overused
(cut/paste error) even when the API is unrelated to abort

3. 48.6.4.17 and 48.6.4.18
Is this wording ok? Is the word "prepared" even necessary here?
- "... called to commit prepared a previously streamed transaction ..."
- "... called to abort prepared a previously streamed transaction ..."

;

COMMENT
Section 48.9
@@ -1017,9 +1061,13 @@ OutputPluginWrite(ctx, true);
 When streaming an in-progress transaction, the changes (and messages) are
 streamed in blocks demarcated by <function>stream_start_cb</function>
 and <function>stream_stop_cb</function> callbacks. Once all the decoded
- changes are transmitted, the transaction is committed using the
- <function>stream_commit_cb</function> callback (or possibly aborted using
- the <function>stream_abort_cb</function> callback).
+ changes are transmitted, the transaction can be committed using the
+ the <function>stream_commit_cb</function> callback

"two phase" --> "two-phase"

~

Also, Missing period on end of sentence.
"or aborted using the stream_abort_prepared_cb" --> "or aborted using
the stream_abort_prepared_cb."

;

==========
Patch V6-0004, File: src/backend/replication/logical/logical.c
==========

COMMENT
Line 84
@@ -81,6 +81,12 @@ static void stream_stop_cb_wrapper(ReorderBuffer
*cache, ReorderBufferTXN *txn,
 XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
 XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void stream_commit_prepared_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void stream_abort_prepared_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);

The 3rd parameter is always "commit_lsn" even for API unrelated to
commit, so seems like cut/paste error.

;

COMMENT
Line 1246
@@ -1231,6 +1243,105 @@ stream_abort_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
}

 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;

Misnamed parameter "commit_lsn" ?

~

Also, Line 1272
There seem to be some missing integrity checking to make sure the
callback is not NULL.
A null callback will give NPE when wrapper attempts to call it

;

COMMENT
Line 1305
+static void
+stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,

There seem to be some missing integrity checking to make sure the
callback is not NULL.
A null callback will give NPE when wrapper attempts to call it.

;

COMMENT
Line 1312
+static void
+stream_abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,

Misnamed parameter "commit_lsn" ?

~

Also, Line 1338
There seem to be some missing integrity checking to make sure the
callback is not NULL.
A null callback will give NPE when wrapper attempts to call it.

==========
Patch V6-0004, File: src/backend/replication/logical/reorderbuffer.c
==========

COMMENT
Line 2684
@@ -2672,15 +2681,31 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb,
TransactionId xid,
txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
strcpy(txn->gid, gid);

- if (is_commit)
+ if (rbtxn_is_streamed(txn))
 {
- txn->txn_flags |= RBTXN_COMMIT_PREPARED;
- rb->commit_prepared(rb, txn, commit_lsn);
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;

The setting/checking of the flags could be refactored if you wanted to
write less code:
e.g.
if (is_commit)
txn->txn_flags |= RBTXN_COMMIT_PREPARED;
else
txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;

if (rbtxn_is_streamed(txn) && rbtxn_commit_prepared(txn))
rb->stream_commit_prepared(rb, txn, commit_lsn);
else if (rbtxn_is_streamed(txn) && rbtxn_rollback_prepared(txn))
rb->stream_abort_prepared(rb, txn, commit_lsn);
else if (rbtxn_commit_prepared(txn))
rb->commit_prepared(rb, txn, commit_lsn);
else if (rbtxn_rollback_prepared(txn))
rb->abort_prepared(rb, txn, commit_lsn);

;

==========
Patch V6-0004, File: src/include/replication/output_plugin.h
==========

COMMENT
Line 171
@@ -157,6 +157,33 @@ typedef void (*LogicalDecodeStreamAbortCB)
(struct LogicalDecodingContext *ctx,
XLogRecPtr abort_lsn);

 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit and only when
+ * two-phased commits are supported
+ */

1. Missing period all these comments.

2. Is the part that says "and only where two-phased commits are
supported" necessary to say? Is seems redundant since comments already
says called as part of a two-phase commit.

;

==========
Patch V6-0004, File: src/include/replication/reorderbuffer.h
==========

COMMENT
Line 467
@@ -466,6 +466,24 @@ typedef void (*ReorderBufferStreamAbortCB) (
ReorderBufferTXN *txn,
XLogRecPtr abort_lsn);

+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);

Cut/paste error - repeated same comment 3 times?

[END]

#47Amit Kapila
amit.kapila16@gmail.com
In reply to: Nikhil Sontakke (#1)

On Tue, Oct 6, 2020 at 10:23 AM Peter.B.Smith@fujitsu.com
<Peter.B.Smith@fujitsu.com> wrote:

[BEGIN]

==========
Patch V6-0001, File: contrib/test_decoding/expected/prepared.out (so
prepared.sql also)
==========

COMMENT
Line 30 - The INSERT INTO test_prepared1 VALUES (2); is kind of
strange because it is not really part of the prior test nor the
following test. Maybe it would be better to have a comment describing
the purpose of this isolated INSERT and to also consume the data from
the slot so it does not get jumbled with the data of the following
(abort) test.

;

COMMENT
Line 53 - Same comment for this test INSERT INTO test_prepared1 VALUES
(4); It kind of has nothing really to do with either the prior (abort)
test nor the following (ddl) test.

;

COMMENT
Line 60 - Seems to check which locks are held for the test_prepared_1
table while the transaction is in progress. Maybe it would be better
to have more comments describing what is expected here and why.

;

COMMENT
Line 88 - There is a comment in the test saying "-- We should see '7'
before '5' in our results since it commits first." but I did not see
any test code that actually verifies that happens.

;

All the above comments are genuine and I think it is mostly because
the author has blindly modified the existing tests without completely
understanding the intent of the test. I suggest we write a completely
new regression file (decode_prepared.sql) for these and just copy
whatever is required from prepared.sql. Once we do that we might also
want to rename existing prepared.sql to decode_commit_prepared.sql or
something like that. I think modifying the existing test appears to be
quite ugly and also it is changing the intent of the existing tests.

QUESTION
Line 120 - I did not really understand the SQL checking the pg_class.
I expected this would be checking table 'test_prepared1' instead. Can
you explain it?
SELECT 'pg_class' AS relation, locktype, mode
FROM pg_locks
WHERE locktype = 'relation'
AND relation = 'pg_class'::regclass;
relation | locktype | mode
----------+----------+------
(0 rows)

;

Yes, I also think your expectation is correct and this should be on
'test_prepared_1'.

QUESTION
Line 139 - SET statement_timeout = '1s'; is 1 seconds short enough
here for this test, or might it be that these statements would be
completed in less than one seconds anyhow?

;

Good question. I think we have to mention the reason why logical
decoding is not blocked while it needs to acquire a shared lock on the
table and the previous commands already held an exclusive lock on the
table. I am not sure if I am missing something but like you, it is not
clear to me as well what this test intends to do, so surely more
commentary is required.

QUESTION
Line 163 - How is this testing a SAVEPOINT? Or is it only to check
that the SAVEPOINT command is not part of the replicated changes?

;

It is more of testing that subtransactions will not create a problem
while decoding.

COMMENT
Line 175 - Missing underscore in comment. Code requires also underscore:
"nodecode" --> "_nodecode"

makes sense.

==========
Patch V6-0001, File: contrib/test_decoding/test_decoding.c
==========

COMMENT
Line 43
@@ -36,6 +40,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ TransactionId check_xid; /* track abort of this txid */
} TestDecodingData;

The "check_xid" seems a meaningless name. Check what?
IIUC maybe should be something like "check_xid_aborted"

;

COMMENT
Line 105
@ -88,6 +93,19 @@ static void
pg_decode_stream_truncate(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
int nrelations, Relation relations[],
ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,

Remove extra blank line after these functions

;

The above two sounds reasonable suggestions.

COMMENT
Line 149
@@ -116,6 +134,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->stream_change_cb = pg_decode_stream_change;
cb->stream_message_cb = pg_decode_stream_message;
cb->stream_truncate_cb = pg_decode_stream_truncate;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
+
}

There is a confusing mix of terminology where sometimes things are
referred as ROLLBACK/rollback and other times apparently the same
operation is referred as ABORT/abort. I do not know the root cause of
this mixture. IIUC maybe the internal functions and protocol generally
use the term "abort", whereas the SQL syntax is "ROLLBACK"... but
where those two terms collide in the middle it gets quite confusing.

At least I thought the names of the "callbacks" which get exposed to
the user (e.g. in the help) might be better if they would match the
SQL.
"abort_prepared_cb" --> "rollback_prepared_db"

This suggestion sounds reasonable. I think it is to entertain the case
where due to error we need to rollback the transaction. I think it is
better if use 'rollback' terminology in the exposed functions. We
already have a function with the name stream_abort_cb in the code
which we also might want to rename but that is a separate thing and we
can deal it with a separate patch.

There are similar review comments like this below where the
alternating terms caused me some confusion.

~

Also, Remove the extra blank line before the end of the function.

;

COMMENT
Line 267
@ -227,6 +252,42 @@ pg_decode_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "two-phase-commit") == 0)
+ {
+ if (elem->arg == NULL)
+ continue;

IMO the "check-xid" code might be better rearranged so the NULL check
is first instead of if/else.
e.g.
if (elem->arg == NULL)
ereport(FATAL,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("check-xid needs an input value")));
~

Also, is it really supposed to be FATAL instead or ERROR. That is not
the same as the other surrounding code.

;

+1.

COMMENT
Line 296
if (data->check_xid <= 0)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("Specify positive value for parameter \"%s\","
" you specified \"%s\"",
elem->defname, strVal(elem->arg))));

The code checking for <= 0 seems over-complicated. Because conversion
was using strtoul() I fail to see how this can ever be < 0. Wouldn't
it be easier to simply test the result of the strtoul() function?

BEFORE: if (errno == EINVAL || errno == ERANGE)
AFTER: if (data->check_xid == 0)

Better to use TransactionIdIsValid(data->check_xid) here.

~

Also, should this be FATAL? Everything else similar is ERROR.

;

It should be an error.

COMMENT
(general)
I don't recall seeing any of these decoding options (e.g.
"two-phase-commit", "check-xid") documented anywhere.
So how can a user even know these options exist so they can use them?
Perhaps options should be described on this page?
https://www.postgresql.org/docs/13/functions-admin.html#FUNCTIONS-REPLICATION

;

I think we should do what we are doing for other options, if they are
not documented then why to document this one separately. I guess we
can make a case to document all the existing options and write a
separate patch for that.

COMMENT
(general)
"check-xid" is a meaningless option name. Maybe something like
"checked-xid-aborted" is more useful?
Suggest changing the member, the option, and the error messages to
match some better name.

;

COMMENT
Line 314
@@ -238,6 +299,7 @@ pg_decode_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
}

ctx->streaming &= enable_streaming;
+ ctx->enable_twophase &= enable_2pc;
}

The "ctx->enable_twophase" is inconsistent naming with the
"ctx->streaming" member.
"enable_twophase" --> "twophase"

;

+1.

COMMENT
Line 374
@@ -297,6 +359,94 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}

+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,

Remove the extra preceding blank line.

~

I did not find anything in the help about "_nodecode". Should it be
there or is this deliberately not documented feature?

;

I guess we can document it along with filter_prepare API, if not
already documented.

QUESTION
Line 440
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,

Is this a wrong comment
"ABORT PREPARED" --> "ROLLBACK PREPARED" ??

;

COMMENT
Line 620
@@ -455,6 +605,22 @@ pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;

+ /* if check_xid is specified */
+ if (TransactionIdIsValid(data->check_xid))
+ {
+ elog(LOG, "waiting for %u to abort", data->check_xid);
+ while (TransactionIdIsInProgress(dat

The check_xid seems a meaningless name, and the comment "/* if
check_xid is specified */" was not helpful either.
IIUC purpose of this is to check that the nominated xid always is rolled back.
So the appropriate name may be more like "check-xid-aborted".

;

Yeah, this part deserves better comments.

--
With Regards,
Amit Kapila.

#48Amit Kapila
amit.kapila16@gmail.com
In reply to: Nikhil Sontakke (#1)

On Tue, Oct 6, 2020 at 10:23 AM Peter.B.Smith@fujitsu.com
<Peter.B.Smith@fujitsu.com> wrote:

==========
Patch V6-0001, File: doc/src/sgml/logicaldecoding.sgml
==========

COMMENT/QUESTION
Section 48.6.1
@ -387,6 +387,10 @@ typedef struct OutputPluginCallbacks
LogicalDecodeTruncateCB truncate_cb;
LogicalDecodeCommitCB commit_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;

Confused by the mixing of terminologies "abort" and "rollback".
Why is it LogicalDecodeAbortPreparedCB instead of
LogicalDecodeRollbackPreparedCB?
Why is it abort_prepared_cb instead of rollback_prepared_cb;?

I thought everything the user sees should be ROLLBACK/rollback (like
the SQL) regardless of what the internal functions might be called.

;

Fair enough.

COMMENT
Section 48.6.1
The begin_cb, change_cb and commit_cb callbacks are required, while
startup_cb, filter_by_origin_cb, truncate_cb, and shutdown_cb are
optional. If truncate_cb is not set but a TRUNCATE is to be decoded,
the action will be ignored.

The 1st paragraph beneath the typedef does not mention the newly added
callbacks to say if they are required or optional.

;

Yeah, in code comments it was mentioned but is missed here, see the
comment "To support two phase logical decoding, we require
prepare/commit-prepare/abort-prepare callbacks. The filter-prepare
callback is optional.". I think instead of directly editing the above
paragraph we can write a new one similar to what we have done for
streaming of large in-progress transactions (Refer <para> An output
plugin may also define functions to support streaming of large,
in-progress transactions.).

COMMENT
Section 48.6.4.5
Section 48.6.4.6
Section 48.6.4.7
@@ -578,6 +588,55 @@ typedef void (*LogicalDecodeCommitCB) (struct
LogicalDecodingContext *ctx,
</para>
</sect3>

+ <sect3 id="logicaldecoding-output-plugin-prepare">
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+    <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+<programlisting>

The wording and titles are a bit backwards compared to the others.
e.g. previously was "Transaction Begin" (not "Begin Transaction") and
"Transaction End" (not "End Transaction").

So for consistently following the existing IMO should change these new
titles (and wording) to:
- "Commit Prepared Transaction Callback" --> "Transaction Commit
Prepared Callback"
- "Rollback Prepared Transaction Callback" --> "Transaction Rollback
Prepared Callback"

makes sense.

- "whenever a commit prepared transaction has been decoded" -->
"whenever a transaction commit prepared has been decoded"
- "whenever a rollback prepared transaction has been decoded." -->
"whenever a transaction rollback prepared has been decoded."

;

I don't find above suggestions much better than current wording. How
about below instead?

"whenever we decode a transaction which is prepared for two-phase
commit is committed"
"whenever we decode a transaction which is prepared for two-phase
commit is rolled back"

Also, related to this:
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Commit Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>commit_prepared_cb</function> callback
is called whenever
+      a commit prepared transaction has been decoded. The
<parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can
be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct
LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+     <title>Rollback Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>abort_prepared_cb</function> callback is
called whenever
+      a rollback prepared transaction has been decoded. The
<parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can
be used in this
+      callback.
+<programlisting>

Both the above are not optional as per code and I think code is
correct. I think the documentation is wrong here.

==========
Patch V6-0001, File: src/backend/replication/logical/decode.c
==========

COMMENT
Line 74
@@ -70,6 +70,9 @@ static void DecodeCommit(LogicalDecodingContext
*ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);

The 2nd line of DecodePrepare is misaligned by one space.

;

Yeah, probably pgindent is the answer. Ajin, can you please run
pgindent on all the patches?

COMMENT
Line 321
@@ -312,17 +315,34 @@ DecodeXactOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
}
break;
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
+ xl_xact_prepare *xlrec;
+ /* check that output plugin is capable of twophase decoding */

"twophase" --> "two-phase"

~

Also, add a blank line after the declarations.

;

==========
Patch V6-0001, File: src/backend/replication/logical/logical.c
==========

COMMENT
Line 249
@@ -225,6 +237,19 @@ StartupDecodingContext(List *output_plugin_options,
(ctx->callbacks.stream_message_cb != NULL) ||
(ctx->callbacks.stream_truncate_cb != NULL);

+ /*
+ * To support two phase logical decoding, we require
prepare/commit-prepare/abort-prepare
+ * callbacks. The filter-prepare callback is optional. We however
enable two phase logical
+ * decoding when at least one of the methods is enabled so that we
can easily identify
+ * missing methods.

The terminology is generally well known as "two-phase" (with the
hyphen) https://en.wikipedia.org/wiki/Two-phase_commit_protocol so
let's be consistent for all the patch code comments. Please search the
code and correct this in all places, even where I might have missed to
identify it.

"two phase" --> "two-phase"

;

COMMENT
Line 822
@@ -782,6 +807,111 @@ commit_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
}

static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)

"support 2 phase" --> "supports two-phase" in the comment

;

COMMENT
Line 844
Code condition seems strange and/or broken.
if (ctx->enable_twophase && ctx->callbacks.prepare_cb == NULL)
Because if the flag is null then this condition is skipped.
But then if the callback was also NULL then attempting to call it to
"do the actual work" will give NPE.

~

Also, I wonder should this check be the first thing in this function?
Because if it fails does it even make sense that all the errcallback
code was set up?> E.g errcallback.arg potentially is left pointing to a stack variable
on a stack that no longer exists.

;

Right, I think we should have an Assert(ctx->enable_twophase) in the
beginning and then have the check (ctx->callbacks.prepare_cb == NULL)
t its current place. Refer any of the streaming APIs (for ex.
stream_stop_cb_wrapper).

COMMENT
Line 857
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,

"support 2 phase" --> "supports two-phase" in the comment

~

Also, Same potential trouble with the condition:
if (ctx->enable_twophase && ctx->callbacks.commit_prepared_cb == NULL)
Same as previously asked. Should this check be first thing in this function?

;

Yeah, so the same solution as mentioned above can be used.

COMMENT
Line 892
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,

"support 2 phase" --> "supports two-phase" in the comment

~

Same potential trouble with the condition:
if (ctx->enable_twophase && ctx->callbacks.abort_prepared_cb == NULL)
Same as previously asked. Should this check be the first thing in this function?

;

Again the same solution can be used.

COMMENT
Line 1013
@@ -858,6 +988,51 @@ truncate_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}

+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)

Fix wording in comment:
"twophase" --> "two-phase transactions"
"twophase transactions" --> "two-phase transactions"

==========
Patch V6-0001, File: src/backend/replication/logical/reorderbuffer.c
==========

COMMENT
Line 255
@@ -251,7 +251,8 @@ static Size
ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
static void ReorderBufferRestoreChange(ReorderBuffer *rb,
ReorderBufferTXN *txn,
char *change);
static void ReorderBufferRestoreCleanup(ReorderBuffer *rb,
ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+ bool txn_prepared);

The alignment is inconsistent. One more space needed before "bool txn_prepared"

;

COMMENT
Line 417
@@ -413,6 +414,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
}

/* free data that's contained */
+ if (txn->gid != NULL)
+ {
+ pfree(txn->gid);
+ txn->gid = NULL;
+ }

Should add the blank link before this new code, as it was before.

;

COMMENT
Line 1564
@ -1502,12 +1561,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
}

/*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them. Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either
after streaming or
+ * after a PREPARE.

typo "snapshots.If" -> "snapshots. If"

;

COMMENT/QUESTION
Line 1590
@@ -1526,7 +1587,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);

- ReorderBufferTruncateTXN(rb, subtxn);
+ ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
}

There are some code paths here I did not understand how they match the comments.
Because this function is recursive it seems that it may be called
where the 2nd parameter txn is a sub-transaction.

But then this seems at odds with some of the other code comments of
this function which are processing the txn without ever testing is it
really toplevel or not:

e.g. Line 1593 "/* cleanup changes in the toplevel txn */"

I think this comment is wrong but this is not the fault of this patch.

e.g. Line 1632 "They are always stored in the toplevel transaction."

;

This seems to be correct and we probably need an Assert that the
transaction is a top-level transaction.

COMMENT
Line 1644
@@ -1560,9 +1621,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
* about the toplevel xact (we send the XID in all messages), but we never
* stream XIDs of empty subxacts.
*/
- if ((!txn->toptxn) || (txn->nentries_mem != 0))
+ if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
txn->txn_flags |= RBTXN_IS_STREAMED;

+ if (txn_prepared)

/* remove the change from it's containing list */
typo "it's" --> "its"

;

QUESTION
Line 1977
@@ -1880,7 +1965,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
ReorderBufferChange *specinsert)
{
/* Discard the changes that we just streamed */
- ReorderBufferTruncateTXN(rb, txn);
+ ReorderBufferTruncateTXN(rb, txn, false);

How do you know the 3rd parameter - i.e. txn_prepared - should be
hardwired false here?
e.g. I thought that maybe rbtxn_prepared(txn) can be true here.

;

COMMENT
Line 2345
@@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
break;
}
}
-
/*

Looks like accidental blank line deletion. This should be put back how it was

;

COMMENT/QUESTION
Line 2374
@@ -2278,7 +2362,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
}
}
else
- rb->commit(rb, txn, commit_lsn);
+ {
+ /*
+ * Call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).

"twophase" --> "two-phase"

~

Also, I was confused by the apparent assumption of exclusiveness of
streaming and 2PC...
e.g. what if streaming AND 2PC then it won't do rb->prepare()

;

QUESTION
Line 2424
@@ -2319,11 +2412,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
*/
if (streaming)
{
- ReorderBufferTruncateTXN(rb, txn);
+ ReorderBufferTruncateTXN(rb, txn, false);

/* Reset the CheckXidAlive */
CheckXidAlive = InvalidTransactionId;
}
+ else if (rbtxn_prepared(txn))

I was confused by the exclusiveness of streaming/2PC.
e.g. what if streaming AND 2PC at same time - how can you pass false
as 3rd param to ReorderBufferTruncateTXN?

;

Yeah, this and another handling wherever it is assumed that both can't
be true together is wrong.

COMMENT
Line 2463
@@ -2352,17 +2451,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,

/*
* The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
- * abort of the (sub)transaction we are streaming. We need to do the
+ * abort of the (sub)transaction we are streaming or preparing. We
need to do the
* cleanup and return gracefully on this error, see SetupCheckXidLive.
*/

"twoi phase" --> "two-phase"

;

QUESTIONS
Line 2482
@@ -2370,10 +2470,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
errdata = NULL;
curtxn->concurrent_abort = true;

- /* Reset the TXN so that it is allowed to stream remaining data. */
- ReorderBufferResetTXN(rb, txn, snapshot_now,
- command_id, prev_lsn,
- specinsert);
+ /* If streaming, reset the TXN so that it is allowed to stream
remaining data. */
+ if (streaming)

Re: /* If streaming, reset the TXN so that it is allowed to stream
remaining data. */
I was confused by the exclusiveness of streaming/2PC.
Is it not possible for streaming flags and rbtxn_prepared(txn) true at
the same time?

Yeah, I think it is not correct to assume that both can't be true at
the same time. But when prepared is true irrespective of whether
streaming is true or not we can use ReorderBufferTruncateTXN() API
instead of Reset API.

~

elog(LOG, "stopping decoding of %s (%u)",
txn->gid[0] != '\0'? txn->gid:"", txn->xid);

Is this a safe operation, or do you also need to test txn->gid is not NULL?

;

I think if 'prepared' is true then we can assume it to be non-NULL,
otherwise, not.

I am responding to your email in phases so that we can have a
discussion on specific points if required and I am slightly afraid
that the email might not bounce as it happened in your case when you
sent such a long email.

--
With Regards,
Amit Kapila.

#49Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#47)

On Wed, Oct 7, 2020 at 1:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

There is a confusing mix of terminology where sometimes things are
referred as ROLLBACK/rollback and other times apparently the same
operation is referred as ABORT/abort. I do not know the root cause of
this mixture. IIUC maybe the internal functions and protocol generally
use the term "abort", whereas the SQL syntax is "ROLLBACK"... but
where those two terms collide in the middle it gets quite confusing.

At least I thought the names of the "callbacks" which get exposed to
the user (e.g. in the help) might be better if they would match the
SQL.
"abort_prepared_cb" --> "rollback_prepared_db"

This suggestion sounds reasonable. I think it is to entertain the case
where due to error we need to rollback the transaction. I think it is
better if use 'rollback' terminology in the exposed functions. We
already have a function with the name stream_abort_cb in the code
which we also might want to rename but that is a separate thing and we
can deal it with a separate patch.

So, for an ordinary transaction, rollback implies an explicit user
action, but an abort could either be an explicit user action (ABORT;
or ROLLBACK;) or an error. I agree that calling that case "abort"
rather than "rollback" is better. However, the situation is a bit
different for a prepared transaction: no error can prevent such a
transaction from being committed. That is the whole point of being
able to prepare transactions. So it is not unreasonable to think of
use "rollback" rather than "abort" for prepared transactions, but I
think it would be wrong in other cases. On the other hand, using
"abort" for all the cases also doesn't seem bad to me. It's true that
there is no ABORT PREPARED command at the SQL level, but I don't think
that is very important. I don't feel wrong saying that ROLLBACK
PREPARED causes a transaction abort.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#50Ajin Cherian
itsajin@gmail.com
In reply to: Robert Haas (#49)

On Thu, Oct 8, 2020 at 6:14 AM Robert Haas <robertmhaas@gmail.com> wrote:

So, for an ordinary transaction, rollback implies an explicit user
action, but an abort could either be an explicit user action (ABORT;
or ROLLBACK;) or an error. I agree that calling that case "abort"
rather than "rollback" is better. However, the situation is a bit
different for a prepared transaction: no error can prevent such a
transaction from being committed. That is the whole point of being
able to prepare transactions. So it is not unreasonable to think of
use "rollback" rather than "abort" for prepared transactions, but I
think it would be wrong in other cases. On the other hand, using
"abort" for all the cases also doesn't seem bad to me. It's true that
there is no ABORT PREPARED command at the SQL level, but I don't think
that is very important. I don't feel wrong saying that ROLLBACK
PREPARED causes a transaction abort.

So, as I understand you don't object to renaming the callback APIs for
ROLLBACK PREPARED transactions to "rollback_prepared_cb" but keeping
the "stream_abort" as such. This was what I was planning on doing.
I was just writing this up, so wanted to confirm.

regards,
Ajin Cherian
Fujitsu Australia

#51Amit Kapila
amit.kapila16@gmail.com
In reply to: Nikhil Sontakke (#1)

On Tue, Oct 6, 2020 at 10:23 AM Peter.B.Smith@fujitsu.com
<Peter.B.Smith@fujitsu.com> wrote:

==========
Patch V6-0001, File: src/include/replication/reorderbuffer.h
==========

QUESTION
Line 116
@@ -162,9 +163,13 @@ typedef struct ReorderBufferChange
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
#define RBTXN_IS_SUBXACT 0x0002
#define RBTXN_IS_SERIALIZED 0x0004
-#define RBTXN_IS_STREAMED 0x0008
-#define RBTXN_HAS_TOAST_INSERT 0x0010
-#define RBTXN_HAS_SPEC_INSERT 0x0020
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_IS_STREAMED 0x0080
+#define RBTXN_HAS_TOAST_INSERT 0x0100
+#define RBTXN_HAS_SPEC_INSERT 0x0200

I was wondering why when adding new flags, some of the existing flag
masks were also altered.
I am assuming this is ok because they are never persisted but are only
used in the protocol (??)

;

This is bad even though there is no direct problem. I don't think we
need to change the existing ones, we can add the new ones at the end
with the number starting where the last one ends.

COMMENT
Line 133

Assert(strlen(txn->gid) > 0);
Shouldn't that assertion also check txn->gid is not NULL (to prevent
NPE in case gid was NULL)

;

I think that would be better and a stronger Assertion than the current one.

COMMENT
Line 177
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)

prepare_data->prepare_type = flags;
This code may be OK but it does seem a bit of an abuse of the flags.

e.g. Are they flags or are the really enum values?
e.g. And if they are effectively enums (it appears they are) then
seemed inconsistent that |= was used when they were previously
assigned.

;

I don't understand this point. As far as I can see at the time of
write (logicalrep_write_prepare()), the patch has used |=, and at the
time of reading (logicalrep_read_prepare()) it has used assignment
which seems correct from the code perspective. Do you have a better
proposal?

COMMENT
Line 408
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,

Since all this function is identical to pg_output_prepare it might be
better to either
1. just leave this as a wrapper to delegate to that function
2. remove this one entirely and assign the callback to the common
pgoutput_prepare_txn

;

I think this is because as of now the patch uses the same function and
protocol message to send both Prepare and Commit/Rollback Prepare
which I am not sure is the right thing. I suggest keeping that code as
it is for now. Let's first try to figure out if it is a good idea to
overload the same protocol message and use flags to distinguish the
actual message. Also, I don't know whether prepare_lsn is required
during commit time?

COMMENT
Line 419
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,

Since all this function is identical to pg_output_prepare if might be
better to either
1. just leave this as a wrapper to delegate to that function
2. remove this one entirely and assign the callback to the common
pgoutput_prepare_tx

;

Due to reasons mentioned for the previous comment, let's keep this
also as it is for now.

--
With Regards,
Amit Kapila.

#52Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#51)

On Thu, Oct 8, 2020 at 5:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

COMMENT
Line 177
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)

prepare_data->prepare_type = flags;
This code may be OK but it does seem a bit of an abuse of the flags.

e.g. Are they flags or are the really enum values?
e.g. And if they are effectively enums (it appears they are) then
seemed inconsistent that |= was used when they were previously
assigned.

;

I don't understand this point. As far as I can see at the time of
write (logicalrep_write_prepare()), the patch has used |=, and at the
time of reading (logicalrep_read_prepare()) it has used assignment
which seems correct from the code perspective. Do you have a better
proposal?

OK. I will explain my thinking when I wrote that review comment.

I agree all is "correct" from a code perspective.

But IMO using bit arithmetic implies that different combinations are
also possible, whereas in current code they are not.
So code is kind of having a bet each-way - sometimes treating "flags"
as bit flags and sometimes as enums.

e.g. If these flags are not really bit flags at all then the
logicalrep_write_prepare() code might just as well be written as
below:

BEFORE
if (rbtxn_commit_prepared(txn))
flags |= LOGICALREP_IS_COMMIT_PREPARED;
else if (rbtxn_rollback_prepared(txn))
flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
else
flags |= LOGICALREP_IS_PREPARE;

/* Make sure exactly one of the expected flags is set. */
if (!PrepareFlagsAreValid(flags))
elog(ERROR, "unrecognized flags %u in prepare message", flags);

AFTER
if (rbtxn_commit_prepared(txn))
flags = LOGICALREP_IS_COMMIT_PREPARED;
else if (rbtxn_rollback_prepared(txn))
flags = LOGICALREP_IS_ROLLBACK_PREPARED;
else
flags = LOGICALREP_IS_PREPARE;

~

OTOH, if you really do want to anticipate having future flag bit
combinations then maybe the PrepareFlagsAreValid() macro ought to to
be tweaked accordingly, and the logicalrep_read_prepare() code maybe
should look more like below:

BEFORE
/* set the action (reuse the constants used for the flags) */
prepare_data->prepare_type = flags;

AFTER
/* set the action (reuse the constants used for the flags) */
prepare_data->prepare_type =
flags & LOGICALREP_IS_COMMIT_PREPARED ? LOGICALREP_IS_COMMIT_PREPARED :
flags & LOGICALREP_IS_ROLLBACK_PREPARED ? LOGICALREP_IS_ROLLBACK_PREPARED :
LOGICALREP_IS_PREPARE;

Kind Regards.
Peter Smith
Fujitsu Australia

#53Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#52)

On Fri, Oct 9, 2020 at 5:45 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Oct 8, 2020 at 5:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

COMMENT
Line 177
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)

prepare_data->prepare_type = flags;
This code may be OK but it does seem a bit of an abuse of the flags.

e.g. Are they flags or are the really enum values?
e.g. And if they are effectively enums (it appears they are) then
seemed inconsistent that |= was used when they were previously
assigned.

;

I don't understand this point. As far as I can see at the time of
write (logicalrep_write_prepare()), the patch has used |=, and at the
time of reading (logicalrep_read_prepare()) it has used assignment
which seems correct from the code perspective. Do you have a better
proposal?

OK. I will explain my thinking when I wrote that review comment.

I agree all is "correct" from a code perspective.

But IMO using bit arithmetic implies that different combinations are
also possible, whereas in current code they are not.
So code is kind of having a bet each-way - sometimes treating "flags"
as bit flags and sometimes as enums.

e.g. If these flags are not really bit flags at all then the
logicalrep_write_prepare() code might just as well be written as
below:

BEFORE
if (rbtxn_commit_prepared(txn))
flags |= LOGICALREP_IS_COMMIT_PREPARED;
else if (rbtxn_rollback_prepared(txn))
flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
else
flags |= LOGICALREP_IS_PREPARE;

/* Make sure exactly one of the expected flags is set. */
if (!PrepareFlagsAreValid(flags))
elog(ERROR, "unrecognized flags %u in prepare message", flags);

AFTER
if (rbtxn_commit_prepared(txn))
flags = LOGICALREP_IS_COMMIT_PREPARED;
else if (rbtxn_rollback_prepared(txn))
flags = LOGICALREP_IS_ROLLBACK_PREPARED;
else
flags = LOGICALREP_IS_PREPARE;

~

OTOH, if you really do want to anticipate having future flag bit
combinations

I don't anticipate more combinations rather I am not yet sure whether
we want to distinguish these operations with flags or have separate
messages for each of these operations. I think for now we can go with
your proposal above.

--
With Regards,
Amit Kapila.

#54Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#47)
4 attachment(s)

On Wed, Oct 7, 2020 at 4:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

All the above comments are genuine and I think it is mostly because
the author has blindly modified the existing tests without completely
understanding the intent of the test. I suggest we write a completely
new regression file (decode_prepared.sql) for these and just copy
whatever is required from prepared.sql. Once we do that we might also
want to rename existing prepared.sql to decode_commit_prepared.sql or
something like that. I think modifying the existing test appears to be
quite ugly and also it is changing the intent of the existing tests.

Updated this. Kept the original prepared.sql untouched and added a new
regression file called two_phase.sql
which is specific to test cases with the new flag two-phase-commit.

QUESTION
Line 120 - I did not really understand the SQL checking the pg_class.
I expected this would be checking table 'test_prepared1' instead. Can
you explain it?
SELECT 'pg_class' AS relation, locktype, mode
FROM pg_locks
WHERE locktype = 'relation'
AND relation = 'pg_class'::regclass;
relation | locktype | mode
----------+----------+------
(0 rows)

;

Yes, I also think your expectation is correct and this should be on
'test_prepared_1'.

Updated

QUESTION
Line 139 - SET statement_timeout = '1s'; is 1 seconds short enough
here for this test, or might it be that these statements would be
completed in less than one seconds anyhow?

;

Good question. I think we have to mention the reason why logical
decoding is not blocked while it needs to acquire a shared lock on the
table and the previous commands already held an exclusive lock on the
table. I am not sure if I am missing something but like you, it is not
clear to me as well what this test intends to do, so surely more
commentary is required.

Updated.

QUESTION
Line 163 - How is this testing a SAVEPOINT? Or is it only to check
that the SAVEPOINT command is not part of the replicated changes?

;

It is more of testing that subtransactions will not create a problem
while decoding.

Updated with a testcase that actually does a rollback to a savepoint

COMMENT
Line 175 - Missing underscore in comment. Code requires also underscore:
"nodecode" --> "_nodecode"

makes sense.

Updated.

==========
Patch V6-0001, File: contrib/test_decoding/test_decoding.c
==========

COMMENT
Line 43
@@ -36,6 +40,7 @@ typedef struct
bool skip_empty_xacts;
bool xact_wrote_changes;
bool only_local;
+ TransactionId check_xid; /* track abort of this txid */
} TestDecodingData;

The "check_xid" seems a meaningless name. Check what?
IIUC maybe should be something like "check_xid_aborted"

Updated.

;

COMMENT
Line 105
@ -88,6 +93,19 @@ static void
pg_decode_stream_truncate(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
int nrelations, Relation relations[],
ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,

Remove extra blank line after these functions

;

The above two sounds reasonable suggestions.

Updated.

COMMENT
Line 149
@@ -116,6 +134,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->stream_change_cb = pg_decode_stream_change;
cb->stream_message_cb = pg_decode_stream_message;
cb->stream_truncate_cb = pg_decode_stream_truncate;
+ cb->filter_prepare_cb = pg_decode_filter_prepare;
+ cb->prepare_cb = pg_decode_prepare_txn;
+ cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+ cb->abort_prepared_cb = pg_decode_abort_prepared_txn;
+
}

There is a confusing mix of terminology where sometimes things are
referred as ROLLBACK/rollback and other times apparently the same
operation is referred as ABORT/abort. I do not know the root cause of
this mixture. IIUC maybe the internal functions and protocol generally
use the term "abort", whereas the SQL syntax is "ROLLBACK"... but
where those two terms collide in the middle it gets quite confusing.

At least I thought the names of the "callbacks" which get exposed to
the user (e.g. in the help) might be better if they would match the
SQL.
"abort_prepared_cb" --> "rollback_prepared_db"

This suggestion sounds reasonable. I think it is to entertain the case
where due to error we need to rollback the transaction. I think it is
better if use 'rollback' terminology in the exposed functions. We
already have a function with the name stream_abort_cb in the code
which we also might want to rename but that is a separate thing and we
can deal it with a separate patch.

Changed the call back names from abort_prepared to rollback_prepapred
and stream_abort_prepared to stream_rollback_prepared.

There are similar review comments like this below where the
alternating terms caused me some confusion.

~

Also, Remove the extra blank line before the end of the function.

;

COMMENT
Line 267
@ -227,6 +252,42 @@ pg_decode_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
errmsg("could not parse value \"%s\" for parameter \"%s\"",
strVal(elem->arg), elem->defname)));
}
+ else if (strcmp(elem->defname, "two-phase-commit") == 0)
+ {
+ if (elem->arg == NULL)
+ continue;

IMO the "check-xid" code might be better rearranged so the NULL check
is first instead of if/else.
e.g.
if (elem->arg == NULL)
ereport(FATAL,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("check-xid needs an input value")));
~

Also, is it really supposed to be FATAL instead or ERROR. That is not
the same as the other surrounding code.

;

+1.

Updated.

COMMENT
Line 296
if (data->check_xid <= 0)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("Specify positive value for parameter \"%s\","
" you specified \"%s\"",
elem->defname, strVal(elem->arg))));

The code checking for <= 0 seems over-complicated. Because conversion
was using strtoul() I fail to see how this can ever be < 0. Wouldn't
it be easier to simply test the result of the strtoul() function?

BEFORE: if (errno == EINVAL || errno == ERANGE)
AFTER: if (data->check_xid == 0)

Better to use TransactionIdIsValid(data->check_xid) here.

Updated.

~

Also, should this be FATAL? Everything else similar is ERROR.

;

It should be an error.

Updated

COMMENT
(general)
I don't recall seeing any of these decoding options (e.g.
"two-phase-commit", "check-xid") documented anywhere.
So how can a user even know these options exist so they can use them?
Perhaps options should be described on this page?
https://www.postgresql.org/docs/13/functions-admin.html#FUNCTIONS-REPLICATION

;

I think we should do what we are doing for other options, if they are
not documented then why to document this one separately. I guess we
can make a case to document all the existing options and write a
separate patch for that.

I didnt see any of the test_decoding options updated in the
documentation as these seem specific for the test_decoder used in
testing.
https://www.postgresql.org/docs/13/test-decoding.html

COMMENT
(general)
"check-xid" is a meaningless option name. Maybe something like
"checked-xid-aborted" is more useful?
Suggest changing the member, the option, and the error messages to
match some better name.

Updated.

;

COMMENT
Line 314
@@ -238,6 +299,7 @@ pg_decode_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
}

ctx->streaming &= enable_streaming;
+ ctx->enable_twophase &= enable_2pc;
}

The "ctx->enable_twophase" is inconsistent naming with the
"ctx->streaming" member.
"enable_twophase" --> "twophase"

;

+1.

Updated

COMMENT
Line 374
@@ -297,6 +359,94 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
OutputPluginWrite(ctx, true);
}

+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,

Remove the extra preceding blank line.

Updated.

~

I did not find anything in the help about "_nodecode". Should it be
there or is this deliberately not documented feature?

;

I guess we can document it along with filter_prepare API, if not
already documented.

Again , this seems to be specific to test_decoder and an example of a
way to create a filter_prepare.

QUESTION
Line 440
+pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,

Is this a wrong comment
"ABORT PREPARED" --> "ROLLBACK PREPARED" ??

;

COMMENT
Line 620
@@ -455,6 +605,22 @@ pg_decode_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
}
data->xact_wrote_changes = true;

+ /* if check_xid is specified */
+ if (TransactionIdIsValid(data->check_xid))
+ {
+ elog(LOG, "waiting for %u to abort", data->check_xid);
+ while (TransactionIdIsInProgress(dat

The check_xid seems a meaningless name, and the comment "/* if
check_xid is specified */" was not helpful either.
IIUC purpose of this is to check that the nominated xid always is rolled back.
So the appropriate name may be more like "check-xid-aborted".

;

Yeah, this part deserves better comments.

Updated.

Other than these first batch of review comments from Peter Smith, I've
also updated new functions in decode.c for DecodeCommitPrepared
and DecodeAbortPrepared as agreed in a previous review comment by
Amit and Dilip.
I've also incorporated Dilip's comment on acquiring SHARED lock rather
than EXCLUSIVE lock while looking for transaction matching Gid.
Since Peter's comments are many, I'll be sending patch updates in
parts addressing his comments.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v7-0003-pgoutput-support-for-logical-decoding-of-2pc.patchapplication/octet-stream; name=v7-0003-pgoutput-support-for-logical-decoding-of-2pc.patchDownload
From 8ac357f16a4e0306e7252a1deac4488a0b1d6286 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 9 Oct 2020 02:02:08 -0400
Subject: [PATCH v7] pgoutput support for logical decoding of 2pc

Support decoding of two phase commit in pgoutput and on subscriber side.
---
 src/backend/access/transam/twophase.c       |  31 ++++++
 src/backend/replication/logical/proto.c     |  90 ++++++++++++++-
 src/backend/replication/logical/worker.c    | 147 ++++++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c |  54 ++++++++-
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  37 ++++++-
 src/test/subscription/t/020_twophase.pl     | 163 ++++++++++++++++++++++++++++
 7 files changed, 514 insertions(+), 9 deletions(-)
 create mode 100644 src/test/subscription/t/020_twophase.pl

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7940060..1200bf0 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,37 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (!gxact->valid)
+			continue;
+		if (strcmp(gxact->gid, gid) != 0)
+			continue;
+
+		LWLockRelease(TwoPhaseStateLock);
+
+		return true;
+	}
+
+	LWLockRelease(TwoPhaseStateLock);
+
+	return false;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index eb19142..291ed10 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -72,12 +72,17 @@ logicalrep_read_begin(StringInfo in, LogicalRepBeginData *begin_data)
  */
 void
 logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
-						XLogRecPtr commit_lsn)
+						XLogRecPtr commit_lsn, bool is_commit)
 {
 	uint8		flags = 0;
 
 	pq_sendbyte(out, 'C');		/* sending COMMIT */
 
+	if (is_commit)
+		flags |= LOGICALREP_IS_COMMIT;
+	else
+		flags |= LOGICALREP_IS_ABORT;
+
 	/* send the flags field (unused for now) */
 	pq_sendbyte(out, flags);
 
@@ -88,16 +93,20 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 }
 
 /*
- * Read transaction COMMIT from the stream.
+ * Read transaction COMMIT|ABORT from the stream.
  */
 void
 logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 {
-	/* read flags (unused for now) */
+	/* read flags */
 	uint8		flags = pq_getmsgbyte(in);
 
-	if (flags != 0)
-		elog(ERROR, "unrecognized flags %u in commit message", flags);
+	if (!CommitFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in commit|abort message",
+			 flags);
+
+	/* the flag is either commit or abort */
+	commit_data->is_commit = (flags == LOGICALREP_IS_COMMIT);
 
 	/* read fields */
 	commit_data->commit_lsn = pq_getmsgint64(in);
@@ -106,6 +115,77 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'P');		/* sending PREPARE protocol */
+
+	/*
+	 * This should only ever happen for 2PC transactions. In which case we
+	 * expect to have a non-empty GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(strlen(txn->gid) > 0);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags |= LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags |= LOGICALREP_IS_PREPARE;
+
+	/* Make sure exactly one of the expected flags is set. */
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9c6fdee..a08da85 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -729,7 +729,11 @@ apply_handle_commit(StringInfo s)
 		replorigin_session_origin_lsn = commit_data.end_lsn;
 		replorigin_session_origin_timestamp = commit_data.committime;
 
-		CommitTransactionCommand();
+		if (commit_data.is_commit)
+			CommitTransactionCommand();
+		else
+			AbortCurrentTransaction();
+
 		pgstat_report_stat(false);
 
 		store_flush_position(commit_data.end_lsn);
@@ -749,6 +753,141 @@ apply_handle_commit(StringInfo s)
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
 
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/* End the earlier transaction and start a new one */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
 /*
  * Handle ORIGIN message.
  *
@@ -1909,10 +2048,14 @@ apply_dispatch(StringInfo s)
 		case 'B':
 			apply_handle_begin(s);
 			break;
-			/* COMMIT */
+			/* COMMIT/ABORT */
 		case 'C':
 			apply_handle_commit(s);
 			break;
+			/* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+		case 'P':
+			apply_handle_prepare(s);
+			break;
 			/* INSERT */
 		case 'I':
 			apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..a078c50 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+					 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+							 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -143,6 +149,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -373,7 +383,49 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginUpdateProgress(ctx);
 
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit(ctx->out, txn, commit_lsn, true);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
 }
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 0c2cda2..33d719c 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -87,20 +87,55 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
+	bool        is_commit;
 	XLogRecPtr	commit_lsn;
 	XLogRecPtr	end_lsn;
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/* types of the commit protocol message */
+#define LOGICALREP_IS_COMMIT			0x01
+#define LOGICALREP_IS_ABORT				0x02
+
+/* commit message is COMMIT or ABORT, and there is nothing else */
+#define CommitFlagsAreValid(flags) \
+	((flags == LOGICALREP_IS_COMMIT) || (flags == LOGICALREP_IS_ABORT))
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+}			LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ABORT] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	((flags == LOGICALREP_IS_PREPARE) || \
+	 (flags == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 (flags == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
 extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
-									XLogRecPtr commit_lsn);
+									XLogRecPtr commit_lsn, bool is_commit);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+						LogicalRepPrepareData * prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..c7f373d
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+        ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+        'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+   is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+   is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (11,12);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

v7-0001-Support-decoding-of-two-phase-transactions.patchapplication/octet-stream; name=v7-0001-Support-decoding-of-two-phase-transactions.patchDownload
From 0091729a100f47494c3138ea4daf8add73e20f53 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 9 Oct 2020 01:45:06 -0400
Subject: [PATCH v7] Support decoding of two-phase transactions

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes documentation changes.
---
 contrib/test_decoding/Makefile                  |   2 +-
 contrib/test_decoding/expected/two_phase.out    | 219 ++++++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql         | 117 +++++++++++
 contrib/test_decoding/test_decoding.c           | 159 +++++++++++++++
 doc/src/sgml/logicaldecoding.sgml               | 110 +++++++++-
 src/backend/replication/logical/decode.c        | 250 +++++++++++++++++++++--
 src/backend/replication/logical/logical.c       | 175 ++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c | 259 ++++++++++++++++++++----
 src/include/replication/logical.h               |   5 +
 src/include/replication/output_plugin.h         |  37 ++++
 src/include/replication/reorderbuffer.h         |  66 ++++++
 11 files changed, 1348 insertions(+), 51 deletions(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f23f15b..36b3f9a 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,7 +4,7 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase messages \
 	spill slot truncate stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..3ac01a4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,219 @@
+-- Test two-phased transactions, when two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time. 
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- 
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- test abort of a prepared xact
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- test prepared xact containing ddl
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints and sub-xacts as a result
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..e3e2690
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,117 @@
+-- Test two-phased transactions, when two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time. 
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- 
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test abort of a prepared xact
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+
+-- test prepared xact containing ddl
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test savepoints and sub-xacts as a result
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e60ab34..6b8e502 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid_aborted; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -88,6 +93,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -116,6 +133,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -127,6 +148,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool 		enable_2pc = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +158,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +250,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_2pc))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)
+					strtoul(strVal(elem->arg), NULL, 0);
+
+				if (!TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+								strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +290,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_2pc;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +350,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +595,25 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid_aborted is a valid xid, then it was passed in
+	 * as an option to check if the transaction having this xid would be aborted.
+	 * This is to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			   !TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..f24470a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,6 +387,10 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
@@ -477,7 +481,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +588,55 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The optional <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Commit Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>commit_prepared_cb</function> callback is called whenever
+      a commit prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Rollback Prepared Transaction Callback</title>
+
+     <para>
+      The optional <function>rollback_prepared_cb</function> callback is called whenever
+      a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +646,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +729,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +783,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..19fd92a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -68,8 +68,15 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						 xl_xact_parsed_commit *parsed, TransactionId xid);
+static void DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -239,7 +246,6 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	switch (info)
 	{
 		case XLOG_XACT_COMMIT:
-		case XLOG_XACT_COMMIT_PREPARED:
 			{
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
@@ -256,8 +262,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeCommit(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_COMMIT_PREPARED:
+			{
+				xl_xact_commit *xlrec;
+				xl_xact_parsed_commit parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_commit *) XLogRecGetData(r);
+				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+				DecodeCommitPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ABORT:
-		case XLOG_XACT_ABORT_PREPARED:
 			{
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
@@ -274,6 +296,23 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_ABORT_PREPARED:
+			{
+				xl_xact_abort *xlrec;
+				xl_xact_parsed_abort parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_abort *) XLogRecGetData(r);
+				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+					DecodeAbortPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ASSIGNMENT:
 
 			/*
@@ -312,17 +351,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of twophase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -659,6 +716,131 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Consolidated commit record handling between the different form of commit
+ * records.
+ */
+static void
+DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+             xl_xact_parsed_commit *parsed, TransactionId xid)
+{
+    XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
+    TimestampTz commit_time = parsed->xact_time;
+    RepOriginId origin_id = XLogRecGetOrigin(buf->record);
+    int         i;
+
+    if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+    {
+        origin_lsn = parsed->origin_lsn;
+        commit_time = parsed->origin_timestamp;
+    }
+
+    SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
+                       parsed->nsubxacts, parsed->subxacts);
+
+    /* ----
+     * Check whether we are interested in this specific transaction, and tell
+     * the reorderbuffer to forget the content of the (sub-)transactions
+     * if not.
+     *
+     * There can be several reasons we might not be interested in this
+     * transaction:
+     * 1) We might not be interested in decoding transactions up to this
+     *    LSN. This can happen because we previously decoded it and now just
+     *    are restarting or if we haven't assembled a consistent snapshot yet.
+     * 2) The transaction happened in another database.
+     * 3) The output plugin is not interested in the origin.
+     * 4) We are doing fast-forwarding
+     *
+     * We can't just use ReorderBufferAbort() here, because we need to execute
+     * the transaction's invalidations.  This currently won't be needed if
+     * we're just skipping over the transaction because currently we only do
+     * so during startup, to get to the first transaction the client needs. As
+     * we have reset the catalog caches before starting to read WAL, and we
+     * haven't yet touched any catalogs, there can't be anything to invalidate.
+     * But if we're "forgetting" this commit because it's it happened in
+     * another database, the invalidations might be important, because they
+     * could be for shared catalogs and we might have loaded data into the
+     * relevant syscaches.
+     * ---
+     */
+    if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+        (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+        ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+    {
+        for (i = 0; i < parsed->nsubxacts; i++)
+        {
+            ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+        }
+        ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+        return;
+    }
+
+    /* tell the reorderbuffer about the surviving subtransactions */
+    for (i = 0; i < parsed->nsubxacts; i++)
+    {
+        ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+                                 buf->origptr, buf->endptr);
+    }
+
+    /*
+     * For COMMIT PREPARED, the changes have already been replayed at
+     * PREPARE time, so we only need to notify the subscriber that the GID
+     * finally committed.
+     * If filter check present and this needs to be skipped, do a regular commit.
+     */
+    if (ctx->callbacks.filter_prepare_cb &&
+            ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+    {
+        ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+                            commit_time, origin_id, origin_lsn);
+    }
+    else
+    {
+        ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+                                        commit_time, origin_id, origin_lsn,
+                                        parsed->twophase_gid, true);
+    }
+
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+              xl_xact_parsed_prepare * parsed)
+{
+    XLogRecPtr  origin_lsn = parsed->origin_lsn;
+    TimestampTz commit_time = parsed->origin_timestamp;
+    XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
+    int         i;
+    TransactionId xid = parsed->twophase_xid;
+
+    if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+        (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+        ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+        return;
+
+    /*
+     * Tell the reorderbuffer about the surviving subtransactions. We need to
+     * do this because the main transaction itself has not committed since we
+     * are in the prepare phase right now. So we need to be sure the snapshot
+     * is setup correctly for the main transaction in case all changes
+     * happened in subtransanctions
+     */
+    for (i = 0; i < parsed->nsubxacts; i++)
+    {
+        ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+                                 buf->origptr, buf->endptr);
+    }
+
+    /* replay actions of all transaction + subtransactions in order */
+    ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+                         commit_time, origin_id, origin_lsn, parsed->twophase_gid);
+}
+
+/*
  * Get the data from the various forms of abort records and pass it on to
  * snapbuild.c and reorderbuffer.c
  */
@@ -681,6 +863,50 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Get the data from the various forms of abort records and pass it on to
+ * snapbuild.c and reorderbuffer.c
+ */
+static void
+DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			xl_xact_parsed_abort *parsed, TransactionId xid)
+{
+	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it passes through the filters handle the ROLLBACK via callbacks
+	 */
+	if(!FilterByOrigin(ctx, origin_id) &&
+	   !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+	   !ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		Assert(TransactionIdIsValid(xid));
+		Assert(parsed->dbId == ctx->slot->data.database);
+
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
+
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+						   buf->record->EndRecPtr);
+	}
+
+	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+}
+
+/*
  * Parse XLOG_HEAP_INSERT (not MULTI_INSERT!) records into tuplebufs.
  *
  * Deletes can contain the new tuple.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8675832..8aeb648 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +215,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -227,6 +239,19 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two phase logical decoding, we require prepare/commit-prepare/abort-prepare
+	 * callbacks. The filter-prepare callback is optional. We however enable two phase logical
+	 * decoding when at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -783,6 +808,111 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then prepare callback is mandatory */
+	if (ctx->twophase && ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then commit prepared callback is mandatory */
+	if (ctx->twophase && ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then abort prepared callback is mandatory */
+	if (ctx->twophase && ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +989,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of twophase at PREPARE time is not enabled. In that
+	 * case all twophase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4cb27f2..4da840c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -417,6 +418,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	}
 
 	/* free data that's contained */
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
 
 	if (txn->tuplecid_hash != NULL)
 	{
@@ -1506,12 +1512,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots.If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1530,7 +1538,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1564,9 +1572,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for decoding
+		 * catalog snapshot access.
+		 * They are always stored in the toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* remove the change from it's containing list */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1891,7 +1923,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1998,7 +2030,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2260,7 +2292,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					break;
 			}
 		}
-
 		/*
 		 * There's a speculative insertion remaining, just clean in up, it
 		 * can't have been successful, otherwise we'd gotten a confirmation
@@ -2289,7 +2320,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for twophase transactions) or COMMIT
+			 * (for regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2330,11 +2370,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, false);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (rbtxn_prepared(txn))
+		{
+			ReorderBufferTruncateTXN(rb, txn, true);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2363,17 +2409,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * abort of the (sub)transaction we are streaming or preparing. We need to do the
 		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can only occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we are
+			 * sending the data out on a PREPARE during a twoi phase commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started  || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2381,10 +2428,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/* If streaming, reset the TXN so that it is allowed to stream remaining data. */
+			if (streaming)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+						txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2406,23 +2462,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2464,6 +2513,140 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a twophase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+	/*
+	* Always call the prepare filter. It's the job of the prepare filter to
+	* give us the *same* response for a given xid across multiple calls
+	* (including ones on restart)
+	*/
+	return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ABORT PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	* The transaction may or may not exist (during restarts for example).
+	* Anyways, 2PC transactions do not contain any reorderbuffers. So allow
+	* it to be created below.
+	*/
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+	{
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+		rb->commit_prepared(rb, txn, commit_lsn);
+	}
+	else
+	{
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+		rb->rollback_prepared(rb, txn, commit_lsn);
+	}
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(rb, txn);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2506,7 +2689,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
+
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..b4592da 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -84,6 +84,11 @@ typedef struct LogicalDecodingContext
 	 */
 	bool		streaming;
 
+ 	/*
+	 * Does the output plugin support two phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..96acd01 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -171,6 +204,10 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0cc3aeb..e060107 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -166,6 +167,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -225,6 +229,13 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* is this txn prepared? */
+#define rbtxn_prepared(txn)            (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn)     (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn)   (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -236,6 +247,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of 2PC we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -397,6 +411,39 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+                                     ReorderBuffer *rb,
+                                     ReorderBufferTXN *txn,
+                                     XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             TransactionId xid,
+                                             const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+                                       ReorderBuffer *rb,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+                                              ReorderBuffer *rb,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             XLogRecPtr abort_lsn);
+
+
+
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -489,6 +536,11 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferAbortCB abort;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -566,6 +618,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+                           XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                           TimestampTz commit_time,
+                           RepOriginId origin_id, XLogRecPtr origin_lsn,
+                           char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -589,6 +646,15 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+							 const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v7-0004-Support-two-phase-commits-in-streaming-mode-of-lo.patchapplication/octet-stream; name=v7-0004-Support-two-phase-commits-in-streaming-mode-of-lo.patchDownload
From 16ac52c5e944a8122c589c322a1bcaac035bc4e3 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 9 Oct 2020 02:38:46 -0400
Subject: [PATCH v7] Support two phase commits in streaming mode of logical
 decoding

Add streaming APIS for PREPARE, COMMIT PREPARED and ROLLBACK PREPARED
---
 contrib/test_decoding/test_decoding.c           |  84 ++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml               |  56 +++++++++++-
 src/backend/replication/logical/logical.c       | 111 ++++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  41 +++++++--
 src/include/replication/output_plugin.h         |  30 +++++++
 src/include/replication/reorderbuffer.h         |  21 +++++
 6 files changed, 331 insertions(+), 12 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 6b8e502..4508917 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -78,6 +78,15 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_commit_prepared(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_rollback_prepared(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -129,6 +138,9 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
+	cb->stream_commit_prepared_cb = pg_decode_stream_commit_prepared;
+	cb->stream_rollback_prepared_cb = pg_decode_stream_rollback_prepared;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
@@ -805,6 +817,78 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit_prepared(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "commit prepared streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "commit prepared streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_rollback_prepared(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "abort prepared streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "abort prepared streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index f24470a..3769896 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -396,6 +396,9 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
+    LogicalDecodeStreamCommitPreparedCB stream_commit_prepared_cb;
+    LogicalDecodeStreamRollbackPreparedCB stream_rollback_prepared_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -418,7 +421,9 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
      <function>stream_commit_cb</function> and <function>stream_change_cb</function>
-     are required, while <function>stream_message_cb</function> and
+     are required, while <function>stream_message_cb</function>,
+     <function>stream_prepare_cb</function>, <function>stream_commit_prepared_cb</function>,
+     <function>stream_rollback_prepared_cb</function>,
      <function>stream_truncate_cb</function> are optional.
     </para>
    </sect2>
@@ -839,6 +844,45 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit-prepared">
+     <title>Stream Commit Prepared Callback</title>
+     <para>
+      The <function>stream_commit_prepared_cb</function> callback is called to commit prepared
+      a previously streamed transaction as part of a two phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-abort-prepared">
+     <title>Stream Abort Prepared Callback</title>
+     <para>
+      The <function>stream_rollback_prepared_cb</function> callback is called to abort prepared
+      a previously streamed transaction as part of a two phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortPreparedCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -1017,9 +1061,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>stream_commit_prepared_cb</function> callback or aborted using the
+    <function>stream_rollback_prepared_cb</function>
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8aeb648..6b01e2a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -82,6 +82,12 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
+static void stream_rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									 XLogRecPtr commit_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -233,6 +239,9 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
 		(ctx->callbacks.stream_stop_cb != NULL) ||
 		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.stream_commit_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_rollback_prepared_cb != NULL) ||
 		(ctx->callbacks.stream_commit_cb != NULL) ||
 		(ctx->callbacks.stream_change_cb != NULL) ||
 		(ctx->callbacks.stream_message_cb != NULL) ||
@@ -262,6 +271,9 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
+	ctx->reorder->stream_commit_prepared = stream_commit_prepared_cb_wrapper;
+	ctx->reorder->stream_rollback_prepared = stream_rollback_prepared_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -1232,6 +1244,105 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								  XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit_prepared";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	ctx->callbacks.stream_commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_rollback_prepared";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	ctx->callbacks.stream_rollback_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4da840c..a472d08 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1792,9 +1792,18 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
-
-	ReorderBufferCleanupTXN(rb, txn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -2630,15 +2639,31 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
 	strcpy(txn->gid, gid);
 
-	if (is_commit)
+	if (rbtxn_is_streamed(txn))
 	{
-		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
-		rb->commit_prepared(rb, txn, commit_lsn);
+		if (is_commit)
+		{
+			txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+			rb->stream_commit_prepared(rb, txn, commit_lsn);
+		}
+		else
+		{
+			txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+			rb->stream_rollback_prepared(rb, txn, commit_lsn);
+		}
 	}
 	else
 	{
-		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
-		rb->rollback_prepared(rb, txn, commit_lsn);
+		if (is_commit)
+		{
+			txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+			rb->commit_prepared(rb, txn, commit_lsn);
+		}
+		else
+		{
+			txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+			rb->rollback_prepared(rb, txn, commit_lsn);
+		}
 	}
 
 	/* cleanup: make sure there's no cache pollution */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 96acd01..29d3ffd 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -157,6 +157,33 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit and only when
+ * two-phased commits are supported
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called to commit prepared changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit and only when
+ * two-phased commits are supported
+ */
+typedef void (*LogicalDecodeStreamCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called to abort/rollback prepared changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit and only when
+ * two-phased commits are supported
+ */
+typedef void (*LogicalDecodeStreamRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -214,6 +241,9 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
+	LogicalDecodeStreamCommitPreparedCB stream_commit_prepared_cb;
+	LogicalDecodeStreamRollbackPreparedCB stream_rollback_prepared_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e060107..94303a3 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -470,6 +470,24 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitPreparedCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamRollbackPreparedCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -549,6 +567,9 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
+	ReorderBufferStreamCommitPreparedCB stream_commit_prepared;
+	ReorderBufferStreamRollbackPreparedCB stream_rollback_prepared;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
-- 
1.8.3.1

v7-0002-Tap-test-to-test-concurrent-aborts-during-2-phase.patchapplication/octet-stream; name=v7-0002-Tap-test-to-test-concurrent-aborts-during-2-phase.patchDownload
From aeee11f7fc41fbc20d2afd4a49f60e7a89c42457 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 9 Oct 2020 01:46:45 -0400
Subject: [PATCH v7] Tap test to test concurrent aborts during 2 phase commits

This test is specifically for testing concurrent abort while logical decode
is ongoing. Pass in the xid of the 2PC to the plugin as an option.
On the receipt of a valid "check-xid", the change API in the test decoding
plugin will wait for it to be aborted.
---
 contrib/test_decoding/Makefile          |   2 +
 contrib/test_decoding/t/001_twophase.pl | 119 ++++++++++++++++++++++++++++++++
 2 files changed, 121 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 36b3f9a..3fb7dac 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..51c6a35
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,119 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

#55Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#46)
4 attachment(s)

On Wed, Oct 7, 2020 at 9:36 AM Peter Smith <smithpb2250@gmail.com> wrote:

==========
Patch V6-0001, File: doc/src/sgml/logicaldecoding.sgml
==========

COMMENT/QUESTION
Section 48.6.1
@ -387,6 +387,10 @@ typedef struct OutputPluginCallbacks
LogicalDecodeTruncateCB truncate_cb;
LogicalDecodeCommitCB commit_cb;
LogicalDecodeMessageCB message_cb;
+ LogicalDecodeFilterPrepareCB filter_prepare_cb;

Confused by the mixing of terminologies "abort" and "rollback".
Why is it LogicalDecodeAbortPreparedCB instead of
LogicalDecodeRollbackPreparedCB?
Why is it abort_prepared_cb instead of rollback_prepared_cb;?

I thought everything the user sees should be ROLLBACK/rollback (like
the SQL) regardless of what the internal functions might be called.

;

Modified.

COMMENT
Section 48.6.1
The begin_cb, change_cb and commit_cb callbacks are required, while
startup_cb, filter_by_origin_cb, truncate_cb, and shutdown_cb are
optional. If truncate_cb is not set but a TRUNCATE is to be decoded,
the action will be ignored.

The 1st paragraph beneath the typedef does not mention the newly added
callbacks to say if they are required or optional.

Added a new para for this.

;

COMMENT
Section 48.6.4.5
Section 48.6.4.6
Section 48.6.4.7
@@ -578,6 +588,55 @@ typedef void (*LogicalDecodeCommitCB) (struct
LogicalDecodingContext *ctx,
</para>
</sect3>

+ <sect3 id="logicaldecoding-output-plugin-prepare">
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+    <sect3 id="logicaldecoding-output-plugin-abort-prepared">
+<programlisting>

The wording and titles are a bit backwards compared to the others.
e.g. previously was "Transaction Begin" (not "Begin Transaction") and
"Transaction End" (not "End Transaction").

So for consistently following the existing IMO should change these new
titles (and wording) to:
- "Commit Prepared Transaction Callback" --> "Transaction Commit
Prepared Callback"
- "Rollback Prepared Transaction Callback" --> "Transaction Rollback
Prepared Callback"
- "whenever a commit prepared transaction has been decoded" -->
"whenever a transaction commit prepared has been decoded"
- "whenever a rollback prepared transaction has been decoded." -->
"whenever a transaction rollback prepared has been decoded."

;

Updated to this

==========
Patch V6-0001, File: src/backend/replication/logical/decode.c
==========

COMMENT
Line 74
@@ -70,6 +70,9 @@ static void DecodeCommit(LogicalDecodingContext
*ctx, XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+ xl_xact_parsed_prepare * parsed);

The 2nd line of DecodePrepare is misaligned by one space.

;

COMMENT
Line 321
@@ -312,17 +315,34 @@ DecodeXactOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
}
break;
case XLOG_XACT_PREPARE:
+ {
+ xl_xact_parsed_prepare parsed;
+ xl_xact_prepare *xlrec;
+ /* check that output plugin is capable of twophase decoding */

"twophase" --> "two-phase"

~

Also, add a blank line after the declarations.

;

==========
Patch V6-0001, File: src/backend/replication/logical/logical.c
==========

COMMENT
Line 249
@@ -225,6 +237,19 @@ StartupDecodingContext(List *output_plugin_options,
(ctx->callbacks.stream_message_cb != NULL) ||
(ctx->callbacks.stream_truncate_cb != NULL);

+ /*
+ * To support two phase logical decoding, we require
prepare/commit-prepare/abort-prepare
+ * callbacks. The filter-prepare callback is optional. We however
enable two phase logical
+ * decoding when at least one of the methods is enabled so that we
can easily identify
+ * missing methods.

The terminology is generally well known as "two-phase" (with the
hyphen) https://en.wikipedia.org/wiki/Two-phase_commit_protocol so
let's be consistent for all the patch code comments. Please search the
code and correct this in all places, even where I might have missed to
identify it.

"two phase" --> "two-phase"

;

COMMENT
Line 822
@@ -782,6 +807,111 @@ commit_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
}

static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)

"support 2 phase" --> "supports two-phase" in the comment

;

COMMENT
Line 844
Code condition seems strange and/or broken.
if (ctx->enable_twophase && ctx->callbacks.prepare_cb == NULL)
Because if the flag is null then this condition is skipped.
But then if the callback was also NULL then attempting to call it to
"do the actual work" will give NPE.

~

Also, I wonder should this check be the first thing in this function?
Because if it fails does it even make sense that all the errcallback
code was set up?
E.g errcallback.arg potentially is left pointing to a stack variable
on a stack that no longer exists.

Updated accordingly.

;

COMMENT
Line 857
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,

"support 2 phase" --> "supports two-phase" in the comment

~

Also, Same potential trouble with the condition:
if (ctx->enable_twophase && ctx->callbacks.commit_prepared_cb == NULL)
Same as previously asked. Should this check be first thing in this function?

;

COMMENT
Line 892
+abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,

"support 2 phase" --> "supports two-phase" in the comment

~

Same potential trouble with the condition:
if (ctx->enable_twophase && ctx->callbacks.abort_prepared_cb == NULL)
Same as previously asked. Should this check be the first thing in this function?

;

COMMENT
Line 1013
@@ -858,6 +988,51 @@ truncate_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
error_context_stack = errcallback.previous;
}

+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ TransactionId xid, const char *gid)

Fix wording in comment:
"twophase" --> "two-phase transactions"
"twophase transactions" --> "two-phase transactions"

Updated accordingly.

==========
Patch V6-0001, File: src/backend/replication/logical/reorderbuffer.c
==========

COMMENT
Line 255
@@ -251,7 +251,8 @@ static Size
ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
static void ReorderBufferRestoreChange(ReorderBuffer *rb,
ReorderBufferTXN *txn,
char *change);
static void ReorderBufferRestoreCleanup(ReorderBuffer *rb,
ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+ bool txn_prepared);

The alignment is inconsistent. One more space needed before "bool txn_prepared"

;

COMMENT
Line 417
@@ -413,6 +414,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
}

/* free data that's contained */
+ if (txn->gid != NULL)
+ {
+ pfree(txn->gid);
+ txn->gid = NULL;
+ }

Should add the blank link before this new code, as it was before.

;

COMMENT
Line 1564
@ -1502,12 +1561,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
}

/*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them. Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either
after streaming or
+ * after a PREPARE.

typo "snapshots.If" -> "snapshots. If"

;

Updated Accordingly.

COMMENT/QUESTION
Line 1590
@@ -1526,7 +1587,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
Assert(rbtxn_is_known_subxact(subtxn));
Assert(subtxn->nsubtxns == 0);

- ReorderBufferTruncateTXN(rb, subtxn);
+ ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
}

There are some code paths here I did not understand how they match the comments.
Because this function is recursive it seems that it may be called
where the 2nd parameter txn is a sub-transaction.

But then this seems at odds with some of the other code comments of
this function which are processing the txn without ever testing is it
really toplevel or not:

e.g. Line 1593 "/* cleanup changes in the toplevel txn */"
e.g. Line 1632 "They are always stored in the toplevel transaction."

;

I see that another commit in between has updated this now.

COMMENT
Line 1644
@@ -1560,9 +1621,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
* about the toplevel xact (we send the XID in all messages), but we never
* stream XIDs of empty subxacts.
*/
- if ((!txn->toptxn) || (txn->nentries_mem != 0))
+ if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
txn->txn_flags |= RBTXN_IS_STREAMED;

+ if (txn_prepared)

/* remove the change from it's containing list */
typo "it's" --> "its"

Updated.

;

QUESTION
Line 1977
@@ -1880,7 +1965,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
ReorderBufferChange *specinsert)
{
/* Discard the changes that we just streamed */
- ReorderBufferTruncateTXN(rb, txn);
+ ReorderBufferTruncateTXN(rb, txn, false);

How do you know the 3rd parameter - i.e. txn_prepared - should be
hardwired false here?
e.g. I thought that maybe rbtxn_prepared(txn) can be true here.

;

This particular function is only called when streaming and not when
handling a prepared transaction.

COMMENT
Line 2345
@@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
break;
}
}
-
/*

Looks like accidental blank line deletion. This should be put back how it was

;

COMMENT/QUESTION
Line 2374
@@ -2278,7 +2362,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
}
}
else
- rb->commit(rb, txn, commit_lsn);
+ {
+ /*
+ * Call either PREPARE (for twophase transactions) or COMMIT
+ * (for regular ones).

"twophase" --> "two-phase"

~

Updated.

Also, I was confused by the apparent assumption of exclusiveness of
streaming and 2PC...
e.g. what if streaming AND 2PC then it won't do rb->prepare()

;

QUESTION
Line 2424
@@ -2319,11 +2412,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
*/
if (streaming)
{
- ReorderBufferTruncateTXN(rb, txn);
+ ReorderBufferTruncateTXN(rb, txn, false);

/* Reset the CheckXidAlive */
CheckXidAlive = InvalidTransactionId;
}
+ else if (rbtxn_prepared(txn))

I was confused by the exclusiveness of streaming/2PC.
e.g. what if streaming AND 2PC at same time - how can you pass false
as 3rd param to ReorderBufferTruncateTXN?

ReorderBufferProcessTXN can only be called when streaming individual
commands and not for streaming a prepare or a commit, Streaming of
prepare and commit would be handled as part of
ReorderBufferStreamCommit.

;

COMMENT
Line 2463
@@ -2352,17 +2451,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,

/*
* The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
- * abort of the (sub)transaction we are streaming. We need to do the
+ * abort of the (sub)transaction we are streaming or preparing. We
need to do the
* cleanup and return gracefully on this error, see SetupCheckXidLive.
*/

"twoi phase" --> "two-phase"

;

QUESTIONS
Line 2482
@@ -2370,10 +2470,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
errdata = NULL;
curtxn->concurrent_abort = true;

- /* Reset the TXN so that it is allowed to stream remaining data. */
- ReorderBufferResetTXN(rb, txn, snapshot_now,
- command_id, prev_lsn,
- specinsert);
+ /* If streaming, reset the TXN so that it is allowed to stream
remaining data. */
+ if (streaming)

Re: /* If streaming, reset the TXN so that it is allowed to stream
remaining data. */
I was confused by the exclusiveness of streaming/2PC.
Is it not possible for streaming flags and rbtxn_prepared(txn) true at
the same time?

Same as above.

~

elog(LOG, "stopping decoding of %s (%u)",
txn->gid[0] != '\0'? txn->gid:"", txn->xid);

Is this a safe operation, or do you also need to test txn->gid is not NULL?

Since this is in code where it is not streaming and therefore
rbtxn_prepared(txn), so gid has to be NOT NULL.

;

COMMENT
Line 2606
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,

"twophase" --> "two-phase"

;

QUESTION
Line 2655
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,

"This is used to handle COMMIT/ABORT PREPARED"
Should that say "COMMIT/ROLLBACK PREPARED"?

;

COMMENT
Line 2668

"Anyways, 2PC transactions" --> "Anyway, two-phase transactions"

;

COMMENT
Line 2765
@@ -2495,7 +2731,13 @@ ReorderBufferAbort(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
/* cosmetic... */
txn->final_lsn = lsn;

- /* remove potential on-disk data, and deallocate */
+ /*
+ * remove potential on-disk data, and deallocate.
+ *

Remove the blank between the comment and code.

==========
Patch V6-0001, File: src/include/replication/logical.h
==========

COMMENT
Line 89

"two phase" -> "two-phase"

;

COMMENT
Line 89

For consistency with the previous member naming really the new member
should just be called "twophase" rather than "enable_twophase"

;

Updated accordingly.

==========
Patch V6-0001, File: src/include/replication/output_plugin.h
==========

QUESTION
Line 106

As previously asked, why is the callback function/typedef referred as
AbortPrepared instead of RollbackPrepared?
It does not match the SQL and the function comment, and seems only to
add some unnecessary confusion.

;

==========
Patch V6-0001, File: src/include/replication/reorderbuffer.h
==========

QUESTION
Line 116
@@ -162,9 +163,13 @@ typedef struct ReorderBufferChange
#define RBTXN_HAS_CATALOG_CHANGES 0x0001
#define RBTXN_IS_SUBXACT 0x0002
#define RBTXN_IS_SERIALIZED 0x0004
-#define RBTXN_IS_STREAMED 0x0008
-#define RBTXN_HAS_TOAST_INSERT 0x0010
-#define RBTXN_HAS_SPEC_INSERT 0x0020
+#define RBTXN_PREPARE 0x0008
+#define RBTXN_COMMIT_PREPARED 0x0010
+#define RBTXN_ROLLBACK_PREPARED 0x0020
+#define RBTXN_COMMIT 0x0040
+#define RBTXN_IS_STREAMED 0x0080
+#define RBTXN_HAS_TOAST_INSERT 0x0100
+#define RBTXN_HAS_SPEC_INSERT 0x0200

I was wondering why when adding new flags, some of the existing flag
masks were also altered.
I am assuming this is ok because they are never persisted but are only
used in the protocol (??)

;

COMMENT
Line 226
@@ -218,6 +223,15 @@ typedef struct ReorderBufferChange
((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
)

+/* is this txn prepared? */
+#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE)
+/* was this prepared txn committed in the meanwhile? */
+#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED)
+/* was this prepared txn aborted in the meanwhile? */
+#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED)
+/* was this txn committed in the meanwhile? */
+#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT)
+

Probably all the "txn->txn_flags" here might be more safely written
with parentheses in the macro like "(txn)->txn_flags".

~

Also, Start all comments with capital. And what is the meaning "in the
meanwhile?"

;

COMMENT
Line 410
@@ -390,6 +407,39 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);

The format is inconsistent with all other callback signatures here,
where the 1st arg was on the same line as the typedef.

;

COMMENT
Line 440-442

Excessive blank lines following this change?

;

COMMENT
Line 638
@@ -571,6 +631,15 @@ void
ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid,
XLog
bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);

+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);

Not aligned consistently with other function prototypes.

;

Updated

==========
Patch V6-0003, File: src/backend/access/transam/twophase.c
==========

COMMENT
Line 551
@@ -548,6 +548,37 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
}

/*
+ * LookupGXact
+ * Check if the prepared transaction with the given GID is around
+ */
+bool
+LookupGXact(const char *gid)

There is potential to refactor/simplify this code:
e.g.

bool
LookupGXact(const char *gid)
{
int i;
bool found = false;

LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
{
GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
/* Ignore not-yet-valid GIDs */
if (gxact->valid && strcmp(gxact->gid, gid) == 0)
{
found = true;
break;
}
}
LWLockRelease(TwoPhaseStateLock);
return found;
}

;

Updated accordingly.

==========
Patch V6-0003, File: src/backend/replication/logical/proto.c
==========

COMMENT
Line 86
@@ -72,12 +72,17 @@ logicalrep_read_begin(StringInfo in,
LogicalRepBeginData *begin_data)
*/
void
logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
- XLogRecPtr commit_lsn)

Since now the flags are used the code comment is wrong.
"/* send the flags field (unused for now) */"

;

COMMENT
Line 129
@ -106,6 +115,77 @@ logicalrep_read_commit(StringInfo in,
LogicalRepCommitData *commit_data)
}

/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,

"2PC transactions" --> "two-phase commit transactions"

;

Updated

COMMENT
Line 133

Assert(strlen(txn->gid) > 0);
Shouldn't that assertion also check txn->gid is not NULL (to prevent
NPE in case gid was NULL)

In this case txn->gid has to be non NULL.

;

COMMENT
Line 177
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)

prepare_data->prepare_type = flags;
This code may be OK but it does seem a bit of an abuse of the flags.

e.g. Are they flags or are the really enum values?
e.g. And if they are effectively enums (it appears they are) then
seemed inconsistent that |= was used when they were previously
assigned.

;

I have not updated this as according to Amit this might require
refactoring again.

==========
Patch V6-0003, File: src/backend/replication/logical/worker.c
==========

COMMENT
Line 757
@@ -749,6 +753,141 @@ apply_handle_commit(StringInfo s)
pgstat_report_activity(STATE_IDLE, NULL);
}

+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+ Assert(prepare_data->prepare_lsn == remote_final_lsn);

Missing function comment to say this is called from apply_handle_prepare.

;

COMMENT
Line 798
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)

Missing function comment to say this is called from apply_handle_prepare.

;

COMMENT
Line 824
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)

Missing function comment to say this is called from apply_handle_prepare.

Updated.

==========
Patch V6-0003, File: src/backend/replication/pgoutput/pgoutput.c
==========

COMMENT
Line 50
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
ReorderBufferChange *change);
static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);

The parameter indentation (2nd lines) does not match everything else
in this context.

;

COMMENT
Line 152
@@ -143,6 +149,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->change_cb = pgoutput_change;
cb->truncate_cb = pgoutput_truncate;
cb->commit_cb = pgoutput_commit_txn;
+
+ cb->prepare_cb = pgoutput_prepare_txn;
+ cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+ cb->abort_prepared_cb = pgoutput_abort_prepared_txn;

Remove the unnecessary blank line.

;

QUESTION
Line 386
@@ -373,7 +383,49 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
OutputPluginUpdateProgress(ctx);

OutputPluginPrepareWrite(ctx, true);
- logicalrep_write_commit(ctx->out, txn, commit_lsn);
+ logicalrep_write_commit(ctx->out, txn, commit_lsn, true);

Is the is_commit parameter of logicalrep_write_commit ever passed as false?
If yes, where?
If no, the what is the point of it?

It was dead code from an earlier version. I have removed it, updated
accordingly.

;

COMMENT
Line 408
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,

Since all this function is identical to pg_output_prepare it might be
better to either
1. just leave this as a wrapper to delegate to that function
2. remove this one entirely and assign the callback to the common
pgoutput_prepare_txn

;

I have not changed this as this might require re-factoring according to Amit.

COMMENT
Line 419
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,

Since all this function is identical to pg_output_prepare if might be
better to either
1. just leave this as a wrapper to delegate to that function
2. remove this one entirely and assign the callback to the common
pgoutput_prepare_tx

;

Same as above.

COMMENT
Line 419
+pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,

Shouldn't this comment say be "ROLLBACK PREPARED"?

;

Updated.

==========
Patch V6-0003, File: src/include/replication/logicalproto.h
==========

QUESTION
Line 101
@@ -87,20 +87,55 @@ typedef struct LogicalRepBeginData
TransactionId xid;
} LogicalRepBeginData;

+/* Commit (and abort) information */

#define LOGICALREP_IS_ABORT 0x02
Is there a good reason why this is not called:
#define LOGICALREP_IS_ROLLBACK 0x02

;

Removed.

COMMENT
Line 105

((flags == LOGICALREP_IS_COMMIT) || (flags == LOGICALREP_IS_ABORT))

Macros would be safer if flags are in parentheses
(((flags) == LOGICALREP_IS_COMMIT) || ((flags) == LOGICALREP_IS_ABORT))

;

COMMENT
Line 115

Unexpected whitespace for the typedef
"} LogicalRepPrepareData;"

;

COMMENT
Line 122
/* prepare can be exactly one of PREPARE, [COMMIT|ABORT] PREPARED*/
#define PrepareFlagsAreValid(flags) \
((flags == LOGICALREP_IS_PREPARE) || \
(flags == LOGICALREP_IS_COMMIT_PREPARED) || \
(flags == LOGICALREP_IS_ROLLBACK_PREPARED))

There is confusing mixture in macros and comments of ABORT and ROLLBACK terms
"[COMMIT|ABORT] PREPARED" --> "[COMMIT|ROLLBACK] PREPARED"

~

Also, it would be safer if flags are in parentheses
(((flags) == LOGICALREP_IS_PREPARE) || \
((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))

;

updated.

==========
Patch V6-0003, File: src/test/subscription/t/020_twophase.pl
==========

COMMENT
Line 131 - # check inserts are visible

Isn't this supposed to be checking for rows 12 and 13, instead of 11 and 12?

;

Updated.

==========
Patch V6-0004, File: contrib/test_decoding/test_decoding.c
==========

COMMENT
Line 81
@@ -78,6 +78,15 @@ static void
pg_decode_stream_stop(LogicalDecodingContext *ctx,
static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static

All these functions have a 3rd parameter called commit_lsn. Even
though the functions are not commit related. It seems like a cut/paste
error.

;

COMMENT
Line 142
@@ -130,6 +139,9 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
cb->stream_start_cb = pg_decode_stream_start;
cb->stream_stop_cb = pg_decode_stream_stop;
cb->stream_abort_cb = pg_decode_stream_abort;
+ cb->stream_prepare_cb = pg_decode_stream_prepare;
+ cb->stream_commit_prepared_cb = pg_decode_stream_commit_prepared;
+ cb->stream_abort_prepared_cb = pg_decode_stream_abort_prepared;
cb->stream_commit_cb = pg_decode_stream_commit;

Can the "cb->stream_abort_prepared_cb" be changed to
"cb->stream_rollback_prepared_cb"?

;

COMMENT
Line 827
@@ -812,6 +824,78 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
}

static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_pr

The commit_lsn (3rd parameter) is unused and seems like a cut/paste name error.

;

COMMENT
Line 875
+pg_decode_stream_abort_prepared(LogicalDecodingContext *ctx,

The commit_lsn (3rd parameter) is unused and seems like a cut/paste name error.

;

Updated.

==========
Patch V6-0004, File: doc/src/sgml/logicaldecoding.sgml
==========

COMMENT
48.6.1
@@ -396,6 +396,9 @@ typedef struct OutputPluginCallbacks
LogicalDecodeStreamStartCB stream_start_cb;
LogicalDecodeStreamStopCB stream_stop_cb;
LogicalDecodeStreamAbortCB stream_abort_cb;
+ LogicalDecodeStreamPrepareCB stream_prepare_cb;
+ LogicalDecodeStreamCommitPreparedCB stream_commit_prepared_cb;
+ LogicalDecodeStreamAbortPreparedCB stream_abort_prepared_cb;

Same question from previous review comments - why using the
terminology "abort" instead of "rollback"

;

COMMENT
48.6.1
@@ -418,7 +421,9 @@ typedef void (*LogicalOutputPluginInit) (struct
OutputPluginCallbacks *cb);
in-progress transactions. The <function>stream_start_cb</function>,
<function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
<function>stream_commit_cb</function> and <function>stream_change_cb</function>
- are required, while <function>stream_message_cb</function> and
+ are required, while <function>stream_message_cb</function>,
+ <function>stream_prepare_cb</function>,
<function>stream_commit_prepared_cb</function>,
+ <function>stream_abort_prepared_cb</function>,

Missing "and".
... "stream_abort_prepared_cb, stream_truncate_cb are optional." -->
"stream_abort_prepared_cb, and stream_truncate_cb are optional."

;

COMMENT
Section 48.6.4.16
Section 48.6.4.17
Section 48.6.4.18
@@ -839,6 +844,45 @@ typedef void (*LogicalDecodeStreamAbortCB)
(struct LogicalDecodingContext *ctx,
</para>
</sect3>

+ <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+ <title>Stream Prepare Callback</title>
+ <para>
+ The <function>stream_prepare_cb</function> callback is called to prepare
+ a previously streamed transaction as part of a two phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct
LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-stream-commit-prepared">
+ <title>Stream Commit Prepared Callback</title>
+ <para>
+ The <function>stream_commit_prepared_cb</function> callback is
called to commit prepared
+ a previously streamed transaction as part of a two phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitPreparedCB) (struct
LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="logicaldecoding-output-plugin-stream-abort-prepared">
+ <title>Stream Abort Prepared Callback</title>
+ <para>
+ The <function>stream_abort_prepared_cb</function> callback is called
to abort prepared
+ a previously streamed transaction as part of a two phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamAbortPreparedCB) (struct
LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn);
+</programlisting>
+ </para>
+ </sect3>

1. Everywhere it says "two phase" commit should be consistently
replaced to say "two-phase" commit (with the hyphen)

2. Search for "abort_lsn" parameter. It seems to be overused
(cut/paste error) even when the API is unrelated to abort

3. 48.6.4.17 and 48.6.4.18
Is this wording ok? Is the word "prepared" even necessary here?
- "... called to commit prepared a previously streamed transaction ..."
- "... called to abort prepared a previously streamed transaction ..."

;

Updated accordingly.

COMMENT
Section 48.9
@@ -1017,9 +1061,13 @@ OutputPluginWrite(ctx, true);
When streaming an in-progress transaction, the changes (and messages) are
streamed in blocks demarcated by <function>stream_start_cb</function>
and <function>stream_stop_cb</function> callbacks. Once all the decoded
- changes are transmitted, the transaction is committed using the
- <function>stream_commit_cb</function> callback (or possibly aborted using
- the <function>stream_abort_cb</function> callback).
+ changes are transmitted, the transaction can be committed using the
+ the <function>stream_commit_cb</function> callback

"two phase" --> "two-phase"

~

Also, Missing period on end of sentence.
"or aborted using the stream_abort_prepared_cb" --> "or aborted using
the stream_abort_prepared_cb."

;

Updated accordingly.

==========
Patch V6-0004, File: src/backend/replication/logical/logical.c
==========

COMMENT
Line 84
@@ -81,6 +81,12 @@ static void stream_stop_cb_wrapper(ReorderBuffer
*cache, ReorderBufferTXN *txn,
XLogRecPtr last_lsn);
static void stream_abort_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void stream_commit_prepared_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+static void stream_abort_prepared_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);

The 3rd parameter is always "commit_lsn" even for API unrelated to
commit, so seems like cut/paste error.

;

COMMENT
Line 1246
@@ -1231,6 +1243,105 @@ stream_abort_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn,
}

static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;

Misnamed parameter "commit_lsn" ?

~

Also, Line 1272
There seem to be some missing integrity checking to make sure the
callback is not NULL.
A null callback will give NPE when wrapper attempts to call it

;

COMMENT
Line 1305
+static void
+stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,

There seem to be some missing integrity checking to make sure the
callback is not NULL.
A null callback will give NPE when wrapper attempts to call it.

;

COMMENT
Line 1312
+static void
+stream_abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,

Misnamed parameter "commit_lsn" ?

~

Also, Line 1338
There seem to be some missing integrity checking to make sure the
callback is not NULL.
A null callback will give NPE when wrapper attempts to call it.

Updated accordingly.

==========
Patch V6-0004, File: src/backend/replication/logical/reorderbuffer.c
==========

COMMENT
Line 2684
@@ -2672,15 +2681,31 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb,
TransactionId xid,
txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
strcpy(txn->gid, gid);

- if (is_commit)
+ if (rbtxn_is_streamed(txn))
{
- txn->txn_flags |= RBTXN_COMMIT_PREPARED;
- rb->commit_prepared(rb, txn, commit_lsn);
+ if (is_commit)
+ {
+ txn->txn_flags |= RBTXN_COMMIT_PREPARED;

The setting/checking of the flags could be refactored if you wanted to
write less code:
e.g.
if (is_commit)
txn->txn_flags |= RBTXN_COMMIT_PREPARED;
else
txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;

if (rbtxn_is_streamed(txn) && rbtxn_commit_prepared(txn))
rb->stream_commit_prepared(rb, txn, commit_lsn);
else if (rbtxn_is_streamed(txn) && rbtxn_rollback_prepared(txn))
rb->stream_abort_prepared(rb, txn, commit_lsn);
else if (rbtxn_commit_prepared(txn))
rb->commit_prepared(rb, txn, commit_lsn);
else if (rbtxn_rollback_prepared(txn))
rb->abort_prepared(rb, txn, commit_lsn);

;

Updated accordingly.

==========
Patch V6-0004, File: src/include/replication/output_plugin.h
==========

COMMENT
Line 171
@@ -157,6 +157,33 @@ typedef void (*LogicalDecodeStreamAbortCB)
(struct LogicalDecodingContext *ctx,
XLogRecPtr abort_lsn);

/*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit and only when
+ * two-phased commits are supported
+ */

1. Missing period all these comments.

2. Is the part that says "and only where two-phased commits are
supported" necessary to say? Is seems redundant since comments already
says called as part of a two-phase commit.

;

==========
Patch V6-0004, File: src/include/replication/reorderbuffer.h
==========

COMMENT
Line 467
@@ -466,6 +466,24 @@ typedef void (*ReorderBufferStreamAbortCB) (
ReorderBufferTXN *txn,
XLogRecPtr abort_lsn);

+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamAbortPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);

Cut/paste error - repeated same comment 3 times?

Updated Accordingly.

[END]

I believe I have addressed all of Peter's comments. Peter, do have a
look and let me know if I missed anything or if you find anythinge
else. Thanks for your comments, much appreciated.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v8-0001-Support-decoding-of-two-phase-transactions.patchapplication/octet-stream; name=v8-0001-Support-decoding-of-two-phase-transactions.patchDownload
From 4b19f77075bcccb526ac7aa622b2c37bbcd0f702 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 14 Oct 2020 01:24:51 -0400
Subject: [PATCH v8] Support decoding of two-phase transactions

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes documentation changes.
---
 contrib/test_decoding/Makefile                  |   2 +-
 contrib/test_decoding/expected/two_phase.out    | 219 ++++++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql         | 117 +++++++++++
 contrib/test_decoding/test_decoding.c           | 159 +++++++++++++++
 doc/src/sgml/logicaldecoding.sgml               | 117 ++++++++++-
 src/backend/replication/logical/decode.c        | 250 +++++++++++++++++++++--
 src/backend/replication/logical/logical.c       | 184 +++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c | 258 ++++++++++++++++++++----
 src/include/replication/logical.h               |   5 +
 src/include/replication/output_plugin.h         |  37 ++++
 src/include/replication/reorderbuffer.h         |  74 +++++++
 11 files changed, 1372 insertions(+), 50 deletions(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..f0e4dbd 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,7 +4,7 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..3ac01a4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,219 @@
+-- Test two-phased transactions, when two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time. 
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- 
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- test abort of a prepared xact
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- test prepared xact containing ddl
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints and sub-xacts as a result
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..e3e2690
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,117 @@
+-- Test two-phased transactions, when two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time. 
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- 
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test abort of a prepared xact
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+
+-- test prepared xact containing ddl
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test savepoints and sub-xacts as a result
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e60ab34..6b8e502 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid_aborted; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -88,6 +93,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -116,6 +133,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -127,6 +148,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool 		enable_2pc = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +158,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +250,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_2pc))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)
+					strtoul(strVal(elem->arg), NULL, 0);
+
+				if (!TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+								strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +290,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_2pc;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +350,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +595,25 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid_aborted is a valid xid, then it was passed in
+	 * as an option to check if the transaction having this xid would be aborted.
+	 * This is to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			   !TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..4ad5dca 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,6 +387,10 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
@@ -417,6 +421,13 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>commit_prepared_cb</function> and <function>abort_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +488,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +595,55 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The optional <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The optional <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The optional <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +653,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +736,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +790,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..e011fd9 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -68,8 +68,15 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						 xl_xact_parsed_commit *parsed, TransactionId xid);
+static void DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -239,7 +246,6 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	switch (info)
 	{
 		case XLOG_XACT_COMMIT:
-		case XLOG_XACT_COMMIT_PREPARED:
 			{
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
@@ -256,8 +262,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeCommit(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_COMMIT_PREPARED:
+			{
+				xl_xact_commit *xlrec;
+				xl_xact_parsed_commit parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_commit *) XLogRecGetData(r);
+				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+				DecodeCommitPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ABORT:
-		case XLOG_XACT_ABORT_PREPARED:
 			{
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
@@ -274,6 +296,23 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_ABORT_PREPARED:
+			{
+				xl_xact_abort *xlrec;
+				xl_xact_parsed_abort parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_abort *) XLogRecGetData(r);
+				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+					DecodeAbortPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ASSIGNMENT:
 
 			/*
@@ -312,17 +351,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -659,6 +716,131 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Consolidated commit record handling between the different form of commit
+ * records.
+ */
+static void
+DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+					 xl_xact_parsed_commit *parsed, TransactionId xid)
+{
+	XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	RepOriginId origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
+					   parsed->nsubxacts, parsed->subxacts);
+
+	/* ----
+	 * Check whether we are interested in this specific transaction, and tell
+	 * the reorderbuffer to forget the content of the (sub-)transactions
+	 * if not.
+	 *
+	 * There can be several reasons we might not be interested in this
+	 * transaction:
+	 * 1) We might not be interested in decoding transactions up to this
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
+	 * 2) The transaction happened in another database.
+	 * 3) The output plugin is not interested in the origin.
+	 * 4) We are doing fast-forwarding
+	 *
+	 * We can't just use ReorderBufferAbort() here, because we need to execute
+	 * the transaction's invalidations.  This currently won't be needed if
+	 * we're just skipping over the transaction because currently we only do
+	 * so during startup, to get to the first transaction the client needs. As
+	 * we have reset the catalog caches before starting to read WAL, and we
+	 * haven't yet touched any catalogs, there can't be anything to invalidate.
+	 * But if we're "forgetting" this commit because it's it happened in
+	 * another database, the invalidations might be important, because they
+	 * could be for shared catalogs and we might have loaded data into the
+	 * relevant syscaches.
+	 * ---
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+		}
+		ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+		return;
+	}
+
+	/* tell the reorderbuffer about the surviving subtransactions */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/*
+	 * For COMMIT PREPARED, the changes have already been replayed at
+	 * PREPARE time, so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 * If filter check present and this needs to be skipped, do a regular commit.
+	 */
+	if (ctx->callbacks.filter_prepare_cb &&
+			ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+	else
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr  origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		 ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+		return;
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/* replay actions of all transaction + subtransactions in order */
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
+}
+
+/*
  * Get the data from the various forms of abort records and pass it on to
  * snapbuild.c and reorderbuffer.c
  */
@@ -681,6 +863,50 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Get the data from the various forms of abort records and pass it on to
+ * snapbuild.c and reorderbuffer.c
+ */
+static void
+DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			xl_xact_parsed_abort *parsed, TransactionId xid)
+{
+	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it passes through the filters handle the ROLLBACK via callbacks
+	 */
+	if(!FilterByOrigin(ctx, origin_id) &&
+	   !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+	   !ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		Assert(TransactionIdIsValid(xid));
+		Assert(parsed->dbId == ctx->slot->data.database);
+
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
+
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+						   buf->record->EndRecPtr);
+	}
+
+	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+}
+
+/*
  * Parse XLOG_HEAP_INSERT (not MULTI_INSERT!) records into tuplebufs.
  *
  * Deletes can contain the new tuple.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8675832..1c0f70d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +215,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -227,6 +239,19 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require prepare/commit-prepare/abort-prepare
+	 * callbacks. The filter-prepare callback is optional. We however enable two phase logical
+	 * decoding when at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -783,6 +808,120 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are  supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin supports two-phase commits then prepare callback is mandatory */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are  supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then commit prepared callback is mandatory */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+ 
+	/* We're only supposed to call this when two-phase commits are  supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then abort prepared callback is mandatory */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +998,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase at PREPARE time is not enabled. In that
+	 * case all two-phase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4cb27f2..8fc1301 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -418,6 +419,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1506,12 +1513,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1530,7 +1539,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1564,9 +1573,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for decoding
+		 * catalog snapshot access.
+		 * They are always stored in the toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* remove the change from its containing list */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1891,7 +1924,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1998,7 +2031,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2289,7 +2322,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT
+			 * (for regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2330,11 +2372,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, false);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (rbtxn_prepared(txn))
+		{
+			ReorderBufferTruncateTXN(rb, txn, true);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2363,17 +2411,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * abort of the (sub)transaction we are streaming or preparing. We need to do the
 		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can only occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we are
+			 * sending the data out on a PREPARE during a two-phase commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started  || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2381,10 +2430,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/* If streaming, reset the TXN so that it is allowed to stream remaining data. */
+			if (streaming)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+						txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2406,23 +2464,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2464,6 +2515,140 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+	/*
+	* Always call the prepare filter. It's the job of the prepare filter to
+	* give us the *same* response for a given xid across multiple calls
+	* (including ones on restart)
+	*/
+	return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	* The transaction may or may not exist (during restarts for example).
+	* Anyways, two-phase transactions do not contain any reorderbuffers. So allow
+	* it to be created below.
+	*/
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+	{
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+		rb->commit_prepared(rb, txn, commit_lsn);
+	}
+	else
+	{
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+		rb->rollback_prepared(rb, txn, commit_lsn);
+	}
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(rb, txn);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2506,7 +2691,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..a191e82 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -84,6 +84,11 @@ typedef struct LogicalDecodingContext
 	 */
 	bool		streaming;
 
+ 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..96acd01 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -171,6 +204,10 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0cc3aeb..ca823e3 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -166,6 +167,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -225,6 +229,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -236,6 +258,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of 2PC we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -397,6 +422,36 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+                                     ReorderBuffer *rb,
+                                     ReorderBufferTXN *txn,
+                                     XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             TransactionId xid,
+                                             const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+                                       ReorderBuffer *rb,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+                                              ReorderBuffer *rb,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -489,6 +544,11 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferAbortCB abort;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -566,6 +626,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+                           XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                           TimestampTz commit_time,
+                           RepOriginId origin_id, XLogRecPtr origin_lsn,
+                           char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -589,6 +654,15 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool 		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+							 const char *gid);
+bool 		ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid);
+void 		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v8-0002-Tap-test-to-test-concurrent-aborts-during-2-phase.patchapplication/octet-stream; name=v8-0002-Tap-test-to-test-concurrent-aborts-during-2-phase.patchDownload
From 51b58178a02261f787790f93b0ce6e76c857f349 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 14 Oct 2020 01:26:00 -0400
Subject: [PATCH v8] Tap test to test concurrent aborts during 2 phase commits

This test is specifically for testing concurrent abort while logical decode
is ongoing. Pass in the xid of the 2PC to the plugin as an option.
On the receipt of a valid "check-xid", the change API in the test decoding
plugin will wait for it to be aborted.
---
 contrib/test_decoding/Makefile          |   2 +
 contrib/test_decoding/t/001_twophase.pl | 119 ++++++++++++++++++++++++++++++++
 2 files changed, 121 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f0e4dbd..2abf3ce 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..51c6a35
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,119 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

v8-0004-Support-two-phase-commits-in-streaming-mode-in-lo.patchapplication/octet-stream; name=v8-0004-Support-two-phase-commits-in-streaming-mode-in-lo.patchDownload
From cd7c441e0c07e3a1f8428fed3111ef3a5aa50f75 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 14 Oct 2020 07:48:50 -0400
Subject: [PATCH v8] Support two phase commits in streaming mode in logical
 decoding

Add APIs to the streaming APIS for PREPARE, COMMIT PREPARED and ROLLBACK PREPARED
---
 contrib/test_decoding/test_decoding.c           |  84 +++++++++++++++
 doc/src/sgml/logicaldecoding.sgml               |  62 +++++++++--
 src/backend/replication/logical/logical.c       | 132 +++++++++++++++++++++++-
 src/backend/replication/logical/reorderbuffer.c |  29 ++++--
 src/include/replication/output_plugin.h         |  27 +++++
 src/include/replication/reorderbuffer.h         |  21 ++++
 6 files changed, 339 insertions(+), 16 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 6b8e502..9f0cf10 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -78,6 +78,15 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr prepare_lsn);
+static void pg_decode_stream_commit_prepared(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_rollback_prepared(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr rollback_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -129,6 +138,9 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
+	cb->stream_commit_prepared_cb = pg_decode_stream_commit_prepared;
+	cb->stream_rollback_prepared_cb = pg_decode_stream_rollback_prepared;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
@@ -805,6 +817,78 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit_prepared(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "commit prepared streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "commit prepared streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_rollback_prepared(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr rollback_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "abort prepared streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "abort prepared streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 4ad5dca..824c55a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -396,6 +396,9 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
+    LogicalDecodeStreamCommitPreparedCB stream_commit_prepared_cb;
+    LogicalDecodeStreamRollbackPreparedCB stream_rollback_prepared_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -418,14 +421,16 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
      <function>stream_commit_cb</function> and <function>stream_change_cb</function>
-     are required, while <function>stream_message_cb</function> and
+     are required, while <function>stream_message_cb</function>,
+     <function>stream_prepare_cb</function>, <function>stream_commit_prepared_cb</function>,
+     <function>stream_rollback_prepared_cb</function> and 
      <function>stream_truncate_cb</function> are optional.
     </para>
 
     <para>
     An output plugin may also define functions to support two-phase commits, which are
     decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
-    <function>commit_prepared_cb</function> and <function>abort_prepared_cb</function>
+    <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
     callbacks are required, while <function>filter_prepare_cb</function> is optional.
     </para>
    </sect2>
@@ -638,8 +643,8 @@ typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ct
       callback.
 <programlisting>
 typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
-                                              ReorderBufferTXN *txn,
-                                              XLogRecPtr abort_lsn);
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr abort_lsn);
 </programlisting>
      </para>
     </sect3>
@@ -846,6 +851,45 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit-prepared">
+     <title>Stream Commit Prepared Callback</title>
+     <para>
+      The <function>stream_commit_prepared_cb</function> callback is called to commit
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                     ReorderBufferTXN *txn,
+                                                     XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-rollback-prepared">
+     <title>Stream Rollback Prepared Callback</title>
+     <para>
+      The <function>stream_rollback_prepared_cb</function> callback is called to abort
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -1024,9 +1068,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>stream_commit_prepared_cb</function> callback or aborted using the
+    <function>stream_rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1c0f70d..84a4751 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -82,6 +82,12 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
+static void stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+											  XLogRecPtr commit_lsn);
+static void stream_rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+												XLogRecPtr rollback_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -233,6 +239,9 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
 		(ctx->callbacks.stream_stop_cb != NULL) ||
 		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.stream_commit_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_rollback_prepared_cb != NULL) ||
 		(ctx->callbacks.stream_commit_cb != NULL) ||
 		(ctx->callbacks.stream_change_cb != NULL) ||
 		(ctx->callbacks.stream_message_cb != NULL) ||
@@ -262,6 +271,9 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
+	ctx->reorder->stream_commit_prepared = stream_commit_prepared_cb_wrapper;
+	ctx->reorder->stream_rollback_prepared = stream_rollback_prepared_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -885,7 +897,7 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						  XLogRecPtr abort_lsn)
+							 XLogRecPtr abort_lsn)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -1241,6 +1253,124 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming and two-phase commits are supported. */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								  XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit_prepared";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_commit_prepared_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_prepared_cb callback")));
+
+	ctx->callbacks.stream_commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr rollback_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_rollback_prepared";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_rollback_prepared_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_rollback_prepared_cb callback")));
+
+	ctx->callbacks.stream_rollback_prepared_cb(ctx, txn, rollback_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8fc1301..d569d46 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1793,9 +1793,18 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
-
-	ReorderBufferCleanupTXN(rb, txn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -2633,15 +2642,19 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	strcpy(txn->gid, gid);
 
 	if (is_commit)
-	{
 		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
-		rb->commit_prepared(rb, txn, commit_lsn);
-	}
 	else
-	{
 		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_is_streamed(txn) && rbtxn_commit_prepared(txn))
+		rb->stream_commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_is_streamed(txn) && rbtxn_rollback_prepared(txn))
+		rb->stream_rollback_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
 		rb->rollback_prepared(rb, txn, commit_lsn);
-	}
+
 
 	/* cleanup: make sure there's no cache pollution */
 	ReorderBufferExecuteInvalidations(rb, txn);
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 96acd01..dfcb577 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -157,6 +157,30 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr prepare_lsn);
+
+/*
+ * Called to commit prepared changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called to abort/rollback prepared changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr rollback_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -214,6 +238,9 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
+	LogicalDecodeStreamCommitPreparedCB stream_commit_prepared_cb;
+	LogicalDecodeStreamRollbackPreparedCB stream_rollback_prepared_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ca823e3..84e4840 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -478,6 +478,24 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr prepare_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitPreparedCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamRollbackPreparedCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr rollback_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -557,6 +575,9 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
+	ReorderBufferStreamCommitPreparedCB stream_commit_prepared;
+	ReorderBufferStreamRollbackPreparedCB stream_rollback_prepared;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
-- 
1.8.3.1

v8-0003-pgoutput-output-plugin-support-for-logical-decodi.patchapplication/octet-stream; name=v8-0003-pgoutput-output-plugin-support-for-logical-decodi.patchDownload
From de50d2567bb7c8e593224875d2ed4b7f648e0b46 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 14 Oct 2020 06:04:59 -0400
Subject: [PATCH v8] pgoutput output plugin support for logical decoding of 2pc

Support decoding of two phase commit in pgoutput and on subscriber side.
---
 src/backend/access/transam/twophase.c       |  27 +++++
 src/backend/replication/logical/proto.c     |  74 ++++++++++++-
 src/backend/replication/logical/worker.c    | 153 +++++++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c |  51 +++++++++
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  27 +++++
 src/test/subscription/t/020_twophase.pl     | 163 ++++++++++++++++++++++++++++
 7 files changed, 494 insertions(+), 2 deletions(-)
 create mode 100644 src/test/subscription/t/020_twophase.pl

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7940060..36ba21f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+  		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+  		{
+   			found = true;
+   			break;
+  		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index eb19142..1bd3987 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, 'C');		/* sending COMMIT */
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -99,6 +99,7 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 	if (flags != 0)
 		elog(ERROR, "unrecognized flags %u in commit message", flags);
 
+
 	/* read fields */
 	commit_data->commit_lsn = pq_getmsgint64(in);
 	commit_data->end_lsn = pq_getmsgint64(in);
@@ -106,6 +107,77 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'P');		/* sending PREPARE protocol */
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In which case we
+	 * expect to have a non-empty GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(strlen(txn->gid) > 0);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags |= LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags |= LOGICALREP_IS_PREPARE;
+
+	/* Make sure exactly one of the expected flags is set. */
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 8d5d9e0..b417b16 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -722,6 +722,7 @@ apply_handle_commit(StringInfo s)
 		replorigin_session_origin_timestamp = commit_data.committime;
 
 		CommitTransactionCommand();
+
 		pgstat_report_stat(false);
 
 		store_flush_position(commit_data.end_lsn);
@@ -742,6 +743,152 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/* End the earlier transaction and start a new one */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1913,10 +2060,14 @@ apply_dispatch(StringInfo s)
 		case 'B':
 			apply_handle_begin(s);
 			break;
-			/* COMMIT */
+			/* COMMIT/ABORT */
 		case 'C':
 			apply_handle_commit(s);
 			break;
+			/* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+		case 'P':
+			apply_handle_prepare(s);
+			break;
 			/* INSERT */
 		case 'I':
 			apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..4078cab 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -143,6 +149,9 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -378,6 +387,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 0c2cda2..2ca6d74 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -87,6 +87,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -94,6 +95,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -101,6 +124,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 			 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData * prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..3feb2c3
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+        ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+        'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+   is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+   is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#56Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#55)

On Wed, Oct 14, 2020 at 6:15 PM Ajin Cherian <itsajin@gmail.com> wrote:

I think it will be easier to review this work if we can split the
patches according to the changes made in different layers. The first
patch could be changes made in output plugin and the corresponding
changes in test_decoding, see the similar commit of in-progress
transactions [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=45fdc9738b36d1068d3ad8fdb06436d6fd14436b. So you need to move corresponding changes from
v8-0001-Support-decoding-of-two-phase-transactions and
v8-0004-Support-two-phase-commits-in-streaming-mode-in-lo for this.
The second patch could be changes made in ReorderBuffer to support
this feature, see [2]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7259736a6e5b7c7588fff9578370736a6648acbb. The third patch could be changes made to
support pgoutput and subscriber-side stuff, see [3]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=464824323e57dc4b397e8b05854d779908b55304. What do you
think?

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=45fdc9738b36d1068d3ad8fdb06436d6fd14436b
[2]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7259736a6e5b7c7588fff9578370736a6648acbb
[3]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=464824323e57dc4b397e8b05854d779908b55304

--
With Regards,
Amit Kapila.

#57Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#56)
3 attachment(s)

On Thu, Oct 15, 2020 at 2:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 14, 2020 at 6:15 PM Ajin Cherian <itsajin@gmail.com> wrote:

I think it will be easier to review this work if we can split the
patches according to the changes made in different layers. The first
patch could be changes made in output plugin and the corresponding
changes in test_decoding, see the similar commit of in-progress
transactions [1]. So you need to move corresponding changes from
v8-0001-Support-decoding-of-two-phase-transactions and
v8-0004-Support-two-phase-commits-in-streaming-mode-in-lo for this.
The second patch could be changes made in ReorderBuffer to support
this feature, see [2]. The third patch could be changes made to
support pgoutput and subscriber-side stuff, see [3]. What do you
think?

I agree. I have split the patches accordingly. Do have a look.
Pending work is:
1. Add pgoutput support for the new streaming two-phase commit APIs
2. Add test cases for two-phase commits with streaming for pub/sub and
test_decoding
3. Add CREATE SUBSCRIPTION command option to specify two-phase commits
rather than having it turned on by default.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v9-0001-Support-decoding-of-two-phase-transactions.patchapplication/octet-stream; name=v9-0001-Support-decoding-of-two-phase-transactions.patchDownload
From b09f3bfb86aab07dfd378c2f3615a5a8c2adea91 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 15 Oct 2020 02:19:59 -0400
Subject: [PATCH v9] Support decoding of two-phase transactions

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 243 +++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 173 +++++++++++++++-
 src/backend/replication/logical/logical.c | 314 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  64 ++++++
 src/include/replication/reorderbuffer.h   |  95 +++++++++
 6 files changed, 887 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e60ab34..9f0cf10 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid_aborted; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -73,6 +78,15 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr prepare_lsn);
+static void pg_decode_stream_commit_prepared(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_rollback_prepared(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr rollback_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -88,6 +102,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -112,10 +138,17 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
+	cb->stream_commit_prepared_cb = pg_decode_stream_commit_prepared;
+	cb->stream_rollback_prepared_cb = pg_decode_stream_rollback_prepared;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -127,6 +160,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool 		enable_2pc = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +170,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +262,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_2pc))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)
+					strtoul(strVal(elem->arg), NULL, 0);
+
+				if (!TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+								strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +302,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_2pc;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +362,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +607,25 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid_aborted is a valid xid, then it was passed in
+	 * as an option to check if the transaction having this xid would be aborted.
+	 * This is to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			   !TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -646,6 +817,78 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit_prepared(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "commit prepared streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "commit prepared streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_rollback_prepared(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr rollback_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "abort prepared streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "abort prepared streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..824c55a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,18 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
+    LogicalDecodeStreamCommitPreparedCB stream_commit_prepared_cb;
+    LogicalDecodeStreamRollbackPreparedCB stream_rollback_prepared_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -414,9 +421,18 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
      <function>stream_commit_cb</function> and <function>stream_change_cb</function>
-     are required, while <function>stream_message_cb</function> and
+     are required, while <function>stream_message_cb</function>,
+     <function>stream_prepare_cb</function>, <function>stream_commit_prepared_cb</function>,
+     <function>stream_rollback_prepared_cb</function> and 
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +493,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +600,55 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The optional <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The optional <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The optional <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +658,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +741,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +795,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +851,45 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit-prepared">
+     <title>Stream Commit Prepared Callback</title>
+     <para>
+      The <function>stream_commit_prepared_cb</function> callback is called to commit
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                     ReorderBufferTXN *txn,
+                                                     XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-rollback-prepared">
+     <title>Stream Rollback Prepared Callback</title>
+     <para>
+      The <function>stream_rollback_prepared_cb</function> callback is called to abort
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1068,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>stream_commit_prepared_cb</function> callback or aborted using the
+    <function>stream_rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8675832..84a4751 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,12 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
+static void stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+											  XLogRecPtr commit_lsn);
+static void stream_rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+												XLogRecPtr rollback_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +221,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -221,12 +239,28 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
 		(ctx->callbacks.stream_stop_cb != NULL) ||
 		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.stream_commit_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_rollback_prepared_cb != NULL) ||
 		(ctx->callbacks.stream_commit_cb != NULL) ||
 		(ctx->callbacks.stream_change_cb != NULL) ||
 		(ctx->callbacks.stream_message_cb != NULL) ||
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require prepare/commit-prepare/abort-prepare
+	 * callbacks. The filter-prepare callback is optional. We however enable two phase logical
+	 * decoding when at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +271,9 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
+	ctx->reorder->stream_commit_prepared = stream_commit_prepared_cb_wrapper;
+	ctx->reorder->stream_rollback_prepared = stream_rollback_prepared_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +820,120 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are  supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin supports two-phase commits then prepare callback is mandatory */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are  supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then commit prepared callback is mandatory */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+ 
+	/* We're only supposed to call this when two-phase commits are  supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then abort prepared callback is mandatory */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1010,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase at PREPARE time is not enabled. In that
+	 * case all two-phase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1253,124 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming and two-phase commits are supported. */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								  XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit_prepared";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_commit_prepared_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_prepared_cb callback")));
+
+	ctx->callbacks.stream_commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr rollback_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_rollback_prepared";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_rollback_prepared_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_rollback_prepared_cb callback")));
+
+	ctx->callbacks.stream_rollback_prepared_cb(ctx, txn, rollback_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..a191e82 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -84,6 +84,11 @@ typedef struct LogicalDecodingContext
 	 */
 	bool		streaming;
 
+ 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..dfcb577 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,30 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr prepare_lsn);
+
+/*
+ * Called to commit prepared changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called to abort/rollback prepared changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr rollback_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +228,19 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
+	LogicalDecodeStreamCommitPreparedCB stream_commit_prepared_cb;
+	LogicalDecodeStreamRollbackPreparedCB stream_rollback_prepared_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1c77819..6cb4cb4 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -174,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -244,6 +266,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of 2PC we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -405,6 +430,36 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+                                     ReorderBuffer *rb,
+                                     ReorderBufferTXN *txn,
+                                     XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             TransactionId xid,
+                                             const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+                                       ReorderBuffer *rb,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+                                              ReorderBuffer *rb,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -431,6 +486,24 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr prepare_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitPreparedCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamRollbackPreparedCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr rollback_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -497,6 +570,11 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferAbortCB abort;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -505,6 +583,9 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
+	ReorderBufferStreamCommitPreparedCB stream_commit_prepared;
+	ReorderBufferStreamRollbackPreparedCB stream_rollback_prepared;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
@@ -574,6 +655,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+                           XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                           TimestampTz commit_time,
+                           RepOriginId origin_id, XLogRecPtr origin_lsn,
+                           char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -597,6 +683,15 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool 		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+							 const char *gid);
+bool 		ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid);
+void 		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v9-0003-pgoutput-plugin-support-for-logical-decoding-of-t.patchapplication/octet-stream; name=v9-0003-pgoutput-plugin-support-for-logical-decoding-of-t.patchDownload
From baf959a60c76ea3ba5c29c8b2b7d597e31b5fdd9 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 15 Oct 2020 06:20:56 -0400
Subject: [PATCH v9] pgoutput plugin support for logical decoding of two-phase
 commits

Support decoding of two phase commit in pgoutput and on subscriber side.
---
 src/backend/access/transam/twophase.c       |  27 +++++
 src/backend/replication/logical/proto.c     |  74 ++++++++++++-
 src/backend/replication/logical/worker.c    | 153 +++++++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c |  51 +++++++++
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  27 +++++
 src/test/subscription/t/020_twophase.pl     | 163 ++++++++++++++++++++++++++++
 7 files changed, 494 insertions(+), 2 deletions(-)
 create mode 100644 src/test/subscription/t/020_twophase.pl

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7940060..2e0a408 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index eb19142..1bd3987 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, 'C');		/* sending COMMIT */
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -99,6 +99,7 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 	if (flags != 0)
 		elog(ERROR, "unrecognized flags %u in commit message", flags);
 
+
 	/* read fields */
 	commit_data->commit_lsn = pq_getmsgint64(in);
 	commit_data->end_lsn = pq_getmsgint64(in);
@@ -106,6 +107,77 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'P');		/* sending PREPARE protocol */
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In which case we
+	 * expect to have a non-empty GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(strlen(txn->gid) > 0);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags |= LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags |= LOGICALREP_IS_PREPARE;
+
+	/* Make sure exactly one of the expected flags is set. */
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 640409b..a02424b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -722,6 +722,7 @@ apply_handle_commit(StringInfo s)
 		replorigin_session_origin_timestamp = commit_data.committime;
 
 		CommitTransactionCommand();
+
 		pgstat_report_stat(false);
 
 		store_flush_position(commit_data.end_lsn);
@@ -742,6 +743,152 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/* End the earlier transaction and start a new one */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1908,10 +2055,14 @@ apply_dispatch(StringInfo s)
 		case 'B':
 			apply_handle_begin(s);
 			break;
-			/* COMMIT */
+			/* COMMIT/ABORT */
 		case 'C':
 			apply_handle_commit(s);
 			break;
+			/* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+		case 'P':
+			apply_handle_prepare(s);
+			break;
 			/* INSERT */
 		case 'I':
 			apply_handle_insert(s);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..4078cab 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -143,6 +149,9 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -378,6 +387,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 0c2cda2..2ca6d74 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -87,6 +87,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -94,6 +95,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	((flags == LOGICALREP_IS_PREPARE) || \
+	 (flags == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 (flags == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -101,6 +124,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 			 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData * prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..3feb2c3
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+        ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+        'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+   is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+   is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

v9-0002-Backend-support-for-logical-decoding-of-two-phase.patchapplication/octet-stream; name=v9-0002-Backend-support-for-logical-decoding-of-two-phase.patchDownload
From f644b9bcdc29fb77e3869b873c2905f0d8c795a9 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 15 Oct 2020 06:09:30 -0400
Subject: [PATCH v9] Backend support for logical decoding of two-phase commits

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.
---
 contrib/test_decoding/Makefile                  |   4 +-
 contrib/test_decoding/expected/two_phase.out    | 219 +++++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql         | 117 ++++++++++
 contrib/test_decoding/t/001_twophase.pl         | 119 ++++++++++
 src/backend/replication/logical/decode.c        | 250 ++++++++++++++++++++-
 src/backend/replication/logical/reorderbuffer.c | 278 ++++++++++++++++++++----
 6 files changed, 937 insertions(+), 50 deletions(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..2abf3ce 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,11 +4,13 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..3ac01a4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,219 @@
+-- Test two-phased transactions, when two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time. 
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- 
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- test abort of a prepared xact
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- test prepared xact containing ddl
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints and sub-xacts as a result
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..e3e2690
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,117 @@
+-- Test two-phased transactions, when two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time. 
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- 
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test abort of a prepared xact
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+
+-- test prepared xact containing ddl
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test savepoints and sub-xacts as a result
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..51c6a35
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,119 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..e011fd9 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -68,8 +68,15 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						 xl_xact_parsed_commit *parsed, TransactionId xid);
+static void DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -239,7 +246,6 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	switch (info)
 	{
 		case XLOG_XACT_COMMIT:
-		case XLOG_XACT_COMMIT_PREPARED:
 			{
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
@@ -256,8 +262,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeCommit(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_COMMIT_PREPARED:
+			{
+				xl_xact_commit *xlrec;
+				xl_xact_parsed_commit parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_commit *) XLogRecGetData(r);
+				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+				DecodeCommitPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ABORT:
-		case XLOG_XACT_ABORT_PREPARED:
 			{
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
@@ -274,6 +296,23 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_ABORT_PREPARED:
+			{
+				xl_xact_abort *xlrec;
+				xl_xact_parsed_abort parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_abort *) XLogRecGetData(r);
+				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+					DecodeAbortPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ASSIGNMENT:
 
 			/*
@@ -312,17 +351,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -659,6 +716,131 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Consolidated commit record handling between the different form of commit
+ * records.
+ */
+static void
+DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+					 xl_xact_parsed_commit *parsed, TransactionId xid)
+{
+	XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	RepOriginId origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
+					   parsed->nsubxacts, parsed->subxacts);
+
+	/* ----
+	 * Check whether we are interested in this specific transaction, and tell
+	 * the reorderbuffer to forget the content of the (sub-)transactions
+	 * if not.
+	 *
+	 * There can be several reasons we might not be interested in this
+	 * transaction:
+	 * 1) We might not be interested in decoding transactions up to this
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
+	 * 2) The transaction happened in another database.
+	 * 3) The output plugin is not interested in the origin.
+	 * 4) We are doing fast-forwarding
+	 *
+	 * We can't just use ReorderBufferAbort() here, because we need to execute
+	 * the transaction's invalidations.  This currently won't be needed if
+	 * we're just skipping over the transaction because currently we only do
+	 * so during startup, to get to the first transaction the client needs. As
+	 * we have reset the catalog caches before starting to read WAL, and we
+	 * haven't yet touched any catalogs, there can't be anything to invalidate.
+	 * But if we're "forgetting" this commit because it's it happened in
+	 * another database, the invalidations might be important, because they
+	 * could be for shared catalogs and we might have loaded data into the
+	 * relevant syscaches.
+	 * ---
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+		}
+		ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+		return;
+	}
+
+	/* tell the reorderbuffer about the surviving subtransactions */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/*
+	 * For COMMIT PREPARED, the changes have already been replayed at
+	 * PREPARE time, so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 * If filter check present and this needs to be skipped, do a regular commit.
+	 */
+	if (ctx->callbacks.filter_prepare_cb &&
+			ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+	else
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr  origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		 ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+		return;
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/* replay actions of all transaction + subtransactions in order */
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
+}
+
+/*
  * Get the data from the various forms of abort records and pass it on to
  * snapbuild.c and reorderbuffer.c
  */
@@ -681,6 +863,50 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Get the data from the various forms of abort records and pass it on to
+ * snapbuild.c and reorderbuffer.c
+ */
+static void
+DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			xl_xact_parsed_abort *parsed, TransactionId xid)
+{
+	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it passes through the filters handle the ROLLBACK via callbacks
+	 */
+	if(!FilterByOrigin(ctx, origin_id) &&
+	   !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+	   !ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		Assert(TransactionIdIsValid(xid));
+		Assert(parsed->dbId == ctx->slot->data.database);
+
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
+
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+						   buf->record->EndRecPtr);
+	}
+
+	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+}
+
+/*
  * Parse XLOG_HEAP_INSERT (not MULTI_INSERT!) records into tuplebufs.
  *
  * Deletes can contain the new tuple.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 7a8bf76..9df7ff7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -418,6 +419,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1511,12 +1518,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1535,7 +1544,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1569,9 +1578,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for decoding
+		 * catalog snapshot access.
+		 * They are always stored in the toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1765,9 +1798,18 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
-
-	ReorderBufferCleanupTXN(rb, txn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1896,7 +1938,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2003,7 +2045,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2294,7 +2336,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT
+			 * (for regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2335,11 +2386,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, false);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (rbtxn_prepared(txn))
+		{
+			ReorderBufferTruncateTXN(rb, txn, true);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2369,17 +2426,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * abort of the (sub)transaction we are streaming or preparing. We need to do the
 		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can only occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we are
+			 * sending the data out on a PREPARE during a two-phase commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started  || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2387,10 +2445,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/* If streaming, reset the TXN so that it is allowed to stream remaining data. */
+			if (streaming)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+						txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2412,23 +2479,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2470,6 +2530,145 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+	/*
+	* Always call the prepare filter. It's the job of the prepare filter to
+	* give us the *same* response for a given xid across multiple calls
+	* (including ones on restart)
+	*/
+	return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	* The transaction may or may not exist (during restarts for example).
+	* Anyways, two-phase transactions do not contain any reorderbuffers. So allow
+	* it to be created below.
+	*/
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_is_streamed(txn) && rbtxn_commit_prepared(txn))
+		rb->stream_commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_is_streamed(txn) && rbtxn_rollback_prepared(txn))
+		rb->stream_rollback_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2512,7 +2711,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
-- 
1.8.3.1

#58Peter Smith
smithpb2250@gmail.com
In reply to: Ajin Cherian (#57)
3 attachment(s)

Hello Ajin,

The v9 patches provided support for two-phase transactions for NON-streaming.

Now I have added STREAM support for two-phase transactions, and bumped
all patches to version v10.

(The 0001 and 0002 patches are unchanged. Only 0003 is changed).

--

There are a few TODO/FIXME comments in the code highlighting parts
needing some attention.

There is a #define DEBUG_STREAM_2PC useful for debugging, which I can
remove later.

All the patches have some whitespaces issues when applied. We can
resolve them as we go.

Please let me know any comments/feedback.

Kind Regards
Peter Smith.
Fujitsu Australia.

Attachments:

v10-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v10-0001-Support-2PC-txn-base.patchDownload
From 5df1c97cddf3539ecdbdcb18f55a341ee2cd410a Mon Sep 17 00:00:00 2001
From: postgres <postgres@CentOS7-x64.fritz.box>
Date: Fri, 16 Oct 2020 14:19:59 +1100
Subject: [PATCH v10] =?UTF-8?q?Support=202PC=20txn=20=E2=80=93=20base.?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 243 +++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 173 +++++++++++++++-
 src/backend/replication/logical/logical.c | 314 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  64 ++++++
 src/include/replication/reorderbuffer.h   |  95 +++++++++
 6 files changed, 887 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 8e33614..cf7d674 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid_aborted; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -73,6 +78,15 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr prepare_lsn);
+static void pg_decode_stream_commit_prepared(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn);
+static void pg_decode_stream_rollback_prepared(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr rollback_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -88,6 +102,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -112,10 +138,17 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
+	cb->stream_commit_prepared_cb = pg_decode_stream_commit_prepared;
+	cb->stream_rollback_prepared_cb = pg_decode_stream_rollback_prepared;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -127,6 +160,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool 		enable_2pc = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +170,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +262,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_2pc))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)
+					strtoul(strVal(elem->arg), NULL, 0);
+
+				if (!TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+								strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +302,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_2pc;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +362,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +607,25 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid_aborted is a valid xid, then it was passed in
+	 * as an option to check if the transaction having this xid would be aborted.
+	 * This is to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			   !TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -646,6 +817,78 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_commit_prepared(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "commit prepared streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "commit prepared streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_rollback_prepared(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr rollback_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "abort prepared streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "abort prepared streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..824c55a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,18 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
+    LogicalDecodeStreamCommitPreparedCB stream_commit_prepared_cb;
+    LogicalDecodeStreamRollbackPreparedCB stream_rollback_prepared_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -414,9 +421,18 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
      <function>stream_commit_cb</function> and <function>stream_change_cb</function>
-     are required, while <function>stream_message_cb</function> and
+     are required, while <function>stream_message_cb</function>,
+     <function>stream_prepare_cb</function>, <function>stream_commit_prepared_cb</function>,
+     <function>stream_rollback_prepared_cb</function> and 
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +493,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +600,55 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The optional <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The optional <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The optional <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr abort_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +658,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +741,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +795,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +851,45 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-commit-prepared">
+     <title>Stream Commit Prepared Callback</title>
+     <para>
+      The <function>stream_commit_prepared_cb</function> callback is called to commit
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                     ReorderBufferTXN *txn,
+                                                     XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-stream-rollback-prepared">
+     <title>Stream Rollback Prepared Callback</title>
+     <para>
+      The <function>stream_rollback_prepared_cb</function> callback is called to abort
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1068,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>stream_commit_prepared_cb</function> callback or aborted using the
+    <function>stream_rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8675832..84a4751 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,12 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
+static void stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+											  XLogRecPtr commit_lsn);
+static void stream_rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+												XLogRecPtr rollback_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +221,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -221,12 +239,28 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
 		(ctx->callbacks.stream_stop_cb != NULL) ||
 		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.stream_commit_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_rollback_prepared_cb != NULL) ||
 		(ctx->callbacks.stream_commit_cb != NULL) ||
 		(ctx->callbacks.stream_change_cb != NULL) ||
 		(ctx->callbacks.stream_message_cb != NULL) ||
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require prepare/commit-prepare/abort-prepare
+	 * callbacks. The filter-prepare callback is optional. We however enable two phase logical
+	 * decoding when at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +271,9 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
+	ctx->reorder->stream_commit_prepared = stream_commit_prepared_cb_wrapper;
+	ctx->reorder->stream_rollback_prepared = stream_rollback_prepared_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +820,120 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are  supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin supports two-phase commits then prepare callback is mandatory */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are  supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then commit prepared callback is mandatory */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+ 
+	/* We're only supposed to call this when two-phase commits are  supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support 2 phase commits then abort prepared callback is mandatory */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1010,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase at PREPARE time is not enabled. In that
+	 * case all two-phase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1253,124 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming and two-phase commits are supported. */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+								  XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_commit_prepared";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_commit_prepared_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_commit_prepared_cb callback")));
+
+	ctx->callbacks.stream_commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+stream_rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									XLogRecPtr rollback_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming is supported. */
+	Assert(ctx->streaming);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_rollback_prepared";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_rollback_prepared_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming requires a stream_rollback_prepared_cb callback")));
+
+	ctx->callbacks.stream_rollback_prepared_cb(ctx, txn, rollback_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..a191e82 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -84,6 +84,11 @@ typedef struct LogicalDecodingContext
 	 */
 	bool		streaming;
 
+ 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..dfcb577 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,30 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr prepare_lsn);
+
+/*
+ * Called to commit prepared changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/*
+ * Called to abort/rollback prepared changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr rollback_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +228,19 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
+	LogicalDecodeStreamCommitPreparedCB stream_commit_prepared_cb;
+	LogicalDecodeStreamRollbackPreparedCB stream_rollback_prepared_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1c77819..6cb4cb4 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -174,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -244,6 +266,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of 2PC we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -405,6 +430,36 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (
+                                     ReorderBuffer *rb,
+                                     ReorderBufferTXN *txn,
+                                     XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             TransactionId xid,
+                                             const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (
+                                       ReorderBuffer *rb,
+                                       ReorderBufferTXN *txn,
+                                       XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (
+                                              ReorderBuffer *rb,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr commit_lsn);
+
+/* abort prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (
+                                             ReorderBuffer *rb,
+                                             ReorderBufferTXN *txn,
+                                             XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -431,6 +486,24 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr prepare_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitPreparedCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamRollbackPreparedCB) (
+											 ReorderBuffer *rb,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr rollback_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -497,6 +570,11 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferAbortCB abort;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -505,6 +583,9 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
+	ReorderBufferStreamCommitPreparedCB stream_commit_prepared;
+	ReorderBufferStreamRollbackPreparedCB stream_rollback_prepared;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
@@ -574,6 +655,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+                           XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                           TimestampTz commit_time,
+                           RepOriginId origin_id, XLogRecPtr origin_lsn,
+                           char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -597,6 +683,15 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool 		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+							 const char *gid);
+bool 		ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid);
+void 		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v10-0002-Support-2PC-txn-backend-and-tests.patchapplication/octet-stream; name=v10-0002-Support-2PC-txn-backend-and-tests.patchDownload
From 6ab90d1c917070a9ac6cc6c055a5d9ff755027a8 Mon Sep 17 00:00:00 2001
From: postgres <postgres@CentOS7-x64.fritz.box>
Date: Fri, 16 Oct 2020 14:35:00 +1100
Subject: [PATCH v10] =?UTF-8?q?Support=202PC=20txn=20=E2=80=93=20backend?=
 =?UTF-8?q?=20and=20tests.?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.

Includes two-phase commit test code (for test_decoding).
---
 contrib/test_decoding/Makefile                  |   4 +-
 contrib/test_decoding/expected/two_phase.out    | 219 +++++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql         | 117 ++++++++++
 contrib/test_decoding/t/001_twophase.pl         | 119 ++++++++++
 src/backend/replication/logical/decode.c        | 250 ++++++++++++++++++++-
 src/backend/replication/logical/reorderbuffer.c | 278 ++++++++++++++++++++----
 6 files changed, 937 insertions(+), 50 deletions(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..2abf3ce 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,11 +4,13 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..3ac01a4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,219 @@
+-- Test two-phased transactions, when two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time. 
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- 
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- test abort of a prepared xact
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- test prepared xact containing ddl
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- test savepoints and sub-xacts as a result
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..e3e2690
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,117 @@
+-- Test two-phased transactions, when two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time. 
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- 
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test abort of a prepared xact
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+
+-- test prepared xact containing ddl
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+
+COMMIT PREPARED 'test_prepared_lock';
+
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test savepoints and sub-xacts as a result
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- cleanup
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..51c6a35
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,119 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..e011fd9 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -68,8 +68,15 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						 xl_xact_parsed_commit *parsed, TransactionId xid);
+static void DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -239,7 +246,6 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	switch (info)
 	{
 		case XLOG_XACT_COMMIT:
-		case XLOG_XACT_COMMIT_PREPARED:
 			{
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
@@ -256,8 +262,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeCommit(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_COMMIT_PREPARED:
+			{
+				xl_xact_commit *xlrec;
+				xl_xact_parsed_commit parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_commit *) XLogRecGetData(r);
+				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+				DecodeCommitPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ABORT:
-		case XLOG_XACT_ABORT_PREPARED:
 			{
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
@@ -274,6 +296,23 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_ABORT_PREPARED:
+			{
+				xl_xact_abort *xlrec;
+				xl_xact_parsed_abort parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_abort *) XLogRecGetData(r);
+				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+					DecodeAbortPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ASSIGNMENT:
 
 			/*
@@ -312,17 +351,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -659,6 +716,131 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Consolidated commit record handling between the different form of commit
+ * records.
+ */
+static void
+DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+					 xl_xact_parsed_commit *parsed, TransactionId xid)
+{
+	XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	RepOriginId origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
+					   parsed->nsubxacts, parsed->subxacts);
+
+	/* ----
+	 * Check whether we are interested in this specific transaction, and tell
+	 * the reorderbuffer to forget the content of the (sub-)transactions
+	 * if not.
+	 *
+	 * There can be several reasons we might not be interested in this
+	 * transaction:
+	 * 1) We might not be interested in decoding transactions up to this
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
+	 * 2) The transaction happened in another database.
+	 * 3) The output plugin is not interested in the origin.
+	 * 4) We are doing fast-forwarding
+	 *
+	 * We can't just use ReorderBufferAbort() here, because we need to execute
+	 * the transaction's invalidations.  This currently won't be needed if
+	 * we're just skipping over the transaction because currently we only do
+	 * so during startup, to get to the first transaction the client needs. As
+	 * we have reset the catalog caches before starting to read WAL, and we
+	 * haven't yet touched any catalogs, there can't be anything to invalidate.
+	 * But if we're "forgetting" this commit because it's it happened in
+	 * another database, the invalidations might be important, because they
+	 * could be for shared catalogs and we might have loaded data into the
+	 * relevant syscaches.
+	 * ---
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+		}
+		ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+		return;
+	}
+
+	/* tell the reorderbuffer about the surviving subtransactions */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/*
+	 * For COMMIT PREPARED, the changes have already been replayed at
+	 * PREPARE time, so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 * If filter check present and this needs to be skipped, do a regular commit.
+	 */
+	if (ctx->callbacks.filter_prepare_cb &&
+			ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+	else
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr  origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		 ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+		return;
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/* replay actions of all transaction + subtransactions in order */
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
+}
+
+/*
  * Get the data from the various forms of abort records and pass it on to
  * snapbuild.c and reorderbuffer.c
  */
@@ -681,6 +863,50 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Get the data from the various forms of abort records and pass it on to
+ * snapbuild.c and reorderbuffer.c
+ */
+static void
+DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			xl_xact_parsed_abort *parsed, TransactionId xid)
+{
+	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it passes through the filters handle the ROLLBACK via callbacks
+	 */
+	if(!FilterByOrigin(ctx, origin_id) &&
+	   !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+	   !ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		Assert(TransactionIdIsValid(xid));
+		Assert(parsed->dbId == ctx->slot->data.database);
+
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
+
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+						   buf->record->EndRecPtr);
+	}
+
+	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+}
+
+/*
  * Parse XLOG_HEAP_INSERT (not MULTI_INSERT!) records into tuplebufs.
  *
  * Deletes can contain the new tuple.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 7a8bf76..9df7ff7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -418,6 +419,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1511,12 +1518,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1535,7 +1544,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1569,9 +1578,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for decoding
+		 * catalog snapshot access.
+		 * They are always stored in the toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1765,9 +1798,18 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
-
-	ReorderBufferCleanupTXN(rb, txn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1896,7 +1938,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2003,7 +2045,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2294,7 +2336,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT
+			 * (for regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2335,11 +2386,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, false);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (rbtxn_prepared(txn))
+		{
+			ReorderBufferTruncateTXN(rb, txn, true);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2369,17 +2426,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * abort of the (sub)transaction we are streaming or preparing. We need to do the
 		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can only occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we are
+			 * sending the data out on a PREPARE during a two-phase commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started  || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2387,10 +2445,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/* If streaming, reset the TXN so that it is allowed to stream remaining data. */
+			if (streaming)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+						txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2412,23 +2479,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2470,6 +2530,145 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+                               false);
+
+	/*
+	* Always call the prepare filter. It's the job of the prepare filter to
+	* give us the *same* response for a given xid across multiple calls
+	* (including ones on restart)
+	*/
+	return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	* The transaction may or may not exist (during restarts for example).
+	* Anyways, two-phase transactions do not contain any reorderbuffers. So allow
+	* it to be created below.
+	*/
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_is_streamed(txn) && rbtxn_commit_prepared(txn))
+		rb->stream_commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_is_streamed(txn) && rbtxn_rollback_prepared(txn))
+		rb->stream_rollback_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2512,7 +2711,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
-- 
1.8.3.1

v10-0003-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v10-0003-Support-2PC-txn-pgoutput.patchDownload
From a5e0b59a6d7d9a3cf1e091d41814597cce36e5f1 Mon Sep 17 00:00:00 2001
From: postgres <postgres@CentOS7-x64.fritz.box>
Date: Fri, 16 Oct 2020 16:53:24 +1100
Subject: [PATCH v10] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.

Includes two-phase commit test code (streaming and not streaming).
---
 src/backend/access/transam/twophase.c             |  27 ++
 src/backend/replication/logical/proto.c           | 173 +++++++-
 src/backend/replication/logical/worker.c          | 466 +++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c       | 130 ++++++
 src/include/access/twophase.h                     |   1 +
 src/include/replication/logicalproto.h            |  39 ++
 src/test/subscription/t/020_twophase.pl           | 163 ++++++++
 src/test/subscription/t/021_twophase_streaming.pl | 366 +++++++++++++++++
 8 files changed, 1363 insertions(+), 2 deletions(-)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_streaming.pl

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7940060..2e0a408 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index eb19142..c0b83da 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, 'C');		/* sending COMMIT */
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -99,6 +99,7 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 	if (flags != 0)
 		elog(ERROR, "unrecognized flags %u in commit message", flags);
 
+
 	/* read fields */
 	commit_data->commit_lsn = pq_getmsgint64(in);
 	commit_data->end_lsn = pq_getmsgint64(in);
@@ -106,6 +107,176 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'P');		/* sending PREPARE protocol */
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In which case we
+	 * expect to have a non-empty GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(strlen(txn->gid) > 0);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags |= LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags |= LOGICALREP_IS_PREPARE;
+
+	/* Make sure exactly one of the expected flags is set. */
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ *
+ * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED)
+ *
+ * TODO- This is mostly cut/paste from logicalrep_write_prepare. Consider refactoring for commonality.
+ */
+void 
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8	flags = 0;
+
+#ifdef DEBUG_STREAM_2PC
+	elog(LOG, "proto: logicalrep_write_stream_prepare");
+#endif
+
+	pq_sendbyte(out, 'p');		/* sending STREAM PREPARE protocol */
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case we
+	 * expect to have a non-empty GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(strlen(txn->gid) > 0);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags |= LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags |= LOGICALREP_IS_PREPARE;
+
+	/* Make sure exactly one of the expected flags is set. */
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+    /* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ *
+ * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED)
+ *
+ * TODO - This is mostly cut/paste from logicalrep_read_prepare. Consider refactoring for commonality.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+#ifdef DEBUG_STREAM_2PC
+	elog(LOG, "proto: logicalrep_read_stream_prepare");
+#endif
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b8e297c..fbb7acf 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -722,6 +722,7 @@ apply_handle_commit(StringInfo s)
 		replorigin_session_origin_timestamp = commit_data.committime;
 
 		CommitTransactionCommand();
+
 		pgstat_report_stat(false);
 
 		store_flush_position(commit_data.end_lsn);
@@ -742,6 +743,461 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/* End the earlier transaction and start a new one */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/* Called from apply_handle_stream_prepare to handle STREAM PREPARE. */
+static void
+apply_handle_stream_prepare_txn(TransactionId xid, LogicalRepPrepareData *prepare_data)
+{
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+
+#ifdef DEBUG_STREAM_2PC
+	elog(LOG, "worker: apply_handle_stream_prepare_txn");
+#endif	
+
+	/*
+	 * FIXME - Following condition was in apply_handle_prepare_txn except I found  it was ALWAYS IsTransactionState() == false
+	 * The synchronization worker runs in single transaction. *
+	if (IsTransactionState() && !am_tablesync_worker())
+	*/
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * ==================================================================================================
+		 * The following chunk of code is largely cut/paste from the existing apply_handle_prepare_commit_txn
+		 * which was handling the non-two-phase streaming commit by applying the operations of the spooled file.
+		 *
+		 * Differences are:
+		 * - Here the xid is known already because apply_handle_stream_prepare already called
+		 *   locicalrep_read_stream_prepare
+		 *
+		 * TODO - This is possible candidate for refactoring since so much of it is the same.
+		 * ==================================================================================================
+		 */
+
+		Assert(!in_streamed_transaction);
+
+		elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+		ensure_transaction();
+
+		/*
+		 * Allocate file handle and memory required to process all the messages in
+		 * TopTransactionContext to avoid them getting reset after each message is
+		 * processed.
+		 */
+		oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+		/* open the spool file for the committed transaction */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		elog(DEBUG1, "replaying changes from file \"%s\"", path);
+#ifdef DEBUG_STREAM_2PC
+		elog(LOG, "worker: replaying changes from file \"%s\"", path);
+#endif
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+		buffer = palloc(BLCKSZ);
+		initStringInfo(&s2);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		remote_final_lsn = prepare_data->prepare_lsn;
+
+		/*
+		 * Make sure the handle apply_dispatch methods are aware we're in a remote
+		 * transaction.
+		 */
+		in_remote_transaction = true;
+		pgstat_report_activity(STATE_RUNNING, NULL);
+
+		/*
+		 * Read the entries one by one and pass them through the same logic as in
+		 * apply_dispatch.
+		 */
+		nchanges = 0;
+		while (true)
+		{
+			int			nbytes;
+			int			len;
+
+			CHECK_FOR_INTERRUPTS();
+
+			/* read length of the on-disk record */
+			nbytes = BufFileRead(fd, &len, sizeof(len));
+
+			/* have we reached end of the file? */
+			if (nbytes == 0)
+				break;
+
+			/* do we have a correct length? */
+			if (nbytes != sizeof(len))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+								path)));
+
+			Assert(len > 0);
+
+			/* make sure we have sufficiently large buffer */
+			buffer = repalloc(buffer, len);
+
+			/* and finally read the data into the buffer */
+			if (BufFileRead(fd, buffer, len) != len)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+								path)));
+
+			/* copy the buffer to the stringinfo and call apply_dispatch */
+			resetStringInfo(&s2);
+			appendBinaryStringInfo(&s2, buffer, len);
+
+			/* Ensure we are reading the data into our memory context. */
+			oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+			apply_dispatch(&s2);
+
+			MemoryContextReset(ApplyMessageContext);
+
+			MemoryContextSwitchTo(oldcxt);
+
+			nchanges++;
+
+			if (nchanges % 1000 == 0)
+				elog(DEBUG1, "replayed %d changes from file '%s'",
+					 nchanges, path);
+		}
+
+		BufFileClose(fd);
+
+		pfree(buffer);
+		pfree(s2.data);
+
+		/*
+		 * ==================================================================================================
+		 * The following chunk of code is cut/paste from the existing apply_handle_prepare_txn
+		 * which was handling the two-phase prepare of the non-streamed tx
+		 * ==================================================================================================
+		 */
+#ifdef DEBUG_STREAM_2PC
+		elog(LOG, "worker: call PrepareTransactionBlock()");
+#endif
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		/* End of copied prepare code */
+
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+
+		elog(DEBUG1, "replayed %d (all) changes from file \"%s\"", nchanges, path);
+#ifdef DEBUG_STREAM_2PC
+		elog(LOG, "worker: replayed %d (all) changes from file \"%s\"", nchanges, path);
+#endif
+
+		/* ============================================================================================== */
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	/* FIXME - OK to do this here (outside of if/else). */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+	
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_stream_prepare to handle STREAM COMMIT PREPARED.
+ * NOTE. Following code is exactly same as apply_handle_commit_prepared_txn. 
+ */
+static void
+apply_handle_stream_commit_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+#ifdef DEBUG_STREAM_2PC
+	elog(LOG, "worker: apply_handle_stream_commit_prepared_txn");
+#endif
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/* 
+ * Called from apply_handle_stream_prepare to handle STREAM ROLLBACK PREPARED.
+ * NOTE. Following code is exactly same as apply_handle_commit_prepared_txn.
+ */
+static void
+apply_handle_stream_rollback_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+#ifdef DEBUG_STREAM_2PC
+	elog(LOG, "worker: apply_handle_stream_rollback_prepared_txn");
+#endif
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE message.
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+
+#ifdef DEBUG_STREAM_2PC
+	elog(LOG, "worker: apply_handle_stream_prepare (xid=%u, prepare_type=%u)", xid, prepare_data.prepare_type);
+#endif
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_stream_prepare_txn(xid, &prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_stream_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_stream_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of stream prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1908,10 +2364,14 @@ apply_dispatch(StringInfo s)
 		case 'B':
 			apply_handle_begin(s);
 			break;
-			/* COMMIT */
+			/* COMMIT/ABORT */
 		case 'C':
 			apply_handle_commit(s);
 			break;
+			/* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+		case 'P':
+			apply_handle_prepare(s);
+			break;
 			/* INSERT */
 		case 'I':
 			apply_handle_insert(s);
@@ -1956,6 +2416,10 @@ apply_dispatch(StringInfo s)
 		case 'c':
 			apply_handle_stream_commit(s);
 			break;
+			/* STREAM PREPARE and [COMMIT|ROLLBACK] PREPARED */
+		case 'p':
+			apply_handle_stream_prepare(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..240fca6 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +63,12 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_stream_commit_prepared_txn(LogicalDecodingContext *ctx,
+												ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_stream_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn, XLogRecPtr abort_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -143,6 +155,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +169,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
+	cb->stream_commit_prepared_cb = pgoutput_stream_commit_prepared_txn;
+	cb->stream_rollback_prepared_cb = pgoutput_stream_rollback_prepared_txn;
 }
 
 static void
@@ -378,6 +398,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +919,74 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+#ifdef DEBUG_STREAM_2PC
+	elog(LOG, "pgoutput: pgoutput_stream_prepare_txn");
+#endif
+
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to commit the previously prepared transaction.
+ */
+static void
+pgoutput_stream_commit_prepared_txn(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr commit_lsn)
+{
+#ifdef DEBUG_STREAM_2PC
+	elog(LOG, "pgoutput: pgoutput_stream_commit_prepared_txn");
+#endif
+
+	Assert(rbtxn_is_streamed(txn));
+	Assert(rbtxn_prepared(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to abort the previously prepared transaction.
+ */
+static void
+pgoutput_stream_rollback_prepared_txn(LogicalDecodingContext *ctx,
+									  ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn)
+{
+#ifdef DEBUG_STREAM_2PC
+	elog(LOG, "pgoutput: pgoutput_stream_rollback_prepared_txn");
+#endif
+
+	Assert(rbtxn_is_streamed(txn));
+	Assert(rbtxn_prepared(txn));
+	
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, abort_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 0c2cda2..162e41a 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -87,6 +87,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -94,6 +95,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	((flags == LOGICALREP_IS_PREPARE) || \
+	 (flags == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 (flags == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -101,6 +124,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData * prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -144,4 +171,16 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+/* 
+ * FIXME - Uncomment this to see more lgging for streamed two-phase transactions.
+ *
+ * #define DEBUG_STREAM_2PC
+ */
+
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..3feb2c3
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+        ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+        'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+   is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+   is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_streaming.pl b/src/test/subscription/t/021_twophase_streaming.pl
new file mode 100644
index 0000000..7dfc965
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_streaming.pl
@@ -0,0 +1,366 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', 
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# --------------------------------------------------------------
+# 2PC PREPARE / COMMIT PREPARED test.
+#
+# Mass data is streamed as a 2PC transaction.
+# Then there is a commit prepared.
+# Expect all data is replicated on subscriber side after the commit.
+# --------------------------------------------------------------
+#
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+  $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+# --------------------------------------------------------------
+# 2PC PREPARE / ROLLBACK PREPARED test.
+#
+# Table is deleted back to 2 rows which are replicated on subscriber.
+# Mass data is streamed using 2PC but then there is a rollback prepared.
+# Expect data rolls back leaving only the 2 rows on the subscriber.
+# --------------------------------------------------------------
+#
+# First, delete the data (delete will be replicated)
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+# --------------------------------------------------------------
+# Check that 2PC commit prepared is decoded properly on crash restart.
+#
+# insert, update and delete enough rows to exceed the 64kB limit.
+# Then server crashes before the 2PC transaction is committed.
+# After servers are restarted the pending transaction is committed.
+# Expect all data is replicated on subscriber side after the commit.
+# --------------------------------------------------------------
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result =
+  $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+# --------------------------------------------------------------
+# Do INSERT outside of 2PC but before ROLLBACK PREPARED.
+#
+# Table is deleted back to 2 rows which are replicated on subscriber.
+# Mass data is streamed using 2PC.
+# A single row INSERT is done which is outside of the 2PC transaction
+# Then there is a rollback prepared.
+# Expect 2PC data rolls back leaving only 3 rows on the subscriber.
+# --------------------------------------------------------------
+#
+# First, delete the data (delete will be replicated)
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# FIXME - this works OK but if INSERT overlaps pk of a row participating in the 2PC then it hangs
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+# but the extra INSERT outside of the 2PC still happened
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+# --------------------------------------------------------------
+# Do INSERT outside of 2PC but before COMMIT PREPARED.
+#
+# Table is deleted back to 2 rows which are replicated on subscriber.
+# Mass data is streamed using 2PC.
+# A single row INSERT is done which is outside of the 2PC transaction
+# Then there is a commit prepared.
+# Expect 2PC data + the extra row are on the subscriber.
+# --------------------------------------------------------------
+#
+# First, delete the data (delete will be replicated)
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# FIXME - this works OK but if INSERT overlaps pk of a row participating in the 2PC then it hangs
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+  $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+# --------------------------------------------------------------
+# Do DELETE outside of 2PC but before COMMIT PREPARED.
+#
+# Table is deleted back to 2 rows which are replicated on subscriber.
+# Mass data is streamed using 2PC.
+# A single row DELETE is done for one of the records of the 2PC transaction
+# Then there is a commit prepared.
+# Expect all the 2PC data rows on the subscriber (because delete would have failed).
+# --------------------------------------------------------------
+#
+# First, delete the data (delete will be replicated)
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+  $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+$result =
+  $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+# --------------------------------------------------------------
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#59Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#58)

Hello Ajin.

I have gone through the v10 patches to verify if and how my previous
v6 review comments got addressed.

Some issues remain, and there are a few newly introduced ones.

Mostly it is all very minor stuff.

Please find my revised review comments below.

Kind Regards.
Peter Smith
Fujitsu Australia

---

V10 REVIEW COMMENTS FOLLOW

==========
Patch v10-0001, File: contrib/test_decoding/test_decoding.c
==========

COMMENT
Line 285
+ {
+ errno = 0;
+ data->check_xid_aborted = (TransactionId)
+ strtoul(strVal(elem->arg), NULL, 0);
+
+ if (!TransactionIdIsValid(data->check_xid_aborted))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+ strVal(elem->arg))));
+ }

I think it is risky to assign strtoul directly to the
check_xid_aborted member because it makes some internal assumption
that the invalid transaction is the same as the error return from
strtoul.

Maybe better to do in 2 steps like below:

BEFORE
errno = 0;
data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);

AFTER
long xid;
errno = 0;
xid = strtoul(strVal(elem->arg), NULL, 0);
if (xid == 0 || errno != 0)
data->check_xid_aborted = InvalidTransactionId;
else
data->check_xid_aborted =(TransactionId)xid;

---

COMMENT
Line 430
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)

Fix comment "ABORT PREPARED" --> "ROLLBACK PREPARED"

==========
Patch v10-0001, File: doc/src/sgml/logicaldecoding.sgml
==========

COMMENT
Section 48.6.1
Says:
An output plugin may also define functions to support streaming of
large, in-progress transactions. The stream_start_cb, stream_stop_cb,
stream_abort_cb, stream_commit_cb and stream_change_cb are required,
while stream_message_cb, stream_prepare_cb, stream_commit_prepared_cb,
stream_rollback_prepared_cb and stream_truncate_cb are optional.

An output plugin may also define functions to support two-phase
commits, which are decoded on PREPARE TRANSACTION. The prepare_cb,
commit_prepared_cb and rollback_prepared_cb callbacks are required,
while filter_prepare_cb is optional.

-

But is that correct? It seems strange/inconsistent to say that the 2PC
callbacks are mandatory for the non-streaming, but that they are
optional for streaming.

---

COMMENT
48.6.4.5 "Transaction Prepare Callback"
48.6.4.6 "Transaction Commit Prepared Callback"
48.6.4.7 "Transaction Rollback Prepared Callback"

There seems some confusion about what is optional and what is
mandatory. e.g. Why are the non-stream 2PC callbacks mandatory but the
stream 2PC callbacks are not? And also there is some inconsistency
with what is said in the paragraph at the top of the page versus what
each of the callback sections says wrt optional/mandatory.

The sub-sections 49.6.4.5, 49.6.4.6, 49.6.4.7 say those callbacks are
optional which IIUC Amit said is incorrect. This is similar to the
previous review comment

---

COMMENT
Section 48.6.4.7 "Transaction Rollback Prepared Callback"

parameter "abort_lsn" probably should be "rollback_lsn"

---

COMMENT
Section 49.6.4.18. "Stream Rollback Prepared Callback"
Says:
The stream_rollback_prepared_cb callback is called to abort a
previously streamed transaction as part of a two-phase commit.

maybe should say "is called to rollback"

==========
Patch v10-0001, File: src/backend/replication/logical/logical.c
==========

COMMENT
Line 252
Says: We however enable two phase logical...

"two phase" --> "two-phase"

--

COMMENT
Line 885
Line 923
Says: If the plugin support 2 phase commits...

"support 2 phase" --> "supports two-phase" in the comment. Same issue
occurs twice.

---

COMMENT
Line 830
Line 868
Line 906
Says:
/* We're only supposed to call this when two-phase commits are supported */

There is an extra space between the "are" and "supported" in the comment.
Same issue occurs 3 times.

---

COMMENT
Line 1023
+ /*
+ * Skip if decoding of two-phase at PREPARE time is not enabled. In that
+ * case all two-phase transactions are considered filtered out and will be
+ * applied as regular transactions at COMMIT PREPARED.
+ */

Comment still is missing the word "transactions"
"Skip if decoding of two-phase at PREPARE time is not enabled."
-> "Skip if decoding of two-phase transactions at PREPARE time is not enabled.

==========
Patch v10-0001, File: src/include/replication/reorderbuffer.h
==========

COMMENT
Line 459
/* abort prepared callback signature */
typedef void (*ReorderBufferRollbackPreparedCB) (
ReorderBuffer *rb,
ReorderBufferTXN *txn,
XLogRecPtr abort_lsn);

There is no alignment consistency here for
ReorderBufferRollbackPreparedCB. Some function args are directly under
the "(" and some are on the same line. This function code is neither.

---

COMMENT
Line 638
@@ -431,6 +486,24 @@ typedef void (*ReorderBufferStreamAbortCB) (
ReorderBufferTXN *txn,
XLogRecPtr abort_lsn);

+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamRollbackPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr rollback_lsn);

There is no inconsistent alignment with the arguments (compare how
other functions are aligned)

See:
- for ReorderBufferStreamCommitPreparedCB
- for ReorderBufferStreamRollbackPreparedCB
- for ReorderBufferPrepareNeedSkip
- for ReorderBufferTxnIsPrepared
- for ReorderBufferPrepare

---

COMMENT
Line 489
Line 495
Line 501
/* prepare streamed transaction callback signature */

Same comment cut/paste 3 times?
- for ReorderBufferStreamPrepareCB
- for ReorderBufferStreamCommitPreparedCB
- for ReorderBufferStreamRollbackPreparedCB

---

COMMENT
Line 457
/* abort prepared callback signature */
typedef void (*ReorderBufferRollbackPreparedCB) (
ReorderBuffer *rb,
ReorderBufferTXN *txn,
XLogRecPtr abort_lsn);

"abort" --> "rollback" in the function comment.

---

COMMENT
Line 269
/* In case of 2PC we need to pass GID to output plugin */

"2PC" --> "two-phase commit"

==========
Patch v10-0002, File: contrib/test_decoding/expected/two_phase.out (and .sql)
==========

COMMENT
General

It is a bit hard to see what are the main tests here are what are just
sub-parts of some test case.

e.g. It seems like the main tests are.

1. Test that decoding happens at PREPARE time
2. Test decoding of an aborted tx
3. Test a prepared tx which contains some DDL
4. Test decoding works while an uncommitted prepared tx with DDL exists
5. Test operations holding exclusive locks won't block decoding
6. Test savepoints and sub-transactions
7. Test "_nodecode" will defer the decoding until the commit time

Can the comments be made more obvious so it is easy to distinguish the
main tests from the steps of those tests?

---

COMMENT
Line 1
-- Test two-phased transactions, when two-phase-commit is enabled,
transactions are
-- decoded at PREPARE time rather than at COMMIT PREPARED time.

Some commas to be removed and this comment to be split into several sentences.

---

COMMENT
Line 19
-- should show nothing

Comment could be more informative. E.g. "Should show nothing because
the PREPARE has not happened yet"

---

COMMENT
Line 77

Looks like there is a missing comment about here that should say
something like "Show that the DDL does not appear in the decoding"

---

COMMENT
Line 160
-- test savepoints and sub-xacts as a result

The subsequent test is testing savepoints. But is it testing sub
transactions like the comment says?

==========
Patch v10-0002, File: contrib/test_decoding/t/001_twophase.pl
==========

COMMENT
General

I think basically there are only 2 tests in this file.
1. to check that the concurrent abort works.
2. to check that the prepared tx can span a server shutdown/restart

But the tests comments do not make this clear at all.
e.g. All the "#" comments look equally important although most of them
are just steps of each test case.
Can the comments be better to distinguish the tests versus the steps
of each test?

==========
Patch v10-0002, File: src/backend/replication/logical/decode.c
==========

COMMENT
Line 71
static void DecodeCommitPrepared(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
static void DecodeAbortPrepared(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_prepare * parsed);

The 2nd line or args are not aligned properly.
- for DecodeCommitPrepared
- for DecodeAbortPrepared
- for DecodePrepare

==========
Patch v10-0002, File: src/backend/replication/logical/reorderbuffer.c
==========

COMMENT
There are some parts of the code where in my v6 review I had a doubt
about the mutually exclusive treatment of the "streaming" flag and the
"rbtxn_prepared(txn)" state.

Basically I did not see how some parts of the code are treating NOT
streaming as implying 2PC etc because it defies my understanding that
2PC can also work in streaming mode. Perhaps the "streaming" flag has
a different meaning to how I interpret it? Or perhaps some functions
are guarding higher up and can only be called under certain
conditions?

Anyway, this confusion manifests in several parts of the code, none of
which was changed after my v6 review.

Affected code includes the following:

CASE 1
Wherever the ReorderBufferTruncateTXN(...) "prepared" flag (third
parameter) is hardwired true/false, I think there must be some
preceding Assert to guarantee the prepared state condition holds true.
There can't be any room for doubts like "but what will it do for
streamed 2PC..."
Line 1805 - ReorderBufferTruncateTXN(rb, txn, true); // if rbtxn_prepared(txn)
Line 1941 - ReorderBufferTruncateTXN(rb, txn, false); // state ??
Line 2389 - ReorderBufferTruncateTXN(rb, txn, false); // if streaming
Line 2396 - ReorderBufferTruncateTXN(rb, txn, true); // if not
streaming and if rbtxm_prepared(txn)
Line 2459 - ReorderBufferTruncateTXN(rb, txn, true); // if not streaming

~

CASE 2
Wherever the "streaming" flag is tested I don't really understand how
NOT streaming can automatically imply 2PC.
Line 2330 - if (streaming) // what about if it is streaming AND 2PC at
the same time?
Line 2387 - if (streaming) // what about if it is streaming AND 2PC at
the same time?
Line 2449 - if (streaming) // what about if it is streaming AND 2PC at
the same time?

~

Case 1 and Case 2 above overlap a fair bit. I just listed them so they
all get checked again.

Even if the code is thought to be currently OK I do still think
something should be done like:
a) add some more substantial comments to explain WHY the combination
of streaming and 2PC is not valid in the context
b) the Asserts to be strengthened to 100% guarantee that the streaming
and prepared states really are exclusive (if indeed they are). For
this point I thought the following Assert condition could be better:
Assert(streaming || rbtxn_prepared(txn));
Assert(stream_started || rbtxn_prepared(txn));
because as it is you still are left wondering if both streaming AND
rbtxn_prepared(txn) can be possible at the same time...

---

COMMENT
Line 2634
* Anyways, two-phase transactions do not contain any reorderbuffers.

"Anyways" --> "Anyway"

==========
Patch v10-0003, File: src/backend/access/transam/twophase.c
==========

COMMENT
Line 557
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
}

 /*
+ * LookupGXact
+ * Check if the prepared transaction with the given GID is around
+ */
+bool
+LookupGXact(const char *gid)
+{
+ int i;
+ bool found = false;

The variable declarations (i and found) are not aligned.

==========
Patch v10-0003, File: src/backend/replication/logical/proto.c
==========

COMMENT
Line 125
Line 205
Assert(strlen(txn->gid) > 0);

I suggested that the assertion should also check txn->gid is not NULL.
You replied "In this case txn->gid has to be non NULL".

But that is exactly what I said :-)
If it HAS to be non-NULL then why not just Assert that in code instead
of leaving the reader wondering?

"Assert(strlen(txn->gid) > 0);" --> "Assert(tdx->gid && strlen(txn->gid) > 0);"
Same occurs several times.

---

COMMENT
Line 133
Line 213
if (rbtxn_commit_prepared(txn))
flags |= LOGICALREP_IS_COMMIT_PREPARED;
else if (rbtxn_rollback_prepared(txn))
flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
else
flags |= LOGICALREP_IS_PREPARE;

Previously I wrote that the use of the bit flags on assignment in the
logicalrep_write_prepare was inconsistent with the way they are
treated when they are read. Really it should be using a direct
assignment instead of bit flags.

You said this is skipped anticipating a possible refactor. But IMO
this leaves the code in a half/half state. I think it is better to fix
it properly and if refactoring happens then deal with that at the
time.

The last comment I saw from Amit said to use my 1st proposal of direct
assignment instead of bit flag assignment.

(applies to both non-stream and stream functions)
- see logicalrep_write_prepare
- see logicalrep_write_stream_prepare

==========
Patch v10-0003, File: src/backend/replication/pgoutput/pgoutput.c
==========

COMMENT
Line 429
/*
* PREPARE callback
*/
static void
pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr prepare_lsn)
The function comment looks wrong.
Shouldn't this comment say be "ROLLBACK PREPARED callback"?

==========
Patch v10-0003, File: src/include/replication/logicalproto.h
==========

Line 115
#define PrepareFlagsAreValid(flags) \
((flags == LOGICALREP_IS_PREPARE) || \
(flags == LOGICALREP_IS_COMMIT_PREPARED) || \
(flags == LOGICALREP_IS_ROLLBACK_PREPARED))

Would be safer if all the references to flags are in parentheses
e.g. "flags" --> "(flags)"

[END]

#60Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#58)

On Fri, Oct 16, 2020 at 5:21 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hello Ajin,

The v9 patches provided support for two-phase transactions for NON-streaming.

Now I have added STREAM support for two-phase transactions, and bumped
all patches to version v10.

(The 0001 and 0002 patches are unchanged. Only 0003 is changed).

--

There are a few TODO/FIXME comments in the code highlighting parts
needing some attention.

There is a #define DEBUG_STREAM_2PC useful for debugging, which I can
remove later.

All the patches have some whitespaces issues when applied. We can
resolve them as we go.

Please let me know any comments/feedback.

Hi Peter,

Thanks for your patch. Some comments for your patch:

Comments:

src/backend/replication/logical/worker.c
@@ -888,6 +888,319 @@ apply_handle_prepare(StringInfo s)
+ /*
+ * FIXME - Following condition was in apply_handle_prepare_txn except
I found  it was ALWAYS IsTransactionState() == false
+ * The synchronization worker runs in single transaction. *
+ if (IsTransactionState() && !am_tablesync_worker())
+ */
+ if (!am_tablesync_worker())

Comment: I dont think a tablesync worker will use streaming, none of
the other stream APIs check this, this might not be relevant for
stream_prepare either.

+ /*
+ * ==================================================================================================
+ * The following chunk of code is largely cut/paste from the existing
apply_handle_prepare_commit_txn

Comment: Here, I think you meant apply_handle_stream_commit. Also
rather than duplicating this chunk of code, you could put it in a new
function.

+ /* open the spool file for the committed transaction */
+ changes_filename(path, MyLogicalRepWorker->subid, xid);

Comment: Here the comment should read "committed/prepared" rather than
"committed"

+ else
+ {
+ /* Process any invalidation messages that might have accumulated. */
+ AcceptInvalidationMessages();
+ maybe_reread_subscription();
+ }

Comment: This else block might not be necessary as a tablesync worker
will not initiate the streaming APIs.

+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();

Comment: Rereading the code and the transaction state description in
src/backend/access/transam/README. I am not entirely sure if the
BeginTransactionBlock followed by CommitTransactionBlock is really
needed here.
I understand this code was copied over from apply_handle_prepare_txn,
but now looking back I'm not so sure if it is correct. The transaction
would have already begin as part of applying the changes, why begin it
again?
Maybe Amit could confirm this.

END

regards,
Ajin Cherian
Fujitsu Australia

#61Peter Smith
smithpb2250@gmail.com
In reply to: Ajin Cherian (#60)

The PG docs for PREPARE TRANSACTION [1]https://www.postgresql.org/docs/current/sql-prepare-transaction.html don't say anything about an
empty (zero length) transaction-id.
e.g. PREPARE TRANSACTION '';
[1]: https://www.postgresql.org/docs/current/sql-prepare-transaction.html

~

Meanwhile, during testing I found the 2PC prepare hangs when an empty
id is used.

Now I am not sure does this represent some bug within the 2PC code, or
in fact should the PREPARE never have allowed an empty transaction-id
to be specified in the first place?

Thoughts?

Kind Regards
Peter Smith.
Fujitsu Australia.

#62Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#60)

On Tue, Oct 20, 2020 at 4:32 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Fri, Oct 16, 2020 at 5:21 PM Peter Smith <smithpb2250@gmail.com> wrote:

Comments:

src/backend/replication/logical/worker.c
@@ -888,6 +888,319 @@ apply_handle_prepare(StringInfo s)
+ /*
+ * FIXME - Following condition was in apply_handle_prepare_txn except
I found  it was ALWAYS IsTransactionState() == false
+ * The synchronization worker runs in single transaction. *
+ if (IsTransactionState() && !am_tablesync_worker())
+ */
+ if (!am_tablesync_worker())

Comment: I dont think a tablesync worker will use streaming, none of
the other stream APIs check this, this might not be relevant for
stream_prepare either.

Yes, I think this is right. See pgoutput_startup where we are
disabling the streaming for init phase. But it is always good to once
test this and ensure the same.

+ /*
+ * ==================================================================================================
+ * The following chunk of code is largely cut/paste from the existing
apply_handle_prepare_commit_txn

Comment: Here, I think you meant apply_handle_stream_commit. Also
rather than duplicating this chunk of code, you could put it in a new
function.

+ /* open the spool file for the committed transaction */
+ changes_filename(path, MyLogicalRepWorker->subid, xid);

Comment: Here the comment should read "committed/prepared" rather than
"committed"

+ else
+ {
+ /* Process any invalidation messages that might have accumulated. */
+ AcceptInvalidationMessages();
+ maybe_reread_subscription();
+ }

Comment: This else block might not be necessary as a tablesync worker
will not initiate the streaming APIs.

I think it is better to have an Assert here for streaming-mode?

+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();

Comment: Rereading the code and the transaction state description in
src/backend/access/transam/README. I am not entirely sure if the
BeginTransactionBlock followed by CommitTransactionBlock is really
needed here.

Yeah, I also find this strange. I guess the patch is doing so because
it needs to call PrepareTransactionBlock later but I am not sure. How
can we call CommitTransactionCommand(), won't it commit the on-going
transaction and make it visible before even it is visible on the
publisher. I think you can verify by having a breakpoint after
CommitTransactionCommand() and see if the changes for which we are
doing prepare become visible.

I understand this code was copied over from apply_handle_prepare_txn,
but now looking back I'm not so sure if it is correct. The transaction
would have already begin as part of applying the changes, why begin it
again?
Maybe Amit could confirm this.

I hope the above suggestions will help to proceed here.

--
With Regards,
Amit Kapila.

#63Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#61)

On Wed, Oct 21, 2020 at 1:38 PM Peter Smith <smithpb2250@gmail.com> wrote:

The PG docs for PREPARE TRANSACTION [1] don't say anything about an
empty (zero length) transaction-id.
e.g. PREPARE TRANSACTION '';
[1] https://www.postgresql.org/docs/current/sql-prepare-transaction.html

~

Meanwhile, during testing I found the 2PC prepare hangs when an empty
id is used.

Can you please take an example to explain what you are trying to say?
I have tried below and doesn't face any problem:

postgres=# Begin;
BEGIN
postgres=*# select txid_current();
txid_current
--------------
534
(1 row)
postgres=*# Prepare Transaction 'foo';
PREPARE TRANSACTION
postgres=# Commit Prepared 'foo';
COMMIT PREPARED
postgres=# Begin;
BEGIN
postgres=*# Prepare Transaction 'foo';
PREPARE TRANSACTION
postgres=# Commit Prepared 'foo';
COMMIT PREPARED

--
With Regards,
Amit Kapila.

#64Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#59)

On Tue, Oct 20, 2020 at 9:46 AM Peter Smith <smithpb2250@gmail.com> wrote:

==========
Patch v10-0002, File: src/backend/replication/logical/reorderbuffer.c
==========

COMMENT
There are some parts of the code where in my v6 review I had a doubt
about the mutually exclusive treatment of the "streaming" flag and the
"rbtxn_prepared(txn)" state.

I am not sure about the exact specifics here but we can always prepare
a transaction that is streamed. I have to raise one more point in this
regard. Why do we need stream_commit_prepared_cb,
stream_rollback_prepared_cb callbacks? Do we need to do something
separate in pgoutput or otherwise for these APIs? If not, can't we use
a non-stream version of these APIs instead? There appears to be a
use-case for stream_prepare_cb which is to apply the existing changes
on subscriber and call prepare but I can't see usecase for the other
two APIs.

One minor comment:
v10-0001-Support-2PC-txn-base

1.
@@ -574,6 +655,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *,
TransactionId, Snapshot snapsho
 void ReorderBufferCommit(ReorderBuffer *, TransactionId,
  XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
  TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+                           XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+                           TimestampTz commit_time,
+                           RepOriginId origin_id, XLogRecPtr origin_lsn,
+                           char *gid, bool is_commit);
 void ReorderBufferAssignChild(ReorderBuffer *, TransactionId,
TransactionId, XLogRecPtr commit_lsn);
 void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
  XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -597,6 +683,15 @@ void
ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid,
XLog
 bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+    const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuf

I don't think these changes belong to this patch as the definition of
these functions is not part of this patch.

--
With Regards,
Amit Kapila.

#65Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#63)

On Wed, Oct 21, 2020 at 7:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 21, 2020 at 1:38 PM Peter Smith <smithpb2250@gmail.com> wrote:

The PG docs for PREPARE TRANSACTION [1] don't say anything about an
empty (zero length) transaction-id.
e.g. PREPARE TRANSACTION '';
[1] https://www.postgresql.org/docs/current/sql-prepare-transaction.html

~

Meanwhile, during testing I found the 2PC prepare hangs when an empty
id is used.

Can you please take an example to explain what you are trying to say?

I was referring to an empty (zero length) transaction ID, not an empty
transaction.

The example was already given as PREPARE TRANSACTION '';

A longer example from my regress test is shown below. Using 2PC
pub/sub this will currently hang:

# --------------------
# Test using empty GID
# --------------------
# check that 2PC gets replicated to subscriber
$node_publisher->safe_psql('postgres',
"BEGIN;INSERT INTO tab_full VALUES (51);PREPARE TRANSACTION '';");
$node_publisher->poll_query_until('postgres', $caughtup_query)
or die "Timed out while waiting for subscriber to catch up";
# check that transaction is in prepared state on subscriber
$result =
$node_subscriber->safe_psql('postgres', "SELECT count(*) FROM
pg_prepared_xacts where gid = '';");
is($result, qq(1), 'transaction is prepared on subscriber');
# ROLLBACK
$node_publisher->safe_psql('postgres',
"ROLLBACK PREPARED '';");
# check that 2PC gets aborted on subscriber
$node_publisher->poll_query_until('postgres', $caughtup_query)
or die "Timed out while waiting for subscriber to catch up";
$result =
$node_subscriber->safe_psql('postgres', "SELECT count(*) FROM
pg_prepared_xacts where gid = '';");
is($result, qq(0), 'transaction is aborted on subscriber');

~

Is that something that should be made to work for 2PC pub/sub, or was
Postgres PREPARE TRANSACTION statement wrong to allow the user to
specify an empty transaction ID in the first place?

Kind Regards
Peter Smith.
Fujitsu Australia.

#66Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#65)

On Thu, Oct 22, 2020 at 4:58 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Wed, Oct 21, 2020 at 7:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 21, 2020 at 1:38 PM Peter Smith <smithpb2250@gmail.com> wrote:

The PG docs for PREPARE TRANSACTION [1] don't say anything about an
empty (zero length) transaction-id.
e.g. PREPARE TRANSACTION '';
[1] https://www.postgresql.org/docs/current/sql-prepare-transaction.html

~

Meanwhile, during testing I found the 2PC prepare hangs when an empty
id is used.

Can you please take an example to explain what you are trying to say?

I was referring to an empty (zero length) transaction ID, not an empty
transaction.

oh, I got it confused with the system generated 32-bit TransactionId.
But now, I got what you were referring to.

The example was already given as PREPARE TRANSACTION '';

Is that something that should be made to work for 2PC pub/sub, or was
Postgres PREPARE TRANSACTION statement wrong to allow the user to
specify an empty transaction ID in the first place?

I don't see any problem with the empty transaction identifier used in
Prepare Transaction. This is just used as an identifier to uniquely
identify the transaction. If you try to use an empty string ('') more
than once for Prepare Transaction, it will give an error like below:
postgres=*# prepare transaction '';
ERROR: transaction identifier "" is already in use

So, I think this should work for pub/sub as well. Did you find out the
reason of hang?

--
With Regards,
Amit Kapila.

#67Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#59)
3 attachment(s)

On Tue, Oct 20, 2020 at 3:15 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hello Ajin.

I have gone through the v10 patches to verify if and how my previous
v6 review comments got addressed.

Some issues remain, and there are a few newly introduced ones.

Mostly it is all very minor stuff.

Please find my revised review comments below.

Kind Regards.
Peter Smith
Fujitsu Australia

---

V10 REVIEW COMMENTS FOLLOW

==========
Patch v10-0001, File: contrib/test_decoding/test_decoding.c
==========

COMMENT
Line 285
+ {
+ errno = 0;
+ data->check_xid_aborted = (TransactionId)
+ strtoul(strVal(elem->arg), NULL, 0);
+
+ if (!TransactionIdIsValid(data->check_xid_aborted))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+ strVal(elem->arg))));
+ }

I think it is risky to assign strtoul directly to the
check_xid_aborted member because it makes some internal assumption
that the invalid transaction is the same as the error return from
strtoul.

Maybe better to do in 2 steps like below:

BEFORE
errno = 0;
data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);

AFTER
long xid;
errno = 0;
xid = strtoul(strVal(elem->arg), NULL, 0);
if (xid == 0 || errno != 0)
data->check_xid_aborted = InvalidTransactionId;
else
data->check_xid_aborted =(TransactionId)xid;

---

Updated accordingly.

COMMENT
Line 430
+
+/* ABORT PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
+ XLogRecPtr abort_lsn)

Fix comment "ABORT PREPARED" --> "ROLLBACK PREPARED"

Updated accordingly.

==========
Patch v10-0001, File: doc/src/sgml/logicaldecoding.sgml
==========

COMMENT
Section 48.6.1
Says:
An output plugin may also define functions to support streaming of
large, in-progress transactions. The stream_start_cb, stream_stop_cb,
stream_abort_cb, stream_commit_cb and stream_change_cb are required,
while stream_message_cb, stream_prepare_cb, stream_commit_prepared_cb,
stream_rollback_prepared_cb and stream_truncate_cb are optional.

An output plugin may also define functions to support two-phase
commits, which are decoded on PREPARE TRANSACTION. The prepare_cb,
commit_prepared_cb and rollback_prepared_cb callbacks are required,
while filter_prepare_cb is optional.

-

But is that correct? It seems strange/inconsistent to say that the 2PC
callbacks are mandatory for the non-streaming, but that they are
optional for streaming.

Updated making all the 2PC callbacks mandatory.

---

COMMENT
48.6.4.5 "Transaction Prepare Callback"
48.6.4.6 "Transaction Commit Prepared Callback"
48.6.4.7 "Transaction Rollback Prepared Callback"

There seems some confusion about what is optional and what is
mandatory. e.g. Why are the non-stream 2PC callbacks mandatory but the
stream 2PC callbacks are not? And also there is some inconsistency
with what is said in the paragraph at the top of the page versus what
each of the callback sections says wrt optional/mandatory.

The sub-sections 49.6.4.5, 49.6.4.6, 49.6.4.7 say those callbacks are
optional which IIUC Amit said is incorrect. This is similar to the
previous review comment

---

Updated making all the 2PC callbacks mandatory.

COMMENT
Section 48.6.4.7 "Transaction Rollback Prepared Callback"

parameter "abort_lsn" probably should be "rollback_lsn"

---

COMMENT
Section 49.6.4.18. "Stream Rollback Prepared Callback"
Says:
The stream_rollback_prepared_cb callback is called to abort a
previously streamed transaction as part of a two-phase commit.

maybe should say "is called to rollback"

==========
Patch v10-0001, File: src/backend/replication/logical/logical.c
==========

COMMENT
Line 252
Says: We however enable two phase logical...

"two phase" --> "two-phase"

--

COMMENT
Line 885
Line 923
Says: If the plugin support 2 phase commits...

"support 2 phase" --> "supports two-phase" in the comment. Same issue
occurs twice.

---

COMMENT
Line 830
Line 868
Line 906
Says:
/* We're only supposed to call this when two-phase commits are supported */

There is an extra space between the "are" and "supported" in the comment.
Same issue occurs 3 times.

---

COMMENT
Line 1023
+ /*
+ * Skip if decoding of two-phase at PREPARE time is not enabled. In that
+ * case all two-phase transactions are considered filtered out and will be
+ * applied as regular transactions at COMMIT PREPARED.
+ */

Comment still is missing the word "transactions"
"Skip if decoding of two-phase at PREPARE time is not enabled."
-> "Skip if decoding of two-phase transactions at PREPARE time is not enabled.

Updated accordingly.

==========
Patch v10-0001, File: src/include/replication/reorderbuffer.h
==========

COMMENT
Line 459
/* abort prepared callback signature */
typedef void (*ReorderBufferRollbackPreparedCB) (
ReorderBuffer *rb,
ReorderBufferTXN *txn,
XLogRecPtr abort_lsn);

There is no alignment consistency here for
ReorderBufferRollbackPreparedCB. Some function args are directly under
the "(" and some are on the same line. This function code is neither.

---

COMMENT
Line 638
@@ -431,6 +486,24 @@ typedef void (*ReorderBufferStreamAbortCB) (
ReorderBufferTXN *txn,
XLogRecPtr abort_lsn);

+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamCommitPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn);
+
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamRollbackPreparedCB) (
+ ReorderBuffer *rb,
+ ReorderBufferTXN *txn,
+ XLogRecPtr rollback_lsn);

There is no inconsistent alignment with the arguments (compare how
other functions are aligned)

See:
- for ReorderBufferStreamCommitPreparedCB
- for ReorderBufferStreamRollbackPreparedCB
- for ReorderBufferPrepareNeedSkip
- for ReorderBufferTxnIsPrepared
- for ReorderBufferPrepare

---

COMMENT
Line 489
Line 495
Line 501
/* prepare streamed transaction callback signature */

Same comment cut/paste 3 times?
- for ReorderBufferStreamPrepareCB
- for ReorderBufferStreamCommitPreparedCB
- for ReorderBufferStreamRollbackPreparedCB

---

COMMENT
Line 457
/* abort prepared callback signature */
typedef void (*ReorderBufferRollbackPreparedCB) (
ReorderBuffer *rb,
ReorderBufferTXN *txn,
XLogRecPtr abort_lsn);

"abort" --> "rollback" in the function comment.

---

COMMENT
Line 269
/* In case of 2PC we need to pass GID to output plugin */

"2PC" --> "two-phase commit"

Updated accordingly.

==========
Patch v10-0002, File: contrib/test_decoding/expected/two_phase.out (and .sql)
==========

COMMENT
General

It is a bit hard to see what are the main tests here are what are just
sub-parts of some test case.

e.g. It seems like the main tests are.

1. Test that decoding happens at PREPARE time
2. Test decoding of an aborted tx
3. Test a prepared tx which contains some DDL
4. Test decoding works while an uncommitted prepared tx with DDL exists
5. Test operations holding exclusive locks won't block decoding
6. Test savepoints and sub-transactions
7. Test "_nodecode" will defer the decoding until the commit time

Can the comments be made more obvious so it is easy to distinguish the
main tests from the steps of those tests?

---

COMMENT
Line 1
-- Test two-phased transactions, when two-phase-commit is enabled,
transactions are
-- decoded at PREPARE time rather than at COMMIT PREPARED time.

Some commas to be removed and this comment to be split into several sentences.

---

COMMENT
Line 19
-- should show nothing

Comment could be more informative. E.g. "Should show nothing because
the PREPARE has not happened yet"

---

COMMENT
Line 77

Looks like there is a missing comment about here that should say
something like "Show that the DDL does not appear in the decoding"

---

COMMENT
Line 160
-- test savepoints and sub-xacts as a result

The subsequent test is testing savepoints. But is it testing sub
transactions like the comment says?

Updated accordingly.

==========
Patch v10-0002, File: contrib/test_decoding/t/001_twophase.pl
==========

COMMENT
General

I think basically there are only 2 tests in this file.
1. to check that the concurrent abort works.
2. to check that the prepared tx can span a server shutdown/restart

But the tests comments do not make this clear at all.
e.g. All the "#" comments look equally important although most of them
are just steps of each test case.
Can the comments be better to distinguish the tests versus the steps
of each test?

Updated accordingly.

==========
Patch v10-0002, File: src/backend/replication/logical/decode.c
==========

COMMENT
Line 71
static void DecodeCommitPrepared(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf,
xl_xact_parsed_commit *parsed, TransactionId xid);
static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
static void DecodeAbortPrepared(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf,
xl_xact_parsed_abort *parsed, TransactionId xid);
static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
xl_xact_parsed_prepare * parsed);

The 2nd line or args are not aligned properly.
- for DecodeCommitPrepared
- for DecodeAbortPrepared
- for DecodePrepare

Updated accordingly.

==========
Patch v10-0002, File: src/backend/replication/logical/reorderbuffer.c
==========

COMMENT
There are some parts of the code where in my v6 review I had a doubt
about the mutually exclusive treatment of the "streaming" flag and the
"rbtxn_prepared(txn)" state.

Basically I did not see how some parts of the code are treating NOT
streaming as implying 2PC etc because it defies my understanding that
2PC can also work in streaming mode. Perhaps the "streaming" flag has
a different meaning to how I interpret it? Or perhaps some functions
are guarding higher up and can only be called under certain
conditions?

Anyway, this confusion manifests in several parts of the code, none of
which was changed after my v6 review.

Affected code includes the following:

CASE 1
Wherever the ReorderBufferTruncateTXN(...) "prepared" flag (third
parameter) is hardwired true/false, I think there must be some
preceding Assert to guarantee the prepared state condition holds true.
There can't be any room for doubts like "but what will it do for
streamed 2PC..."
Line 1805 - ReorderBufferTruncateTXN(rb, txn, true); // if rbtxn_prepared(txn)
Line 1941 - ReorderBufferTruncateTXN(rb, txn, false); // state ??
Line 2389 - ReorderBufferTruncateTXN(rb, txn, false); // if streaming
Line 2396 - ReorderBufferTruncateTXN(rb, txn, true); // if not
streaming and if rbtxm_prepared(txn)
Line 2459 - ReorderBufferTruncateTXN(rb, txn, true); // if not streaming

~

CASE 2
Wherever the "streaming" flag is tested I don't really understand how
NOT streaming can automatically imply 2PC.
Line 2330 - if (streaming) // what about if it is streaming AND 2PC at
the same time?
Line 2387 - if (streaming) // what about if it is streaming AND 2PC at
the same time?
Line 2449 - if (streaming) // what about if it is streaming AND 2PC at
the same time?

~

Case 1 and Case 2 above overlap a fair bit. I just listed them so they
all get checked again.

Even if the code is thought to be currently OK I do still think
something should be done like:
a) add some more substantial comments to explain WHY the combination
of streaming and 2PC is not valid in the context
b) the Asserts to be strengthened to 100% guarantee that the streaming
and prepared states really are exclusive (if indeed they are). For
this point I thought the following Assert condition could be better:
Assert(streaming || rbtxn_prepared(txn));
Assert(stream_started || rbtxn_prepared(txn));
because as it is you still are left wondering if both streaming AND
rbtxn_prepared(txn) can be possible at the same time...

---

Updated with more comments and a new Assert.

COMMENT
Line 2634
* Anyways, two-phase transactions do not contain any reorderbuffers.

"Anyways" --> "Anyway"

Updated.

==========
Patch v10-0003, File: src/backend/access/transam/twophase.c
==========

COMMENT
Line 557
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
}

/*
+ * LookupGXact
+ * Check if the prepared transaction with the given GID is around
+ */
+bool
+LookupGXact(const char *gid)
+{
+ int i;
+ bool found = false;

The variable declarations (i and found) are not aligned.

Updated.

==========
Patch v10-0003, File: src/backend/replication/logical/proto.c
==========

COMMENT
Line 125
Line 205
Assert(strlen(txn->gid) > 0);

I suggested that the assertion should also check txn->gid is not NULL.
You replied "In this case txn->gid has to be non NULL".

But that is exactly what I said :-)
If it HAS to be non-NULL then why not just Assert that in code instead
of leaving the reader wondering?

"Assert(strlen(txn->gid) > 0);" --> "Assert(tdx->gid && strlen(txn->gid) > 0);"
Same occurs several times.

---

Updated checking that gid is non-NULL as zero strlen is actually a valid case.

COMMENT
Line 133
Line 213
if (rbtxn_commit_prepared(txn))
flags |= LOGICALREP_IS_COMMIT_PREPARED;
else if (rbtxn_rollback_prepared(txn))
flags |= LOGICALREP_IS_ROLLBACK_PREPARED;
else
flags |= LOGICALREP_IS_PREPARE;

Previously I wrote that the use of the bit flags on assignment in the
logicalrep_write_prepare was inconsistent with the way they are
treated when they are read. Really it should be using a direct
assignment instead of bit flags.

You said this is skipped anticipating a possible refactor. But IMO
this leaves the code in a half/half state. I think it is better to fix
it properly and if refactoring happens then deal with that at the
time.

The last comment I saw from Amit said to use my 1st proposal of direct
assignment instead of bit flag assignment.

(applies to both non-stream and stream functions)
- see logicalrep_write_prepare
- see logicalrep_write_stream_prepare

Updated accordingly.

==========
Patch v10-0003, File: src/backend/replication/pgoutput/pgoutput.c
==========

COMMENT
Line 429
/*
* PREPARE callback
*/
static void
pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr prepare_lsn)
The function comment looks wrong.
Shouldn't this comment say be "ROLLBACK PREPARED callback"?

==========
Patch v10-0003, File: src/include/replication/logicalproto.h
==========

Line 115
#define PrepareFlagsAreValid(flags) \
((flags == LOGICALREP_IS_PREPARE) || \
(flags == LOGICALREP_IS_COMMIT_PREPARED) || \
(flags == LOGICALREP_IS_ROLLBACK_PREPARED))

Would be safer if all the references to flags are in parentheses
e.g. "flags" --> "(flags)"

Updated accordingly.

Amit,
I have also modified the stream callback APIs to not include
stream_commit_prpeared and stream_rollback_prepared, instead use the
non-stream APIs for the same functionality.
I have also updated the test_decoding and pgoutput plugins accordingly.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v11-0003-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v11-0003-Support-2PC-txn-pgoutput.patchDownload
From e7b21894ccd62ecb6c447ffb370dc9c2184fd55f Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 23 Oct 2020 05:37:43 -0400
Subject: [PATCH v11] Support 2PC txn - pgoutput

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.

Includes two-phase commit test code (streaming and not streaming).
---
 src/backend/access/transam/twophase.c             |  27 ++
 src/backend/replication/logical/proto.c           | 162 +++++++++-
 src/backend/replication/logical/worker.c          | 364 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c       |  79 ++++-
 src/include/access/twophase.h                     |   1 +
 src/include/replication/logicalproto.h            |  39 +++
 src/test/subscription/t/020_twophase.pl           | 163 ++++++++++
 src/test/subscription/t/021_twophase_streaming.pl | 366 ++++++++++++++++++++++
 8 files changed, 1198 insertions(+), 3 deletions(-)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_streaming.pl

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7940060..2e0a408 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index eb19142..691ef0c 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, 'C');		/* sending COMMIT */
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -99,6 +99,7 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 	if (flags != 0)
 		elog(ERROR, "unrecognized flags %u in commit message", flags);
 
+
 	/* read fields */
 	commit_data->commit_lsn = pq_getmsgint64(in);
 	commit_data->end_lsn = pq_getmsgint64(in);
@@ -106,6 +107,165 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'P');		/* sending PREPARE protocol */
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In which case we
+	 * expect to have a non-empty GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* Make sure exactly one of the expected flags is set. */
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ *
+ * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED)
+ *
+ * TODO- This is mostly cut/paste from logicalrep_write_prepare. Consider refactoring for commonality.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8	flags = 0;
+
+#ifdef DEBUG_STREAM_2PC
+	elog(LOG, "proto: logicalrep_write_stream_prepare");
+#endif
+
+	pq_sendbyte(out, 'p');		/* sending STREAM PREPARE protocol */
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case we
+	 * expect to have a non-empty GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK] PREPARED
+	 * uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ *
+ * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED)
+ *
+ * TODO - This is mostly cut/paste from logicalrep_read_prepare. Consider refactoring for commonality.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+#ifdef DEBUG_STREAM_2PC
+	elog(LOG, "proto: logicalrep_read_stream_prepare");
+#endif
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3a5b733..26ad790 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -722,6 +722,7 @@ apply_handle_commit(StringInfo s)
 		replorigin_session_origin_timestamp = commit_data.committime;
 
 		CommitTransactionCommand();
+
 		pgstat_report_stat(false);
 
 		store_flush_position(commit_data.end_lsn);
@@ -742,6 +743,359 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/* End the earlier transaction and start a new one */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE message.
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+
+{
+	StringInfoData s2;
+	int			nchanges;
+	char		path[MAXPGPATH];
+	char	   *buffer = NULL;
+	bool		found;
+	StreamXidHash *ent;
+	MemoryContext oldcxt;
+	BufFile    *fd;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+
+	/* This should be a PREPARE. COMMIT PREPARED and ROLLBACK PREPARED should
+	 * result in non-streaming APIs being called.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+
+#ifdef DEBUG_STREAM_2PC
+	elog(LOG, "worker: apply_handle_stream_prepare");
+#endif
+
+	/*
+	 * FIXME - Following condition was in apply_handle_prepare_txn except I found  it was ALWAYS IsTransactionState() == false
+	 * The synchronization worker runs in single transaction. *
+	if (IsTransactionState() && !am_tablesync_worker())
+	*/
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * ==================================================================================================
+		 * The following chunk of code is largely cut/paste from the existing apply_handle_prepare_commit_txn
+		 * which was handling the non-two-phase streaming commit by applying the operations of the spooled file.
+		 *
+		 * Differences are:
+		 * - Here the xid is known already because apply_handle_stream_prepare already called
+		 *   locicalrep_read_stream_prepare
+		 *
+		 * TODO - This is possible candidate for refactoring since so much of it is the same.
+		 * ==================================================================================================
+		 */
+
+		Assert(!in_streamed_transaction);
+
+		elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+		ensure_transaction();
+
+		/*
+		 * Allocate file handle and memory required to process all the messages in
+		 * TopTransactionContext to avoid them getting reset after each message is
+		 * processed.
+		 */
+		oldcxt = MemoryContextSwitchTo(TopTransactionContext);
+
+		/* open the spool file for the committed transaction */
+		changes_filename(path, MyLogicalRepWorker->subid, xid);
+		elog(DEBUG1, "replaying changes from file \"%s\"", path);
+#ifdef DEBUG_STREAM_2PC
+		elog(LOG, "worker: replaying changes from file \"%s\"", path);
+#endif
+		ent = (StreamXidHash *) hash_search(xidhash,
+											(void *) &xid,
+											HASH_FIND,
+											&found);
+		Assert(found);
+		fd = BufFileOpenShared(ent->stream_fileset, path, O_RDONLY);
+
+		buffer = palloc(BLCKSZ);
+		initStringInfo(&s2);
+
+		MemoryContextSwitchTo(oldcxt);
+
+		remote_final_lsn = prepare_data.prepare_lsn;
+
+		/*
+		 * Make sure the handle apply_dispatch methods are aware we're in a remote
+		 * transaction.
+		 */
+		in_remote_transaction = true;
+		pgstat_report_activity(STATE_RUNNING, NULL);
+
+		/*
+		 * Read the entries one by one and pass them through the same logic as in
+		 * apply_dispatch.
+		 */
+		nchanges = 0;
+		while (true)
+		{
+			int			nbytes;
+			int			len;
+
+			CHECK_FOR_INTERRUPTS();
+
+			/* read length of the on-disk record */
+			nbytes = BufFileRead(fd, &len, sizeof(len));
+
+			/* have we reached end of the file? */
+			if (nbytes == 0)
+				break;
+
+			/* do we have a correct length? */
+			if (nbytes != sizeof(len))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+								path)));
+
+			Assert(len > 0);
+
+			/* make sure we have sufficiently large buffer */
+			buffer = repalloc(buffer, len);
+
+			/* and finally read the data into the buffer */
+			if (BufFileRead(fd, buffer, len) != len)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read from streaming transaction's changes file \"%s\": %m",
+								path)));
+
+			/* copy the buffer to the stringinfo and call apply_dispatch */
+			resetStringInfo(&s2);
+			appendBinaryStringInfo(&s2, buffer, len);
+
+			/* Ensure we are reading the data into our memory context. */
+			oldcxt = MemoryContextSwitchTo(ApplyMessageContext);
+
+			apply_dispatch(&s2);
+
+			MemoryContextReset(ApplyMessageContext);
+
+			MemoryContextSwitchTo(oldcxt);
+
+			nchanges++;
+
+			if (nchanges % 1000 == 0)
+				elog(DEBUG1, "replayed %d changes from file '%s'",
+					 nchanges, path);
+		}
+
+		BufFileClose(fd);
+
+		pfree(buffer);
+		pfree(s2.data);
+
+		/*
+		 * ==================================================================================================
+		 * The following chunk of code is cut/paste from the existing apply_handle_prepare_txn
+		 * which was handling the two-phase prepare of the non-streamed tx
+		 * ==================================================================================================
+		 */
+#ifdef DEBUG_STREAM_2PC
+		elog(LOG, "worker: call PrepareTransactionBlock()");
+#endif
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		/* End of copied prepare code */
+
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+
+		elog(DEBUG1, "replayed %d (all) changes from file \"%s\"", nchanges, path);
+#ifdef DEBUG_STREAM_2PC
+		elog(LOG, "worker: replayed %d (all) changes from file \"%s\"", nchanges, path);
+#endif
+
+		/* ============================================================================================== */
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	/* FIXME - OK to do this here (outside of if/else). */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1905,10 +2259,14 @@ apply_dispatch(StringInfo s)
 		case 'B':
 			apply_handle_begin(s);
 			break;
-			/* COMMIT */
+			/* COMMIT/ABORT */
 		case 'C':
 			apply_handle_commit(s);
 			break;
+			/* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+		case 'P':
+			apply_handle_prepare(s);
+			break;
 			/* INSERT */
 		case 'I':
 			apply_handle_insert(s);
@@ -1953,6 +2311,10 @@ apply_dispatch(StringInfo s)
 		case 'c':
 			apply_handle_stream_commit(s);
 			break;
+			/* STREAM PREPARE */
+		case 'p':
+			apply_handle_stream_prepare(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..32fe5c5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,7 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
-
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static bool publications_valid;
 static bool in_streaming;
 
@@ -143,6 +150,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +164,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +391,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +912,28 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+#ifdef DEBUG_STREAM_2PC
+	elog(LOG, "pgoutput: pgoutput_stream_prepare_txn");
+#endif
+
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 0c2cda2..ae59c04 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -87,6 +87,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -94,6 +95,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -101,6 +124,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData * prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -144,4 +171,16 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+/*
+ * FIXME - Uncomment this to see more lgging for streamed two-phase transactions.
+ *
+ * #define DEBUG_STREAM_2PC
+ */
+
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..3feb2c3
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,163 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 12;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+        ));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf(
+        'postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (11);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+   is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is committed on subscriber');
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres',
+	"BEGIN;INSERT INTO tab_full VALUES (12);PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+   is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result =
+   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+   is($result, qq(0), 'transaction is aborted on subscriber');
+
+# Check that commit prepared is decoded properly on crash restart
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_streaming.pl b/src/test/subscription/t/021_twophase_streaming.pl
new file mode 100644
index 0000000..391eb17
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_streaming.pl
@@ -0,0 +1,366 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+my $result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+# --------------------------------------------------------------
+# 2PC PREPARE / COMMIT PREPARED test.
+#
+# Mass data is streamed as a 2PC transaction.
+# Then there is a commit prepared.
+# Expect all data is replicated on subscriber side after the commit.
+# --------------------------------------------------------------
+#
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+  $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+# --------------------------------------------------------------
+# 2PC PREPARE / ROLLBACK PREPARED test.
+#
+# Table is deleted back to 2 rows which are replicated on subscriber.
+# Mass data is streamed using 2PC but then there is a rollback prepared.
+# Expect data rolls back leaving only the 2 rows on the subscriber.
+# --------------------------------------------------------------
+#
+# First, delete the data (delete will be replicated)
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+# --------------------------------------------------------------
+# Check that 2PC commit prepared is decoded properly on crash restart.
+#
+# insert, update and delete enough rows to exceed the 64kB limit.
+# Then server crashes before the 2PC transaction is committed.
+# After servers are restarted the pending transaction is committed.
+# Expect all data is replicated on subscriber side after the commit.
+# --------------------------------------------------------------
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result =
+  $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+# --------------------------------------------------------------
+# Do INSERT outside of 2PC but before ROLLBACK PREPARED.
+#
+# Table is deleted back to 2 rows which are replicated on subscriber.
+# Mass data is streamed using 2PC.
+# A single row INSERT is done which is outside of the 2PC transaction
+# Then there is a rollback prepared.
+# Expect 2PC data rolls back leaving only 3 rows on the subscriber.
+# --------------------------------------------------------------
+#
+# First, delete the data (delete will be replicated)
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# FIXME - this works OK but if INSERT overlaps pk of a row participating in the 2PC then it hangs
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+# but the extra INSERT outside of the 2PC still happened
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+# --------------------------------------------------------------
+# Do INSERT outside of 2PC but before COMMIT PREPARED.
+#
+# Table is deleted back to 2 rows which are replicated on subscriber.
+# Mass data is streamed using 2PC.
+# A single row INSERT is done which is outside of the 2PC transaction
+# Then there is a commit prepared.
+# Expect 2PC data + the extra row are on the subscriber.
+# --------------------------------------------------------------
+#
+# First, delete the data (delete will be replicated)
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# FIXME - this works OK but if INSERT overlaps pk of a row participating in the 2PC then it hangs
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+  $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+# --------------------------------------------------------------
+# Do DELETE outside of 2PC but before COMMIT PREPARED.
+#
+# Table is deleted back to 2 rows which are replicated on subscriber.
+# Mass data is streamed using 2PC.
+# A single row DELETE is done for one of the records of the 2PC transaction
+# Then there is a commit prepared.
+# Expect all the 2PC data rows on the subscriber (because delete would have failed).
+# --------------------------------------------------------------
+#
+# First, delete the data (delete will be replicated)
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result =
+  $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+$result =
+  $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+$result =
+	$node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+# --------------------------------------------------------------
+
+# TODO add test cases involving DDL. This can be added after we add functionality
+# to replicate DDL changes to subscriber.
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

v11-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v11-0001-Support-2PC-txn-base.patchDownload
From 79c222f842c3faa23c6d729fc6cb66fcf50af34d Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 23 Oct 2020 04:29:05 -0400
Subject: [PATCH v11] Support-2PC-txn-base

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 192 +++++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 144 ++++++++++++++++++-
 src/backend/replication/logical/logical.c | 228 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++++
 src/include/replication/reorderbuffer.h   |  76 ++++++++++
 6 files changed, 684 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 8e33614..7eb13ce 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid_aborted; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -73,6 +78,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -88,6 +96,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -112,10 +132,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -127,6 +152,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool 		enable_2pc = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +162,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +254,40 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_2pc))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				long xid;
+
+				errno = 0;
+				xid = strtoul(strVal(elem->arg), NULL, 0);
+				if (xid == 0 || errno != 0)
+					data->check_xid_aborted = InvalidTransactionId;
+				else
+					data->check_xid_aborted = (TransactionId)xid;
+
+				if (!TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+								strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +299,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_2pc;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +359,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +604,25 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid_aborted is a valid xid, then it was passed in
+	 * as an option to check if the transaction having this xid would be aborted.
+	 * This is to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			   !TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -646,6 +814,30 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..cfb1b32 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,18 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +490,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +597,55 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +655,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +738,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +792,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +848,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1039,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8675832..0f605ef 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -221,12 +235,26 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
 		(ctx->callbacks.stream_stop_cb != NULL) ||
 		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
 		(ctx->callbacks.stream_commit_cb != NULL) ||
 		(ctx->callbacks.stream_change_cb != NULL) ||
 		(ctx->callbacks.stream_message_cb != NULL) ||
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require prepare/commit-prepare/abort-prepare
+	 * callbacks. The filter-prepare callback is optional. We however enable two-phase logical
+	 * decoding when at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +265,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +812,120 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin supports two-phase commits then prepare callback is mandatory */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support two-phase commits then commit prepared callback is mandatory */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support two-phase commits then abort prepared callback is mandatory */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1002,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not enabled. In that
+	 * case all two-phase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1245,46 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming and two-phase commits are supported. */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..4c1341f 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1c77819..a323dfb 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -174,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -244,6 +266,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -405,6 +430,31 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (ReorderBuffer *rb,
+									  ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -431,6 +481,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -497,6 +553,11 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferAbortCB abort;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -505,6 +566,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
@@ -574,6 +636,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -597,6 +664,15 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool 		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+							 const char *gid);
+bool 		ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid);
+void 		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v11-0002-Support-2PC-txn-backend-and-tests.patchapplication/octet-stream; name=v11-0002-Support-2PC-txn-backend-and-tests.patchDownload
From 2e50cbdab0e8c06f76741444fd65f32f8875c3da Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 23 Oct 2020 04:44:11 -0400
Subject: [PATCH v11] Support 2PC txn backend and tests

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.

Includes two-phase commit test code (for test_decoding).
---
 contrib/test_decoding/Makefile                  |   4 +-
 contrib/test_decoding/expected/two_phase.out    | 228 ++++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql         | 119 ++++++++++
 contrib/test_decoding/t/001_twophase.pl         | 121 ++++++++++
 src/backend/replication/logical/decode.c        | 250 +++++++++++++++++++-
 src/backend/replication/logical/reorderbuffer.c | 302 +++++++++++++++++++++---
 6 files changed, 972 insertions(+), 52 deletions(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..2abf3ce 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,11 +4,13 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..da6e6e6
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+--
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..a7b23e6
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+--
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..1555582
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..fd961d4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -68,8 +68,15 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						 xl_xact_parsed_commit *parsed, TransactionId xid);
+static void DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+								 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+								xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -239,7 +246,6 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	switch (info)
 	{
 		case XLOG_XACT_COMMIT:
-		case XLOG_XACT_COMMIT_PREPARED:
 			{
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
@@ -256,8 +262,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeCommit(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_COMMIT_PREPARED:
+			{
+				xl_xact_commit *xlrec;
+				xl_xact_parsed_commit parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_commit *) XLogRecGetData(r);
+				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+				DecodeCommitPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ABORT:
-		case XLOG_XACT_ABORT_PREPARED:
 			{
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
@@ -274,6 +296,23 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_ABORT_PREPARED:
+			{
+				xl_xact_abort *xlrec;
+				xl_xact_parsed_abort parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_abort *) XLogRecGetData(r);
+				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+					DecodeAbortPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ASSIGNMENT:
 
 			/*
@@ -312,17 +351,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -659,6 +716,131 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Consolidated commit record handling between the different form of commit
+ * records.
+ */
+static void
+DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+					 xl_xact_parsed_commit *parsed, TransactionId xid)
+{
+	XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	RepOriginId origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
+					   parsed->nsubxacts, parsed->subxacts);
+
+	/* ----
+	 * Check whether we are interested in this specific transaction, and tell
+	 * the reorderbuffer to forget the content of the (sub-)transactions
+	 * if not.
+	 *
+	 * There can be several reasons we might not be interested in this
+	 * transaction:
+	 * 1) We might not be interested in decoding transactions up to this
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
+	 * 2) The transaction happened in another database.
+	 * 3) The output plugin is not interested in the origin.
+	 * 4) We are doing fast-forwarding
+	 *
+	 * We can't just use ReorderBufferAbort() here, because we need to execute
+	 * the transaction's invalidations.  This currently won't be needed if
+	 * we're just skipping over the transaction because currently we only do
+	 * so during startup, to get to the first transaction the client needs. As
+	 * we have reset the catalog caches before starting to read WAL, and we
+	 * haven't yet touched any catalogs, there can't be anything to invalidate.
+	 * But if we're "forgetting" this commit because it's it happened in
+	 * another database, the invalidations might be important, because they
+	 * could be for shared catalogs and we might have loaded data into the
+	 * relevant syscaches.
+	 * ---
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+		}
+		ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+		return;
+	}
+
+	/* tell the reorderbuffer about the surviving subtransactions */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/*
+	 * For COMMIT PREPARED, the changes have already been replayed at
+	 * PREPARE time, so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 * If filter check present and this needs to be skipped, do a regular commit.
+	 */
+	if (ctx->callbacks.filter_prepare_cb &&
+			ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+	else
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr  origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		 ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+		return;
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/* replay actions of all transaction + subtransactions in order */
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
+}
+
+/*
  * Get the data from the various forms of abort records and pass it on to
  * snapbuild.c and reorderbuffer.c
  */
@@ -681,6 +863,50 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Get the data from the various forms of abort records and pass it on to
+ * snapbuild.c and reorderbuffer.c
+ */
+static void
+DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			xl_xact_parsed_abort *parsed, TransactionId xid)
+{
+	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it passes through the filters handle the ROLLBACK via callbacks
+	 */
+	if(!FilterByOrigin(ctx, origin_id) &&
+	   !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+	   !ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		Assert(TransactionIdIsValid(xid));
+		Assert(parsed->dbId == ctx->slot->data.database);
+
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
+
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+						   buf->record->EndRecPtr);
+	}
+
+	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+}
+
+/*
  * Parse XLOG_HEAP_INSERT (not MULTI_INSERT!) records into tuplebufs.
  *
  * Deletes can contain the new tuple.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 7a8bf76..0a5c158 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -418,6 +419,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1511,12 +1518,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1535,7 +1544,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1569,9 +1578,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for decoding
+		 * catalog snapshot access.
+		 * They are always stored in the toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1765,9 +1798,22 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
-
-	ReorderBufferCleanupTXN(rb, txn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
+		/* Here we are streaming and part of the PREPARE of a two-phase commit
+		 * The full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1895,8 +1941,11 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/* Discard the changes that we just streamed.
+	 * This can only be called if streaming and not part of a PREPARE in
+	 * a two-phase commit, so set prepared flag as false.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1918,6 +1967,11 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 /*
  * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
  *
+ * We are here due to one of the 3 scenarios:
+ * 1. As part of streaming an in-progress transactions
+ * 2. Prepare of a two-phase commit
+ * 3. Commit of a transaction.
+ *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
  * merge) and replay the changes in lsn order.
@@ -2003,7 +2057,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2294,7 +2348,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT
+			 * (for regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2328,18 +2391,32 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
+		 * We are here due to one of the 3 scenarios:
+		 * 1. As part of streaming in-progress transactions
+		 * 2. Prepare of a two-phase commit
+		 * 3. Commit of a transaction.
+		 *
 		 * If we are streaming the in-progress transaction then discard the
 		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
+		 * streamed (if they contained changes), set prepared flag as false.
+		 * If part of a prepare of a two-phase commit set the prepared flag
+		 * as true so that we can discard changes and cleanup tuplecids.
+		 * Otherwise, remove all the
 		 * changes and deallocate the ReorderBufferTXN.
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, false);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (rbtxn_prepared(txn))
+		{
+			ReorderBufferTruncateTXN(rb, txn, true);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2369,17 +2446,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * abort of the (sub)transaction we are streaming or preparing. We need to do the
 		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can only occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we are
+			 * sending the data out on a PREPARE during a two-phase commit.
+			 * Both conditions can't be true either, it should be one of them.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started  || rbtxn_prepared(txn));
+			Assert(!(streaming && rbtxn_prepared(txn)));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2387,10 +2467,21 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/*
+			 * If streaming, reset the TXN so that it is allowed to stream remaining data.
+			 */
+			if (streaming)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+						txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2412,23 +2503,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2470,6 +2554,141 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/*
+	* Always call the prepare filter. It's the job of the prepare filter to
+	* give us the *same* response for a given xid across multiple calls
+	* (including ones on restart)
+	*/
+	return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	* The transaction may or may not exist (during restarts for example).
+	* Anyway, two-phase transactions do not contain any reorderbuffers. So allow
+	* it to be created below.
+	*/
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2512,7 +2731,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
-- 
1.8.3.1

#68Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#67)

On Fri, Oct 23, 2020 at 3:41 PM Ajin Cherian <itsajin@gmail.com> wrote:

Amit,
I have also modified the stream callback APIs to not include
stream_commit_prpeared and stream_rollback_prepared, instead use the
non-stream APIs for the same functionality.
I have also updated the test_decoding and pgoutput plugins accordingly.

Thanks, I think you forgot to address one of my comments in the
previous email[1]/messages/by-id/CAA4eK1JzRvUX2XLEKo2f74Vjecnt6wq-kkk1OiyMJ5XjJN+GvQ@mail.gmail.com (See "One minor comment .."). You have not even
responded to it.

[1]: /messages/by-id/CAA4eK1JzRvUX2XLEKo2f74Vjecnt6wq-kkk1OiyMJ5XjJN+GvQ@mail.gmail.com

--
With Regards,
Amit Kapila.

#69Peter Smith
smithpb2250@gmail.com
In reply to: Ajin Cherian (#60)
3 attachment(s)

Hi Ajin.

I've addressed your review comments (details below) and bumped the
patch set to v12 attached.

I also added more test cases.

On Tue, Oct 20, 2020 at 10:02 PM Ajin Cherian <itsajin@gmail.com> wrote:

Thanks for your patch. Some comments for your patch:

Comments:

src/backend/replication/logical/worker.c
@@ -888,6 +888,319 @@ apply_handle_prepare(StringInfo s)
+ /*
+ * FIXME - Following condition was in apply_handle_prepare_txn except
I found  it was ALWAYS IsTransactionState() == false
+ * The synchronization worker runs in single transaction. *
+ if (IsTransactionState() && !am_tablesync_worker())
+ */
+ if (!am_tablesync_worker())

Comment: I dont think a tablesync worker will use streaming, none of
the other stream APIs check this, this might not be relevant for
stream_prepare either.

Updated

+ /*
+ * ==================================================================================================
+ * The following chunk of code is largely cut/paste from the existing
apply_handle_prepare_commit_txn

Comment: Here, I think you meant apply_handle_stream_commit.

Updated.

Also
rather than duplicating this chunk of code, you could put it in a new
function.

Code is refactored to share a common function for the spool file processing.

+ else
+ {
+ /* Process any invalidation messages that might have accumulated. */
+ AcceptInvalidationMessages();
+ maybe_reread_subscription();
+ }

Comment: This else block might not be necessary as a tablesync worker
will not initiate the streaming APIs.

Updated

~

Kind Regards,
Peter Smith
Fujitsu Australia

Attachments:

v12-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v12-0001-Support-2PC-txn-base.patchDownload
From 5fa5544624609d841558ef37e4dc4b2995519330 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 26 Oct 2020 11:20:49 +1100
Subject: [PATCH v12] Support-2PC-txn-base.

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 192 +++++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 144 ++++++++++++++++++-
 src/backend/replication/logical/logical.c | 228 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++++
 src/include/replication/reorderbuffer.h   |  76 ++++++++++
 6 files changed, 684 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 8e33614..7eb13ce 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid_aborted; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -73,6 +78,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -88,6 +96,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -112,10 +132,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -127,6 +152,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool 		enable_2pc = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +162,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +254,40 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_2pc))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				long xid;
+
+				errno = 0;
+				xid = strtoul(strVal(elem->arg), NULL, 0);
+				if (xid == 0 || errno != 0)
+					data->check_xid_aborted = InvalidTransactionId;
+				else
+					data->check_xid_aborted = (TransactionId)xid;
+
+				if (!TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+								strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +299,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_2pc;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +359,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +604,25 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid_aborted is a valid xid, then it was passed in
+	 * as an option to check if the transaction having this xid would be aborted.
+	 * This is to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			   !TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -646,6 +814,30 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..cfb1b32 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,18 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +490,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +597,55 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +655,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +738,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +792,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +848,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1039,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8675832..0f605ef 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -221,12 +235,26 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
 		(ctx->callbacks.stream_stop_cb != NULL) ||
 		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
 		(ctx->callbacks.stream_commit_cb != NULL) ||
 		(ctx->callbacks.stream_change_cb != NULL) ||
 		(ctx->callbacks.stream_message_cb != NULL) ||
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require prepare/commit-prepare/abort-prepare
+	 * callbacks. The filter-prepare callback is optional. We however enable two-phase logical
+	 * decoding when at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +265,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +812,120 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin supports two-phase commits then prepare callback is mandatory */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support two-phase commits then commit prepared callback is mandatory */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support two-phase commits then abort prepared callback is mandatory */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1002,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not enabled. In that
+	 * case all two-phase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1245,46 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming and two-phase commits are supported. */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..4c1341f 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1c77819..a323dfb 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -174,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -244,6 +266,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -405,6 +430,31 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+/* abort callback signature */
+typedef void (*ReorderBufferAbortCB) (ReorderBuffer *rb,
+									  ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -431,6 +481,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -497,6 +553,11 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferAbortCB abort;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -505,6 +566,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
@@ -574,6 +636,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -597,6 +664,15 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool 		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+							 const char *gid);
+bool 		ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid);
+void 		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v12-0002-Support-2PC-txn-backend-and-tests.patchapplication/octet-stream; name=v12-0002-Support-2PC-txn-backend-and-tests.patchDownload
From 585e672158366f5a3b2a7d9330f5e0a0404d6f54 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 26 Oct 2020 11:25:57 +1100
Subject: [PATCH v12] Support 2PC txn backend and tests.

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.

Includes two-phase commit test code (for test_decoding).
---
 contrib/test_decoding/Makefile                  |   4 +-
 contrib/test_decoding/expected/two_phase.out    | 228 ++++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql         | 119 ++++++++++
 contrib/test_decoding/t/001_twophase.pl         | 121 ++++++++++
 src/backend/replication/logical/decode.c        | 250 +++++++++++++++++++-
 src/backend/replication/logical/reorderbuffer.c | 302 +++++++++++++++++++++---
 6 files changed, 972 insertions(+), 52 deletions(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..2abf3ce 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,11 +4,13 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..da6e6e6
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+--
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..a7b23e6
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+--
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..1555582
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..fd961d4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -68,8 +68,15 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						 xl_xact_parsed_commit *parsed, TransactionId xid);
+static void DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+								 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+								xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -239,7 +246,6 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	switch (info)
 	{
 		case XLOG_XACT_COMMIT:
-		case XLOG_XACT_COMMIT_PREPARED:
 			{
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
@@ -256,8 +262,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeCommit(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_COMMIT_PREPARED:
+			{
+				xl_xact_commit *xlrec;
+				xl_xact_parsed_commit parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_commit *) XLogRecGetData(r);
+				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+				DecodeCommitPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ABORT:
-		case XLOG_XACT_ABORT_PREPARED:
 			{
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
@@ -274,6 +296,23 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_ABORT_PREPARED:
+			{
+				xl_xact_abort *xlrec;
+				xl_xact_parsed_abort parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_abort *) XLogRecGetData(r);
+				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+					DecodeAbortPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ASSIGNMENT:
 
 			/*
@@ -312,17 +351,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -659,6 +716,131 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Consolidated commit record handling between the different form of commit
+ * records.
+ */
+static void
+DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+					 xl_xact_parsed_commit *parsed, TransactionId xid)
+{
+	XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	RepOriginId origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
+					   parsed->nsubxacts, parsed->subxacts);
+
+	/* ----
+	 * Check whether we are interested in this specific transaction, and tell
+	 * the reorderbuffer to forget the content of the (sub-)transactions
+	 * if not.
+	 *
+	 * There can be several reasons we might not be interested in this
+	 * transaction:
+	 * 1) We might not be interested in decoding transactions up to this
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
+	 * 2) The transaction happened in another database.
+	 * 3) The output plugin is not interested in the origin.
+	 * 4) We are doing fast-forwarding
+	 *
+	 * We can't just use ReorderBufferAbort() here, because we need to execute
+	 * the transaction's invalidations.  This currently won't be needed if
+	 * we're just skipping over the transaction because currently we only do
+	 * so during startup, to get to the first transaction the client needs. As
+	 * we have reset the catalog caches before starting to read WAL, and we
+	 * haven't yet touched any catalogs, there can't be anything to invalidate.
+	 * But if we're "forgetting" this commit because it's it happened in
+	 * another database, the invalidations might be important, because they
+	 * could be for shared catalogs and we might have loaded data into the
+	 * relevant syscaches.
+	 * ---
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+		}
+		ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+		return;
+	}
+
+	/* tell the reorderbuffer about the surviving subtransactions */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/*
+	 * For COMMIT PREPARED, the changes have already been replayed at
+	 * PREPARE time, so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 * If filter check present and this needs to be skipped, do a regular commit.
+	 */
+	if (ctx->callbacks.filter_prepare_cb &&
+			ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+	else
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr  origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		 ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+		return;
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/* replay actions of all transaction + subtransactions in order */
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
+}
+
+/*
  * Get the data from the various forms of abort records and pass it on to
  * snapbuild.c and reorderbuffer.c
  */
@@ -681,6 +863,50 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Get the data from the various forms of abort records and pass it on to
+ * snapbuild.c and reorderbuffer.c
+ */
+static void
+DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			xl_xact_parsed_abort *parsed, TransactionId xid)
+{
+	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it passes through the filters handle the ROLLBACK via callbacks
+	 */
+	if(!FilterByOrigin(ctx, origin_id) &&
+	   !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+	   !ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		Assert(TransactionIdIsValid(xid));
+		Assert(parsed->dbId == ctx->slot->data.database);
+
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
+
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+						   buf->record->EndRecPtr);
+	}
+
+	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+}
+
+/*
  * Parse XLOG_HEAP_INSERT (not MULTI_INSERT!) records into tuplebufs.
  *
  * Deletes can contain the new tuple.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 7a8bf76..0a5c158 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -418,6 +419,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1511,12 +1518,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1535,7 +1544,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1569,9 +1578,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for decoding
+		 * catalog snapshot access.
+		 * They are always stored in the toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1765,9 +1798,22 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
-
-	ReorderBufferCleanupTXN(rb, txn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
+		/* Here we are streaming and part of the PREPARE of a two-phase commit
+		 * The full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1895,8 +1941,11 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/* Discard the changes that we just streamed.
+	 * This can only be called if streaming and not part of a PREPARE in
+	 * a two-phase commit, so set prepared flag as false.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1918,6 +1967,11 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 /*
  * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
  *
+ * We are here due to one of the 3 scenarios:
+ * 1. As part of streaming an in-progress transactions
+ * 2. Prepare of a two-phase commit
+ * 3. Commit of a transaction.
+ *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
  * merge) and replay the changes in lsn order.
@@ -2003,7 +2057,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2294,7 +2348,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT
+			 * (for regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2328,18 +2391,32 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
+		 * We are here due to one of the 3 scenarios:
+		 * 1. As part of streaming in-progress transactions
+		 * 2. Prepare of a two-phase commit
+		 * 3. Commit of a transaction.
+		 *
 		 * If we are streaming the in-progress transaction then discard the
 		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
+		 * streamed (if they contained changes), set prepared flag as false.
+		 * If part of a prepare of a two-phase commit set the prepared flag
+		 * as true so that we can discard changes and cleanup tuplecids.
+		 * Otherwise, remove all the
 		 * changes and deallocate the ReorderBufferTXN.
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, false);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (rbtxn_prepared(txn))
+		{
+			ReorderBufferTruncateTXN(rb, txn, true);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2369,17 +2446,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * abort of the (sub)transaction we are streaming or preparing. We need to do the
 		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can only occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we are
+			 * sending the data out on a PREPARE during a two-phase commit.
+			 * Both conditions can't be true either, it should be one of them.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started  || rbtxn_prepared(txn));
+			Assert(!(streaming && rbtxn_prepared(txn)));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2387,10 +2467,21 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/*
+			 * If streaming, reset the TXN so that it is allowed to stream remaining data.
+			 */
+			if (streaming)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+						txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2412,23 +2503,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2470,6 +2554,141 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/*
+	* Always call the prepare filter. It's the job of the prepare filter to
+	* give us the *same* response for a given xid across multiple calls
+	* (including ones on restart)
+	*/
+	return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	* The transaction may or may not exist (during restarts for example).
+	* Anyway, two-phase transactions do not contain any reorderbuffers. So allow
+	* it to be created below.
+	*/
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2512,7 +2731,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
-- 
1.8.3.1

v12-0003-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v12-0003-Support-2PC-txn-pgoutput.patchDownload
From 73142a5bb0ead534cddbdbdb9fb4d4b2bdf88b28 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 26 Oct 2020 12:59:59 +1100
Subject: [PATCH v12] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.

Includes two-phase commit test code (streaming and not streaming).
---
 src/backend/access/transam/twophase.c             |  27 ++
 src/backend/replication/logical/proto.c           | 148 +++++-
 src/backend/replication/logical/worker.c          | 286 +++++++++++-
 src/backend/replication/pgoutput/pgoutput.c       |  75 +++-
 src/include/access/twophase.h                     |   1 +
 src/include/replication/logicalproto.h            |  33 ++
 src/test/subscription/t/020_twophase.pl           | 345 ++++++++++++++
 src/test/subscription/t/021_twophase_streaming.pl | 521 ++++++++++++++++++++++
 8 files changed, 1414 insertions(+), 22 deletions(-)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_streaming.pl

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7940060..2e0a408 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index eb19142..ecd734b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, 'C');		/* sending COMMIT */
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -99,6 +99,7 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 	if (flags != 0)
 		elog(ERROR, "unrecognized flags %u in commit message", flags);
 
+
 	/* read fields */
 	commit_data->commit_lsn = pq_getmsgint64(in);
 	commit_data->end_lsn = pq_getmsgint64(in);
@@ -106,6 +107,151 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'P');		/* sending PREPARE protocol */
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In which case we
+	 * expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* Make sure exactly one of the expected flags is set. */
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED)
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8	flags = 0;
+
+	pq_sendbyte(out, 'p');		/* sending STREAM PREPARE protocol */
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case we
+	 * expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK] PREPARED
+	 * uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED)
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3a5b733..2f6514e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -722,6 +724,7 @@ apply_handle_commit(StringInfo s)
 		replorigin_session_origin_timestamp = commit_data.committime;
 
 		CommitTransactionCommand();
+
 		pgstat_report_stat(false);
 
 		store_flush_position(commit_data.end_lsn);
@@ -742,6 +745,225 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/* End the earlier transaction and start a new one */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK PREPARED
+	 * for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 * ========================================
+	 * 1. Replay all the spooled operations
+	 * - This code is same as what apply_handle_stream_commit does for NON two-phase stream commit
+	 * ========================================
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * ========================================
+	 * 2. Mark the transaction as prepared.
+	 * - This code is same as what apply_handle_prepare_txn does for two-phase prepare of the non-streamed tx
+	 * ========================================
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -935,30 +1157,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -981,7 +1194,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1050,6 +1263,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1057,16 +1299,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
@@ -1905,10 +2143,14 @@ apply_dispatch(StringInfo s)
 		case 'B':
 			apply_handle_begin(s);
 			break;
-			/* COMMIT */
+			/* COMMIT/ABORT */
 		case 'C':
 			apply_handle_commit(s);
 			break;
+			/* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+		case 'P':
+			apply_handle_prepare(s);
+			break;
 			/* INSERT */
 		case 'I':
 			apply_handle_insert(s);
@@ -1953,6 +2195,10 @@ apply_dispatch(StringInfo s)
 		case 'c':
 			apply_handle_stream_commit(s);
 			break;
+			/* STREAM PREPARE */
+		case 'p':
+			apply_handle_stream_prepare(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..b4f2c9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,7 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
-
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static bool publications_valid;
 static bool in_streaming;
 
@@ -143,6 +150,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +164,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +391,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +912,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 0c2cda2..ee38f89 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -87,6 +87,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -94,6 +95,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -101,6 +124,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData * prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -144,4 +171,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..d0b0a25
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,345 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 21;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub 
+	CONNECTION '$publisher_connstr application_name=$appname' 
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', 
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', 
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', 
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', 
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', 
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', 
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', 
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', 
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', 
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', 
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres', 
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', 
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres', 
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_streaming.pl b/src/test/subscription/t/021_twophase_streaming.pl
new file mode 100644
index 0000000..3cfe3dc
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_streaming.pl
@@ -0,0 +1,521 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 28;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', 
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub 
+	CONNECTION '$publisher_connstr application_name=$appname' 
+	PUBLICATION tap_pub 
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', 
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# 
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# 
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+# 
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+# 
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+# 
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+# 
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE 
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#70Peter Smith
smithpb2250@gmail.com
In reply to: Ajin Cherian (#67)

Hi Ajin.

I checked the to see how my previous review comments (of v10) were
addressed by the latest patches (currently v12)

There are a couple of remaining items.

---

====================
v12-0001. File: doc/src/sgml/logicaldecoding.sgml
====================

COMMENT
Section 49.6.1
Says:
An output plugin may also define functions to support streaming of
large, in-progress transactions. The stream_start_cb, stream_stop_cb,
stream_abort_cb, stream_commit_cb, stream_change_cb, and
stream_prepare_cb are required, while stream_message_cb and
stream_truncate_cb are optional.

An output plugin may also define functions to support two-phase
commits, which are decoded on PREPARE TRANSACTION. The prepare_cb,
commit_prepared_cb and rollback_prepared_cb callbacks are required,
while filter_prepare_cb is optional.
~
I was not sure how the paragraphs are organised. e.g. 1st seems to be
about streams and 2nd seems to be about two-phase commit. But they are
not mutually exclusive, so I guess I thought it was odd that
stream_prepare_cb was not mentioned in the 2nd paragraph.

Or maybe it is OK as-is?

====================
v12-0002. File: contrib/test_decoding/expected/two_phase.out
====================

COMMENT
Line 26
PREPARE TRANSACTION 'test_prepared#1';
--
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1');
~
Seems like a missing comment to explain the expectation of that select.

---

COMMENT
Line 80
-- The insert should show the newly altered column.
~
Do you also need to mention something about the DDL not being present
in the decoding?

====================
v12-0002. File: src/backend/replication/logical/reorderbuffer.c
====================

COMMENT
Line 1807
/* Here we are streaming and part of the PREPARE of a two-phase commit
* The full cleanup will happen as part of the COMMIT PREPAREDs, so now
* just truncate txn by removing changes and tuple_cids
*/
~
Something seems strange about the first sentence of that comment

---

COMMENT
Line 1944
/* Discard the changes that we just streamed.
* This can only be called if streaming and not part of a PREPARE in
* a two-phase commit, so set prepared flag as false.
*/
~
I thought since this comment that is asserting various things, that
should also actually be written as code Assert.

---

COMMENT
Line 2401
/*
* We are here due to one of the 3 scenarios:
* 1. As part of streaming in-progress transactions
* 2. Prepare of a two-phase commit
* 3. Commit of a transaction.
*
* If we are streaming the in-progress transaction then discard the
* changes that we just streamed, and mark the transactions as
* streamed (if they contained changes), set prepared flag as false.
* If part of a prepare of a two-phase commit set the prepared flag
* as true so that we can discard changes and cleanup tuplecids.
* Otherwise, remove all the
* changes and deallocate the ReorderBufferTXN.
*/
~
The above comment is beyond my understanding. Anything you could do to
simplify it would be good.

For example, when viewing this function in isolation I have never
understood why the streaming flag and rbtxn_prepared(txn) flag are not
possible to be set at the same time?

Perhaps the code is relying on just internal knowledge of how this
helper function gets called? And if it is just that, then IMO there
really should be some Asserts in the code to give more assurance about
that. (Or maybe use completely different flags to represent those 3
scenarios instead of bending the meanings of the existing flags)

====================
v12-0003. File: src/backend/access/transam/twophase.c
====================

COMMENT
Line 557
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
}

 /*
+ * LookupGXact
+ * Check if the prepared transaction with the given GID is around
+ */
+bool
+LookupGXact(const char *gid)
+{
+ int i;
+ bool found = false;
~
Alignment of the variable declarations in LookupGXact function

---

Kind Regards,
Peter Smith.
Fujitsu Australia

#71Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#70)
3 attachment(s)

On Mon, Oct 26, 2020 at 6:49 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Ajin.

I checked the to see how my previous review comments (of v10) were
addressed by the latest patches (currently v12)

There are a couple of remaining items.

---

====================
v12-0001. File: doc/src/sgml/logicaldecoding.sgml
====================

COMMENT
Section 49.6.1
Says:
An output plugin may also define functions to support streaming of
large, in-progress transactions. The stream_start_cb, stream_stop_cb,
stream_abort_cb, stream_commit_cb, stream_change_cb, and
stream_prepare_cb are required, while stream_message_cb and
stream_truncate_cb are optional.

An output plugin may also define functions to support two-phase
commits, which are decoded on PREPARE TRANSACTION. The prepare_cb,
commit_prepared_cb and rollback_prepared_cb callbacks are required,
while filter_prepare_cb is optional.
~
I was not sure how the paragraphs are organised. e.g. 1st seems to be
about streams and 2nd seems to be about two-phase commit. But they are
not mutually exclusive, so I guess I thought it was odd that
stream_prepare_cb was not mentioned in the 2nd paragraph.

Or maybe it is OK as-is?

I've added stream_prepare_cb to the 2nd paragraph as well.

====================
v12-0002. File: contrib/test_decoding/expected/two_phase.out
====================

COMMENT
Line 26
PREPARE TRANSACTION 'test_prepared#1';
--
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1');
~
Seems like a missing comment to explain the expectation of that select.

---

Updated.

COMMENT
Line 80
-- The insert should show the newly altered column.
~
Do you also need to mention something about the DDL not being present
in the decoding?

Updated.

====================
v12-0002. File: src/backend/replication/logical/reorderbuffer.c
====================

COMMENT
Line 1807
/* Here we are streaming and part of the PREPARE of a two-phase commit
* The full cleanup will happen as part of the COMMIT PREPAREDs, so now
* just truncate txn by removing changes and tuple_cids
*/
~
Something seems strange about the first sentence of that comment

---

COMMENT
Line 1944
/* Discard the changes that we just streamed.
* This can only be called if streaming and not part of a PREPARE in
* a two-phase commit, so set prepared flag as false.
*/
~
I thought since this comment that is asserting various things, that
should also actually be written as code Assert.

---

Added an assert.

COMMENT
Line 2401
/*
* We are here due to one of the 3 scenarios:
* 1. As part of streaming in-progress transactions
* 2. Prepare of a two-phase commit
* 3. Commit of a transaction.
*
* If we are streaming the in-progress transaction then discard the
* changes that we just streamed, and mark the transactions as
* streamed (if they contained changes), set prepared flag as false.
* If part of a prepare of a two-phase commit set the prepared flag
* as true so that we can discard changes and cleanup tuplecids.
* Otherwise, remove all the
* changes and deallocate the ReorderBufferTXN.
*/
~
The above comment is beyond my understanding. Anything you could do to
simplify it would be good.

For example, when viewing this function in isolation I have never
understood why the streaming flag and rbtxn_prepared(txn) flag are not
possible to be set at the same time?

Perhaps the code is relying on just internal knowledge of how this
helper function gets called? And if it is just that, then IMO there
really should be some Asserts in the code to give more assurance about
that. (Or maybe use completely different flags to represent those 3
scenarios instead of bending the meanings of the existing flags)

Left this for now, probably re-look at this at a later review.
But just to explain; this function is what does the main decoding of
changes of a transaction.
At what point this decoding happens is what this feature and the
streaming in-progress feature is about. As of PG13, this decoding only
happens at commit time. With the streaming of in-progress txn feature,
this began to happen (if streaming enabled) at the time when the
memory limit for decoding transactions was crossed. This 2PC feature
is supporting decoding at the time of a PREPARE transaction.
Now, if streaming is enabled and streaming has started as a result of
crossing the memory threshold, then there is no need to
again begin streaming at a PREPARE transaction as the transaction that
is being prepared has already been streamed. Which is why this
function will not be called when a streaming transaction is prepared
as part of a two-phase commit.

====================
v12-0003. File: src/backend/access/transam/twophase.c
====================

COMMENT
Line 557
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
}

/*
+ * LookupGXact
+ * Check if the prepared transaction with the given GID is around
+ */
+bool
+LookupGXact(const char *gid)
+{
+ int i;
+ bool found = false;
~
Alignment of the variable declarations in LookupGXact function

---

Updated.

Amit, I have also updated your comment about removing function
declaration from commit 1 and I've added it to commit 2. Also removed
whitespace errors.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v13-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v13-0001-Support-2PC-txn-base.patchDownload
From 69c8c31ce3fec377957bee7411136d7eb0d61628 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 27 Oct 2020 05:05:22 -0400
Subject: [PATCH v13] Support-2PC-txn-base.

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 192 +++++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 145 ++++++++++++++++++-
 src/backend/replication/logical/logical.c | 228 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++++
 src/include/replication/reorderbuffer.h   |  56 ++++++++
 6 files changed, 665 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 8e33614..7eb13ce 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid_aborted; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -73,6 +78,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -88,6 +96,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -112,10 +132,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -127,6 +152,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool 		enable_2pc = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +162,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +254,40 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_2pc))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				long xid;
+
+				errno = 0;
+				xid = strtoul(strVal(elem->arg), NULL, 0);
+				if (xid == 0 || errno != 0)
+					data->check_xid_aborted = InvalidTransactionId;
+				else
+					data->check_xid_aborted = (TransactionId)xid;
+
+				if (!TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+								strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +299,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_2pc;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +359,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +604,25 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid_aborted is a valid xid, then it was passed in
+	 * as an option to check if the transaction having this xid would be aborted.
+	 * This is to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			   !TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -646,6 +814,30 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..ad8991d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>stream_prepare_cb</function>, <function>commit_prepared_cb</function>
+    and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +598,55 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +656,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +739,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +793,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +849,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1040,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 8675832..0f605ef 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -221,12 +235,26 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
 		(ctx->callbacks.stream_stop_cb != NULL) ||
 		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
 		(ctx->callbacks.stream_commit_cb != NULL) ||
 		(ctx->callbacks.stream_change_cb != NULL) ||
 		(ctx->callbacks.stream_message_cb != NULL) ||
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require prepare/commit-prepare/abort-prepare
+	 * callbacks. The filter-prepare callback is optional. We however enable two-phase logical
+	 * decoding when at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +265,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +812,120 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin supports two-phase commits then prepare callback is mandatory */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support two-phase commits then commit prepared callback is mandatory */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support two-phase commits then abort prepared callback is mandatory */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1002,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not enabled. In that
+	 * case all two-phase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1245,46 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming and two-phase commits are supported. */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..4c1341f 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1c77819..1d9bfb0 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -174,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -244,6 +266,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -405,6 +430,26 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -431,6 +476,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -497,6 +548,10 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -505,6 +560,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
-- 
1.8.3.1

v13-0002-Support-2PC-txn-backend-and-tests.patchapplication/octet-stream; name=v13-0002-Support-2PC-txn-backend-and-tests.patchDownload
From b6bda03755fa60ae52ce8c77c2f4970dff97b316 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 27 Oct 2020 05:19:18 -0400
Subject: [PATCH v13] Support 2PC txn backend and tests.

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.

Includes two-phase commit test code (for test_decoding).
---
 contrib/test_decoding/Makefile                  |   4 +-
 contrib/test_decoding/expected/two_phase.out    | 228 ++++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql         | 119 ++++++++++
 contrib/test_decoding/t/001_twophase.pl         | 121 ++++++++++
 src/backend/replication/logical/decode.c        | 250 ++++++++++++++++++-
 src/backend/replication/logical/reorderbuffer.c | 303 +++++++++++++++++++++---
 src/include/replication/reorderbuffer.h         |  14 ++
 7 files changed, 987 insertions(+), 52 deletions(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..2abf3ce 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,11 +4,13 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..e5e34b4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..1555582
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..fd961d4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -68,8 +68,15 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						 xl_xact_parsed_commit *parsed, TransactionId xid);
+static void DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+								 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+								xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -239,7 +246,6 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	switch (info)
 	{
 		case XLOG_XACT_COMMIT:
-		case XLOG_XACT_COMMIT_PREPARED:
 			{
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
@@ -256,8 +262,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeCommit(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_COMMIT_PREPARED:
+			{
+				xl_xact_commit *xlrec;
+				xl_xact_parsed_commit parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_commit *) XLogRecGetData(r);
+				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+				DecodeCommitPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ABORT:
-		case XLOG_XACT_ABORT_PREPARED:
 			{
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
@@ -274,6 +296,23 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_ABORT_PREPARED:
+			{
+				xl_xact_abort *xlrec;
+				xl_xact_parsed_abort parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_abort *) XLogRecGetData(r);
+				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+					DecodeAbortPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ASSIGNMENT:
 
 			/*
@@ -312,17 +351,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -659,6 +716,131 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Consolidated commit record handling between the different form of commit
+ * records.
+ */
+static void
+DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+					 xl_xact_parsed_commit *parsed, TransactionId xid)
+{
+	XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	RepOriginId origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
+					   parsed->nsubxacts, parsed->subxacts);
+
+	/* ----
+	 * Check whether we are interested in this specific transaction, and tell
+	 * the reorderbuffer to forget the content of the (sub-)transactions
+	 * if not.
+	 *
+	 * There can be several reasons we might not be interested in this
+	 * transaction:
+	 * 1) We might not be interested in decoding transactions up to this
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
+	 * 2) The transaction happened in another database.
+	 * 3) The output plugin is not interested in the origin.
+	 * 4) We are doing fast-forwarding
+	 *
+	 * We can't just use ReorderBufferAbort() here, because we need to execute
+	 * the transaction's invalidations.  This currently won't be needed if
+	 * we're just skipping over the transaction because currently we only do
+	 * so during startup, to get to the first transaction the client needs. As
+	 * we have reset the catalog caches before starting to read WAL, and we
+	 * haven't yet touched any catalogs, there can't be anything to invalidate.
+	 * But if we're "forgetting" this commit because it's it happened in
+	 * another database, the invalidations might be important, because they
+	 * could be for shared catalogs and we might have loaded data into the
+	 * relevant syscaches.
+	 * ---
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+		}
+		ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+		return;
+	}
+
+	/* tell the reorderbuffer about the surviving subtransactions */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/*
+	 * For COMMIT PREPARED, the changes have already been replayed at
+	 * PREPARE time, so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 * If filter check present and this needs to be skipped, do a regular commit.
+	 */
+	if (ctx->callbacks.filter_prepare_cb &&
+			ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+	else
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr  origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		 ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+		return;
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/* replay actions of all transaction + subtransactions in order */
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
+}
+
+/*
  * Get the data from the various forms of abort records and pass it on to
  * snapbuild.c and reorderbuffer.c
  */
@@ -681,6 +863,50 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Get the data from the various forms of abort records and pass it on to
+ * snapbuild.c and reorderbuffer.c
+ */
+static void
+DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			xl_xact_parsed_abort *parsed, TransactionId xid)
+{
+	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it passes through the filters handle the ROLLBACK via callbacks
+	 */
+	if(!FilterByOrigin(ctx, origin_id) &&
+	   !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+	   !ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		Assert(TransactionIdIsValid(xid));
+		Assert(parsed->dbId == ctx->slot->data.database);
+
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
+
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+						   buf->record->EndRecPtr);
+	}
+
+	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+}
+
+/*
  * Parse XLOG_HEAP_INSERT (not MULTI_INSERT!) records into tuplebufs.
  *
  * Deletes can contain the new tuple.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 7a8bf76..c5620e1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -418,6 +419,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1511,12 +1518,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1535,7 +1544,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1569,9 +1578,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for decoding
+		 * catalog snapshot access.
+		 * They are always stored in the toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1765,9 +1798,22 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
-
-	ReorderBufferCleanupTXN(rb, txn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
+		/* This is a PREPARED transaction, part of a two-phase commit.
+		 * The full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1895,8 +1941,12 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/* Discard the changes that we just streamed.
+	 * This can only be called if streaming and not part of a PREPARE in
+	 * a two-phase commit, so set prepared flag as false.
+	 */
+	Assert(!rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1918,6 +1968,11 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 /*
  * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
  *
+ * We are here due to one of the 3 scenarios:
+ * 1. As part of streaming an in-progress transactions
+ * 2. Prepare of a two-phase commit
+ * 3. Commit of a transaction.
+ *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
  * merge) and replay the changes in lsn order.
@@ -2003,7 +2058,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2294,7 +2349,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT
+			 * (for regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2328,18 +2392,32 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
+		 * We are here due to one of the 3 scenarios:
+		 * 1. As part of streaming in-progress transactions
+		 * 2. Prepare of a two-phase commit
+		 * 3. Commit of a transaction.
+		 *
 		 * If we are streaming the in-progress transaction then discard the
 		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
+		 * streamed (if they contained changes), set prepared flag as false.
+		 * If part of a prepare of a two-phase commit set the prepared flag
+		 * as true so that we can discard changes and cleanup tuplecids.
+		 * Otherwise, remove all the
 		 * changes and deallocate the ReorderBufferTXN.
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, false);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (rbtxn_prepared(txn))
+		{
+			ReorderBufferTruncateTXN(rb, txn, true);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2369,17 +2447,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * abort of the (sub)transaction we are streaming or preparing. We need to do the
 		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can only occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we are
+			 * sending the data out on a PREPARE during a two-phase commit.
+			 * Both conditions can't be true either, it should be one of them.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started  || rbtxn_prepared(txn));
+			Assert(!(streaming && rbtxn_prepared(txn)));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2387,10 +2468,21 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/*
+			 * If streaming, reset the TXN so that it is allowed to stream remaining data.
+			 */
+			if (streaming)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+						txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2412,23 +2504,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2470,6 +2555,141 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Check whether this transaction was sent as prepared to subscribers.
+ * Called while handling commit|abort prepared.
+ */
+bool
+ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/*
+	* Always call the prepare filter. It's the job of the prepare filter to
+	* give us the *same* response for a given xid across multiple calls
+	* (including ones on restart)
+	*/
+	return !(rb->filter_prepare(rb, txn, xid, gid));
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	* The transaction may or may not exist (during restarts for example).
+	* Anyway, two-phase transactions do not contain any reorderbuffers. So allow
+	* it to be created below.
+	*/
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2512,7 +2732,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1d9bfb0..28c7337 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -630,6 +630,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -653,6 +658,15 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool 		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+							 const char *gid);
+bool 		ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+						   const char *gid);
+void 		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v13-0003-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v13-0003-Support-2PC-txn-pgoutput.patchDownload
From 4f8354c70dc8b685d26c214fce239870d0695b4d Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 27 Oct 2020 05:35:28 -0400
Subject: [PATCH v13] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.

Includes two-phase commit test code (streaming and not streaming).
---
 src/backend/access/transam/twophase.c             |  27 ++
 src/backend/replication/logical/proto.c           | 148 +++++-
 src/backend/replication/logical/worker.c          | 286 +++++++++++-
 src/backend/replication/pgoutput/pgoutput.c       |  75 +++-
 src/include/access/twophase.h                     |   1 +
 src/include/replication/logicalproto.h            |  33 ++
 src/test/subscription/t/020_twophase.pl           | 345 ++++++++++++++
 src/test/subscription/t/021_twophase_streaming.pl | 521 ++++++++++++++++++++++
 8 files changed, 1414 insertions(+), 22 deletions(-)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_streaming.pl

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7940060..129afe9 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index eb19142..ecd734b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, 'C');		/* sending COMMIT */
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -99,6 +99,7 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 	if (flags != 0)
 		elog(ERROR, "unrecognized flags %u in commit message", flags);
 
+
 	/* read fields */
 	commit_data->commit_lsn = pq_getmsgint64(in);
 	commit_data->end_lsn = pq_getmsgint64(in);
@@ -106,6 +107,151 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'P');		/* sending PREPARE protocol */
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In which case we
+	 * expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* Make sure exactly one of the expected flags is set. */
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED)
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8	flags = 0;
+
+	pq_sendbyte(out, 'p');		/* sending STREAM PREPARE protocol */
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case we
+	 * expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK] PREPARED
+	 * uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED)
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3a5b733..2f6514e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -722,6 +724,7 @@ apply_handle_commit(StringInfo s)
 		replorigin_session_origin_timestamp = commit_data.committime;
 
 		CommitTransactionCommand();
+
 		pgstat_report_stat(false);
 
 		store_flush_position(commit_data.end_lsn);
@@ -742,6 +745,225 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/* End the earlier transaction and start a new one */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK PREPARED
+	 * for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 * ========================================
+	 * 1. Replay all the spooled operations
+	 * - This code is same as what apply_handle_stream_commit does for NON two-phase stream commit
+	 * ========================================
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * ========================================
+	 * 2. Mark the transaction as prepared.
+	 * - This code is same as what apply_handle_prepare_txn does for two-phase prepare of the non-streamed tx
+	 * ========================================
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -935,30 +1157,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -981,7 +1194,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1050,6 +1263,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1057,16 +1299,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
@@ -1905,10 +2143,14 @@ apply_dispatch(StringInfo s)
 		case 'B':
 			apply_handle_begin(s);
 			break;
-			/* COMMIT */
+			/* COMMIT/ABORT */
 		case 'C':
 			apply_handle_commit(s);
 			break;
+			/* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+		case 'P':
+			apply_handle_prepare(s);
+			break;
 			/* INSERT */
 		case 'I':
 			apply_handle_insert(s);
@@ -1953,6 +2195,10 @@ apply_dispatch(StringInfo s)
 		case 'c':
 			apply_handle_stream_commit(s);
 			break;
+			/* STREAM PREPARE */
+		case 'p':
+			apply_handle_stream_prepare(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..b4f2c9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,7 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
-
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static bool publications_valid;
 static bool in_streaming;
 
@@ -143,6 +150,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +164,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +391,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +912,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 0c2cda2..ee38f89 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -87,6 +87,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -94,6 +95,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -101,6 +124,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData * prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -144,4 +171,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..f489f47
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,345 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 21;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_streaming.pl b/src/test/subscription/t/021_twophase_streaming.pl
new file mode 100644
index 0000000..9a03b83
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_streaming.pl
@@ -0,0 +1,521 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 28;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#72Peter Smith
smithpb2250@gmail.com
In reply to: Ajin Cherian (#71)
2 attachment(s)

FYI - Please find attached code coverage reports which I generated
(based on the v12 patches) after running the following tests:

1. cd contrib/test_decoding; make check

2. cd src/test/subscriber; make check

Kind Regards,
Peter Smith.
Fujitsu Australia

Show quoted text

On Tue, Oct 27, 2020 at 8:55 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Oct 26, 2020 at 6:49 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Ajin.

I checked the to see how my previous review comments (of v10) were
addressed by the latest patches (currently v12)

There are a couple of remaining items.

---

====================
v12-0001. File: doc/src/sgml/logicaldecoding.sgml
====================

COMMENT
Section 49.6.1
Says:
An output plugin may also define functions to support streaming of
large, in-progress transactions. The stream_start_cb, stream_stop_cb,
stream_abort_cb, stream_commit_cb, stream_change_cb, and
stream_prepare_cb are required, while stream_message_cb and
stream_truncate_cb are optional.

An output plugin may also define functions to support two-phase
commits, which are decoded on PREPARE TRANSACTION. The prepare_cb,
commit_prepared_cb and rollback_prepared_cb callbacks are required,
while filter_prepare_cb is optional.
~
I was not sure how the paragraphs are organised. e.g. 1st seems to be
about streams and 2nd seems to be about two-phase commit. But they are
not mutually exclusive, so I guess I thought it was odd that
stream_prepare_cb was not mentioned in the 2nd paragraph.

Or maybe it is OK as-is?

I've added stream_prepare_cb to the 2nd paragraph as well.

====================
v12-0002. File: contrib/test_decoding/expected/two_phase.out
====================

COMMENT
Line 26
PREPARE TRANSACTION 'test_prepared#1';
--
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1');
~
Seems like a missing comment to explain the expectation of that select.

---

Updated.

COMMENT
Line 80
-- The insert should show the newly altered column.
~
Do you also need to mention something about the DDL not being present
in the decoding?

Updated.

====================
v12-0002. File: src/backend/replication/logical/reorderbuffer.c
====================

COMMENT
Line 1807
/* Here we are streaming and part of the PREPARE of a two-phase commit
* The full cleanup will happen as part of the COMMIT PREPAREDs, so now
* just truncate txn by removing changes and tuple_cids
*/
~
Something seems strange about the first sentence of that comment

---

COMMENT
Line 1944
/* Discard the changes that we just streamed.
* This can only be called if streaming and not part of a PREPARE in
* a two-phase commit, so set prepared flag as false.
*/
~
I thought since this comment that is asserting various things, that
should also actually be written as code Assert.

---

Added an assert.

COMMENT
Line 2401
/*
* We are here due to one of the 3 scenarios:
* 1. As part of streaming in-progress transactions
* 2. Prepare of a two-phase commit
* 3. Commit of a transaction.
*
* If we are streaming the in-progress transaction then discard the
* changes that we just streamed, and mark the transactions as
* streamed (if they contained changes), set prepared flag as false.
* If part of a prepare of a two-phase commit set the prepared flag
* as true so that we can discard changes and cleanup tuplecids.
* Otherwise, remove all the
* changes and deallocate the ReorderBufferTXN.
*/
~
The above comment is beyond my understanding. Anything you could do to
simplify it would be good.

For example, when viewing this function in isolation I have never
understood why the streaming flag and rbtxn_prepared(txn) flag are not
possible to be set at the same time?

Perhaps the code is relying on just internal knowledge of how this
helper function gets called? And if it is just that, then IMO there
really should be some Asserts in the code to give more assurance about
that. (Or maybe use completely different flags to represent those 3
scenarios instead of bending the meanings of the existing flags)

Left this for now, probably re-look at this at a later review.
But just to explain; this function is what does the main decoding of
changes of a transaction.
At what point this decoding happens is what this feature and the
streaming in-progress feature is about. As of PG13, this decoding only
happens at commit time. With the streaming of in-progress txn feature,
this began to happen (if streaming enabled) at the time when the
memory limit for decoding transactions was crossed. This 2PC feature
is supporting decoding at the time of a PREPARE transaction.
Now, if streaming is enabled and streaming has started as a result of
crossing the memory threshold, then there is no need to
again begin streaming at a PREPARE transaction as the transaction that
is being prepared has already been streamed. Which is why this
function will not be called when a streaming transaction is prepared
as part of a two-phase commit.

====================
v12-0003. File: src/backend/access/transam/twophase.c
====================

COMMENT
Line 557
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
}

/*
+ * LookupGXact
+ * Check if the prepared transaction with the given GID is around
+ */
+bool
+LookupGXact(const char *gid)
+{
+ int i;
+ bool found = false;
~
Alignment of the variable declarations in LookupGXact function

---

Updated.

Amit, I have also updated your comment about removing function
declaration from commit 1 and I've added it to commit 2. Also removed
whitespace errors.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

coverage_test_decoding.tar.gzapplication/gzip; name=coverage_test_decoding.tar.gzDownload
coverage_replication.tar.gzapplication/gzip; name=coverage_replication.tar.gzDownload
�C��_�]]l$�U��Ivq6�GAw{��=��u��=�Y���3��h%V������TW�VU�����	�G%D"�xE< �/�(H(
EH �OyEZ��U�]u�n����;!��]wWu�{���;��S_�����5NYynvM��h��_��i��Q���F�N+5Z��4Z�����P�Qz��27p<��e��w�������>���Y��'�^�*�_���Zl�)��^��m�������o�M�"OMv���6}�����~r�:C�C��g�|�v,�%7��'N�|��yY%7�y�G��-G'wE�5���|r,���1��L�	�:n���wc�O��7���M�?d����k��c���pCX�����N8c��$u�nW�g�}��!(���x���6�x}���o��
�wV�!1�����S���l��|�����������N�Pm���:���5��u����1���Ny�{l������-8f�0H{t����=��8V'C�3@)�����'��xJz���
��=)!�|�mXQ��3���z ��70����������a
Y�o�)5,�Z�{R
&5
��7N�e�)0dB�-��1?��
8Pr��\�����&r������
�������3jg��4���z�X�]�������������������B�aj����u������fcw��=�u��c�E�s�����hwwgS��5�sV�8��myT���)��h���#V�v�.�M��������z�T	9�Kz�k�C��-�����B�����v5�Wp	rH�\��$������M�Dz]��W�dH�m���;0!\���	�?�������l���0����ky��elm�I��WX��7��E'����O��`1��kw����P�k����l��c+����P�BB��)����5:g��,�wlGX��b'��|
�%4�)v�p���!:�����r��p��^-&����
�Q*6������"0^��{���{K�<C;dn^rqQFI�5J��>���������5�+���/0�9yT4����EM%9�`�"|l>`)r����������~�s����������%����{f��L�aZ^)^�����L�|w�����8�I���e���a�*y�8rf5/;��e���A0����X���p�,\+�VL�I�R�yn.��dV�s�z	��<,Ps�����I�3Fg'H�'�5���n�x�?f���Y����v	�U���Eg����)��3G����
�������6�m���^��=�:��&ty~MT#��J�2�&��Jy�#��`����L����Vd�����i�
��E�a�'���-6��T�2����SiVr�*�M��D�9�=r�\���X{���`���-���x\��wks;��q����)%��@�������+�d9S��?�F�����w��vw�zE+4��'�R�qfd�|Y���bY,r'6n2�AF7f�b?>�>�E���5��@�+.���m��&O��OMg����qn��n�L���66w���k��^�@��fpo�x���W\��O��p�,	.���d���vN\��~�%}t�T�D����1�b�z%�\��� �SZ�.l6S7c�;X��]��@e9/t$2������a�����������tM�����g�U��rU�{����6�	l��M�{<�+(�A�O���o3�8��S��l4���;���a�>��7�?n����6��>��:+������j�Ne�gMo�����h�}x��g��C�3�{�����^�����o�-�y�������_z�������/V�����S|�'��/|�_��8������ss�{����I��������������_�����#��������~>��������������������a��6��p�q��Y��?�����k�_E��?��/
�������?k��~��?������������o������?!_�~�����7^����C��_�_�����'���_���s��+�����_�>�5���6~��2<o&����jC~���h\��*Z��?s]���������������������/'�tg{��>���9E�%?���?��o~v.���~gNd{���r��#�������|6��"��������������������$��G5�h�?T��������k#�{�s�<��z���6��s�����s�?�'\��yk#��T�1��i�7��5m2�/|�����77��w���w���<�����g�������_��������~���������V����������w��2��Bu�0�����4�����_EC����\�f&��������xUP��[����h�Rm����k:��u�wm����[�o�����������-RZ.�U�������DuE���5l��naX��������wb��z���ft6�	Y�3� =�,�w������c�������D����g*�y?�H�g����.7K��B�����o�erL��/Z�������}w����dT�Y�%���c�/D%j{^��\�]/������d�hk�@��5�������&����[/QM{�DN���u���eq��	�
��D�u����v7��i�[1 ��).x:����q}����q3����6���)�P\��^O)�Bz��&���z����?vF�G��U�c���It�1X0�g�~�����u�u���U�C�&� t��!��iw�3�J�3 w������:�#����dx'P�m�%�)W� !%�<-#������r;#�[{|���2e#�(`��P�����>���g�����
�1���,r'c���_��������_�H�y����pnzoa)mT����Ju����FE[���/�����&0����2���&�tU�����n�)�^����46�i+���< v�:�W�N��RD���${����aB�B���
J������9�$�w] c�������utktl�q�H1��#Z����xu�X;q7d�3~6�6pY�I�i��Op� 2Ah��.��puV<�2����A��+�O����$����������j|M��#�������n*H7U7YIE��*�s���5���R�����W!���0mo��G��������w���N�y��������YK�SG����e�=1\�p��+d���X�)%�'�[����M��@�i(�1��M;>dt���H��b3�B�i)e<��R�d���>Yh/]�k+��K�M��91,���g�
H���p��j��P��R�TR� 8�J����{����[�����I�"��j\C�d��<-�y���*A?�`K�S%��RG`��7L�m
a���w��+*����� p��R/n��cQ-��w��!�� ��,��@QW@15�o�Y�FuF(;G��+�����?N��=`]��q�S��2�G��+�*���TW�T�f�����3�9}l���3<[Gp�+p�����H�p�W��U/�VEU���WW�W��QL�GN�5�q�xb0��g���*�+(�E�C�������VAp_Q�^�xE�G����t�o3n��:����@�AxE�p�����>�rE�g�9g�r Y����0��5��+0#�c]@q��C�A�p%)|W����|<����)�
����`���������L��E��@���l�u��������7o>8~x��x�������a� `�(�*m� V@,�$D��x=�;t@���N<�Ml�:��*p�#	�m�O�����������"�����(���ZC��`+��C-<u�����nU���n�V�����;���!�h��U@�^U�]�Ak�>���F���\��YG���8�..*�B]U Z��s�9�Z.���-�P�@����R��vM�m��5e��?�����xL��p�S[
�]f��&��&j�k����V�	�k�Rd:a�7�a����0QC�]S��R��!P��fh'�cEG���R
RC\S,�R/�j�I��(mZ�'S
3#h�Bk|t�F|������� 
,�aV��[�������D���-asE��!��Z��������]��Z x��u$:�U�/h�k���U����wl��O�~��=���`�p��G,u$�'
Q�3���]����
?�#a�^(
��	F_���@����q�y��pg[!r��� 50���D�M��������,�D�z�4#�B��^a�I�	qu�V��F��xU�(^}�-.[��bK�-
�Na�Kl��m���1;A�e8X��:Hti(�_������(�;����h�����7���(�7h A�Q,(����=XH����58���
$4r��G�����L�;��m>���c���Y��1|��G��(�h HmCjl���~wh�,���l�Q��`�Q�M��b�BL'/\Xbj�V�l]���t
)OHJw�,9��T$ R/���2�?y�,fx��I	�B��&��f���bA{3��6g��>�^�zO�q�rcN�q���!��B -K��H�h��� RH��1�T�0����>������K:��D�BbMSu�=(�����mufZ�E����7�z��������e�U��Rm��D����} �:8C4F8�{�A����X�R��cH�C	)��C
���m=Q���w'{�o!�������SZ����h��7�z���b�-�}��������V���BbKK�����O�Wjy�,/v�,"����2f	5-uu$W[�.�O����*Dh!q���^;>��YEaDR-?������-?�)���Z~����Z~���w�iR��G38�������/��Z~4���l�5�����k���=?�����?�}��CEM��`���,����G8?�1cjt
P���*v��T;����r7�U�@9������5z���~T��T��>��?5���m���d����D."�=#�V�e����*�_�E������z]�����S��������� ��!(���)���08M]��Y������IQ�0HU�A��z�j�x".�ER�HH�	)�$�z�[1���KQ�_H���xB�)
SaR�P�A�^���v���@�nHUtC�������������,(�l�S�R�0����S!+RY1����S�=R�1�P��"�H�$E�A�t!7����e�EH�TIz$��Ex����.3,RQ�'I��$)B���)K�T�������Q�g�S�:�LU�[R�R��9|�+)U!�LZ��,��)U!�OZ-D�S�V/�*:U���?i5�FI��f'U1;��B�����"�NZ+x�[��l�w'_�2Q-Rp.{�'U�?%g
�`'���HHt���^�h�D�� �[F�!����[��yE3�EJU,��{#1�Vl��BiM����ahRC��#���x��	�E�����?��B�+'�����O*����I>�~�|�=5���T��0:����fmS��I�4��Qo����PP�+�!|LZ/F����$Q^��E�$-����00;"-�-����3�q����r7��9�r
�;����������k�����G���6v���A2��dB��T��e��W�9v����i�^�u��P�����D��I����[�	*f(���*B��
9� �L��-n���G�Dcj��T{��I''n���^(�B��y�M9�!dM�G���
9�����H�h�;��������'��B�JI����eJ3$�4.����Fa��&(*Y���'m�5�P7����=y�VE�=��I�o!>����,A��TE��b��I�����j��dR�^r��a+��+z(�J |O���LyK�O��H8iNY
�8H�b��*�"q�y����n��I�8��~�t��)��!�N�bw�� 1��SR�,B��*B�h�M��z!/��;5�#���p�*'s���&�C�P/ik*�<��FE�1!4M:-M�	o,�9��s������� R��IZ&m�lqd'�����M)^	b������:�|���3��Xx�=��-LY��������8��hE�<��6��DGx��V��Q��1�z�}�����}ZGh�zM�}��|��*�d��^��u]����������|Y�u�g ^F"�������_s;]w8's���eq���kH���@�Y]�n����T�U>��zdn���
�������p���#���c�]����|�y=��tM��N��n�N��8J�������T��MK��Y��RwW���p�O��93�Ow9e`�Z������f6@�=~�GG��|��
O�x�|]vGb�osU�����QX�����E��_�>T��=��IK�<��.;l��TzO��?�,�0�[��>�f���I)z��m8�e�������\m��wRa+�%������7�|<��u2y)���M&Wi��6�d���F)����'���,�\I
S@��<^�����X�N�%����1��R/E�
������s������$��1��e����������S��c
�
�z>��Qqi��+�K��DLR���51�*`A}���*�A}�M���R���K������l|/�h$>���i���f��3�fr���~c�G���j��[Y�z�:��4������]�2[)	���'\C����]g&'�D��qn*i+;���(M!��F���O������\uP���{9���VW��7�t[��#��R�}z��^T�b�������1�R{�R�sHi�������5'.�@��Az�����:���N������'��xFn�.oQ�����K�U?���w��V��[Z�-����e���dq��9�h�<��Z�`�8�<��:�*�?�qq>@H��@���E��8=P^lTz�<�X�����k����"H�Y"@H}�u�~M�7i��w�}�-G3e��$J��j@��������KF��i�1yFs��,�dg��@ZU&6����-��B���4���D�eW�s�M4�����6�U����|v]�T��(&��������*��s��#9���c(p�&���\�2���O�W[�fI��ke�9�CS�p�=@���L�2�AH����U�6��F��A������c����PJU#���5e� U�f�C}�������>���P���}�R�������T���e����TZ��AZyOs�%_Q����
�7�+����	����0P��U��W+	�P�n���^}����
�E�L��z��P2
j��� 0^�U�X0N�76��t�Z��	�X�����{G������8et������w���7�y�_v~���g��<�^
�fnS}����J��H����W����\�����99��v_d���\�zU�\O�,^.���"�����!��3�o*b�a&�r���35�G�[�]������E���Y����]Z�����MLe��yuw�U���L��iryT����~���i�N����fiF�;��
�����5�b�w�N�X��K��}�������wZi�	�^��z)��h������fx�_�������������!g��;��v3�+�^r�_����`���0�
�]T��?��w���t8X���<�d���������}P�>�v������`Mg���4�������<�,v��O>�|}���+2��"9��!qq�VGV���J��(H�(��jlI�2u(^������=�i��qB��}�(_�E�Z�p���N�W��P��$�l��I�U�
�T��)+����2������N��^���s�Pr>�cX9��}�|�w3�|R�Hyt�p�4M��5�����k�7��w`Nk]��~Ei�w�������`����q��p�.���	l��P���^��~S�2J�_��xU�eU��(�(9����X��0Q.�����a�!>a��^X��&J�(�����d3`��1�Fe\�����j���^��g����Cx8���`��l�!q�J�^�\��v`�`��&�4���x���7�?��8�=�����{����O��
�<��+MC�������E��l�2S��E��k��O�����G*�<�\��t8��
�	��e�	�$7zR)�v�����
��\qUg������q3��D��Bn����QA�*/Gg�I:����h=�*�U�����4e����.Kf�xR���J�V����:U�)���1A�{�Y������>��f�����V�������*Q|�\��8�]V��XG����� nv"E�KZ3�* ��|^�af�@��5�w?�3MC����EnT���fKl�
��|Y��1�u����0��������a������R��	�j���p���
�������@����y��9I���xDn*	����q�3KI����S�r�����m`���]w�,K^������`��v���J�����������C�.��<tu�y�5��<�lf,G,[�^�,�7��N{���9c�r���-n*��������F=E+����O�+���lI,����7�8�`cv������=o������7���@&�l���<����� @!��������F��A4Ko�� ���`�-{[tp�.pV���f`��	��-���Zf��A������*�M��h���\����mN\��g���si��q��r�RZ:����s������r$o�9��m@p���^3'����J,����)v���Q��q���i�8����$��BK�X��X{l/�]�����0�-SWP�V#N$lk�b[���"��0�D��r��J-X|2��(j3r5�j�"Wig#��a���z��D5�)6%`k���f6@[84}����A�A�r�Y�.����}���@SM5�
�&ua]d%�QA����:.��B
8
������m�no�/�o�-
���K��l ���O���A���:��{��<���o���o� ����5x������,{�-$j�P��.�S�4
8������YV��`�.���3r-P[�������<�L�~�l���@�.�.�a�&+[��j�b��
�p^h�l��x�Z����������B���/��z�J��:`P�A5��?
8���p�A�����
8o��"x�����:������I� R����Q��3S`Xw���h��T����Ju9�����Iq�9#W*[tU�.&����{:1���9
X��Z1�4��S{�h�`]'��+��,���"t���
8L�\6@4x ������~����:�������?�� �-�����sB�i����A�����A�����p(3x�i-�]��wy���sk��2hV�V>j�
0��!�^
�2��M�����`���J�������{���C�B���OZ3�
�w����.P?��:���$�y����+����h.|3������
T���M�[��M&�\�t[��c=�l�6*�y�2E�Z��(���b	�j���W���_�F���_�K��s#��K�	�T���d4�y�DV��O�e���������,k('��oS��G��|�<��f�q�I���k�g�7O^�|��wM_x9�\?)�
�ML���?�����W$n)��5����ur�>���xg�%�j�����7Onf���^�g��y:���U�]�/����0�D���������������X�<�F��~�����������B���������ttO�~3���_��|ZT�l�W����4��*���*���|2�__���T
� ��vH?�Nk�����{��_����������-�������y��+�/���:�s�{����������]���g������=�d����{�^�;{��/���V��F�_��>!��� =i$���I)C���X.�HK�c9e����^:!�!�g���	�>���<�S������M�\��9?{�-M��X�L����mq�����9^8�?�c������/���i.�[�LeH������y����q~�3i2���t�<R����#�B�AP/����k[Z%������Rs�� �fns��,�"�r���V���|���W���_�7+Y� k�x��8���r�,6O��$��a9�����J�|��l�d2��(#��/VrL]�Y��oW�����O���}���<����M��~���_��cQH�V�(V
|��W�����q����<�
�I�4���G���������g�,C]9�Zbf��G���lgr��t���e����ej���9=��PVC2����%���/���9[v^���_����y����;�����������<����J��Zo=��wR����;_���ki������������Y]���K��M7d&��j�7�b����_6k��*Q��k���r�P���(�c�Q����D;�����H[��q�:Y��+3���
�f�aO��>[Z�R��!�w[i���N�D�YEx;eMx��
�[��V�j�-�p4sb�.^�f����_�Jrx�����{8_�1����\~m{X�5/?����foS�d�y�I��#��k���(�|?q��b��T����Xz�������?Hh:2t�������[w ��
Lw,�.�wXE{����U`r~5&{sQL���k��	K��J�������E=���o�'�lMa[���������r�����Ng�na�uZ�^��������C��-^���8}�-���n��}�����I+*���-c�������/H�����������_[������������j����������6���N�C�	jv���Y���>�������m������6��`��m������vL�S|������-�����:�n���^���v�I3Z����e������E��z���w�m��������v;���d�jYL���t`��*���]�o�������7��<�]�Cv������]~U�������]�o���.�=��{;9�/w��`����y�����A�,�,��'�G�y���Tm4})���y<�\�w;��u\/v��$���)M�-q+�,mz8</�/w^�/k�Y�2�����������8����n��]D���8��D^����:��������W��tz�Q'{���u�-�oo���v��/�v������8���jv����'#��[b24<t��8���������\0`��4�"��W$z^����O�bQ<�Q����r>�E2����d������p���� UG�+k*����Y���q�'����������|J���<&�gS����J
X���MF����7c����b�((�,��fh+�K�!����f��HU$�Mr:���-�7z�%��^F.�P����>��:�2p�M:-��F�K�cq�����p�w	6|��D���������������������E9o���AJ�x����v/��.��L�'{?�rw����]�+���g���S���D��c���w~�{"��O��N�/)@!U��o�*t�4Yv�Q<��qn�'`���_F.�Fy����F(��5���s\z��H��{l$I��r�����6�=�d��es�_���-��f����u�}ww��z2�f���Z��9���A�0�,b&|������1�4&��/�x��
�o���=��'���Gy�TJL�ZQ���Q�!�y?�S�{%�e$-C���	��?��xF�*&��})��H�I�1�ycV�H�h�,��&�m);����fY�4x�gd��x|%"C`�|��/~�<d��s�������A��d�e��DCUl�3L��=���� ���XV�D��c���\�~�f$Yu
a����[5��������@��V����K��a�k���&���m��m������+�|<+NQ+�s���I+���r��J����)���UfY��4����me9�,�M�����k@h1��3�'�r��D1w������G#5�t�
���A.:�L��/}2=�]k�
�Lpa��\�i����qcO��lx��*�a�d��$t�1�3i�o#��dl�1�}j�vu_Z�T.���l�<�R����������y,��%C�[!o]����,3��\�����J��o/���?�W��d����]9���������[���;6�u1x[�L�2Y�T��9�������
-����H��I�'�*�o:���wa 
oO>GYB?���l�{q��)
pw�45R�����BO����0K�)-y&��g��X����>OeE�'������\��]�I�(|a�MO�n)B�L�.!&r&��Q&��eF?�>�ZW�I�������d��C�T�3�G����"w?�����^�P-�Yk4DO�����i��&�G��rNR'������h������G������	.h���,\ ��\�uP���Zp������r�/�_����i,B��k9�&�h�>N���r��M�-]���O������$��������B@�3�E?3rj�E8�r<���u�T�m��������1r��E:3r���4So����(�����?Zo
t��of�T��\��Z�3o
����U��L"��v#�� ��h7��Kp���p�/�Y2���b~�Vv+w)�j�m�E����O*H��xf��,p�<Y-P>3���f4qY\�erY_�zf�
��~_����3�@��\t�e�r���L���Q P�v����\��J?�,V\?�l��AP0����=�����Q0�Ff��\(/Z
��4��$�g�N��^��b�]���3
 ���pF��'�bj��������J�7~�`�v��4p���\��s��^=��Z�X.��0bK�P^��-��W��{������*�u�}y����A�P�Kp�������%_�8����r>���^{����*��a�^JC����tJ���C����i10\0�4��8!�f�8@>�$j��N��^4Nd�Oc{|1#`=� `F>����������������}s�"y	.���V&L�@0]6L���g.Z����%��a00��%�E�����i�K�G���yS�:a����(��U�iq���2N��l��(c�QF��I:��W���X_��/(l����"�R�b�Z��2�2��1�
1%E�u5)q0����,fJ��k� ��\��h�
�g���>Qg}��D�4�XO�"dN<js�4 (��1�}H�(�W'�j]8�G�F{ ���,��"]y;�j����9����sR�A�����HGp��u�����2�qk���2�q+��E�2��z ��������(������p��=�E\kDa�#u<�"��F����N�r���<.��(O�]���8�����w�������"���8�N��O^.�1:G���8n�v���m|��II}�	O���=:������I)���*%I'RtB;���)��x\4+)��j0������,���N�u�E��~������9�Ed=�|T���76M���B>g%��"��<�k�E�#�r:C�6J�w/n�����Wn�o��w[Bv,��'�3�9>z,�����g�^��1Q����zh��=�w�,�<z������{pp�Wtn3w`j8�q����~_~4%H�� �O4����l�����b#��X[�f[�=34�Z�1�O�����l����������3�N��{*H;K`0N����� ��G*Tp��7�/�l/�J�B�Eo,_PL���tf����8�O��-f|G��K�c�T���H���2z��8��i}n@'9<���$�I��O��1�`96]�y9n�{p]{��~�������V�����	"�=:���������|��
��������y��<�����5z�2���XaM�t���eO��2��<23����bf6@�8�Kj�;uV4��a}�����_O������f�@�8����*[���BqXV}#���*���{P��@��!�*�x�q�����!P��0'�����spNg�'��r?��Z<@7y,����gG?�����3�*p�tR6k�U�D���V����N�J���bq���0s���L�N��O��\���2N8QRg*���-��<n�t���c�t���(���h�F�zV�@�8�����oG'�T���/�3j������i_�g���c�}`e����t��Y��������QZE���']���GS���|��q�S��������|P�D�B�V!@�8J�w��=�WK������
��R��w'G��Y�)�c�tAg���������gG'�K}I�q$�.�`�m��2�X��*���t��'GN�~#�m�0@�<��y���6�c�0@�<����mw_�k)/ �<����-r�t@RI��?��<8�_�m�-��<��*kCwKS��q�S��w��%.�W��c)(��<s*�[.d�V������������v��0��=
�Ty[Z	�'�C�t��?^�J����k)&��<o�y}Xdm�%`�<�y�Y���k�%��<�j��J;�?�k���]K=��q�R�����G'���v-
P��AM:���2�������`u�X�
0%�����^W����a�X�Ir.�7+�[J�$�����d������w-U���K+y��n�(pH�!���H@yET��(���Df6@9�H�#�'$v	X"�c�h���b;�BNl�!ce+�z8�?��������B-�#3��`4��W�i�	�c��u%@�<�'*�}�R���%BM���"#�
y7T�Ux�����T@@y]T��4��D���Vq@E9��L�d2|�8���8��cdf����D^����U��x8�i��:T&�����HY�=�h�2$�����|`�����.���Q�U�4�M�����#3��R������������3�9��f}�hdf��^���h@����_��du���QF�=��xv����]�R�$�C���(�Y����T7��Z�sh���[}	2�qk���<�!C���0Vgp����9F6���9��>
�7�j1;�5~3gb> k|��1�z�!4f6@U8X�z�'���d����5~3��t���k�ftM�[�@���5>�k|���p@~����f	���j4P�'�d|���3v��9��(���\�2����g]3g�[S�����M~<�)l�4)���Y:��|���������j}��(���w�Gcf���`V:C]_��!0F_�v��: �����@�R��~P+2��up�0~3�8���0�M�@R�Y���� G��~ym�f����L]g���2�YP���2�e��?�~���#��4c�SP��@�:,�Dm���� ��.H�#i���	�Ud����H��� P3>G�,Q�������l���~���`���������G�<���:ww1@�8����������# k|�m��
��%��B����������6���������l$zyC�\����2�s���X���jZ�@]9.�i
�����a��42f)@�YGDul���oj�@k|�1�Z���Z��������bc���M����|@���|����9^ficM�(��d�V��\\�	��V������c��)|�@�X�Cb����K���@���["Q�F��1�NF�bt�(z����'���v6����X�%�wM9����j/�����Y_D��=����'3K������jW?�s��n�jkP6>G���E���Wcf�u�l��������g}3k��B�l���N������@��~5�*�BnD��m��b,���w������W��k/n��o|�?<:677kD��BvH�e����
����|i���R��+�@�9�gE�b�t7ei��p�>���b�m�@�� {|��q�������6����9���F��_����7�||����Sg#����
���k/U�|\�L����<�I����a�(��
*��>f6@9���������YP8���r]�]���<�Gu�T�u*TW���9��������wa{S�u�������<Ou���b�������Hd}��*����f#��g�
�\�x�����)�C4�c�h�������^V�`@�\AlL��WJ���_�s P�����oP�4�	�9&�~����!?�*��V�	�p�r�;�fq�ms�~s��l�L�sr��O*�Mq4������c�V��1���L���,�m��Xbi|7����(�B�o���U=`I!���
A����*@P� (h
8 �����b�P������5�@<W~�^�S�YTU�H�H�,��(��'iN���K�^T8����I�\jQ2K�q��xT�5npRuJ��A�)m�]��Y�����������
��?���jp��Q����6*�Y>��U�`:��q�v��@���`��5
#��~����j/>-�����[q]�n���x�!�)�?���f}���n#`��Q�H��[���XuV�XKcUkLW������
u�bS_�0���j�-���U.^�0K�����[�Hq�����,���Q]�+p�^�&��-3{����#�w��u�*��/k���������;R������l�~:�U&������}�i��[Q��c��l�fr�9Y�`L&��Oq^x&��d������9�ZU����4,��03���{���9=���7 �k�]���,���p�@A9BL��E�b0c�������^�<�S�,�t��=�H2*p�������J�3�=� 
���Y�f�%���F�*<m�����w������=��9����e`�9Y/Q��N-`95��
��l+����J}f����0��s��-�x6����v��T�T���
~�����y�lpz���?=5e���(ev�|6'��"�����(����x�=u(��d6�%>�)�p���n5!tU3�(����V�o�,���T7��]�������:&?����6;��<W���fQ�@�<��:�R��(�������������\<�{����e:�Vn8�����rq�m%����lS�R���N�k�O�>�����AP��C���%�S���3K��rQUL��N�(���(�&�8���)0}D7��crNS��t6Ko��,~2�}�\.�d��{�.��O�}�2&C����������~�<w��>���y]�c3*5@��+,!�%Q95�8��Q�P�
&O�$��F�H�c�JY{JV��<���3S `Q\��d��cZZ�x����NMe?����#1��4�����h��v�x����NC�f�#�N�(�V��uB4L2Y���N�F`��a�e�`���5s5@���v�V��7�F����V�o�,&�
\�0\0p���N����0���. ��������z��9[��>\N��X-pc!�"�A��T/��#�-.�_�F����VD�����*��n����5��wr���G$VW���;�
nX����a}v�`8H��Y����xf�b�
e���
�0`1Ba��u�k/�yp�!�I��b���b�B�L��b�C����e��J��H�p�J�R��,1`�D�SQ��w)������*qC���&`	X`��M�Y8f��M�]Xv�>7��llC�Un�C'���4sC�0��B3��|hB9���sX���Tl�
�
�c4��6���,�Q+�����
0�A3�c��>�uW���A�Tx�tgr���MI���������5����!G.j?_��5cx0����a��V��#���t�K��3��fn�
9�P,�[����vkr����c� �!����6�`�bcK�U����v~+�����gwq<Q\e��}�������Ym�s�t�b��������v+�)��X���b[�W]C�!��hf������z�eYt/�Qv��T<I��7���t<�M*]������MM��@��*<�8���u�Z!�C������Y��v�~5K��b�J\��|��U��/D�y���5�K�q~:����f\_�,$aX��e��M�wt|18����
C����gf���TT�"R��Q9NoEB����'�����-wdyT����7���D��EI���6��hEC���)f,���S>$q��TiS`'8�O��fR��AA!��!A�E�,*I�kC�(�*W����s4�E���n�?B�D�LQ�U����(���`�F4,�I����%���-�~�|2
@
C)�4J5L�<�
q��N����s���eZ����}_c�~�<��q��R��bC�%����l�6H�N�E�R9.m��k��)R�I)�^�F�f�l,G[�i�hR6Yj~C3���1�y�r�c�k["`f9ps)����MU�WKa���&��!n����?�4�3Y��������@�!}��S���5[A!:�f11C�i�,���u�:�Z�i�<0v�C6K��{G��g"��?�PN�$��d��w�gz�$�t��a���8�%=b�RS`|8nS&�hiM�Fk@m�<���K�*4s6��h�+o��
�\�n�.tser��.��a"�c�^:����
�3C��i�&���C�[*n%V,��|`���������]�&���dU$���S��U.0,�Y��31�\`W��&���+�B�\`C���e���\�'r����
��X�W����uE#��AZ&�U���"~��+}P����U.0C�����f��
]AL�A������hmn�z����gu��F*��������!��J	/.�����m�����N�Z_�����sz^7�z4d�%.�X����9��Q�b�R��A��O�Q���:..T|���*�����PE
�
����r��
�6C�T�q�>�R�.�X�o���@�!�oqQe�"�1������U�������Uf���P��o�
��H�'�}�M���6�}`���%�������5P`��/!��g�[([Q�@�Y~r�E��ku*��7if��#Uu����H����7eW����*��4�C���VN�kLk?7��r1�d�szyO.+�/[�PG���vJT�|2���&~���C�T�,S��hr���^��T3o��*���y�|���L��9��(53.�\��
������d�����76�^�7�G�WYV����vM���Z�o���9id�q�f6�p���fj��><��D�_6e��Z��������5�X��$S���N��
��F�+�<]�ti���9��H;[�����8��p�e�.#�Hw�G�,�m�Q���a�:�m��)+����n&0l.�Uj'0�V�0
9����SS;�Fr��%��������������~uq�U�{��8�*s\�%�� L�WXeG����2�4���V@�_]`�9h��Xe>�c�i�����8n����?J�Un�(�#��*�xp���m�[k�G����-�����q[�G~�-���[���8n�����"�t��2A�+D���+	O/T=-�����8�-��*�8\'O��iq<-�����8�����Z_������r��������Z_��kq|-N��	�8�'���!
�[Pn^bh���,nvP�-�������=�`�,~������C�?>�[Z��,��M���������wQ^&5;K-���8F��YR�_2���Vj������0���C��������h��{Zu<�:�VO���a��[=oq��J�V�/J��Ju<�7���GK���r���N`���C��tY]]-]��^O��S��;;���MW����/m�)�/�����e�P��R���}�CI�w<�C�����,q��r����j�@��@�Y���TA{G�������u���-���
�l����.D7V��j�C�X�<�cS�uLi���Xv�������a-Z� oq��
�>�]1���b��h�,	�'���t������o��G��N)(���&}����KS
`���$#��By�s��d�,6#a�������������F�t(��a�-��6�@R�x�Z;=����eB�3��4[��O�I1<�~~�wtxz���������y��N�{�g'�g�;{���:�����N��������i_�g���c���y����������||����I�-�o���l��lp�?>���=�?:4�;��������1�����Ky��s�-�t���OS�����gG'�����������'��SFr`?92})��]���4�������Q��w����s8`��c��+�K9���[}��Ho��{?��~:=�x���U�?^�����v�C��r}�d���5E�w����J-.��<>:�]tSL00�t}Y@�g�����p�X*pM7_ w��3SF0\p��������}����s����3�������[���CwQ`1�D�a<���vE����@��l����0���G���62��dOf��� ��g�m����A��pP����w��~��e���e����h�i>e�
�I>cD#a��$��8���,��}�t@k�D��`��n6���V�e�H��t��0b�!��2!V�����qMq��(x3�<���f�v������#&����7k���>����O���u2�6�o�3������a�Q�a�C���>��k�K��xP�)����9�4�cw��Z2��-��v����hq70�l����e�~��O���\�u�w]B/�~�k�_���F��\v�E������	K�Y��[�~E��_!�F2)K�W��C��rw�o��\.�.�A�oD�Y�8w3Z���.�,NJU�>w7Z��
s}nh��E��f-�_�����
.��������F>����nz�����r3���ikq�����Y�h��t�8[`~�)�Vu7����hq:Z�EY��u�8]-N�����G����bsQ�����m7��L��Z�C�Pz��1�N����XV�iq�Gjo5��%�w���X|9D���������onR��!-�z��
������4��)�� ���h>�m\���
������Hs��w�|���V���-p���1���:�^���.��3sz���0�
���^���<�,�Om��J�|\��@�����"w7Z��
3�i�u��0���uef���o\�+���lX����a�\�Y�)
PK����u�ps�������;�n��\�hq�;������/eP��4��ti���'�[y�������vS������V��?Lq�	�.whi�A��>�*jP���q,0%��K9��f���yf�lw�C'
�/r/��]�POMJ94����
����Tk���������a�[P$�u�n���7J��[)�]�(i9o���S���$�RbS`(�%��\iqWA�l���p���]��=>���,����P�H��3)n�W�S`�W8���*�n�q6��q����9�Oc�(OocwU����(����Yj����
�mM��b�I>yeVjS7��-����
];���-���9���-pqb��0+�8�00���W���a���8Q�\|�X`�u�WaK��}������D� ak�ZZ;S�"a�W��-�>�.N��*��"��������
'v*S�E�tW�������m�^��[��8Q��+De1	,��W�l,>����}q����"5
%Z���L���v��\�h�/N�0iQrw��pzM��6�Z�l��)
����	k7Z����l���B����Wy���~9<��|T�h���������:�����7��������M�y%�1�Q7l�+m����r�Y�<@��i;�0��W?4Kt��6G��Z��~����l�Z��9�r��Do����y���Y�9�osp��9��osl|�9�������@�+7!����)0NO���N���Nu:S{Kburc]�h@������l���>���B��Y"0�oi��`of4����%��~�v���ot�Eu6!��LY�1p���4Of��xK�7��L�E�cHU�����~�������Aqw��#��o;��r��m�T�k1�;�R���,������os@���z�X@��9Z��8so;1���C�����`�Z��mS����8w��U�>#������S]+���r�`�,��e������[k�bB��N��#����;`���X����0��r�`����a{�,*f��j@�S����Z���>ms�
����_�>f�"w-@�3�F�C��P�m��7�&�#�]&G�;�19��os��ir��v�g&�n5�fi)���<����T'[��p�
f��a�V�$��m�fWi���Uq��!4�&�G�UZ�*�����J%��3��dZ:�6�W�
j����*��<�6����\y�������^��o�L�[&��m��6�&�'�������89 �pm���� �:��S�6��<�,S���>/���m�+�9^�ec�Y���X��9��4��n�<�J;+�~]�b�#��v{V/�(a�"����ZN"Z�_�2"��m���"�2U*��b��j�-#h����{1s�_W��0�l�}�n������3�&�U�����-
�)�Z�e{ke<Y���+6�����`�����{{�D@�9�>I!q1K���8�Xn���s�q��u{����4A��XJ\��*��9����A�'.=:xKw+05���o��%���(�(�Z� ��6�$����Bl��`cs�p�I	�n%l����?Lq�UrR���V�������
ns�p!����!?Lq�U�`�B�����[��;~X�h�6G+������Nu�(,poS`�8�Y�r�x�j��~��,le��eJlGI[Fp�m��6��������F�����Hu��,�����.�'����O6��V\����N[���JT�9�����ns�t�j�){�Ic���'�.���C~S`�x6��v�K�1����p�_)du�o�X�8�8��x��k���B;���8�:9aq�"�:���4��q��e+?���q3`�8|����^���3�O�'�2s��ilj���q�����8����<�����?�]X�����**��n�V��p4v�����'�=�6���^mf�����b�������' P,{�m`]������Jqo�N�-��L����zrXs�
�c�Y�:`�)
�p3��
��6�>����������������*YE�����6�$����X��������o����X0�j�������`�mcN��Vpi,�.~�1��-9��p���QAwWld$0,nb��~�9��>B`r��������
TM}e�F�Xr��f���UY!��?��#�=7�\�w���1����Jn��������������|=9�=u9p����x���s{1�S���F����\� ���n�P����M+�{0����bV�Ve��mN;.W�_��v����l&l���6�q[�GB�z.����8V1��:{�p�s������3�w�GC��~�����'~�m��6/�6�q[��B����A�f6@99��Z}B4�����k��'gaY���8��:��������`�;�lf4��+���+F����Nq�1����Q�
��Q�������k������N����Cq~�?�W(?�5�����V��o'��Y2�]�%����Z�*�w8|�����>]��;����u�.&L�P�����Zc�U.Pd��'���/;����e��*v�V�@�9���8
x�N3����v8W*��M���h���O��s2�G�J�4��������h�\&�dv��Vw��v8�����
�
��5^�:���p�mE��6�����pSll|N����i��n�#t���u���8]#��v8��h.G�VkT�7/��,�Q���O�bSH��r!Y���n��A���0>Ym�9����k�Bq�N%Y���q��xCN7���p�}.J���c5�6�dV������]:����R����c�wg3U��f�8�����*��`��N��`yx�q+b^"�w8���/��xa3`9���1~�|�azi��G��.����hN���Y.��7I..�������:�w�D`�8��7}Ek������������B33G��.�4x��^a����i�8�������/��-��_������QL-�����k��k�'^�"����^����]�^\�u1
����)�d�f�S���tk��[����<�X��� U�W� R;��[=����6%�Qm�����U�U��|��X��u���:*����[�+��\�������@�8@��F>���4�C�Df�#2�l�bq(%��1
�������Xp��c����f�i��� �Q�c�);NiM�������T��`!;<Y��/_h�^����
b��c��:�a��eX:�������b����2"XaN-���@��s�����geDP�����HEg�n�+.�0`#8����g�4�;g�p8c�%FsC
P��:����,~*{{a�����z,���V$�\�=:;;�(�Sr�B��P����]���Xu3��k'K`x8\��i�����fhe����@Ov8z��:K�Q^����t6KoK+��|�Z
@*;Ri4��p��c�Tu�I'��J���Ne��a��(+J�3o�,|��������q�����
���bb!������`8�R'�Y����8+K���>[R2GLvx`����Y�o��dZ�A�f6@39&q���`����ma���dnX&��1��=*����3��; ��V��$���h��-BX����nK`A8&��Hta���l��7�u�l_�c�����4Kv�.���.Nb��Qx��Q���\���uZ����s#{1�_����5�s�.@���gf�6]��c�H]�u9hO���gn����������w=��e�MS��w9��f>n��r��I�\O�����)I����Q.�h�{m���.��T2�nr]�����.���E$f���r��c�?��@Z��U���R��#�A������~��x�a��)�H���6�BTmg�A� ��P\/nB_���KKc)<��sS5{�MF7�h���}�O������.�R25 �]9����
�'�����6/�g���ST��O��B��pz;�S�JY�_���P��o�Y���tpj��n��7���e��VY��.�7��s����%�e����F��^qt��u9C�����h��G����P�������|�_,_NhL^~�W�<`8&��*�*��g���U��H�f*�&?��G����t���xDv}�Pb���Pb���y�N�gi4G��4�(<O
�f?g_��|+�$�$'���<U�zS�������,���Rd9�6�4����� �]Y�,��L���.�>U��d}~W\Z���?rp��=Pw�|Qjt�C�����E��>�5��|�<��<��j��Qsf6@]9TNy�?�?{�?>���j
� ri�1�"?����-[@v]�[��)���.s�]��,_���1���Avutz�s]��3�z�Aq�5?���h���Kr;���Y���T�:K�h�f#����#@����)
�S��Dw��y2�m�Pru��r��NE�KG4*�s:�3K��,�}�r]���w��{w}�5w��Y��q�]���O�8�^.�,�#��s����p����$���Q�=88�����x��d���g�S��[��i
���y<�kG3���n���"oO�x���$���q&�@~A6k2�3���6�E��^�������#3c`	xhoah����X��{]���4��r���
Po�3����v�z��<�k����i���/�������v�k��������/P+6�2���>�����������]��0{]���E��;*�R���������S37�]�'����*c+���6(�R��������|���?83s���y�*��\�r��\}j�h@�u9Z��F�]��ZX��PNV�E@��<rWl�R5[���Q��Zf�������d]���<6_�q]�G����+�r]�3w�����7R��!�m]n���q�6C��q�����x:���T��uY�|0��?�N��O���k]���X��������^��j�4`��S&��^*E�6�d���r>=�*4GK�Fr���+K�[��4����_w����)PYG8t-A���.����q�7���g���(��Q����1^dC�@2���i�xp�.��-V��$�iY���2��U	����7���HY�G��=�
��/
�`r4��Q=�j5��.���L��`23�j�����8�3`K<�m7f�����X����.�������r��
P������@��E�����+�~�?Uf�:b<��������)�uD2��1���{1�Lb
:e����.��M��c���N�5r��7q�U�c�dF��D|TS����n��0]]����x�%9�*��,z��e��$�r��Wk@�{L#�����@L=���>��=�dB�"�i*�O�7������m�%
7$���F*e�H��)������o/�E���TA�S��E���������m3z�o��x�E���y7�}���U��8��c)yWRmz+���muz� �w�|<7��X\�*"�9�������j�e=�-+��r2��n��k@d��Y@d="��g-��}��s��b��c��hX�G�(U��zUl�8���QRMU��������,�
���_f6@Y�������fS���R-D~�����=�B���;�;�%j���znzK5Z��r����f�@��`�2�-;g�oMh,��v����	�@�Vg���;\ T=�P���U���1)��R]N�~�l��z�H���:��z�@�:�9��hLA��t~}S7���<�����j���P��6Q#��@g�������E�����oj��_w@�����V�@}]�T�.�8�det�YT��z�LZW������q�T5�}P=wt�e�U���5
���;j�N����5� TA���S��<��\� ,��N�|�Y����&3�f��1�wF��/�b.�b�/�����J�~��`�z|����z��XOhq���C��z�d�>`�z,�T���R����l��p�Q��L���������-�|�S�D��(�G?����hG���q��Ms���.��^3�'��i<$<D{��P^-�����w���N�?`�z���r)uz����,.��-uD�-��w��vwmM�@�W���^�e�l�v�o{�o�u���2�^���d��9������j��{���D�d��+��8OE�;"e�9P�j�$�����L���"����LE-*�/X2S`'LW���*�.0��Y\7�4S�������f���p\���*��`���X�3���z '��p�����m�E���Of���X�
R�l�r�jb�`�����C�sGm�,��..���g'+�w�U,�;=�R�����z�������wj�~.�a�V�������hY��z����q��=9/�(���l �[i^�����8��vu��(�J�Mh�@�z|0[�f�����<)��{`%8H������W���V{��q��z���Q����rl��F:�#�^�-'��B��4�L���������^-h����E������0�Ofu�����q���
0|�[}
6�g|M��4�s�L>��z&,�2�(���2��z�X��tR�]�����G�K����T��Q��#�!)�u��	�d=��:����XW��*M����g�[Y^0��"����%�r�����Q�t��lSL^����~l
,�`[��e<�������2�
iN���s���V7�+-@%���K*L���StG#1��;���j��[��X+��[��3�y�<7'C��q�\�F��nS����|���HU����_]%��d���X�P�Pw=��[T�>��T������*��X�I9��_v`S`9z�HZ��\�i� Y)yrKw�(e����q~E�}�4Q����>���p�F+}�XU�++�yG%Qo����y�-
0�	X$����tN�l�U��B\C@�8��H��!��l�L�C��GK'�@�m�Q2����m`������st%�|�Q�%}64f,kD6�4��rrG�j&Wkk�H�$��r����d�,�jE&z����x�dE���=Z�������E�o(gxT<]��"���$��*�
o	T�X��B��=���J9�3��G��R��b�������}X.��m�=�TiC�����	��������T�JUQ������e
L��7��~*_���.�g�+�����~EY��$Fc!�'7�����O-�/p[���]���X���`�Mm���4\}C��z
��n7�BG���<Rw��I� QY�Lw��c�������==YW�w\�;{���f^���#~���TV�/z��Z������z�X��x��rWg�}cL�;@q^���p|���R�����$����40������t�9����B*����!��$����`��c��$���/�ij.�l�s�^�����{?���F��lD��S2U6�*��B��0�����;�`M��2
�,����ju�Vw=�0��+�p����|�������D���.����M���*��7��j0�_J��M�K�"j�E�Z��T��������^_�6Nn�������v�Dn��ww*z����|JU��p�{{����DN�F�Jd���^��������*�s$aq����1�[��������kA|��M�t�
��Mq��P�/�f/�h!���W5b����,NF���TJ��b��#DqbS|^��.dyN#F����+[�]���^��v�{��y�`���t���d�y=7�J��f-*����q��|�r��@23`F[Mu��������V���`J��%���9�����Z��.Es��1����(�C�/���,����5��Q�Sz�WRZ��������+FP����n���7���|���L/=oZ�����q���YQ�j�^z�$c;[`L�kgL���3v�Cs�|��I���Ba�����Cq)Y��
��s�[���������8e�<�j��t���(��a �B�F"|���r���B�n�a^5[Y�t�es��G��w��in���7)�����E}��beG���~C1��]X�8v	��;$r)T� �hO�%�C����h�#DCs��*k[��	%�N?�>�JA�l�(5=_c�rs���������9'�5��!iz���{��$��]���MV�s�l���P7fM�]=��r��iz�g'�nz��s���(|3?��{@#y�����a��(zu���p3=_c��&�����Wd~#|���j.7�L�������e[�%w�K^��VYm�X���2=wKb�pm�ve}��0S*��V97�L���02=o�tn���� YuZ�+o�W["*�d$�+���0����L�mrWn�t����(�E�,��^������rT)0�YM�qd3���L��	�Y������g
�P�w��Y�B(�3���7�L���n>����U��SQ��O�)dz��u��W��{K\q�TH^L(�M27��<T��C�>S����g��	]�/�:�x�Oh�w�L2=_�Y��<�\&z�+w(�'[�������}J7�L�q��b]�m����1=_s�M�sW���=�@��k7�����
���p.�h4 u+mU�9Pk���4�\���������o[17�J����e
���\�90�j7[J���(�T(=��g����$�M�����)��8�4�!n��4�'�8&�FA:DE������2r�����Vsc���a���M6�����K����^��IY�O>�~(�Of�����L��K�*J54��A_�cJT�C/-c��&�9g;..��6n,$�l��������d����
0K���VZ-S���~�,p���T7�����f�pC������9T������m���SS�n���7�-7QI�/������/��o���C�t��{���w�����Va� �`��+R%���<@�~���;�(`9zkY�r�h?GTZ���XH+�������m���[��a���������i
_�=�,i���x'���m�Z����`�2�iy�E��)�a7���y�W�{���B��qk���+���G19�d�����pk���~@.0����qk��1{V>nM�8�����i�U���,�����r�+�>���	�89�����c��(���8+�i�f���_����#��B5��!7hz�L�<�y�fN"=@�y.�Mk���-����Vt��LNSn��Yyk�Q.�	��u��eQ@�\��~E��-��q\��qif>'�\8Yi�u6����O2�83��JJ	��gt���"���2U_O�L
 �<@�=�p�FA+�[M�@�������������`���x��K�
R����4��fY��.������g��8�����J���4�J�R����;��[�1_��k��<�#�e�.�|��
�����8�~�R��h����<��I����=�<�V�#��h>��Y��v�h��T�y���8����3jY��y.,���S��@u]��J�5�	���J�=��.��J�(KW�x��88���GO-.�Z�p���)��v�og4�C��~,Vk�n�<�$���Xl����h���>���ayK���TT��G���Sly<�U��Q���NGq�#�:�0,�v��4��,
h��*�m��{&�����,���XU�x0���XV>@�9�
��
t��d)[�����J�y��|�t<��r�2R�H�k��R��j��r~���+�#�<���($N�blsUlo7�zR����I��������@�9j���(&wq����XqV�t�`C��b����+����L������WV>@]y���z���m�-X��O��<��C�C6[D�-��)�Oq<%K��8K�\L���D��U��KZ�L��B`]^����1���C
���f�|=�X��j&��q��-��������U��?e��|�����!T&��}\��Y��h.+`Axz����:�{/66��HE[���r]Ad����u�����<���&d	_�����@���xV�"�e�b��h�i,��{g��T��3f�H5�/��V����h����lvTZ�Z���u��=*��l�xVn�����Y��Qak�7N���Ad���s�e{[�s��.�+��-�_���LJ��;�����f�b9+��������2��c@�y<���6�d^]������>���x���j�a5�:=�X����������c*\MiNS�������C"��ye;�v�p9����y!m�/&
g�W^`��s �<�����X�0q^��|F��e���x^�L� �<�����d��5�_�1���+��m�-/��0�J�s�1������J5�k���y���N�?�^��}����"'�q����j
���.?�a�/��=-E����]i�Y~X�����J�����(��"=����y<�����M9S�h��['N�-�����N��xjm&��sy,�,K�PF�U�������p�}�f��XHGc����7�(��������^x9����8��%��(����>Xw�T���)
��fq����9@ew�Gr����"�2
��fpt�JW��u�\Z.K`
�E��L -	����M�2�q��>�\.D�$��4�58L'�D��Q���j�,�]Z����ki�/U&��L��-a��N��w�������o�a��)��@�������_�����x[F���Jw�';\&e��Nm����r�vCOxjp[%{���	��V�����A�������&�o�����V����>`}�_��IW����n����0�>� >%^���$��P�~������O�(<\���'�����3������v�Wl�,}]ZU�E����]�����}��p�ek����y&�;�IR;����{���1��%�������^�r��P�����H��F:��Z�^���;D�.���������=h�S��Z����u����_
�V
�C��v��s�vh.�|@a�>���1��� @i�>����d>�-��{}XL�g�x�	l��
9�b��cs�I�
7�~}{q�?��h�����0|���2���*N�1���rk�9@��������d��UF���M��� �W/���H�"�1�O���{��T����a7�����8����*��K�Z��O�E7���8����hM]��s�jm�e�u`�8�uy�M���s����-��l�
���\n��>Y�����
��H`QB�>�^}���A�9�z�d2�������~</�kQ
�X��������'���2&@r��)���(��f��=��~�����N|����'Q�o�N��.
�[�w|Xc�I�7�/�����}�O��tX��J�-zxyPW��+�s���0a#R�:����a����]��pMhFu��=�fb{0
�����F������^J��O9�E�s���h�]!MLp�D����/�����sd��va>�d}��-%5��}��
�W���~}��h:��i���qY:���L�u�9��8���	]�0�J|@��Nvl\��6�.����N4��O�1!]�$�8�uq��9/�V>@c�����XV��^��&�X���Q�k�<�s�xgqqtkTC��\�o��Vi�J����������x�e�������,GCx����EaN���
��!3�f���Y������	mE�����kErEu:�����o������y'��m����'t'*@\�w�so�������X�d����[�!c��t�V��������!�#Dk�0����!?�$k��;�m~j�X�o���'�rZ��t����`752��<*Zs{���7k��A��V��J����s�����������w�9a
�OWM����ml�;9���;��I�y��\�O��������p���|D��CM������V���6�����+M�l�������b��z��[��A�m�oE��_f�P�/������� �I�_�����m�]q���W��G��>���yQ������|�1���d�Q�K�TH��h���=����l�������,�kJV;�x�w����W]�)����}�db�����(�����>ZF���� �����W�E��Y���Y;��+�2Z��]vV���9��V����%�b�5�l�/3m_-������Q����A�ky��eu�\���F�Clk�-L��U����]��c�&��Z� \��N$}���<�J�T��������.�^��Y�EtE��W 
3y�H���CK`�p���Xg���;R��EIF��;8�����t���AC��E�������KvG*�h�����M�p��I[���O�|�Z��)m�&�0�
���/G5�mB��lvQ^�K�\�s����{��������5? �J
x*u}g�T�����>����1|����-�Q�Y�u�1b
8��V>�(4t����b�.o��5��s�i�����i���������9[����>_��~�H{V,?)��D����<�3p��c�g�f����sQ�A^�����V�z�3V�ET>�	,8`������~e�3x�i�8�f��{��l������j��JC�� )��F�2�8��/j��1���b;�2�8��n�`�>#�,���|�F|P�W���'���d�~*R��*;ntN\;]e�����'���k3&xe��^��?������������%������2L�x��c���@��a����O����=�0�ACf2�d�d&��$�.�G�.��R&�\I�����u��(���-�f%'(�n�.4����(���(��d$�&��1�d�;��n�����IJ���^��r����>���k��n��f���\���V����o�O�Vu����/9U��O��t�("�<�[lMx��wk*���=qe�ei_]��mt��&;����8�{�����ACw�k#������x����&��AC3,f���V>�\���N������XE{� �A3�rY��E�7�0�����#D�{�@&�7�N��H���e��}�,E��;JY<�������@
��p�r�����[����#�2p��^M�2���4P�e�}#K���8��~{9IG�5����;��2���@0�,Z*h�VS
�7���ZQA(]fq��3N��r��8��T�����$3����:��`�H�u��/�H:��V��C��IC4�{c$'&����\w��{�#[o�8�GW��7���P�e���G�
`*��L��FzP��!J�2�QJ�7-Kop,d�
�c����+��7�~p�e�k���&�;����sZ-.��|\�4�"����4�&��r^s;���gYM��x�p���+�A`��Sz�,@��!F�1&��A���F��Z��	���F�m	L�������F�����G��IYl���p��p8���u����p��|`����w�������p��H2� �������H��M�i�LfW�a�����xf�b�uo�2zg]�De��e��>f�0�����=kp���� \,��{v��	����h����'�.���4�1p��\Moee�U���:B<��T�L+O����!��d�j��Ni� _���$����1h�t��i���.���?U� "��,
E�1���C�I���d��/h�������Y����>��������d��������n<���x2��l��G�Qk_��Y�����#&�U�����c����U>�?��9�V>n���8�����C@J�)Y�����c���G'k�������B@B�	��f�������#�y�*w7s��a��@�!E������[�C�A�n��#/B@I�.����(d����S�Y����W�q|[���e<���*2t{�t�r�B����������5i
=r���,-8e�qQ�
-���us-Pe�?�N-�e���}����[G�! )C��,��1��'C��MK������Z<��k��(2(C�����?���q�1��1����A��E���!!��_���@��C�d:��aC`2�d������t2d�P�v�=f8e���n�/0�������^E���]�T�ng�����3�a��d�. -C�ce������\wm �������2�|�
��J����2�h�Rw3�3t����2\��$�����?�2t������2t���d�.'C�,U7|�ai����L>Lu&6�$C�I���J��Q`2��I�����`����N�R]�I�.Lr55P�Iz%���1�����;���_M���2���R{[v�i����[���^c��i���^��k�!*C6D|���x]rt%%K{b����h�m�e�E���_g�N��T�����y2�I��-,n�.p��lmf��8Kmn?�����k�����!GrZ����G�2���B�C����yV�F��r���h�r�#3���h��4�7C�	��c2 ?��������9�s����h10!�8��b@q��i����Rk������9��?fLPf�A��,-df��b����?�Yru?P>���-�%
���
�����X���s�,M\g���1\g���ks�!�:��\g��p��������\'${�&�3tr�Vf�,p$����'�U�nK03t�n��g������Rd�r��Y*r����\��:|b>L��r���0<��9�A"�k��Y*�c�d@k�����6E0�Rd8$~3D-��*
���y/�����9z�T�����*
������-@��
�a���y�S��
mF3t3����*��f�q�T��/[�P��m9�O���O�����Mi`��N�t�������:��K��S2!iN���{�g�G�����
��8��B��E�{rB���)��m���d��-����Y|;M�(�������2M��a�n��B!��dh|�����;��jZ�JC��-&]M�0�Li0�-�[�-��������i�O�|�E��Q=LXT``�-�,���3� C[J�&����g����	�|�8��.=x��h��Ek�pGjG	@��8��%C�(2��L[��������c�lj�9���a:�_'�U�8�`��:0��CP[MTJ�Q�Omq|���!
��-�
���G��-�����W(o��9@\����g@�-R-({m������Us��|��j�E�~�;w-@��8Zu���DI��������`�-[������Y�F�l���!�eo�����`�-���f7�Z�SmN�U�wk�NlG�,�@�N��Z�#�����Y�a0�-�)-;�#�8�v�r�����#m��������9S�
�I4�oR�t�@��6H�������Q��d)1 V[nbU>�l �-����-�����V+`x(U����\�r���k�]5Wc�������<i��������8m5u�Y'c�_E���c��xhu5Yz����]�����j��p�!Z�&Z����Z�k���Z�-������Z���p�-��PK�G�j�������0�L[f� �9P�������`������|����-����|	S|�&��V�(�-��xUk���P����B����s�Z'�z0����@hp�q�9���Z�0m5�Z'�z
��+�z5Y
P��U��q�X��#D�����gmq<��0<��U�
[�$mq$i�������i�S�Ra�����vz����gT�!m�����8���VR���x��V�t�q���j����t��(KV��@`:\|���9(k�)�j�B���8.�����,G���~e?>-@��82��X��haC��l���Q��-�o�4���<m��Z�@��u���cp�Z�=�m!�T[�1U�I9wPx��:<�ai�f�LGS���:���-�E5�eg��\������s���z
�I��mb	����T+`;�0��Z�k`�m3��@��l	��Jm��T��h�^4=�E��b�83�`i�K�|���^�
����w��Y��������m@��B�6d�m���.������������i�n;�n��Ve�%��R�����`K��0m�)�Kk3m���]>J��f�ms�h���{��6`M��W��(j�a��6�J�<V��m������@���M�4�.m��Rk��h{J��@����X���B��B�Dc��m?�]c|dh�!d����C����k��m@���R���D�� R3YcD����rl����|l�h�o�Uy6l02���Q�V>��nL��`��m@��9B���,A�v�/�F@x��� �t&gm�����f���0�vC��
0�6�y�)�5����g����M�h��|$�����f�������y��h����6�,�x*��;z��^���<PRl�b��(Q����uq�Ld�'#�X|?/"k�2��X
������}�Z�[Sj^��u|�>l]x�6��Z���o�u�`����b��`�6��cp���
��v��]��W���R+`>x��z�H�6`@�Z��5�M�Qp�m�3��_�'+7�k�� �|�����m@��[�}��Dm���9����=���f��m�z������D
l����\G���vO�g�s[��i�YX��j��h���.��pQ+`x����Y�Z��m��Q�����H���]7Fo`�m7�����m@~�q�6�"��0r|��m9>��6�7����u����p@�����wx~Ps��
 �����2�
(�v�+�lR����f>�l����f-�^iT���<�o�������Y�Mvy5�����l7�;��{������]s� ����Dk��kn:Z���E���m�-�a�nJ�Z �m!--h�1+���_�cO��m���6��@���x��Q����7���m�|����?�����ls�F��1��}��_��j��m���j������>��g�c=�~���
���
�\�7j�@B�Z���c!m��E�C�����O+`x�S�8�0
��m������5/ :�|����5���l�����c� ��l�����m�4�������������f�C7�n��M� =;.��\�t��Y�5���H��1�WY�,�Y��MA�)j�L�������tv�`�;���4���f�oY�h:����}u��q�leg����Sv���������{Yr��4��?����{���8����5���M�)t��� M+��<p���:�����Q����������/m��`��y���TU"u�_{
� �NS��R#�Ov8|�T#����&;~���?n�F��p����~|�jd�v��q�a
�\������W���m��Av@��N
��F'f�������pX��@$;<"�;�u.��d��$�N���x��K�v��s�`);KY���13@Sv���4uP���J+`x"R�3�8���/����d<v\����
��oGu�
p��s,{^����@;aC���md�	��HH+`8���h=;�;d�8@�j�����b2'���|<�4P�h�d�_�5vx�QV�yg"�o��,�h�����mL'����3�����M�m�h2��h&��?�I��,�z8c�����g��x".c1���Y<w��F>��$��,��cu�����L�7=�D���].��4���jY��Z$���(�R����Wi��s����������/-a��q�{?�]]��D
�9I�"�M,������,Fs�'�@Uv\>:��|����.��G�d�I��K�T{��d�Y���iv�q�����}{tr�E��������E������"U�Tm�O�/[���t�)�<��/*�x���������U%j���[�Z
M,Y�}t��L&�4���LZ%9��g������ZP��tT�����,�\�8��{��Q��+gK���\�x2�(#�T"-�8�F�4������Y*���~��xL�IM�L����d}!X�*/�����`�9'�Z����1N�O��|������u� H;����������G�����3Q(�~������<����_��'�m��n����j��q:�wB���xT+`/[|������	������O�d�i�8�lrC3`P;����%��q������^�l���<�V����2�8�p�F�|�Ej��k����jBV��_l5����V>�np���-�w�Y~�\��)��3 9�S���h>,&&�b9���$���|&��
}����T2�tK\F4��z�Q���XL��T~|
?i�7�r���,��p��J����4������Z�`������GD{FC'������I�4�k� j��,�IE|���
��t��[�mi��~���&^���*��lC��N�B$�Jw
��O��?���..vO�V�������M�yS��&��(9<����������Z�E��x!
[��@�pl<�-}�k���
�\�����\��������X+`e:�<�-:��o|���XM����#Q�|T��<��z��(Y�O�v8��wY�r�;�.���^h�1f$���Zz�Y�4��[����������S;�Z���
��O��y�G������o"�f��|>���W�1Nc�4L�"l2�E4����CSU��n��E�����	�\k��M�!-n���������r��M�CCA���������~��:;�=<=�=���������i��v����V���Wb��D4�d3��I�0��w������Jk������T���6���v�����;���h�snP,l���Z���g=�	��v���p= ����SYI������O��������=�O��I*��\�K����d��@�v8:�i�,����}pZ�\\T/��D���x�b+�6�D�q:\l.�O����8�
�X�&I�1�J�d^����q�:�*����L�j�-���X��������l�w����Zm���)fI]@�vy�v�4v)�}��.�mv�.o�����J]�5�����b���T����mr���%%w��g����t1dkv�.�k�M\�����$�+j�r~e�6�f�Qe/Sm�O���TK�����6���_���P��|sQ�����^Q���5�
J��#�br@/he�6�f^N��@�nS'�e��]���v�B��6�����cu���
LBS��3��]"�{���c�.�j��w�e��ct>�����dw+��.��M�<��6ux
�@�v��8���:�m�,#�E��bZ�k��~���"W��Q���!�c���h�y����,��K���.�����t��,/�9�7:��;�d���'����
(��|�_�cCG�]��v9G�V>@�r�]��v����k5XlP���.`R�.�����i���/�G*��eI<R�]WQ2���j���Tv���SA���E��]�����%�0�F"��m�tu��|xZ�����jZ�����]h+r�iw����L�K���y6Q��6c�+E3!�
��&���i�����Zu�U(����~�T�G�R0�]��f�,������f����]�Y��nD�����D�m�.*I�>Y�J��L�q.�������Kq��[b<u�\�P��2)�|���-~���S�\'��������V>@�y.TZ�>m��%�����(��:� D�O�,�:�P���YS@wv[k���gup�q`�]�����C0�|�s�d�C_�	p��]���\xg�z�E	�F�
�����/OKj�����r�d��`��p��]�\��y:��c�!���c��.��8&W�tn<���,��S,>���;�{I���/9;b�����3D{���x���x\�63������7�(�,a��p�g�w�5C�����i��������������j6����k���g���������c��]'��.U��Y,�6������e:���|L�#A��/Wb����Mx���. ��blj��;�2�b�F�@����U��X>��:p�]����[��F�@g�@��`nd��s�i����i�t���4�;�s�v9��K������#��j� "�2���^y �����������1��]����6�!@�����+�j"x���D�n�eq4.f�j�.�W�x���	#�W���.@�
��.@�|�t�>�_[C#`����2p����f/	������a��������$��<$���LJf}��mP��r�Y_���<$����_����<4�#`S�sY�K0}�
�O������~�I�1�c�/��,�u�+YU��ug�At��]H��e�H\4��{�W�jr]�����$����@�.
�ng�TCo�]�v�!�i��N���y��
O=����mV%31�&����3�q���rl),��{��GNW�������{�}��?�r���0$
��w<��=����t�
���f��y�a�K�4_�h��Y/��r�������:���v����'K�Z��t���Q�u�=@(�v�����k�mO�C�|�F�����j�?��w/�������g�����7x���h�6���SA+�v�����{<��~���;�(�A�q��� {�h�t���yJk�&�a�=�0d<�GEK?������~��4������{6�9\x������	�Y�����X4����b���|��s����Z�+l��eng�p���C�����x:+��Jg*�����b��U��=�Fl�me�(�{�Xy;s���^C��^�

)���*z�G���.Z/�,�#S�[(0�h����uj�K~�����K_�Y�G+OA�C���&��;��a��7�{3�����J� ��'{����z4'{8i�L�L������H
KT������wp�����F>����(�K�8^�w��&���o��^'���_���q;�?���
��=��\}kK�a��^�\]]��x���{��]�V����4e��e.����%��&K��7�3pL�8`�|�Z���Gg0�=��,���a#�s0������n�2�n=@A�\����xc����p���c��_��C��pm� D����z���:��7��q4�O�4��
�{����9}z��FLn��=�i����}n~e�8����f�Y�4k�7 9{�5�Wc��=�~�x��k�����j�4�E�#����D6cq(���P�����e���k5�����������k5�D�"��P�lzS�+p���A��4�N+�s����g�
��\������30+����J��zq�P���A&;r�����L
���E2�c��'Wh%	���6���h������� A{	��4P��^-�U���"��&��1��&����z�b���X�� �^CH� �^CH� ��������?{�#���(#�|V`|N���*��q�5n
��
B^�����E<6����<��
7Nn������`C�� �Q�����[��)-*�/1��s@}o�V��d��Fj�s�����e��#^3��d���X&�h��9��aw	@B�	I��*��x��m��Y�=z���92������.v�p{� b�"�TG[F�{�D[>�p\�o3��8W�O��p<���''-��82��W�,����?����H�[�+����M�^�O���D�%����pg��'�2�Fq�fG��XJ3�k���i4����e�������t&�0��y6|����W�����������@��3A�#��<�E���ZY��s.�������_>���U���{Y������7Onf���^�g��y:��t��:~9�g��o��:~5��z9��>�H>�%��~O�r�w�WQ�#��^HU�"�U~u�����of������|K����g���<����)X2T{��*�^^�'�������#S��O�ZS�I��w���N��������������?��W���L�����ZN������C���{w�wvq�?�}<��o�����W����z����~������'�-���W��'d�?�q$���x�?�#ey�|��q�fi��P}~�d/�����>���O�)&����,�2S&�����K�t~�~[_�D'3it�y����tS��F��x,?l�w������vi�T����>�1t��I>����&�g�8J�
��y��0�KsNV�(����_��M��S��O��W�J�-<���5&-�.G��h�
���_}5o�����uM��������Y,'
{��9��~X�59-3����������dQF��_f�T��e��PV����/1����J�3���<��}��� �;���3��E�U���ZQ��U����y���x���/x�V#��������=�����Q���1����^�f&K�pt���u�3�[�u%�2��-S�5���q�D����Yv��/�����kw���9
;/{���U����v�
�w��mog��
���h��C�_*�j��/���R;;/w�^����p5�u��'+�3�IW����S��.D����P���[�����/�5s]���}��YQ@�*~]Yp��K�E����c���������H\��,�����_huW3����n�������#q(��m������4�o��	/�UAK�l�APm0�Y�����W�0+������t.�����u��p�c;�����H^����(��4{��R��+O*���\��5��T���y�����������g@��3��V��2�q[a�rmawh�sy����t��h���'�v~�n�����uZ{��j����-c����i}����H�����|������m��������o�?�����|�����?�����)y�Q���W�$z�����?�6~�m�|���������P�n)�e8�Z;�c�'=����E��z�g��~������_��c�<����Q�jY�au�#��ff��o{@���V����A[@_u7��������������=��S{�g9�/�{�-����y�����1���,��'�G�Q$g~2����&0��;�'�;B�B217�^=��m����|�������+����x��y�x�p������T��9�X[�p��t�bH�Y���w
��*V�~'�io�X�U:���s^����$��N�3��bcoS]��R����.�����n���Z�}��3\��"�����d�^tK�O��1��:���t�����QwF�J�YJS/r�sE����?��tN���,|���,����{�H�e������~U�5<��+�2_�������8�]xn�n��g��x~)��R����<�H�������UY�7�d���P�|3�	�[�����P[�
d��(5��|���2�N��MY������|�'0+��_#`(x�V�Zv�)5U�M:%����d��h��*.��|�e�l
��W&Y������G�gb��B�m��d�����,qv���r���!�D�eK�U��^6�!�=��c�d�GY���������~���z*��HS~�{r��w~�{"��O��N��o.)�!0^o�*t�4Yv�Q<��qn�'`��6�s,��4B�l��H{5��a�|�p�r���c#a	��*��k�{���������"�S["��F�B�m���������4�~5������z�������c�a9�3��&���|Y��r���
����ktrVI��]K�s�f����V�L��n��R�����HQ��a���	��?��x&�H`��IICr�c�������HY��M*�R6v*T���e���M��Y�����9�1�ycV���>��U3�������
�c2��rtR�.M{�v_,%��YD��h��w���&r������o��7j�@�U��,��q.����.��~�m�4���`C`�8��2����M2����+��Ve`R�	G6�!
0���d�����m*G��d���C���UfY��4�����,�bt)�Ze0�j2���er
�hyl`����j�����G#5�t�
���A���L��/}2=�]k�
��eYB�%0M�e�'vN6�MK��d6�S,Yi'	v���L���H��9Yw{R��&)W�����"�r��Ld���P�'�M;�(��^���uu
L�Y����G2���j��+K���1��.fB��|C=���^&�H�#r�d�lI���dr��J�zMu`d����rI����H��I�'�4���hl��	�IY|���6~�-=����V-�S�����wQ�oiY��*qo�a��SZ��e)�0<��T�������DU���Z��\�KB�@{r)�������Y&�%��s�m���g�g^�
"1��t������~���##������7S���q���/N�w���j��Z�Yn���F����Y��[��Q,-���T��:A�9�|;?-�����������l�e��<����f^���[5�
�E�X�������|��vd�y��8`�F���n�e.�BSL�r���j����X����>N���d�(%�����U��	�,���������Q3��H��M�����bToY��'}�c�
��sFl����S�o����(�I����?Zo
t��(2��	0�Nv���g8��U�Q��[�&���W�Y����K����6�'���|����C������
l[W�n���\�Ew�xR"�RP.�����q���dY���(
h ��wQ���a�>7
J�y�]6�/��5�������r�{��M(�zMa�s�0���_#��m����m�6g�n��Bp>{�e��5e��FY@�9��F.@�9����mgR��$�g�n��^p�(A�6����W��)�rI����*1e�$��.X���Pp���b-q��O�����&X�������o�b5m�c�+BKb��M�N�y>�gr����!���?�G�|'`8���"dgz)[9���w��M$�VN�0N/��xGX:��}�!K�S�d���K�pK�<3[�O��Q�xb/'��'���vK,���F�]�|��3`X/�rE��x��~o�l�\����l�M����jl���z!6+�5�s��24n��R �@����Xf���_u\2�/o���n/���F;�n�B�����;�R��V���=��=~d?������y4�{&y�uZ����

�����}I��{@�{�J��:��{��*�~Gl��R���~����R�����i�Q��qn�k�$���F[X=��=F��xv%���A�C�hC��S4���'�[��`Lz@�{��f:Hdi�o��K9m�y2�u��xv�Zqz��M�{@�{���*�u^>g���8Jcu������=m�_�S�I����X��Bm������&��6
�9?3K)lq�'�{
�h��	(�:�Ql<���������2f�)������8E��q}������N��9��vG
q���)�4:Rt����)������@���Hy#��;���;����:��{���z�Y��ZD�1Q�GU��x�E���E�9+��9�����
�P������&�����
�7�+���n��-!;��V�{V�����Ya����,��d���������pz;n[&�7�x;nS$�����G{E�6sw[����	1%N���m���F�@%=�4�F�� �u�������V��'=�4�����OE���bpx~pp|vb������$���`0���9�_G�T���l��@qpo�_2�����R������|<��L���/�2��E��� �z�w���vr`"�Z����>��
p�������@�q���f����%��C�-��_�n�{���8��ir���,�|�=U��h(�=]v����CPwG�=���y8�
k���<��
k�6����l�~5c�<�y�F!o���,�����g��=]��o�?����x�Jc�l�F��!h��h2���t����
�1a�cU|7����ax1����+�G}��e��-�=�T�mr3P�9��p��s���?$o�U
�QWU��l�@�xL*��o���T%f�@�8,��g�(�C�()@E�[]s
h(��������ge
4�����D��w�\�����5!���2��QQ�������l}3W�\�$�?M�(�#�jm`�<�y2����L�v�����t���Md��[,_(�.���,]�����������I��)P4�2����O�	�%��f�@�8��%�z�X�I-���ivZ-��ic�q��(�QGQZ��e��:���(Jj���z,�m�
)���nk\��*�g3����1T�VK����l��vRyHePIy$efT������fE�����@/Y�i9�YX�������Z��x'���j�(-�]��������p'#@2y�df������B�V�+�#��kd@I%��i4���y[��jy�>�8����������*A���6��B�<�ydf����
���ij��5�` ����l��qO����(�r�Y����r_�6O�`�Sk���qL��
�%�Yl�*�~�x�5s���;�X�������:H7�
@r<����T��Q9��;O�i��f�N���qhNmg���:f6@�8�f�3����x�c��jW��S��xkS�I�q$M)rM?�����w�rMcf�#al������~q�K�Q��U���V��P������h�,�aA��\��v�9���W�U�9T����\>�X���>E|����O�W�[�|��n]�9�ca��P/�[}���n��9�����X>�x,��O���?83����!�9P������������TN�u��b��q��S�+Dy���������
��%�a%���im��sp�x,���,��
�j�"w�2�y	�v{@��4�D��<Rca	�;HS���#��Y
���">���+���&bf���?Vk������?����Z7���i���A�l�6�����2s���U��������?8<8���T?7�*�S!T0�-�4���d�\J�?�����(�Ul�c~��6��:��+~���Oz���'��6�m	�]S���RO��G�9��(�g	��f���<~RS���Y�cC����,�R���fn�|���,g�l��������g}3k��u�l��,�px�7�����0@E��d��
_Y��\��.6��y#����Z�D_�7>�����5"�ux"$���N��O�{UIv����.�)�b�x��e*����x�)K�f����@fl�����?�s������Ng������A6�6�5>�j�63`#�9�E�sM����3`(X7Cu��
�������`0�g1i�Y80wS_�@�9���07>�g��������YP8���r]�]���<�G5����v�*�7~3������7
$Z���T���;�p|��Q�:�|W��g��J�Zz$�>Sc���|��Z�D������������M��,h<�,M���:#%?�bcR|h�*Pr���_�����j��T�|����q@�
~|��UA����cf4�x���4��\h���p�te�dr��H�xR�����~]X��������1���L���,��K��K������G�B���:@�YH����o�s*����@>`��f,�X ��j�rK�X'T�e��Y]#�su����h0�
�AEUe��_�4���L���z��t�>��[������nZ��%�$'�GuZ�#����S���Mi���-�*����w����n���a��U��s���q]�F�6����jL�y��b�.�`��&�4�������V������
X�z�T���e����
��j�7����6f����{F:e�X-�������Y �Zc���=8�o(�~S��Fa��9����w`X����D`yxJ���1�v���wh�@`r8������p�^�&��-3{������S�>��=�������`f6@��y^���������{
/xE�����y��������/��}�l�#�|N?�y�a��k�,���B����y���n��0�C��l�2rTX	)�����@���E��YPP�S�e8{��!���
��v1�1�e��G��I�}��qe�L�����r$�B[o��������|�h���,���d��z��_������C��yw�g3o��,r�h����p8���;L���S3������r[������Y�=���"`a6Q��$<[��l|��K������j���Xu�������|o�zj��6�I�]$����������H��,�<�W�SG��KfZ�+W���u�mp���n5!tT3��tW�S+��cl	�Q������!�'K���wpt���������vs���
PP�|[�e)a�
��������hHQc>��=��3���\m_���g��smt.�k,g�V���q<�6�(�������ps�Xl�l��]�P��b��T�����R�::",����"���<� �q����U��.�]��������l���>��Y4��e��m��q�t�D|*/��*�pN��G�#g���3�~���5I�
��*���8�.R�K�rj�q�I���
;E5I)����T������kNIM��EqD^�D������R�'��;1Lo��f�������^���T�JR�tV�P/8x��i���w�zYGy:����L���#�6���9�:�@��k�j���_�L��i(�����V�7�,&�
\�0\0p���N������3��]@�
����X?s67�|X��72-�VNM���:�	�p��J�bK!T7��@�M7�#���O7~���k�;9h������%D��;�
nX��#x��>;`0$��,��T�<3X�D����x�
F��0��:����<8����(ty����������������b���U3�1`]U;�2��W�`��%����^�Kd1��E��T��w=�7`1`�E�7f1`#'2�ug������r��>I��mx�TS�@k9&��(k3�a���Vm���.�a�Th�'n�8�.Wb*�e��x�����`�S{��������#p������Q�0�4��R9���j5;)��:�G�z��G.j_��uc8�C���Pg�0+w��/y�� ^����B��lu���h2G�>����(&8����R�G�F�m�%�����h*.��]��Be��}�����������K4��v���[IN)��J������1h�1�`���X�W��Q�E�be�1�J��t~}C��I�#����o�8��Q	T8��C�c
zW��������z{q�?�]?��0p1�r%.s|�l���o����������%`0����/Z�
��3���e��M�wt|18��������gd����T��"���Q9NoEB���'�����-wdyT�#s�p��
�(T1�JP4��S�hh��5�~"1�~��$����*m��cQ,4��-

q(	�/*dQI�\�r��h�\Y�����d���=�XA�r0EV������T�XB��T'}:��k�T�����eS��4,)��2�RGA
�%�BlH�����������B!���1��G��>.�!�*6�Z��~:��i���D[4/���f���X�8��L�H�F�b5�5K,`c9��L�F����P��=_�����c_�3���K�d��o���Z
��7SC��r��N�l�(�d�l.G�R�N��4���3`���� t�if��i�x��N�Vv*����#aj��{G��g"��?�PN�$��d��w�gz�$�t�
c5���,��{���n3d�i
�������f�����b/�����mB�	��y�.4�r����������|&��L�{�|2��+!�3C68�JK6�6���R)p�(�b�7�k,g�8m& �f��7A��$�"�v"��N�����r��%� �25��Y���J��6)Uf\	?��Z3�h�e���\�'r����
��X�W����uE#����D�j�Y�o�x�������f��;��t���f�c@ub������C���<Z��,�<�Y���
��a344hh���R�������b{�f��x������8�-Gp
/^s(�h�zK\�*�gD�
!r��z����3z��X�_h����l��A�����!��.�CI������r��
�A�T�q�"��R�.�X�o���@�!�oqQe�������r|�~��3�=&���*@�Y�r�*T�h����H�'�}�E���6beX��e%�o�����5�_���/!��g�[(SQ��Y|r�E������r���
PV��T�qv�{xz �������YU�/K���K��t�����������M<2e���2]�����������g�)A���(��x���R�H�������{��������s�d5����*k;h`(��� �l�R�,�2Hf�6d��6vk�9�Tc�;k���:9�Z�^eY��N'�Y4��Rku�������
`�!�A��1�`GC���}x����-�l�,����������Z����+��&C���|%~Bw�Q��,X���������2�l������
��(9q�ErE\�#������6>��J]�m��,�Q�2����E���|i����8�Fr��%����G��)G��Q�����8�*;�}����a��8�*sx�%������8�*;���/`�C�uq�U���?J`�9���Xe�C���V�C��(q�U�I�?F`�yn��Xe����V���(q�U������V~���`�!�}���m�3s`c9����C@u��bR��9 ��r��������#0L�W�sr�H��zF ������Xmj�{2�[�.�K �n+P�!Gm��B2��f �!�h�� ��]�UE�
��U�ugo�����m�G;�:Z;:-K���xn�:m3;���mi ��u\����9��h%�/u�N��2���<h�t�Q�z��.d�rz�����r�s���&��p���@�_|iJ��H�O2�`*'��\�:�"��f$L4[��m��9�V�Z���*�����S�5������M]�s����Q��dg.���q����GR!��D�~N�������������Ov/V?x�rzV|d�0<;��{��l��g��[����x:[�0�����L6�dty?�-}�\3�[����e���)�
*�\\�����!`�C���L&��C6�x�� s���f6���1����O���\�a�(bm�~���Ki7�_���FgS��O��=����<9�X��dR����D�-���dx#n��H]�l���p��8x9��en&�CA���T5���!G7����!�7IMm�V[���]�C�����J��?�� �S��3
��Cz�)4(���[�l*v`�-S�����[���m�bi���������p����6I��[`lf���	B��O��l�����| ?��X�qq���n���q��"|���2�7o���{n`���f6n�k�N\����
-������H-��5��n�u��h�]�\6��=�J�|l7!�u[<�kvb@��8��h�������B-��X��i��`��0�����(@-9
W��5X�-�����<��=x����`���M��\���X�]?o�5��W��^�{������D{�h4��������B���+{_�g�����XJa���Z�Q�~[�k
���m5���n�w���]�������#�x4*�,&�I�YW���
����8�J��xD;sKC1J�\�������7OocYD���y�(���q�������zG}�������JY�:{���
����Z���
�&@�-.R��zS����������)0)I�*������[�u��q���k�)7��2u���c�����1n���!�������������GeQ�4��-�v����,���8`����nT+c����3�����v�����f6@'y��c�t	�i����v2���-�5;@s[4W��U��b�fQ@+9����mql��
�+��;b4��b����o���F�������j[N�v���,�o��&����/������:���|H��Q��w7�y�������-�Yz��N�h�������Of�~*X���:�mZ��)0(k)�n��k�����y�
�������g����p]YO-���x�	��|X,�L�����p��q��q�L�x���Z����f��[�mq������
5���-�4�/�6[�������ted(��j���3�p�a��~�������p�-�mv���f��rp��}?���I3��<�������ScIN�K�7���7P���sx���w,�������4�2�8�����,�k�`����V[�/l�p�N]W�t���A����8�z�04�O������I���������z�E2�
�8h��Z���������1=��yK�l�x�TH���j�H��(���a�>���A}N����&��8���&@��\�����zu�S�#4��	W�w����
���z*�Z�gf����tO�=�����H���L���!GYy2�����qk;,�e�svX���]�`?������~+����#�Z�����b��-��V���Pk��:,�[hf�����l���N���`��x��>�>8'o���`�Zlp�Ey�qe�dY���~{�����U Pl6.���\e�|�m�*h*G�U�8z_yb����,�zj-6��B������Y��6�7�b����Y�?�c[���dQ����4�+
�'�U��h^�4�TS��f��[�kq��
�a6\���O�{G�S@��M��T�����Qkq���
�K6V�B���{?�o��������@`29��8x��n�-9����Zr���K3�@9�wk;�t>���%��|�9ZN��������.��mX�gW�1��I�?!����9PO�����c�
�\�4�K��j�}��k2NV�s
�2}D}��^�cu*��e4�4_C�:9�������s�l����m��">��n{�v�$����mO�n���O��m����epc�@b>�����<����v{,-f%ku���8d����kvSS�9��q(�Te��*�l�U��H������r��\���f��.9����g�����:��;z�/����Y0/�<������F3���hd�����h_�����k���,,��s���K�m'�Yg��%�6������_���������v�{3c�S�;Q+Bp�>0@p��m��F�.�N\M�d2�2K���j�����`�C�l�����o�N���nv���	������}r1���o�x�6����'���/�lsx �e��*v�V�@�9����8
��6������@�.��m4��"��dT���Y�]���0��F��2'���������]������t8����a��+ze��^���8���*B���i��a�uPZgbz���C3������6��C�/kT_}/��,�Q���O�bSH��ri^���n�����0>Ym�8�u��k����N%Y��vK���r�!�x���sQ��V����%�Jn���}�8P������-`����J��Y��0�m��Q�U~��u��l����>>�V�D��r�&?�����f6�$r���
�c���v���,�)���]+]���Dk��\���t�qqt~v|N�~7Kv��,�w�W��<�ls���
0	kY�4s1s����K��o�f���A�v���l:�m���o�Z�������2�ZZ4�'�a����T�x�P���	�8��O^\�u1
����)�d����w\�,���"�7uKP�����8��c�w��,B��Q�GVUutA������^�p���m��M�V�������+�y^q��Hb[J�},�h������b�C�l�bqx!��1
��m�/���X�6G ����\���B3�{��`���6GZ���7zR��������6�2�2/_��b�$��Q���
@�6��Z#,��(_���?��P���U\�-�0��uc���l�9��Cf��2"�^��Tl�1�
�Y�������hF�����gl��������
5@�����*��n��������#�����r��������O��'}`�r@A�9
�wqvtL�d�y������,Y���p��K�5�V����m�V�9��x=���I��l,�o{m.��,�-����Yj
4�lsH�9�X��/g7�t �Iz�X%V����*>,���U��R���������q����
��IO,�����[��_��6��~�ge�v��wy
4GL�y`����jYEvX�f>n��pP���[3;|h���	�������(�I# �WS�fw8&��G%�nu��L��H,9�V��bM�]4��!��M�V�%���t8&��H�����_a��4�q�~�c����s���:���8��-_�
�FA��rI(���i���2.�a���x|�[{�d��:��4���^�#��9R@{�������$��t�*��6�]�tc�T@�u8��f>@m9�O�E��qJ(��t<���:���7s`X_��~��d7�.�g�:�{4�E$V�@�y����B���P��.�����Gb�2����?��c[�(�G�$]yb�n5����  �p��pM���z��.-���,����Y6����r��it?QXS��;.������tN���
�'�����6/�g��nX����~�`&8��I�[e�~)^L�C�f��f�,pnh��n��7���e��VY��.�7��s�6Lv����Z�����h�VJ@)v8J����:4�
������I���^�X�������9Lb�c��R�����6[�X�{�A��j�Bj�S�yT�ZXO���G��J�4�;J�pP�!����,-\����F9�������������o%�d�D���r���jZo���G6�t;����(%@��m#1L�/I>������E=�R���?�9�p��Su�L��we�$[Z��<���f�@�9�aD�Y8�}�k�3c��c��f�,h:�i�����gft����l*��P9��e�p��]���G:�)L���qX��W��,��:E�u8�n��/�0s��qw���|Ag���3������u8~���!��������1o�/���V7fq�&PY��,��}����*�CS�N���(@ON�y.��x������i�e��~:��=��4��p�,(/���=u�|�*/����Xs'����G�u�Z;FR��re�XDAv���M{9dTw���}�?�,����Y������o�B��u��cz�hF^����4Wd#��i�����:�d�/�fM�u&�{��3���������#3c`	xhoah����X��{���4��p���
Po�3����v�z�J$J3g��ogf�����Di�!�#_��u8���2���>�����������]��0{���E��;*�R�����q��S����
h���1�D1���J)��
J���;�33���p4^�S/����������-�x��S}j�h@�u�����;,�]('+��
 w�+6V���U����~{��wVo�q�v��g�d������Os��p����;T}���}E�p[���j��@�:��@�:����!���<�j�3�q���������=s�Z�u�'V�4���u����Z;
����I;��J�
9��u��O��
�Q����G���0������"1?�u���J�8���iB���O�p|Z��Tq3O��p�h��rl���/��i ����4s��Y����CT�4��,��d����Y	V@��������#e���U���k09�����m��VD�
k&�p0�j6]��uy�y��qDg���|���������5���cu#Z�_]6xl}nn-�6�V�
���,33v+J���J��,�I�T�m���d�/&�"������tJ�<����|2�)A���������{t�S�������������C$3���'*:��z���@pf������p�v��$��@���E���,s�$TN��j
��i���p���������G�LT4�h�����M�%"�9rm��
��,��J2�l��:��-��^��x{�;��"r1������a��S%�)	��o��x�E���y7�}���U��8��c)yWRmz+��X ��.��G�&���X��2g"]��uWV�Z�����e��XNF2�-�v����"����Cdt��e#?��4�`��]@�u]�r-Y��p��������Xw�Z�T�������=�2�u9���(+wQ��^�M8�.�q���%��4v�_��5`������;z����T�Ey,�����`f
��
f-���s����R�hG�@�����mu6�{���uB��UQ����twUI�!���
�W�8������u�Ou]N��$����1������M�8��.#�LELe�Mo��x+�M�Hf5�Y�ZM*�_��M/��V�u��]2y������*�eG�����6�`T�[&���hn�
��0UMl�T�A@�4[�o�~�Y��~��#���X~�[�]Buy�ke2����������%2�I�/4KX�x�df��w8������T��]l�������[U	�����R���Z�Soq�	-���3{�0S�����LR�e�j7x���8$3�?oT��$9�?���x�a�6�7Q�0���4)�E���-mA\�{�V�r��L�e:�k��$�2����h/�������W�U�d������S�c��ZJ��4�:��Kf6@KQS�c�2�]�]D[S*:��|<�wY2������^��v}v�L���%��iG�y7�x�����!3�;Y4���3�S����@�����?I�����$��g��=wy?Sq�J����0M]����v�J���gi�Mx���h�b�F���!�/�$m���0X�|(6�L�6���{:�*6o�g[f��&q���jo"��T �\`��n���{-������Q%�=�������������z��N�T]>`�!���!i�"�1����������`����(|#ZV��.������:>1���e9(�J����m�]�iu9N�</,���Re�o���3��nv=�-O@��j�@Z]�2w��9���x���.�������pa3u���s��N����We�	@y���3��(�u�au��m��7�(|�Wm���|2����W����l�����l+��6���i����|".�LX�e�Q��ke@�uy*L�m:�����lN�#�%�tX|�����������y�P�.���D��XuA��J���D��fX������,/�U�La�����|9���[��Q:yF�)&�W�ke?6tZ��-T�2F�\��Eo�H�4���Y�:���������N�%&�i�)����������dR����bJ��������<x���!��u9l�H�xF�)J��R��P]��*�B�����aB2�Fn�I(s���.G�-*F�lf�bd�TVk�~X�����/;�)���W$�
�g.�4]���<����v�2���o��8����S�(�P����QUZG	�T���e��*��������E��<���U�,����I:'j
���*��v!�)
��IX�r���T�C&��r����~����(����v�60f��`��R�9��o>����>3�5"kOF*P���\��e#����A����e�����y��&�a�a�7�h�o
,.G+I����Q�t�+��C�$/P��x6�Q$PEb��y@2vy{*����K�����bTJ�_�a����ADvy R�
���'�w��'�J�R�+UEm:��V��)0�.�|BT��|����s�����n�/�MLdE��d����X_�s?�0��P�m��?Pw�K�c����a4�-�#�p�
���5����
�[��H��3'q�DdY2�Y,��-rZ��ovQ�d]]�qQ�{��9�"�y��X�������8�����D���P��[L��b5���o�]����U��������F���$��N�`6�&q���;qM���D�P��\`��"9'���YH����8������L:B~���$�Q��6M'�4n;�s�^�����{?���F��lD��S2U6�*��B��0�������D���.SeXY`�����l��z|`�=W\���)��r�K�����#�]l�o?X�=@���Q����^���K�P#��u)D��hMK����^P�=w����������;�==��������������L��^�m�n�<�a��4E�*��10\�B�1����e*.�}Q���)���R:Q��|-���bC@��8z���H���*�����-������J��X.@]M����X���P���W���	���\��2�� W{nr��s�g��y�`�0Mn���r�i�5rhX����ea��s3�f1�s�@a{���:��h��a��O3g`W0[V�cJ����b���Y����Cs������h}e
�������^s���Y�(�)]�+!�2�Y����������u�bYK������e*�%��g@��������, w{M�]J�JT��Q��\�	A���0.4��X��5������]��~����=���dZ(���8����foW��[������Y&�;7���C�U�-��=��V�S�;�U&�H7��F^3G��n>�l���=���P6��@�=��,������3}��RGQ��~C)��V�v�L�;�q)S� �h��!�>����d�#$�#��%k�Z��H	�#�D?���J9�\��	i{��s�Y����{�l�(|�q��nV�,wI�k[l���{,M�1�}�<��=8-��% �{�%
���8Z��y3��=�2�x�����@[��%zo���^:HeK
m�s��2�H�F��y!��R�+��1��N_-�,��+��"U��j���"����-��i��|8eJ�[�X��+���k)�5����p����tUG>�r)~�%���E2�"�7^�s�!����&���@�y�X���Q�����e,?H�c�,G���e�$G6��lq�g��1��8B<���5�B%�e�fy���%[�a�ND{���c�*����K�MSQ	�3���S���750m��=
\��T�]��K=��x$X��m������Vg��_<u��,7�.�����P����2Y�
�,.����������k���������*.J-���hkJ8���-�
P���������������9����{�=���_[i����2^�j#���x���	�O���e�}���SK2�6�O�E`���b��XP3�1�;���*�n��q��pG����m�����H����bU��
{��q����
�Hf�C2�l�>r0%��x�����������TJ�?�"��pfzE�����A���P��/K2D��Q����c�/.��npq�{�V�-i�f����d�q�L��PlZ$���E,�wR�hN����W���X6T�������lt��ic�<��w	@��(����2�s���8<:����������B�����G���m����U���s���jj�i�������*�i8��z���L���~z�����O����uz�0�����85��s���������9xy;n�����Q�}���N�	�7������we������Rx�nV)@�88���Y�������V��VWr��ze�/����
�[`zx�4����|��q|���4���Ns_Z��������=;�v�x=��S���0=oV�nP���]��
��Ik:���7h�[?a����h�X�q�gZokn���?��|�+�_\&'7����7��.(����0��(�z.�L�"Y��s�<��2z��?
ta����q�f?��=�8��,����s�.�3������r���ZF�M��sT�g������D���������I|EN��j��!�|��t7c�R,�,�R�U���N���s�]�Zm��uiu�D%K�n��]���7�F�V$0.���^]:��(��H[�D�:����Due^@�]N"m�Cv������FV�n��7kU7\F�q���T��ucg���ts���e��=�������nt��7���=��6�)����kU�e�lP@ft���|��qH��6�-P9��9j?�V�n2��;���5}[6Rn[��j,6�s�x�Y|C��������ko�-��'��T7�E�U@&�s�%4��AW���:�A��rK���/K�����k�r����;&�5�A����Y@�Q+z�Lu�X=g�qP�K��@�y�������h��|
���K?�('.#�8����a�-���6k/�W�x~��h=wf����.K�
p�Z���wJ�[������7�@`8r�.���6�t���:�F�]\8>:V4��!]�P���~#�������M�l��ae���g�
�7��z�f��Z:�����Sn9�:d���.�����e�)��di�g����	yb������sIK���Yj�.z���f�l��=o�|@���US�u�S�k.���p�V�|��?���|��(t��)T���t\��Y�%� .+`Ixj����:�{/66��HI[���8A4����u�`�2s`J8�����%v�V{�b�"��,z�~���3g��pt�6����P`6:k;��!��o�ZyCWN8����K��QiA�E����%RaXd�g��mm�nN��7��n>��%����b�I�x��U���
��-����!��n������H&�~�����3z�uP��d�nb��3�qY�~��]���)������be��g��\\�3�&V���Y����X/��E�l�S��j�s���TP.	��+[���!7>G�4�Plc���������u���0VOq�r�|��x^�^��������j�
�����Qjn�be��-��E��a�dy����Z'�nx����2����<������Q�S�����R]���8V��ku�M�����0��=-E����ui�Y~�����J���N��=w5���	�i�e� �k�'jD�!�_�:�Zl����wJ.�Ss��d��rX$Y�,��������gy���*��q�����2W1l,Q���s�TI��u��4���'h�x����e���������`�������7�K:����$���gmO�l����D��FU����s���D<�@����k���m�����9�"�h�ol:n*/�6ew+�[�i�{vw�#�@��c1cS�}��W/��X����nk=�~��\F]���wP
�EL.��i� *�&cV���j�[,b�&vg�;H���8U�/�x!'�r$=�}V����[���e,%�3&�g�2v�c�@z��������`�4-�f���+I�J�Y_�/�����X��:���������$�����j����tw�����+aA��
|l�p\X7�l�j>d����W��������k�z�[���e!����]C������fd	�'��Hr���nB�}p�����	7P
Z������G�W�9� F�.�Wn��}_M�F�#�p!O�6�X�.
g��D�^���-^vO�E��z����/�h	���o�{��e��x���>�&�m��*5���*����U�/�v�]����{Z-p^����$�z��l��*�6
�Cn����wcW��C>�3�`����y,����n����rFk��n��W	������������D��(������KO�����6�/l���AsESqvXR?l����s���V~�3�d��Ur���W�_���5W�c�J-U.��U��$��3��������n�\�o�#f�7���h���o.��n��

��]���������������o�����w4P��A�F>��D�|�������1����Z��j��S5�Gq�a}���H��B�9�|�t6Z������dY������]����������2! 1�x��c�V6�n1���k�ZM�up�@����D����7�W\4��K6�=@��na���/���|_?��U:2U�_�W=�:�k�p���9���8���S���)z�V1oN�O��H��RJt��#�nb{pnb�����w�� ���ka��9[}�#8�V���>��nP':D�^��x�_V��������9n���!>�h}���j�������X����}�y�|����T^�[d�$���������s�\�
��������hW3�_�O�YBAJg
QQg�����PP���8�������Yk��720U�xP�]G�8�����6�����B�$��<�������E�Aik��Jo�3�PK�o���1�8��*�{l���jY>���8��>�i}�����Y�}������-��lW�dI[X`l}G�����KJa���/�X2Ie����$Yo����"&z-;�p������o�XS����.6w��l���)`k}� "?�$��X<�q��8���=BKiE����O���@Z�&�h��M�s4������l�JFYt}����I0]��~�����W��#���{�q���e���N���^�/�}����W�����
x>Xx�A'`R}[�o�����������8<�<-���/�&�V�����+�{����|�h�?���[Z3�Ff���o�mt��n��������*���"�e@�
~L0x�����L�?
kI��D
��
�&�xW�������P�{^H�*�8����i�QEW�����9ZN,V�]��
tZ�7��G�(�������%780����6&�Z��m/0Y�F���3v)8Z����7�+��i�0�����/��I�i����65�N���'@lu8�X4�<�_���'�%^(���7bw9��2���|[���6|V������W�^�ku2�i)&���u���=�k/�!�W�:��#����Q����������@�-�u��3~�~��Z�/^p
����fx���I)����T�d���L���d�+��6��������T�vM�Q�(���,�������v���s��8�k�Q
������eX�j�����5b
��o�Q�������m|N@�<����sP����4��a{H�����l[��>�|��s��F>��9=�G���O�@���zm@���J���I��$���$��.li������m�[���d��%�������hi�J���M��`{��Z*������C��Gs���l��:-
pj�F>��8����8���y�2��BY�(�:�V�J�6��dp2�v�`:�QFY=�a���H�/w���Ja����HUm'���'�����CJ5J<n`��	�8���1^DTmn�������e3R�����\����6����e�J`�������_E�����0!�X��e���X��i�}��� ��.�S&�O�����)u��xq�2i++X)o��b:�
��8-�EQ�����&����"O�x�R��}^���2"2&�r.�-i!,�0'�S����7����Y<a���(xH��K=,:��N��i*>M�x���}nP8�/^ U�o����x��v�������0a�K���l<I��Y�.(f��0��,���O��X�������p���p6������G���G3s?@���Pf.����*�kd�f�`�
;��x�S�H��O
����.���v����@?Z$���"��?���a9��G_�]�R��8k�i�*T���C��7:�y_8�������2 5��T�->��|������-���E�����Lj��@-;`ht@x����(�Q`��-
�_�(vU7]��U������������u���y�|�����H8�w���/N�~l����p������c���^[�X��UU��|8k�c#���3��R�=6z\�<g`d��c�.�"�y?/�W}���]�h�!�U��5��{��h�kq�j��

�������N�1;-p\����v�i������F�������6vY�v8rt�e�D[tY���p��.H��Qr5�h����]�ktY@��xj�e�����
������5��U�N�o�o�N(����:���w���F>�1�����=�������M����:-p�nj����'�,�;��g=e�n���p0j��\P�4|�`RG&5Lj�3�������K����{ `L.�xs���d6^��`�:$��^��C>r�p�`�8��C��������9�;d@��A����g���F��:$u���%����B���7Zv���q����a����B������!�CC����?0{���C-{��f�!@KC�0����n�!G��z���,!�L��)S��]rr�F>�78��!�EC�^��H�i���+N��l�FO�hh�E��A�*<�:������H�	Er#[A�{e���N2��L��!<ihU8�9W���Xp>�1�@P�Br�U���}R������0
9�T&�\@��td_�H<�i2&=���*���d� 5�e�6�����SR���t
&2�dA���l��$C[(n^�E{��BQ�������������36M�\�@���%���>��1:
�[CN�U�R�go%��W^�����K5�lHj�T6��y
�"�����zPQ�o�������'��E�:JYmU��������
9&V�e6���6N�(���T��*��TE��vZ���p��M�~������V������\8�f*4�tD���$�.��m���T|���Qp�������<����f�K�.�s��2(�F6/>�L���rp/{���z �h}b�K
ku�a�D������lp�0���_�����$�qW�C������`�!�x��Nt�DW G���3��X�0t���6�������_G~�n�Lx���iKy�{���[���="!�z��rz�:������e���p�
9$W&9�f�ue{���',���K,�U�O�����S����4��
���T"�U���FA��!G����,j��N����=p�� w������4'����{!|���`����e�-���n���PW�/� >�?��B$�Bz���>���@�vm:�!@������z�V����"B�<X� �48l#�Z&�jm�/�������)����]�
Kr,���B���!}C�u��e,H�r4�/�2t�B.=�x��5hU%��
7,B������,s��n��������
y�U1gxK��r!Gd)�FS���_v����L(��T���b=nS���g=���>t����i�2��,!���9�Y��p{8x-�c��y�����m�~B@�Ml���{"D�W2yb�IB�d-7�lZ�Y$�lQ�PRC���������
B@�V�X���a���3�#di�G�8��a'������O���_��B��VA+��!��u����DM_���������5�Dy����"�Q�1�8�a����,D>J�,Sn.L��*���'{����/dC�]��i�� t�����U���$����e�$��H��������R���i�-��0�J�����P��$�Lx�x�j�R���;�Q,�CD^=��o*Z�W����FW�Y�H���<��p�p�d,8�C��C����V����9���S�FR�e�����Yy�2xJ����)�0��?r�p=�z��H�6c+�|eC�C�!��V��z@#{�,���|_��<��tUW0�P��M��(H�Q\u�/����Kmv��Y"�����g@��6��JMs������nx������<��
mj���[���<�k���m��\���m��j��!����^��^�P����M�U%���!@hC�e^ iC����:~Z���Kf���(_!�dC&k>80Gu5���)��s$QC@��m�KW���Cf� �j��~i(���_�6^TX�EZ�<���D��Z$���1
JFD�0"��FH������|R�������oF�0����n�:5���6w��F��N�K\$RT��m��G���������Q �=X��!���� Q�����u�5�`T�&��R#Ke��S#�eZO��)��bo��;|Uy��
)�:����M�I���LEY����h�C#��F<��uB�Pl���p�z� E#������s�N&Ut(�(�����iRX�(YAQVt6�g�����tm4`g���F>�38
�F��8���>�~�2��_��Wy�a�8%�(�4���� <��:-i8�2q�d=��,��'�O��G�|�����'T*��]<Jo��x.������%�uW���g��lL4�G�e3�2��Je� �
5��L�������*AWcQ�t&��W���G�D����
q4�����=:?���y������#�����eM�"�AF�d0���A��JQ;���T1�2PZ�B!IH���0���X����w���� #@A�y��4�j`�<���
k�����M.X�xN�t; 8,c�Gu_�c+�p3k�7F�0�1�k��b��u���q���T��s�F�|����i8'�H�!������K��-#;`#�
|��v�����+]��1�h0�����V����g4���g�$��3�*��8-����+�]�l���#�41�5H���|��14�0�)��t�F���z�v�k#@F�V;�����k#*F<��1m\K�k����C��'>e{��J�x�l=�9`������z��gdS5�_�G����u���"$F����~C/41���1�N���@#N���
%*F��g����ix��W3�D��ln����QL�Ui����,��Lc����,[JK���8���Z��������nK���~q���p$�Q�#F�V���.��mp�`���A��D��,_����8��F@��s�z�z���e��&�}�Z��t�����V[@(Fm4L��=�=���#L�>�X��h����!�"%F<��mH��(�'��z>8�x�p#@"
a���sp����*wG�L�,������om�.q�"�"F��w��Z1�hE#`�����
#�+t��2��pq�!mg��Q2/���J^�v��\^Yl��4b���n�V��81�����T�@������Se�N�"Db4�z�c�X�X�8r��X8�(:ul�[�1�F����E�WK� {��c�A�.���K����'3{4�m���z4
�2���t��Ov8x����:7)�8�kzcQY��z3���_~�;��pxd�3��:���p��/iMV5M����+��s�}ug���<�?�0�����Wn>�nq@YvZD�6��n�s2WX;���l�_��@��jw>>�����!��n�-o�'vS����F6������g`���|���F����!��]A���\��*�o�w���v�*�Dv�9� :��lC���j[�=t���[��;�G�$�-�udw���xv�R��	����c�8�3�@7;�T�g�:��XU:K`���l����[{0g-�@D;���6����v��w9��R�*����ZOO:����3

��N��������N���ZUC�%�v��!DP����5=������
-5@C;*�:���z�c�L���y^�C�`G;m��������G�6�Lc��SZ��e'�F;<4�$��h�&��"��h�CG��F��$����Kl��a��5h��A�e���c���ASD� K;�di��G����G������_�V�H�G2j�����]�`�3��M))���r*?��NS�#+��d!f �tV�WQ��Iq��
�i�cH���@;���
���0c�vl��I0�J�bf������(�L������5�[[xy��U��#oyv8���%�II��Q6�/����#�m����\��Ld���lbq���.��ND�z�b�����I����i�SV1��rstg��q���xt�3U������c��4,
 ������'��f���.u���M���L
m�Q�����/_���>x!�u;�'y���g\g�x�����d�����p�z>h���Fat��rK�B�����.s�;?vl��r��o<�so.u�v�E��0�GF�����T��������4�|�p��ZI�x-���,���oh"`�6��Z��3z�����I-���t�U�Na����N�"�b�?�I2�a������{�$�l��.����������A*�����c�p��F>���x�������x�{���7k�L��������MO�V����D#`�6=�r)���S���B�Q�:�@v8b�w�l4Y�m%��,��v���b��������b��_-�+�/���_^�����'�}:.�D%������d����u����m��<��we�"�����h�~!2��z�b�������������eQ�w�P '�E<)^}!���[������_b^p2�������xp��A�s��-��W_�����/'�|?���]'�M�gI�R�k������j~7��+b��(y8����8=:��W����~DM�Q+��~��:?P������7O�dZu��_R��d6~)�p���t��f�����l�O]b�2D�v#���u��R���oH��K���7A��o��_��ki)�1:�f������s��>�4�������~|w�}w���{�����#��/_~�|����� �?��+Z���-�����g_�����Mb�:�Hf�'I,�����Vx0��^P������Bx�W�����W_��B����\)�����bt�~<-�g�>���N����<��h�|L&��n}W�"�^y�/�������1�N^}��$�K�B8QQ��f�<��t�k�O��>�.R>���^���=���~Q5���^YmD�>�|��k�w5
��'�~�z������o����
-X>�v�8)1�8*o�}L���������������x�X��l�|�N��?���P7�!S4���������O6{��'�����a/h����?��k1_�Y�9^�����@��r����k�V�`����*���W����}�����Q������Qiz&k�t��Nv�����:�����i���O�l:w�@���b�����1�n�G�b�z��}�������n���Ap��?x�=��*������o*�����?��O�[����\��+9�A7������*D���������o��ZyKz��n/��I��}��\Y@�r*��X�����U���H�������+��X\���G����h�0�0�~��k/\Y�wF[]�v�[�b����A�~�����t��P�F�y���OW��0X�)h~����~��
��0#8U�E.~����|9g�3�'���c<u�����S7������f�?�jz�;��h��c�tY([�`�������?Et�k��}�*�{)��6�����������O���������N�����i��i��i��i��i��{��y��y��y��y���{��X6L>�NI�=�����O��m�m7�6~�6~�6~�����*(o�6$�2��?Q�����?~�y���{���?������������Yw>���FR��g�`���<C�mi�x�z�z�~�?~C��P�o��-��	[@���;����������������ER=������}�z��
+���hS��1���~��v�H$������=��7�>�R'�6�c�'w/N���^w�����0)���z�M$���.���'1�xe�Rz�J�[d�l��7�� ���r"q����������JqZJ�(�?,�����s��/h|��/=��d��9_����I���v�-���������j�L��]�G,V2r�g��J)�'G�$�>K���Xo������UT�Wa�]rWa�|����w����gW'oN��N��P��}�����[,#7
V��n�m�������]p�.vr�%�us��������]�b��`��1���r��ZU'����X�!1��dL��MrZ��,��$���YJQ��b�����1X����1�����Noy;y ��<I��������������U�*q��K�Rw3�Vh�_v]J�'c��S�U�;7�$��n���G"e^�������m����Z��s��Z.���H����2-C`�������u%���P���F���sq��9��6j���I��=oGtd����jf��T��2�B��3r��vuIQ?�ch���B�"'o`W�L�|'��^��i����Fx��.Vax��N|��
\��Ix���:��|��������o��ys|Ar�W������x����F����k���" ;'}<��I��OcQ[����9IH-`����x���<<�������|g�N9�G-`�������uE>N�v�F�cM_fsI��SM���(�\�eU9��Za�B��=L����pr����2Z��-0N�q�m:�/�L��pR��L�
�3����><�/bm	f�V�2fS�z/fZ��Z8��*��BLH>��T<yxh� ��EQN3Z�����1����0)V?qW���4&Q(o�L�iJ@A:���&%}��q���$/�6?��`v�E�c�i�N���(2����p�ZV��8�D-`T��a.���'����g^������&�I j������mZ�~�����^A`[����/���P�t�������������^o����=T�����\�x�?��^�?N�pm�'�=�������4;��"���&�vQD��D����?�s�c�����4^�c�$�4ma�5�3�H�gs���;�4V>N��)�*'t�i��r�7�����S����k�#�t
�\��r2�_~�������t~�w��4�Mog�"�^_]��E	N�J������*(jw�X���L����{�'��H��i��q�����@�Vh_F����1��u	=��%�r����t�������~|����_+��������]�P���Oon~��$������?]�?�tU��Q�����+�8`����0FN^������g�J�EE�(�Q��YZ���;26�\����v5BQ�E���d|��Oc#`�|�c����S&\�4�Z9��m��S+�S<�'�������E�
�y���}���=0�������Yd��F[�a�h+{@c�7���J}��ZD�m,��D����gLy��h�`��h.���Td��\������Ct�Y�9s����.@�v���T/�5YO�m���w���~��r(8;|{��L��w\%���{�x���+�����--��e�o,
�����2"e����
�i:�L���.CSj�g�y,j��������Q/X9��{?���(�Z.�����]���%�1�E��v��%����r���A���E��&�[�=n�(�������t���=�0�W�QQ.*�<�6�s����|{�aQ�v�w���Q���|.�0��|{�`��������"����-#!�d�H��m�p�`���f�<�}{_Q�vv�������0C>�����Q���
��=3V|> �|�,R��XJ�i*
'�����M�C�H#=`�h��TQ9��5���C����!�C��l��qp�v|z��d�1�s���
06�5<K��<��Nd�v�]�#
cL?U��Yq@�9d"(> ~|��Y��9�WBQ�o=���m���������h�V��K�l6y�������*/�y�'/G{��n�6F���qB�(�c��� Ej
M�����-=T%��������UA.���t�3k�D~���L[��]��UD��y��.��������3���uJ�%��hk<�-�m2�5 �(���L-_J2?-q�����pr�����
p-P(�C��Q�P>C����O
��|{���!��o�.b���B�A���^�Y(�s��K��V�|���Z����r(��}�\-�����N	��`�P<pwEIGX|@Y�e���{+ob\y#���DL�n
�J+�T�R��9�J��O	w�s�%�qx��������,���]3X�y����8�$E2TTiCM�N�-?�rz��,�8��c��m�q���1?$��������y2J��Wm�*�PFx%	�I�i_�0[,�����>�t�xwU`Jrb��f9`�k���+N�)�+kCA�(�����N��,�'*�g����)��L����������>�����#p1�-�C�:*�����r ��w�����X7���L�c�`m�-�������X6�'���R�4im&1��)T2(�r�g=
��w�����i3��:[���R������ ��|�j�'�7b&���P	���7����a�~�m�~��t�R��H8��|Cyh������[��f_~
�$����Z��m���&����}�I�����w��=�x�����O��{^�)�������O"p�n�'��	"p�����&��?!d��w����2��X�����r��y�����t��t���Z_��4���U�)�����S�.����*(��������}���{������������?���������������6��{���U��60��){
��������������������d����������N/����(z���wI�+��T�}��{���{o��=����_K����i��W����������������������u���S���l����������[�km(V���Q_�p��F����Z[����TA6�z�W{���A`=��Z�mjT������
����_����kH����.Tk�x�Z.-�jUm�;L�[���ri�����$�rR�� M3
�����x��L�����W�g��x��� 7c��{)Z�"�e���V�<��Z[�JjM�6"���T���^R�8�Z.-C�VQ!�FU��a�#���"�l9�k�sm��6��T��;M��b����������������e���2�.��Vk�]��4�f�8K������C$���;2%�tV)�N� ���E��J�s�������i���d����S������E�iY��RG��@��a�MO�O�=��N��D�I�V��!dEmD���$�?��&���J�e�����r:�g����H���l!��T��io�`��>��mQX�W^� ��.���{�����~���s�2�%�C���������F�m"��<�������H��-M�iS,��rtG�A�W������=R��*�.�Z�
_+���-�-(Bo:������4���m��-#�V���������Q����[�]g����X�h~�H4
������$������5e�����/�	f�2���K�x��Y�9_��e[m��m�ZJ��b�7Y�����P_�<_�Y��bK�iv#��L�k5�i�V��e��/�L�0e���d�
�@a4��=E�S���
���j�\�F���LYi�fd�����vu�'Zy-C�����&K���h���t�.�z�
*��h��/���jp-��!n�\Z���+��M���+-3���[��otXo#��kZ}z����f��F������j���W��.��pS��eL��K�e�����o^��9�_J�����]�xv4�8~7|wqr~qr��s��9����i~�~������z�����#�k��c8G���1hz���Q���������o�q������<����K����t��
4\o/�!�n���I���Z
��������x����������{UQ���p��-�7t�l��z;���9P�voCN���������V�����2��>�k�X;�l�����w�_��6��jw�o|�m#�Q���u�6�1%�����g������G�d����3�7�o���M����\����w�M��"��o��*��7����s��IfEsh0�`�[�zS`�W^vs�'Ev�c��=O��A�����5���������B�l�����G�ux������W�{�cLd�8��\�����!y�� ���E%Z?�s�n�<S�H�����-,k�<"f�0����+�����SuBE�Y8�����*���r�D���qc�E�?�����2o�l/hOP}9,�Y9d5<���[���b�J�E��T�s�R��<������Zf�\��C�k���WKC�1n�����s�r�.�^����\�Z}������W��������?��j�g��z�������PR���y�`����z���n�`��>W(l=���
T��^����eP���Z��e�k
n�����G�Ot�n�o{�{��/�A��za���FO�������'�������9��sW��d�.�?]7[����[�����K����>���j��O����N~G�A�	������S�gKm��
����#�oW�����wHO�?O�?O�?O�?O�?���������n�������_5��~�������O?O?O?O�Hp���~L�����?���?a�)���'��������
��
����O����3�C��>�}x���j/!��8h�5��O�>�vpZ��w)���m��mm��)��&���_��6��6�F���\����6����S����l>�Ta6�zO1��C�-��A���2���K�HZU��(����&��8��j�2��V�L��KO�% SV���e�A����Z�t�D���'�0�d���2��K���"�(� L�-�Ah�������2 �F���b����*�e��3����5�5BH��Z.-�1TRe�yz4{�df��/��t��1K��--h3����\Z�4�Ck<"��Z�}��_#�F��Zh��Q~��-C-�*�5Zl�"�F�Pfh����Bk����Z��A��2���K�H\h���Bk�Q&�F��Z.-C6��Fd����T��HM�2Z�w��u�M�����'Zim�;T�dF�xj$�yCi-�@h���P�N���d���^�"�"�mC<TIN�.-H��Z�z��i���,�!lJ��Uz���f�%<nb�������c�`Z.-CE�u�Y6{�!ys��'�����cGl���������J�@i��-#?�
)-T�^��2"D�eP�����wWZ	-�E�%\
�?�x�e�+��A�c4���c���q����o/���������s &�r�������W�o��[�X�~����Z�-����:��������m#��?�8>�:�0�<=���=S����w��8LW'o�OO�����\^�_��e�R�����7��Z�-���|wq�������m���l��~�=,����'z����L�g�
���_e��������Z�m�������R���z��������cQg���U��5���/��}w|����o��.�����5-��"�k{���K�!��n+��6��orm)�����k+���R����?���������}w~q�{�����|��^
/����Oo�����~l��YY�����f4g�.��d�������������r��.��Y��KK��u�?Z��9qE&h8�����<��r��@~�:�~2�n�&C
��<���T���o������ln�4'�"�y�y�L�*����S�N����X ��v~��{�_&��T--u��I�_�0�
�%�e����2\�g���R]X}-�y�HU����d>n��dy���q$�����=���wk�Zi�D��[H�s��m�VR���r���
��T�R��}�
�H�P���+�!�]%���E��awUz�W��|����R���!Z�zi�B����������km!*o���&�f6R:_����1Q��d����> =J���������o�����5�F�CT8��%��2���5��[����N��%�7:�������9���+=�B�}b�p����&���,=�l���t��(�V�����0F���'�d�����R���6���1Q���k�`n3�T
������9�j�4f�������6�j\�%�������Y�^0_���^�XI���X+-����,�#��I�7���9K��/�c�X�8*��QY�6|�D�<��~3���9@��$�g��Z<>@�|5��>�����6c���I��f��������I����I��c.� �|�\��>����AK���$��|7X�����J�M\��p%a�t"��IQ�3}Y�%���L�����K��u+`���I��B��5��|Z��n�_��Q���x���K�r����vg�,W���#����et�z���}�����S�F������i�u������)=�
�T>R����2�Ye�r��b����<E��U����gM���)��2�|�����E:+nv!�e��e���� �>�[�)���>���w��|�]�<v�g\GX���g��$��=�;��u"�$����T��K��?�����S������P,Q���#y������[��U���AZ�pu^�}����n]�\/8��2�ej��U[���3��|����
�����<���\z�Gl@s�k������x�o����BbQY�7>�Fk������V�I�K���A�e�H������![T���R~�'}����g�d\En���l�{�@�&;��e>��SS9��Qg~�et�*���$����xG�?�X��`�z�m]r��=����{vr�?"p)�VOF�~����l�Isd�����
�6�c����0Kd\/�7��oO��{�E���${
I�u�r:��kj:�����[�^����]���������pv>���I"l�s�]=��2�> �|���������=�`��|��NCn�Q���?��BK�u2+��d�S,���S'��z�;��|���_	l�>���I�P�����������E�~]������s0����<�S�K|�4�wE������rN��}���?�'�K�2�9�POZW
RpHa=����6p���te7��g�e�qj�M`
���1�XD=����;��_j�moW78�POF��������S�������}�����������C�'�g:2`U�j�pA*pd��X���u,)*�%�i<�v/�J�?���e��^R2:p]~YO�8\��p����N���ky����3�KF��������� ����I�h�!
8���~���j��
Z�����^�� ��\�4��7�}m5�	��@�.4+%�c���6]d�pd.�*%��G�z�0���]���ud��
x1p����4��f�n�l����fb�+={�cX���^�z�B�'����U{��u�v����n�R����v� ��}���z�b`{V�PK��po��V�H>����_��N���
|T�y��PW?��<X��������Z�9<9{��#\�s
?�XU��'�@�g����MIT(�*<�Y���H������'��H�������l��X~����{�!R�T6;`�����_�v�U��������MI���\�nJ[�w@}�{���9�����XXo�
Z}{�I��<��[����
<��~t��8�u�����)����=��[��6�t������1���/�U��g�����)��
@��d�<� ���A_4�<`��Zj*�.�<�>�m 8X���,b���(������������pd��
0g�5�����jp��������|���|��?�f���O�v���y~��7t	t?�
QK�Y��������G���/X�.�y���z�;@������T��.0����;U�HM��B�{�`.�:%���	���/V��`���N��\���\i�V�k�'�=8�������Z�.��.=%�cr>p!������8aX=��
���x�g-�<Y|L��[��y��!��Z�g������|�E�O4Ap��z6�'q��bZ��f�6��> ��ToT
�g�W\puF���^x��%��=�Tk��k�p�{�P>���6��3K@����>���D{0x�.@�>������W�]�����6@�N:WO��{�����/v#o���������e]P�*��z ���qc���~~K+ZXy����y���B�oY���i�������U[G3�0�����kwc�%otw�;
��wC���������!K��zF���|��LJr�g��^~���S���F�Z6=�#��q �r�!����1pl��Zj����/�w9ik��,b�!��CWa^}tx�J�S��F��Zn�x�&�.<�?��CL��I]>�.,M�<�`�6��C�s�M�&�Y:�r�S#�������J���=�R���7��l]��)��0@�C�d���<t�!`<��7G��fEE���:�7�:���>��8�c��C���1������q�Ao3��^%��\��)����CW������0t=[p���G+@���+�c�[{:*�z�����*G
������0�
�fCW
�#I!@�����r���`��v�G(���.���s�.t]�D<d�w==r�X�i��;dU�����!�-����c �CNl�������C@T��D�>8�:t!��t%��T��w����	`�!�M����>��,��v�5�A�^y��Mp�a�V_�1��������E��c��0r���i'Y,F4�����_/<�:�a(��|O���3�	���<�Mg�I:^����r�$<�\�w�z;�C�����+��
0
��[x9����ZDr�l0����!@�Cz����>[��!����q���
>|r�q}�Q<t�@�\����!�]]K�7�p�B��
�!�A�g,��|7��/��a�u�o�R�0�^S*�fyX���!�����*��7�_;u`�V��lt`�w�g�k1�8%Dp��
�G�E	���EZc�w	���
��@}��#������S��@���E�Z����
{n����
9��~���Y���.��z��z�m��
dZ ���4mh�i?�0��+3����@��
]����k @���W��V>�y;����]
v>����G=`�)�lC��x��gq���
pKnXm����j7Oct�(���
�{��v��Q��g�����2
n8x�D�q�i��f7D�����r���
0�c����;
 fC-gk��0�����|uz���-0�6��hC�L�w����5l	��f
�`V�\���w�GK�5,kt�x�J?�wV��2/-�k�*��7=_��G�d�k=?��GN���B#NY�)_����;r�<pP+(^�����=j���3�(#o��(�����R��-��"@�FE�'�g��:B�z�xh���z6�96t������E��Y����e�kC��J��gU"@�F��)v�Y�6"@�F�@5z�nti���S.C��8��}��TQK���m�����;�^|�|��V�������R� ���g�'��q�����<�G�z��9X�~�r�M}��q�>O���&�1@�Q�x����U+�\��17J�2���'Ud�����#'�l�
��H�'��RL������$�����W�%��Qz��7���x6�7��#�@F6�&[x;_�\�X�
��
�#7�16���[]��A4��S��A�z���*K�=�����/^���61��s��z��F��a���AT�1����+?���ER,3��>�`�	��sO��E��gL�C�~��F��X+=��d����,Y����TI/��������/���Z|�����xB~y-~B���x����{��p���u�'�W�(�LDmFb�W~7��c�]�������z��r�,���[/_�^}���R���~a>���,*������_��������UM��?&�<�f'����?|����d�]?�6���Er��������������r1J�[�M�gI�rD�����D|�?���b������x���/�pzt�'�,������t�VF����u6~�v�+��?��o>sZU[��K��)���Z/��.������r6��zlW��H�nD����A�_J���?�E]?��a��� ����>��6�%�����y����=�������}}~t���c�����������y_�x��������W������]-�Y�R��'/_�}A���7���~���O�X���O�BX����_���W_e�BX����y���)��^}Q$�
����Fw�"F����1]��#-�EH�y��S�A��<?'����C��E���������{� ����y!���.I
a��~e�Fy�Ei�k_C&NH��O�/�Wv��������	�W����+�^�s9�U���E��������������o6�n(@�`�<�����H��v�\�_�>���W���!������x������fIg���tr_����b�+��i���������?�uA5,�n�����_��'_y��/����*����]�,H�82�������Y�_<��|��K3�+z�������wi������x�h����=��mX�ZC��J�qPS��������i���	��]*P5���X<|��p������t*�Ia?���\
�ZL��hp��9x���F��*������o*�j�����O�K�M�|�6?Mn��rnu��_l�T'�����CV����#7!��������j�����D���X<�T���������~YN_��0��oL>���n���	�z\��;���
�
s����z����ygt�p���~�RL]��?�Z��TS�����j��_1k�����Q��#��$�71�Z����q��p���\|�bT��
O�������c�Uu1��I�qc��_l��KS|3c�l59�����m���0k�i�~�`����~�����������������{2�2���AO����~�i��������������������@�?��<��|��n������� j�1���D����v��i��i��i��)1	������X�,�Gk���������?0��������#��������=��=����O���=5�{/�����yrX��������������tZlC��������w�6Zo�u���y������w������?���O:�}�������w������ER=������}�z�7���O�K5����8b��E������1w_���\i�:��B��q�t-k�
�J��1�������k5�_^�<����wu�x��z�z�|���<�vV5~���7K��8�k���{���}�����C�EI�_��<N�`���[�R�h�!�N���n�T�L�/�i�x�d�Uy_1�r�k�O��y*f���U�_��<���.�O.Ql\P���E.��7���3/�j�U����HMn'��=G��I��.N�RC�� �����j�.�q^����������h�$�
�����U�m��~���xiv���dT���Q���"�+��R��r��b��Sz����������������Z�uX{����4��eW[$�K��\1�yi�_�
��1��E�"�ce��
��K�[�P�]:���O�k����}Bj����{��Rt����&��<6r�.u�2z�4O�x�&�����T���5�i<���PH{���'ZE��c�v�}��T-��m�|_:43�K��h�[�(�����QN�A��|hi��tD��w��L���*�,���$���I����*�('� �g$&
��s�����*TvN?�V���^&=������y?zt��G�5E6�&{����KR�a^cy�\���R��%���Z�2�j�(�������-�z���6&a��I�'���g��S��=��8�Q��l!,��^t����h���o��/���1 �I�_,�k��nE�U8INp��?���v�����<���)f
��(k�1��E���J�8�&�,=h5~�
����*�����ko�~HD��n{�]�SrS�t>���$�k����,�j10T���j=�0��^	O�,�6V5�F���p44u�8Z��1d�wr����<?���H&�S�������d�0v�Z^6�_����(�^��L�����k��-9�(��f�@K���4�.���3�.KJ��;�t�������0����uKm��4<VOc�{wq~$��P�>�g��'g'W����w�N����*�Fam��{w��D6}2��D��
o��q�7\�B
���(��b)]�_5e��+9�x�'���n
�yx!�1�-����xe���A�*8���4�l)�&�uh</;�V!�7C�o���'a�7�d����|���x�46����n6��v�*��
~��$ V'��r:�4"�"��E3���������T����~t��_����?Z�������+�_��lo���	����I��?�~t]�p��Dv�U���
~�.���V�HS�ry��	�TN|d�N7�,��h2�-+�C�������M��r���������lC�*����r�Y��0G�QB�����(�p8�������'�����h_u�d�M���@\�#��6 ,���E3��������Z��|�h�;1G���������(���s1�����`�~H��$���5�ds��y�.��\WU�a}�����T�;yS��tU�-��Dr:��(x6l���;p���_k;DU�"�C�Rb��#7�������+�
&��;�r��+[Wu"���vOD5o�J��cwl#������1{`��0�\�Yv��^W<����Bx�b9o~k (��k.�
}����*S�5��)
�2���O��`�&�
c�����_���T 0e.�_�B�W2�������-�t�8�����i���"������G��Ka�<.,_QLY�-p!���������jf�\p \�2���f�0�b�y\�3�,"u�y��G��qL���_^��2���u����_nh��q ��"�����(O�S
�\h2��Y6Nr9�_1e_��/�'���e��i�0�QK����.�s�@����9��4�:���r2�j�qQ�,E�"
��������zq�Sp\@3Ky��)�
.����Y<�^�fTA�<.(�����H|���0�R^�d�)8.����rm�\"p-=7��X�P��?��[s�����<�
�[�l6�@��c�H	Cj.�M���iL�7Loq�<..SL:1�'�ws2U9��j��T�0\P3��E�=�.}7�R/(^,��4�Y��eZi�(��!g<&f�q���J����4S�"\�3�$��x���<����E:c�? F��ws���%��@�3o��T)�djY���e���(�>���,�4���f�e��s
 ���=k.(�Lu���\$���h_%[2�"�y\D4�$�XSf���y'�u�������vlA<3��g����?���T����a��c:�������h�\8�}�{K�EC�R�U:$~sz�����N-@�3��s��u ���7|�x�l��Pg��w�e���IL(�=a���>���g�	�Z_+��Y���T��@h3�
m�+9'"��y��Z���:+�j�XtT�x*�.Sgy�L���e�lP�]�.%�Ec��]���.�e���V�	W/�|��rQ-y�&g�^�c���j�:��<x�2V��tB��ELV8b�x�>����x;�q��{���&*2Nn�������I6��?��L���!�.�f��]R��9�,�7�R�;��n������\����G!�������������:�vjv+��>���{%��)����\�����!M��N��'�t�,%�=b,��4[<4�����\6U�Q1yM�nw�m�K'}�Cr��b���a�6o<Z�S����z��y��8��9"X���f��C��0��9�W��-����)��%,���9��:�&�w�t�������kO�6:�z'M\�4�X3��j5���)|wS���~����Y�(p[��m�I����&�Wp���&���D�q��G�_�rfC[�M����ou
�
��u}�'�yR�Z�x!r,���f��F�c�s�-�<U�����e$���&$3-�����Z]�X_@�k������	e6�FK[o�u#�e�h��
�������W]����86��&�nL��8A�����6�������2��s�=�\��pY}��������rh!TC��b���_�{Gz1���L�����sDj�������U7�,��l�g,�D��XN��r������_7�N�zX_���W�������No�
3���u�t�%���%�j��
'B_)nQ^�R��K����>9}�I���u�B�H����e�3}�4:20a��,C��p��^�VRv�[k��9)�'}�������Fo��o�'���<����\��b�>T7;V/����#�>yF�����^!`�<?Ym��/�Dq�T\:��7i���0(��P�����������^%�:,(�������g�0y��(��ryb����?��4��[(�;���{�6��MuE.��'.6��O�$���n����y�S�D@x��YZ��-��|�4U�K��C7�%�%�����������@9h6�Tn������nm��X4�Q���#}��M�
�Cq���
�B}
5�?}�&�]v/�y�vm��(���Fg���b������\,���d���t����]�r���tSD������X������ i��wHu�d��^����>�L}�25M0�>��6�:z<�~hR��I�l�S�`Q���fysS-���S](���X����[�=,�z��8M�j��f����UD�b!�!�N��(�3_��<�!���x������E�����E����F/0�>���������_\U�u4\x�����Yw����v�7�v���#Z�d�W�-���\��(Q��D�l��s �����s�����j����^�QW*zU�W�H�����w���Vt�a���]t$(W]�^V,���D�j^��x���,��A���l}u:�����st��
`>}����]}����$^�����T� ���b<3JT7�?i��D�����]��=sQ���J�S����.���P=`��i,@9}���M}9cIg�:����[�:�%����\)I�������y�e[y]\�9��]��BE���}~'��i���)/�����s�6�����y�������fs(!�{)E'��G��
���:�DU��
����{%:p��2�|u!Q�g�j3Z*�^���J�s(i�Yw���+����["`A}�e�Y�S�$�h8T�������*����h����9����iw���+��9�N��;�l����M�o&��2;����W&�R��xr'I�f��L�l��p�=phg�p3���HPe�4
�6+��mw���@O\n%��L�c?�l��s\�1�t���M�
�V����p&��H��a��Q��� 9}��D�'e����I�5�
*���z�:�;����9���d����-:6�Kd�h���'r������W
x������5��'4�/�p9�0e����`Q���s0������j����C}7>������S��H����hr)����v��I�q��pT|b�PZ�sh��
pJc�X<N�Z4��(U]6��o#p��Ogf�q��/��Hf��?Xi�a�t�u��xj�I���\�����,�)Rm[������3����V6���n���Ye~)������������c�!�/�b-���6G�\x�rR��QgS��aC1e�q��yz����������o��y#��g_	zQv+
8�R��n�QJ+�&#1���T�<�lR�_����vLe�1�Vk�e���5t2`�T=�i���$7���"mN�/bz��F��^Y5T�60)�e�!��F	�tYuS�:���?��;������V�TL�����nx6`��������[<v.=�R��;�o��h��%td���z6��9���L��'&���h�,�{�������]jD��01�����0�K�S�=�F��;U������/��|n�0uJ�������e��,.�M�j.o���74�a��|�l�������;�'�����������!�
�G��v����C�%���
s�a���YZP���r�����hD�Q�(�;�6Gr+W<�r*���f��
pd�Y�e����#���X��K1 ��ER,��|�3�X/�FVj�w%�����f8����r)�
�;�MJ���wX�I&i��&��:L{3Y�w���C?���V�x�����H�L9�q�c��#�F|���J�l���t����
�^�!��x���z-~��x&�P�:\RP�Z�;������xJ�����U�/���W�<�.{�[u�����p�&_=$(�-������M�����~��/
���N��*��$����c�#O5�@ ���R��� JK�'7g	=r���x0�[��<��\Z��/�	.\N2��Uee���/'�9i�c��4�����s�e�1�-
R�=--w�M�Q�i�>t����
5����������I�\����Ym���Y���/
��~����O����bF��+5�Yq���(
8����T7n��M����+�������z�W��'�_�s2�P��I��s�
����5�����U����r�O�#
�,�Q�F�!o�������w���4t�����'�ltg�xf�h����cT�7�-	O�����q����W����LB���=�B�#s��=����b��f{w���z6�=r���
�X��rW.�S1��wS��k
zOA}�dQ)��#�*S#b5��U�zV����U��d��*���l��!��k�X�O8nU��Dj�&R]�O���.���<��V�
������\�l�3`�VO��D/&���'��l���r�`q����mumcs{�mI�8�U���cU-�g�G��T��Y�kSD/��
z~?��836X�V%@��ZV�xc
{F�(*��#��������j]T��]z|!����j�w=��n�j����W_?`X�aUm}*�@�h#��FA���c�2��wNX����x;�(��t�{�	����:�
AM�6f45�g�b"e�(����K�'�J�k����*Q�&��t3���X#������z�m"ct���\t{IB;qF�p',��*��������I(J�����^)��X�UU�R��e��CD%�^�h��2/����$}L-V��N�8a��
8.VUJ�c@�k��uy/>�[u���t2^O<����A��%�����:��y��:H{����������������w'g�l�����[=��8����t5���o}Y�DG�z�b$���k����kG�b*�V�T����D�^�+Yd;W�|��=I�d7;�5�s���]��.U�S����g���C`k���@��D��r^����(���m�a�F�A����������d_�^��*yF��~�=,n`aq%\Z+-
r�#jP������8���kU���v�p���k����d�a�����"Z�����0���6,uTx?�6����J_�����zy�/`j���W��
�P��SP:��4}�H�S�����
x�`���~T�����9�i�D���~eM��WD���;������Ptm�
!�pC����{������O"�s;5}��*8��71�-�S�)En����
1�J������
�|�{u)���r�r(c���}v/r@�������l���>h����������-�S��Z�f��k��
9�X������~J
J��>��[�n\q���������hz��n��!�C,��OC@�6�xz!�����F-����A��j��-���Cm�����V)\vS^������A4���J�����J�p)6<��!�����Wn�f���j�s$�c*8�`m�1���&��I��$���/j�|C��;�o	P��o�_(�Y]�%*#M��3�p�������Po������mO�m�v��mi�1���;\[r:��i=��l��/�#�0�\����xQ+kVg�!�\�!n!Olw��M��$�{q������[Ng��2B�V�[�g�
GK���C{���0�`�8D���K���!��jt`��l.qC@'������%:��GbHL��z9������C��dJ���l��k�`�~�g�.l��8R���������q�\H(&�q;09�X��1��k���7d��S��^�������J��x2���quE�-^����scn��
f�*%��$��T�g=4\�d�&]��������P�c�7�Z�YT�)�AM���:���Q���f����7����.��H�}�;���x%�~����E��S����M(�0j�36Z��v�������p0�.�/��_/�O�/����9�8����ro5��I"CF�v1Z4�d���,���W�b���=�rh�{{��:�pl�����p�U�H�nn�(���t���������1��8����bI�W^m%:d�
��,'�U
��>��u�e�q� ��X|��c��
Y�p�l
�Pk��0�a�?���^w5�����6�8(V���5�hVJFW<k����$��8S�H�2�;��.m��G��K��R����0	��d+�?�N�,9�����
\	�t�����zQ�r9�VU��W�v9��(���l]��y-�<M�"az�a��#���������e3n'��!���g�@��Y0����8����Jg���k�!�,#rP�e�y:S��,SaKf�n��g�A
0�!����������j����A�Z����������?^^���t;Dk����8k����x`���j����/Y7wn@��q��G
m���N,���?����0�UVB|�6��^��I6����o���Y��Y05��!����
=�@��+�����p��k*���<g�I�RR�
T+J���2)]-�j<)/\���(X���%L�#�u�������Oe�H�d������_�9QTJ�_��d��?�h7^2�d����7���e&)�J��A)��3d��+�����������je��t��{ohG����������b��R'��D%��h,��T�}���]��?-o�Hrh-]XF��`���5Uu��Ek��q:SQJ���������\�I��zU��`AMY}��e�a�z6�������3���������
|��=�����]u�������W��
 -CV���H�Ev��*��b���_H���7���a�x1m�����_��h2��ESs����4CN���>�e�i�VIF^�-�}�\2��;���E���Zj"#�g�R�J#�hF�8*���y�]HdHM�k������`kW���S�}G�������Bd! �_����-�7�+��_�Y���?��������.�?�v�& �����>V���awL�B�2� �O�V�����dLDj7��[WE��8���_����$������q��f*�C�����������[��JJ�S��z��(���P�S7��d��<�����A��a��U
�;R�� fd1UL�b����i�,�$&��~�3B)�b<=�?��F%��.�}tw�����>>�_�����j%|��j<��m�2[)��,���R�<l���'-[�\�^.����/6t��|���Gv�
9A[J�����lm��(p��n�W!g,�����z��2%�b���[;6��.�bGn:�)�~53�?�.�F���4p#@hF�niJ-�6}&$����P������]28�z�4oF������Z���D�T�������V*��A�
����,��h��Y�x�)�)fr�E��\�E��WsA������ S_��X������:���6��9�il���m��j�j��H��qP�=��]E�F�kd�_?S#�U#�6�<��m��cd��P�[��m�(��HMl�M<.�>�������l����)��{�RF��:��o��P�j���(����c�����t6�������~�[�hy�����|��)]P�i���7�@_=�m��hd5Bz�W������-���p��ky�r�3��*X��Iq%f������eZ0�����/���n�`�����h8@�FJ��p��2�w��T�mn-�m���b7�n�v�Wt�y}�b��$8��~��>����m������	�n�]������'������x���7�m=��m55�1[h��,��e�Y���{�e�������6lD6j��RRg�����x�|��n�_�>)��T��r�kt=@�F�����YmT�)JaW�3�T��x����������"E�>p#��6��/;F���dd#@�F1�e���fu�!�l���+d����������������^-l�"��	�5�W�W��Y(�F�E0i5�.��L����W�!��`]#N�U?m�j�����xr2��d������Lk�i�R�clb�V�P�6�-�jdCW������
z�������b����]�a�n��r�U#�Fkvs3�����N�I�������W�E\?I>Int$.�k�2���HS�LMuv��z6��m���$kdU�X/�P�gk��Bv���/�R����*�T�<)��+�;��3u�=/(0�q�
���A#��Fv
U��6�XX=`�����NKo�F4���<��'��T�����@���u��;��v��mQ'�&8lV�8K��e�"�]�a�����bv��F�"����VY����#������z��X�Mm���@�n�Au�B4W ,4����_c����`\=�^8����r\��
���~���e��r�Rk;��L^m��L����d�8�~���+�Z5j�����)�[1���md��^�bqQ��S�V��-|�\����5����b���-�%�{��M��%:_+�:-d\:��|�Y�
�X����u�-�����W
8]N��>������<���Yp����������oxr�F?���6-Gm�e?�o���lss}����a��|[P��6���}[���6����m�]w���~��w��p��&j��k�)m��7|���������n�v%�tu�����V���@���6�+������v5���M;-�)Y�+
�"�X�����{�N�\����f��a|�������{M�oV�Wm�@V��p�7�g<��g��f,^8���-	���C��p>�x}~v��������7'g�?��|����������M��9���;0�z*Y���_\�_4�K@�w�����,4��`"%���v���=�D�}W�,Q�4��r�I��H�.=Anz.����_y_N�;4�;���:�p��
pl��.����W$�)Q*�9�Y����Tv^�*� s��v��Bom;_��$O��W�Y��wy|�����t��i#�KFU����/������.��wl
���n]���p�����WO���������;���V������W���=��P����r������9c��{�y�B��N�(�^�gb���x5�T�����;�"��v'l��Vg���o�6>�vj����u��uE�=W���������.�������Iz;K�2D��0����;6��R�,Fah���q�;���c��H���|xr5<������D�����b�c<���Cv���N�bi�jfa
��g�������E?�ngu������6�%.T`����v�����s2�sw�h3W�e��w'r��Qr|���x����������x: �%fH�x����l
1��u�W�|���������8t=�KmD����G��>�>�:�_\�wu�����)n%��������/�{�tF� �w��V��{�1^��S��^
��G���/��6g6�j#�|T����io=G{�j5i���6��oz^�wq��z6�/uZ�%��n�K����Q@�w,�����S�p��zr����u#1�HFil�(|�
����m�����`�;�-=Q�
��;@�g�G��hl@��.7�Y]�a��B���������sw�h�G��1�;���8��wXH&v�Q��=�h6� aeP�\�
�g����
]X���2����:��B����4Ng�$��F�����7��;��������n�PQ-<������ �;9��l@�c����`�;������+%�����w8x^��E�:�k����`�;�_q�||������9���]
����'g�w�������D��:6f����t�7��@W(��U ��$�y��L|qC�^bj�\��N�Z�����5�D��[)/�=�)���5$ �;���ZE^5�4�w�-W�UP��y)=�������=y�����w���L��txq�����xg���<�~���)HW[��l*�8�o5�2�f@��MU�I2X�nv�����kw��i����������
�O����cU��%�������=~{~���7j��*k���K�X���7�v��(}�����^����d,C���#�x��������e4�����R��[:cMf����Kg��,�������S!�
obU���b_�w�|.��S��(Q�
R`�KF�1��.��i�������<����s��8wc�����~�P�W����p"L�0�M�V��V�����hS2��x�iF��<�	��o@v��Xo z���&�eL�Uzy�._)��UA�5/h	�'U����7dS��`���(��%�.���i:K�����f����q���. ��N��]�|w� �4T�IVH���w���Ij�U���P�m`lJ*p�W[�^��:�e�.�A����o���E���v��=hI6R�S�Nr1����������`�:]�|��8��0���n7�G�����l#���h���
��X�P7bW/����x������{���e��.'��Jr�g�g��N��c�HoT$=�X�r��O[F�I>�����@$�x,��B��!�oQ�E"�LM$�������,Ed(A�o!��Q����}�{���p������w��W�KD[%9�V�)%u?1##�M�N<�t��cv��E�������o�E�M�x%������V�]��6���$#�#]���������w�~����|Z��v�z�����#���o����D\�{�4��O�Dz�z�*�	�w!]Q� �p����x�8���1����k���(�\�v��4��.@���pvs�<t�
�m��.��60Z�
�����6�_�`���)U�������XVS�7\r��%��{]w����f[95p!nDq��6Dq})�rI<'��p���)���.�x���0r>����
�������-W?yMMt)�e0]
`���K������1�#@P��9�3�6�S�Or%��>�x��8p'mid�B��C��bR��?���\����)=S�=1���!X�4xc��v9Ef���k�f���x������
��D����]=�D�G���������1�z6��#�cc!��.����v�#�V�������Z^�y�\��{��;���:A�����Bv-n-f�r�KU��.z��;������]�3���a��Pi�. d����(�fd3�T*zG?0�H{�c1�n���vZ���%���-�,�T\��;X����o)MDoI2'�|Gk�e_���z6��q(n=���]N�y#�zo��r$5�_9�������9*�2�v���vTOop9T1��TUL�
p�N������9��r�J���eZ�9%a�����<�����xm5&�'�8���\"�j5��H?��%$w5��j����]N�Z�����|�2D�XL��]�(�UjUe�:�����������_���dW;e� M����6M��A��o���f����=e58��n�TVi��qy���x7�&���v��j���U\[n���{o=D��^'�*Y0y�N�.im+\��������'U����C������t��ef�JY�e�H��Q{T4���&%���j����Kf���6]�Z�����D7�1�O����mK����k_WC�xY�Nx��~����$ �;\���C��(,B�g�h��$����|#z����i�k]�YN] !l!��u�v6R��,��o.j����]�y�F����[TwmR���9<����GL�$s���8�����Z?WsQ���	@���G������BV����pw���^%}����6�w=�����k�CoH;g���r} +���1���|��Q�N������d�1Y{,����2�������b��}d3��^!�B���-*dV8M
G�I�y���7&������L�.u��5�$Z�25��u���j�����]N��!�z�����AZ��
g���9��[��+���$y��$�w���?��r;���LM>��]�H|�S���@#/�������@��
���K@ow���/�����Wh�nat[�r=�xx��4�^/��0��wmx��/�u�0���;h���&��bn�:�%�R|�V)������M\G����`c�_����X������������DbR3Z'���Q,����p���6��r����_~i����~�m1�}�wm����z�q�x�����{���^�l�G�}���$�G]���-��������� /�3_������d/����W���$��|�����lt��f�2�+bw8��R�aO]���)��b�����M�x�������,�}-�Rv�����k���-����-��rE�Hoo���2�u^�WUwCVH���	�lJ�e���+��������a�x[5
A��?R��i�`�{�=���1"Z�������|_����	��i&�O&w���Q�[Zn�_�F��;���w�]$r��z�v���P�u�/�����F-�Ru�x�"�!�����^}��st��y����l��C��Z�(�?|#���$�[I*�|u��<��`��x��>8%�t���t~ ��6z�ee*����L5�gk+�G�`����r����M �=�%�&jqC)����s����g����i
@�{V����Z�4�#���7�T��{7�b)f��y�L�����=��������],���Z�|+�B�8�?h���C�o������MwKW�-�M�W�V;�=@����O���k�Sr�!�~�g������8��wt�j}������<��t������k���r������4-��{���9�};V���un�V�����x����Y�����{6@������W� }=��8�J4��fq+cmi�s~���G����b�Jf8�~�����n���
���y:�k�q�q!wK�*'��lJ�i�b���tv��>T���|���Z����.��7Kp~�-��Q���.��*�O[���.�I����I��e@�^[
�����"�S�<�f7�-;%��C��l�KBj����V��4V�.�V��^[����5�m�������������k+��z��h�:������>��?�.	������/�rD�|/j�;h������w�N��e���"@�1��l4L>����|�-�o���R|�N�{��g|M[��0��b���j�c1�U�3o��Pv��=7����{m���Ttz����U���c���}1�Kq���
�16����|������[Z���T������jW7�t��Xd�$AB�TyB[@c�f��,��oV�ge�w��W=������1'�z���%��w�G�L��z�F��[9�.�!�|$V!���:*
��S��w������/xA�C�k�tT.�fnc��Q�~��_-���b�T���[��3�`����L�0#�u���E�G��k�0��_�'b��S`���}�e����r2i*0�=�����k�����tV�����*PS����p���{M�
���c�n�j,_#���w5���ws]���?��&���v�-�MNE�����r����o�Y�\��?p#?�{mE��&S1+��v��&�~*�E�2����z6�%!����6���]����br�����h��	-�91
Ifj�q�U���)"^_QI2�*��d�w�I����|�~�&E���|���{`�Z���*E�}�D�p|���%��>'���q�����n9�e@B}:�d��)�{m����te�W��U��rw�L����<�ZM�e�,������b��SVv��9,��(�:Y<��j������������2�e:9�������^����f����[��m�+��kX���vN��j����W�vn���l��D��/��K��B���"���hu�)��a������,[��$p���a&g����������:��*���<��J���t|x�M��p�^[�����,���}�r����������`D���:h�S�J�T�#�ET�JTFUuc���
8W7l����U\�zjPJ�������c�tl����wT��+wO���������l��i��S-j>�,�q��nm�������wp�}�]��G*���G��bE5�� ��AWr<��D���p�
`syuq|�����R7�������V7�9\��}����{�>�O�+�>���,��[�2s!���3�I=�~(���,2L��l�r=NF���'��G�\<�ke��>�>��8�(Gt�}=o�+�����tvs��:��5��,���u}K�~L�xQT��C��)����Wy��t����%
O!�hT�HcB���$I�/����<:��;Y��YN���6)r�R�������@��|r�[Rg���`G�},��Jg�	0�}N���g�F������5��9t������q��������rx�	��Y�xA��o��-�T���r���f�s��L2j���S�����B>�s��5q��������]Wo��on��j�o���UTi�-
]�> ��6�Y�
u��.�S=$�m@��n�2�(����9pl�g,�B7�������b������*O�vJ��,FY7u�q?��L�����~r�f�K"�Av�%���o���)��ju�*�K��`;Td�"�^#���H'\���	�%�C��e�F�f�g\}���8e=��l����O)�mN�$E�?,wTih�?��D�p3FT@�-b�LkVNx��8-�d�����z��0�}�A��>������Jo�5�:]I�����f����%	�����M��~���bq�>�_5��x�4,���Z8�Z�|Q�~mJ��4{/|�m���,����\SPl�uk��������oN��x�_��_��5���+Y��z�|�[ch�!�V��+�W��1EzK�k�����yc5��
-3-S����Q�X�
O���/6����H!���B�S�����]�v�o!��E��O��Q>��Q  �}.5�����P���x��b�-'������G����K
}��~�z��9?�K����T3L���\��eZOD�~�>�s�=�G�D�J�TL�H�&.�.�&�5>�I	�K�f4`����&��6�,�����:_'j�'I�q2�o0%���t�)�S�r�\_'���.-�"N�D�!w6������w��Ui��4��MN6�;6����2���DK�L��.on����).�����H^R2i_���+|�
`���L������.Nq'g�RF�r��tH��l��}�N��G��)�m!�K���]!Em��a4�H@e�}_����d�X��U6���U�|W���\,\F#a��Ob�V�(��H�(��J-+$+�W�8����O(������EC,���R����iM�B�{����[e�
��~�i~U�8���H�2��w�-�A��9��>4�����b=
�o�Ed��R�^�����T�r,�j��>�o��?��z�U���������������s`��
p:�m�y@Y�Y�Z�v�a��|*��cJ�Hb.������\�;=9:�:9?^��_��	 �>a�	@�>Z�g�X�X{�q5��Q�X������.���F�Y��vk�h:�vG�13�<v�Ia{3ID{�F�������)4���N������r��;�-�����~����������~���~�2��SG���������.��b7����)��<Ceq���}R�{N�\�j=��6�������<���`�����e��2�	��=���F������0����d��(�g6�E��Q�Fw���4'�wJ����I"/�����dB'.�w�{�U(�����a�z6�)��&���{W�b4������L|E�V$i���h��U���Wdbm� �G���f�����k�NW�����MM���K���W��xr�zxz�����k����UrP	���-MJA�l�P�G�����?�]�0�u��d���#E��G�������N�K
r
9^����X����J3};�9�Y�����������$���;:?�2�lE�C8�o��1@&�L9����Lgx���s���]~w~�}�\]�N���,*��\+T��z�s�s��%����y�����'���m�1���6�kJ�3�r�:�
������6
�����J��T����3�t�&���>�3v�t�8���u�g���������f�L�3&�������dw����7=?���e�ru��5���r��_���^pp���=���t�������w�@�~}'8p�jjki+'8�
h��������?���;�A[�mJM[ ��������
<P���N���W>`����@����������4�j�8�S��Y��t�&{�W����P�p����P��r&h4*�J������;6�8p\�����%���ny�����`=:iA��������v�P��
gO[mi51��a�����r}���^��L^�W$V3A*M���������{X��.n�i��~z{���������g���P��(u��]������������
�@�������1Q�T���h�p/6��J
�=��D;iJ
�=��Ac=��6@�AK*2��j��#��?p-��)���p�S�C��������f��p��z6�1����'��f������G��'��.����pX��
���~�#O����"�@�1
@� ����M��1g�j8�[.z�&�<�� r�P^=L]"T���~����O��<�Zn��;i��8j%�<'�@mE���*oB������������q���G��������������@Z=����
 ��������8��l9������;��T�H��?y����y�F2�<������$��V���*������<Y��L�D���N4�BL�_(jV�@��%I�U���2�*��Q2!6v6�$���-ro�\(���������&����d��Um��l>'���nS,�Y^����^NL7�.oF�E���P9$��J2�:�1k��Z�zO�T��N��8���v62����Q+��^'�V�8���x�IxW
&�G�;��}2��{�`X	fY�>I���b����`C��X�����W�9��4���;~{|!f��P���I�4�lpB�z6��Y�3_��0�<�B�z��R����#������O�J����V��U�X�&�xx#�+j��1=��i=�V�.�1�Zz�T�e�����%�B�pyV�Rr?�t�:!�R*&���zu�/�|���fy#z��:V���}iD[����7���6�7 �}k;��j��j���z`�)�f0���v:D>���s�o����s���������hP�������jr6��lA�[2h!waC���pfG��F`7��H��b�e^H��ke�4�TB`q}��=q�'e��*�k���}O��`���������v���8<Y���=@�������v`��ivxub�q	��lIq�=p������n���=�;��@f\������=��nvvm �m������	��	>p��{��p7���O��4��L������Y<��7�����{�����u�\�@�qW�3���Wi�2a{������%��M��4#�PdH �K���[>p���{`���
]����/&��������Ur2�7]���i4�Wr�7������x���e�,���&�mK��.�;QN����=���c'����6�&`�����u��X�!}���Q�������S���F��(��X����X}.}�h&V�I��5�-W��(�����Zn�?��:������7}�f�:*>������C����O���l�;;9�khT���@8}��W�d&�����	�r�����������W'g�;�k5�W�~8�z�arb�-�~�O5$����k�L2xN���r�I�,D�"6~�_<����~��T�SA��_��X]���7�I���p��]�������-��E�����3����C����NI?�\�Mh]�]�2�������8�=+���2ct��-o����Y�O%B ���W��zm�>}k#��*��~u@���e��h���F��"v����q���
��])��C"z�$�������<�gy�^O�&������Fu�[���Zun�)IZkZ�0�O���V+1����9T�eF���0�g�l��8:���5f�vn�>o3(�_*��������w:�������~�����|�9�z�N�c�"��J�By�j�
�	Q��>�����yp����p}=;�O�o�`�>}M�X����"��&�bIz�<�t���r!�L��aUvb�>GU1r��1���;�F���Cw>&�b������<��d7���8��xbT8N�V�U!R*t����R]�R�<
����2�������(@���M���_����h������cRX����j[���������-�"�J'�	�W�X��	��s.*J^����n���s<��\���i�����F;�O����j��dI$���v��>o[���n�$���U-�v{��R�k|F7�8!�7�������������V�%azx;�O��1�7�$1��,�%;�O��5������W�5�^�U����nu0e�4��b����$�~l���n�����G����|oJ]����_f��F6��2��>}�Y.T]���HJ�������|���@��<��x��u�(,O�P�q�dG��s�L����o"�hh�������-1�2��3:M�f�pB���EN����y";�O����jo�2���/�����[8";�O�;��\)��o>�/�Yq�C�
{��g������?�����I_sc7�m���ig)������R���x������w(4�G��=��l��?�n��`����|?���M`��Uw�I�_��� v�7��H.).'������V�m2�L�)�P���z���W8@NO���0��r��*����G}9��vLR�m��E:/��X��&�7�r!f���t�eTx��=��Oj`��&��Wj���*��(
8:+�OE�b&/�eeY��g^\�T�O�5�i"ur��K�|���@�����.V�+il����T���O
\���
���DUl&T<un���MR���R�gz�
sVb����m=D����\t-vB�Vz0;�"�>}�mmV��.mg3<������{��aQ,D!�����W��J��>2��|�������W�'����>�Y������v
�vZ��P���.������]�s�m�+\u���/h+�#��mwk+�#�[O��m��j+�_�������C��V���v��>��h(n�&R:h���o���)B*5�sD��`kG������CQ��|�[������z�r�����|��)]PPTvX����c;�i��I���86g9����&*(��,�%�&���:������k?�8/�&&��&�U����l;���u���a7��������4���\�8
�����<A��>��r��m?�|8V^��L��1��QT^��������,������o�Fl��V�p2�`>�����{�>�(���^�xx����2P������h�[6b�$�|T7�T��Pu#���mz�4D���C�����9o5Q�����i��V�R�;C�wn����v��s�:�=����(W4D��>����b�$��3��+h�������-
��cV����	�^�M�T���`����S
`V�/���6�~��zg�8��$q��������N�������6�����t���g9�]�[�[�J�.6�i	yS��Lj�8��d��q>`�}��6�p�>�=����O�������!
z����({������P`�����u�]Qg�t
vG� 
�:)%���I�@�����v�M@��i#`���x�M�f�w��,�N��4[�����M��r��rC}���=:���w>��^%@;�,��[�{9���s�������i�Y�9d�b��Y�C��~@����K-�������~����>.��d�F%�����&��T���F��������������NU���]�^�90[7�p�{�l�h��7���D
xt���i�zdayA���rr���4t^`��:���Lk+�+����#7��`���0���pK9}Qp
Kl�<A��'��C��>\��j1/��n,���/Dn��8a��^=��Y]>��{��}�]�Z�I>�d|�
�QK��E.���oS	�'#b��7����jxv~5<9��8�8���'�'W������c�2�(eQ����o�g$� �FK��ja���\�)�_4�%��g��-UEt1��>�.75�8)�tR����@��su�*U}ZN����^�i�$�������rZ��cI���>`�}N��xA��9;��o�2���d����(���Q��R|��1]N�5@�y#1����f����t�X�u�Q�K�"����N���Ri��3���}��=�H���d���U�|����%���"��h��MR:�m"�^�i.N5I)O���i\?��[e���
@����V��O��8�������5����8�U��YM��~�=���e��,B�N#�����8�iQ�U���)��4�������/��hT/�)�w�l-���L����o�Bv�����}6�I�w-�M�}@C�
m�<�M��t����vwRFH���~�h��/����o���;�����X��T��:��<��e�*�+�V�r����j�K}ry��TM�6Ib�(�?|�wI>�fb�/��#)��F��/�
��������K�t�
n�[�Mw�a�Y~|��+���;��c��3�{�A����
�n�S�WR{�if�To��$j<�E��R���Yd�|�;�V��%m-^�,�w�>�Ur�,�+�o+~�R�"��~;�s'y"��~��L|����RO3�W��AZ��
nhl����|��;c����:�
��������k��^}����(l����l�oc�_�z�t$n��,AM�B��)
��(Xi���^5��/C	9������������g|�wgQ;��A3����^����9c�z
�p�w��|Y����^Q�U�.�o�W@�0��������������+{��B������>�����N���1}�p����o g�6g�![���N��H��w������n1����;��>��};�l����-�.d�
�
�U������TmrU�n�l:M�p*7#)�2�������b!�$���+���>G��Mk�d��+)+.�#�[�^�'�m���M�[8���X�.��N�H�T-���2����.&�3M����l�P������r��N�����Ko��I6�t)�f�Y�c�9�U�6
w�5��V��}��"��n8�8������g�y�d�&�G�1���n������������Ht�ruD�����x�x�P��M"[x���p�,����npv�l�,�A!yZL�``
P����*?|�H����4��
xgi��%�9���|��p��WO��!,A4E�E���g���s
� *hs`A��g$tX3L>�����8C�����g��^\��`�G�9�s`��~��e��vt��G*h�H�����D���
����A�:�Y��hG�tS��_F{����W������|�� ��s�f��XWf-uR���n+��m1�n���S�6������\H� �M�\|�g%�;����g2>	i���`��m���X��]���&!h)����L|��D�j���
���K��������;��9%�!����Ke1�8)VR�%�9��_HA�k)g����Mn�>-�Ck#c`���t��`{r:�t�DN���|��C�KL�=o�zbI����IO���t��Br��V��
���\Y���������1u��XLtr���	���,s��VG�rm/��R@y�1YP���oz��o8IH�^%)����������`C�V�j14b�p����6���	��hE����\)��Y;�h;p��mN������GjO��=�gOi�m��,��bX�F�����R�W�Y�;O��\/���8�m�%V�bNf�pR��w�� ��(#7�]�����Ti�^nW���y�`���ah�Z�4��h]���{�ilN~�+�����nE�z�7F��3��+�m:i@MN*��=N�2��5h��l�hu�qY���M�8���6{5��~���hz�������g��]�pDd������a���5P�\����9cS���
��h��t������+�&�W�������Xa�r�H�����tf�xN�������U���U�Jx��BM���|��]����*�pB��y<����Vf��M���3N@�A�1���D��N�#�������UJ����z�^��@"���`��lzQ@��s�j^���i��Z#�����T3���S�[��
K�q`G�=y'��e����_�N�RDuVd#���W���%�U������-s�j
�P���\,�0�y9U�o�����X6���'1d�zI�Z��@
�8A���'���&���V����9pB���]���z|=�r��7�1M�����Am��:2���������}��3����\N�6*[&�R�T�����
�"*�hc��t�R@^��nc���"'i2�5����X�|H���=�m�;���T�p�]O����g�g{^S�0:IX�+(r�I;����z�u�����o�/�'gW���]]�^@�A�q�������j�@�	Ud�y��X��"�c�H]B4{@������v+�s���)X,F�����v��,�8p��=���Xz�-�qv?��I��8Ye#�V���b�&���1#_�y����@��w���<u�AF��(�r��t�	���	�W|�^1
��6)���c��w�V��R%n��L"�r��'�x��nI] �FVa�_z����U�����].T����4�5���^�
���fc1���	N�FQ��n���X.X��������;���GN8�pT�O(\�:�����zQ���Yn>�#��4X@6��$�v���X�28���fO���$�f����*������
���M����Q*S>I�9��Y
0cG�7�o`�|��#��+Lg4x�����;To�J��i��*������u;�9��A�B���n;�!�l����R��r��1����&��P�����c����C�H��|(�������#m�A�����[/Qt;y�@:)��I�=�m�J�65������%�����]E�����
]��7�H�o1����
�	�=���0�\+&7tdrC����Ln�V��_�D&�nY$��;�{8l��n�)�O�,,��������txJ�N���W�/���'W����������W�y�pO	���pK\�wy~������'W�����Z�?�8>|���������W�H�7t���`�A����XY�Dn�*�.���I����~4������;�[�����qZ�����|q��������^�O0������z�{�flR���e�������
��$po6���?Fq���3�QE�<>{}|��������q+>dp��(���J!��f�$^H���d��j� ^�r�����
������h��V]����
q���a�|��� [m��9���'��]��w����
t�X�e�-���,���VF��/f����_��0h��h.�.D]$�,�S}���T8dQ�2�k��a�R�Kv�����P�a�2����Z�I'���o���'�Z�'B@�aKa/J�w�B@&��\%}s'\r::��\������&C��L�Qn���8�`���8����(]��u�0HZ���6��[L�#Q��i���J������a���R/3�����u7@����:�A�:��x(�0b�9�=N8t�w�F�_0����!�ZAuo8/E�xT0o���c�f��]O�~��J+����M�[�X�1������@w�@!�vC��m����9@���B��N��HQ�0�M%�G�r��o�)$*@���K�l�:�F\Wj���Xm�7�8.WTkS�(�-9�@�!��a26=�}C�B|E*Mu�y�L8�y�lD'RO]���{rtHK��.�BI.��$~���~C��v-b/&���,[�w�{���dF�.J�����Z�`�!���zN`�!���n����tkM�����(x�:1cr�y��_�77�0�Qr,�����<%V���v8tU.nJ.��\?�Z��7�7�o�[��C@	�%l�|L��l�U������2u�R�L0,q�u�]YW�����\.pd{l�vr����)l�h@��,��+7x=X�4�$D3�F�������Z�_�B��v�M2yPP�>�KL�;�7d�_�	��b�$nF��SHBy�5�Y=�T����P�x��\]��-dR5]���{���)>A��
/��rUZm�y;I*��������8')�lQ6�������]��|`�}��d�iq 
��+r���!W�Q�w����^e2F?�����p9�[O_������6::�N}�D��l!������7Itn����$��M���v	rP,%�m�}�7dtJ��v�V�9�xC@��K�����
ml,]8Ya�$j!�����[����3b�	��yB�Y������Q%`�6Y^Q%I�%�ly{'��c�Nh�nue/��\N�5P�IH�NJ[�#{fR	=���]n����w����26��X���o :I��~�[��������y�p����n���]��NC�+�$9�0��M_7���c��U�m���vC@��@n���b�~U�\��{&-�u�����bwk��-����h
��t�U�d�s�8�;���
���K�!�x��p�/f�GM��w>����!��'���f�|�����}`6���\`��g�bC����Xl�����������Awfw#"��F�Xl���M~V���������l���^k}�3��e��������`g���n!lt�O~0"$md!i0"Lmdaj?#�6��*�[c�m#���6�h�*���t6rCguS���B����c�|�W���u�9�8W�����"��-���}�=�X���)�4fZ���u"�*��x=|Q���G��v�p�}*���������)\�����ZR�u�j���@���l�]#N��V��y�V�_�������7��d�Us}��5�0�iJ���j��Ni���qP��M����^�����J����E����X��Dm�����G�VI�5u�/�0��$���$!zo���
���R��q#\�^�6��@�Q�8�����k����e���_#�
G�G�U#@�Fk�|+����z��g�X��*�1�v���a,%���|�T�����,��w�
 ����]���x���s��O:��=�:��QF�h���C��:��$�'�1*\���*�����"K�����i��H�RJ�
�ZN�:���XQ'&�W	��M>W�kxR��FN����G�Fn�j��h��z5r�W#@�F6z���2�q�����\g�l�V��t�Q�)�5����7K��e,#�q��#Tk�)�^�t�b)�|��f�2�knI���Gu�lT
��v�Uj5U&���<9V���0b��e��2w����g�cE��8t�V)�l�y��t8���f��R'7�o�
1�2f���Xw]���"O�=�������{����{|����e��Zf��-k%�Q��#y�:V1�V������U�W�n�O�F�q��@s�1����tK��E��"���]j���USr[V
�MWPx`\��mi�Ca�C�������
A����$|������F�:9��F�8��*�������%��Q{R^������:n<^@�c6���g�����#��l5�xq���F>�7Y�`�����c��^�a�)�����V2Wi�W�TB2F������i�
�����������������E)�8����Nx��z���� ���~�c�
8c< o%~�8���v�@{����E��,����R�7rT�������<�U=Xu������O
d����,[}l�x
�l��M�\���c��`�R�\�����{�����x(�IM	a!����V�`U��>�������m.�:�;����i���X�e�q��,~����s���r/dg ?���y�qY�)8��q��0L����P��Ic������#.6�^�#�����W���
8V)������8�hc#��l�q�r���"� G=��Wcn��9rd�#�"G6Y5��#pV��v�C���.��i���Cx���#�*�Rh��@�i%XI�����,1x�����t�A�YIdU����7Mi�������6U�d����]:��3�<�aVU*�&��m2Kq������F�#]OJ���t�!j�'/����Q+��lR�e���s��d�����zy�:��9�o��x���W�}���z��92�`����6����I�-�s4p�����[�����5�^l����l�r�<�|s��o���/��4������m��
ZM�>�l]�h��$���M�d�"^G�f#[G?��u`�����4F3'�v���Z>�cw�w�>�fN����{�����9�pw��z��\��z����:i��v�Q�����WlL�;�����w�)-)���Kg���~�;R����m`�Ni�@R:����7�����	]-�F��0������(����-���;2uB]>W�����;|<nQ
1�5����w6M���t�ppF^�&}�M��4&�3
k�gX���pHy�������~A���h���X�~����M��~]y/�d��#P���|�����Mf�{�`�6�Yk�"�5�Vl#��&�l�P�8����f)��mx�f)��R
�Y@���q����d�f@w���r����%VC{V�IB2d����WiNh�\���8b"w��!5x�`�;S\V���/s/��-#&�$�N�$)���w�0����X����:�O�X�d�I���(�[:6J�`\��Yl]:��^(PLw���������J����-6��"����~�*�K�*���,t�������T��}�x:yvNq���g�
�����D��:%�XQ������Z�#'�*+�|���M{�T��E2���m5�FE�XQ�])�#w�d���"�K�	9�L�n^�$�=��ce�M{�q'�j�Q���Dn)�^��M��� w���pb]�s|qq~���;6�@��L;�������e�_����������o����!��64�Q�<n,�;b�p��pG�5Z��pJ�=����1u���w�'G�W'�g����+o�+��D[��������N4�:��F;�N��q�rY��}y�b�~L�}y�{F7������bysS?��gr��4d�����i�s�8����R4�7�t2���eS'������� �;�l�\S��k��Zy���?6����
�Vx���������I-����v�HoC�^��������{�N&�H������"�YXy���J��=�z;�l�kR~���v{�S�_���v�����!)����������������W����:�4G��vq���-��xI�rqt����2k���8��vukOX�l�i�O�t�;�]����q�k���IDc�+�� ?�= �;��69\��*v���\�9��9�A�p�h�G�������b������CY�s����*���v��,���������r�^���3�0�����
M�Zw���]������(��A�F>�[9�k#!mOjv]6i�/'�P��9�wP��Z�������2�]9���������
{T�����Ti����*%�/ud��B���f���E����e7�*Pj&���C�X9ds
@�Nw�)����p���{��G�1�-t<��������;]��N}������q_�yr2��R����s��p��C �i�4p��nH����I����f��2�F��;�p��F>�����7�B���6ke�
@wz�W�,�y�L��q����s��p�#��DA��I	�+�(wz��v���Bl7\�����s������������
V�G����Vb�w����tW�_�����1^�t}���O&��o�Q:I�����7�&=��b.���K�w������g^rZ0I?l�6c0�p��]lt`@�v��"Z	������P�zEX���\�o�|�|�sp�%�?�������
%Z�;����������{��J����[���s;��X�}���
����i�2�n�0������>�HG�J��X��]G���i.����������^��[��$���5P���xa�j��Q�e�H�F4�5��H����p�E&�car�&~���:�����.��q���9%����+����<��f��p�N���x�l�i���X��]�����Iv�<���r|-�O��-�5I��Ch;��1�A����q��E���O=Q-&�7�<�����]��v-��������������;9�9<9=~=�	
uvyxD���6_���tj����Zy������/{�^y
a�����|���.�n;���&e��i��.���:"�]��v�$����Qb��]6����.K�z
��.K�R�U'o�v���%�SKZ����O7b�B�mF3S9S)�f�>v?�e�^U�R�K�E������BV��8�rI?%�=�H��.����.���V=d�w#���n�\�X[]���dv���������l�W�����Hu)\�����Q]��\tV���8j-w~���v7%3��p�`U�0�4��P��e}��p�.V�����'c1@�7�y�L�������G��ny^O�R��X������������?^^�U?d��K�����U��y>�&�f�Q�d�' ��n��M��|3�8�����rS
�g�!qT�O
X������$(���rH��p7X��f����d���&��s�-�,+�iT�^�9<���wv��s7���`��4)�����N��'3}��C���<T5��GD��}�u�����B�����2_;��G��h���
qg���r@��pfH��
�����$+� ���zS������]@����6�Y��@����T��G��������U���&=�T�v6	���uSu�	<��u�]�E]Z��E6w�Gpwnj-�=��8��5�������mg�	�:"�
�hxI;;�o��<��\��5me��A�F����X-��]V7�~
�L��xY����xMf|V�ah��w��=������X�f���������}q��q���`���R���9�������g�;5@�w;�:��u�����H�3uy�J����z��]W=����3)� rg�C:��^}���������~���c�����m��z�M�j��=�|�(zC�.����R���X=�ivK`��_��N��x�SW��>��D��U���8a�^���}���J��a������y0�V�w9�k#�����G;��dQ�|�P��������l����G�Z��c��x�k������]��w9����$m]��H��T��wi^��0��y�B�o��H��'g���N.��/~����A�v����h��`��w��}�N��Z��1{���
`�]��������E�q����:
bw��E�����S�H	����/�ia��w9(��� ���G�(Pz���Ejw,�NHm��?�Q�L�"�L�����/i���*��owF�A��x-��.����{���P[
~w��}@�|��_Gi�A��]�7���m�I�{��|Dz�$)E7�H��n��	�d�1]d�ic�.����G��R���HW9k�������N�6)�}[Y��L�F��=��_-���\wR���5�����F�Y�w9����)6����G��#7���.��z��d���M���#5����G����+Q��_/g5M������_y_.W���]+�_O&ne ����,����uT�������h��12]��]Gl/\���h��~w��w�����e��]�7�����V�L��(��M�ZE��
}t��	:��{h&C1@��CyI���E]G�R,�eyAA����~�������Ul��:���Y.CU)O@�T�W��G!/(4�������Q9���Y��]1����KEiY"���p%b�������
�
��B��{��Y1|�P{��8n������!m�1V��}��6�������>���k	���}���u��}�#��|�������_<{��������+M�Ul�`���Wq���f^i�/����p
�n����
E������l��)�ae�w�����p�'�M>3y�����`�~p�)_��b'�����A�`�<@�{�b����y��/gyzK2DR�&]�}���z�G��x��~xC��`]��K�D���������R=��%�{����O#~����_y�q���x�S�6��l��;I;_�����g�6���=�k?����������kQLsyfig`���R�B�nQ����~(�:�������kV�Wx�m��/T�tNP�H��M"��WaU�.�Xk-g�����WP�����o��\�,���������G�o�����d��j^k��B��������t���������=[����z��?[{�����g����������v�QF�����R<�������-^c�
���}��40������u����a������$�t��X�+'�,(�����H}���s����k0��	��/�]'3[A�=�M�6m�F@F�"7���{�0u�#\����r��$|)d�����s��;��)U�'������sc��
�!���,��J{��9�hr�&�M��y��G��8���X�!b��Z���c��p����l#%@�{�����=��mx'�8�:��	��=N��J�^��{x�DF�'4�<����w��(j�w����#�/���6�3�|�����Q��5M���rd��O�}L�M!�{��GD�����h�2���!���������-6A��Y5���*Sy�eT	x���|���nR�5��M�_�8��J�u�a]j�L<)edT	�4�"�J���������� �^���E�P����F�X5�����V�>��7�@�gQ�v��^�rc��{��uSj5F��A����������1��g�k_��\I��7�B�^he'�z�=�=�8�l[Zo�4��l�qB�F>��!JYL��[G��HR��t���^^�a�p�+��[Hg���a������u�g�l%��nY��kCx�1	t��8�2Gu�1�lR�i�{6~&�"��J���y�w���s���������g�����k��[��spQL���-=��,�F��C]>�Rys���m�6b����Y�a���Nl��%���is��������F>������L��q�`���z���O�"\��_�I.�"3O�����un�������d�$���#�r�����Ry+��f}�s��do��t��p����\��t�YP`�kC4o������B���{��Y��k����?����_��e�4�er�tG����Eh'�s�Y`a��5��0��;��4�
��s���P������j�r��J}����Ur����`���	�`o��������*�c@{{V}�z���j�1�cY���02���,k�Cd����� t{6B�Jj���8���
kt�3J.����9�i�����iW�!�{���Yh]��N���YU�R�-@��l�oc���HU�"�=����> ��Vr��l�}��.�>����?p\N��@�o�����>����=I����d8�>�o�+��wT<|�{E�w�Um��
o��Rj9Q��{�> y�����������l}�.������y�}�d�>�����D��MA"z��u��^��3�����BZ����b�
o�����Sp�y6�@B��g�`��'����w��2�?hp�C��|�?���M_��>����oI./��2H5���<,����g�`������_�����&:]mz5o�Y��>�}��k������-���$�4a0i@����>�w�6z��������]���
����6MY<�]�.
��U�����������K\!��{z�-x�>�u��[��>@o��e���s��!9��n������4pQ}�0��}��-��`�}��
^t��m?l}(b�2������o�PX����0x���SO�]�I�1EF�/�
�:�����W���Z#�Bp~�l8�����������@�zj����l�g�W��1T�Q��P�}�������G������m��U��[bd��!�F>�}p���$��}w�s�i2��B�b�8YTJK*r{6��T�?�K�d����O���3ff|����9+�j?b����O������2���j�R�j|��J�?�f�����57Qk�cX�I�X�M]���7(�2ev/P�����������������{��}�V
�G�>�U�]����&Y�oWa��}�u��R�������3�?���L���[��L�E����C��E�����k@��D^���e3,��lh�i\�Y�V�H�6*�yj�����*���e��x�o�W�a���n�Uu���v�����.�st��p95\�9�=����1��S��2�o!C��N��o������x�W[�w[J�i5����|I1`���^�Y���35�8-5�����sK������w>K����eR�h�./�\�o��
���o�������i@G�=���(.�%��1����Y����'��O:���RM3������r�VL��enT	���]��~��n7�������b6�{��^T�>y6����<���i<�<�#M��.��Di�S�-k5�C��GQ���B� #��t����"�[5l��?`I��-]��di�#K�|�3�h�����8i=w�������~.�����������SoR�R��%Bz��DV��}�4_+�=�}��X�u�����5���������_�!�����hkN�S�"_S���B�7�e~�X�\��S<m/
0v�u��[h�MI Q�E�|����	�����
]�Qo�4�5
z���o��`���&DFH�o�$7��4}!
�2Ff)��m��J���i^a�
���>���}�~`��C �|��r���J�)��f)��s�&��U����
l�p���o_�*q91�I���4;YNK�����NY�@��K��j�w�����ms����`+��R��W-�5NhT�Z�'L���l}T��������[�n�K����p��F>v68��n�dL���4�
W:��"��{}���of�5����j��� 5VR�� wvW������9�L#�'p��Uz�����W�b���"#y�R����h�X��P�
���98h�'ZoI���V��|�����7��&`�fN�U����i}X)���|��.��^~�������������5f�}�[5����H�U�Y
������G���O�xo���c �+�\��h���[�&:�0Q�����G��Y��8�t#i�����S��������W��
��X5�TZ�A�:��T#���7[�7�$��������-K8}x`��������*��$)����Hm�+
�l9�(�����R�U�X#�,O�f�~�0��A��I,��T��ul}��Z�����uA���������B��?<�~qru��v�T{���3[O���T�7��/��[�L{5��r��mX"8T�������.�^M��������Ue���U
W��mjW^M)�.���	��
��|I�^�0�����[�����@�8�t^�p�O�k$R3�[1kU����0����U����dX�LCu�b(@|�'$���@
������wEB9��^'@
�����>&�T�8���&��mR�7�n�T�n�R(��wOm'��(�h�	�J����H^JKO�/c�$���G(��t�2/���*�5�h���+�F����(��RJ=���d=�E|s#�b!��q�4�����c �.���mcK0S�,,�e}��vy%�C&�*<�M��XU�x�"�K���ge�DM{��#����K)
�H�����;�!BG����`�pm<���#3U�
lj�
������F������� �A��[?�����+��3��j@M��Z�e��i�
���E���M6����y��t��'����dO���#_";������|�� d����/����z�����.^(iE��Z���
���3��+�{��T`���Hj���H���\���w:4��*�k�i@lz��h�A��������b�����PV?�z*������yly�FX=&)X�`K�j����W��5[�:[��iQ���V�g�"���J��Q[����DsE�+�����0����+�������t�7R<�:���������_�&C�%�@�JX�d�@�
�-8���z\���mU_�
|�o
��
����'�:�AY{Pm���4�xdWU]�����u���6�[�i��a��w?=���~z��O?=�n��e<��|����ZHN�����A��������"��/��Xx�E�o��z�q�F>�s��hs
�������(��	��x����������}�c9�������XyDc����v��`��l�~�=�
�������9�?�	��q�Nz��3=�=�I.���i����{�s����IY��h�2x��	I�6m�({`Q���
��,|`�wlz��(0���/�#>��c��|�o�P����c�Yr/'e�Wo6\�0
7.\����=�Pn��x{`���^��,�,��A<���"�����t2�nJq�h�k��(.�IvK��y@�82�p|�Xtr?�������K{e{������1x������8Z���@�Vd���W�Fgk��#�>���F��=��8�\�]�M����o�����7�(�y=9����$�N{�Sr���Iu������������P����P��<��h<=��u�}N��bo����2Z�����+~@�v���Sd�(y�@��(�L���=/�<��r�w$3���	0��Vi_s����B�[�M�v�N�3�k9����<�~���P�;�N���'1u��V�ZY}��V�zG,�8�locXe���V�G�;��/[]}���]�D�VGG��+�l[�k��V��+��GVH�Z���zK��y�.~�����_�/���p����������X-��:V���s���k�is^�����[x
;O���^�N�����k���f���f�|����68�GT�^2o�;�N��_l/=#��
��|e0J��l�������w��tW�J�v��>�,5��g���
s%��B7������{���1�p3�f����
����	�����������3�����@�'kY�e��dR�G�G��i%�I�nDBj�`��v���;�@�3��=�����1NK�4�0�y��k��r�^�����$lO�E1��Q�W�t��B��y�[�����R��;!`�s5�����uZz$�^Z��j}�t{�d*wt����f�I`��}�T�^q-Oj��V���e�P
{t���~�M��a��ie�������j�v:g�Fdd��3���q��5��7}��.���,�����L�W���/�L�H�����yUW�}=����v��>��&^83�	(�r���W�(��(t�$�:l����������~�V�6W-���S[=�$\�kg��s��h�������[�f��S�������q' ���`���������L��/���������1;KM�o������^�6���O6r�����l�|��T@������O'n>4��gkh��y���O(��?��= ;�������,4���;�M�o������n}hz"������-��d'�gze�`[g�Q��F��m������t�l���k�������N���z����;����6��B���U����Z5�;��<�i)�hOviZ�a_�
�
\E:/��;O�U��{�'�:���*m�${���I���C�{@vqxn��6'i��������i���s�|�c������(���8�{1~n Ws�F��^�>����N
��[R6�O@��.���[c�{�
�a�F�bG|�s�Q��z��TE��rw�����`G|��mfs�\���6���gj��<��6���SQ��6��Jl�wM�������x�M����S���K7D#" 6J�����=�C@���|��v����c���s8i��<�V��A=������k����#y��}�L�g;oK���0��`pd�v���#�
�\Oh^�G����k����(���'	dn����O�?1J ��l��s����6��f@9m�F�����vf��fR$��s�6�����F�������6�I�;���H��.t0]=6���J��r�������������jt>q7�Q!���k�7�D��w��h���H�o�i#���R����G�d����f�(�����F:@�v���r6{|���G�:������	��Z���Cp���}w�/bQ"�j�������(f������h�|���6�U�2�w�w�]����D��^��#�:���A����������n[����2�J�~�f}qW�36�fC�j�s�F:@l4�)~����B���Wf�r�*���Z2i��j
0?��������Z�B ����&"����B�c��}Z��E�6K�<��=�q#��U����P%�:uc(&F}��X���M�+�@�v�N;������(������P���������)Q����8j{��!|��{���K6��^Yl��7�3�7�,�6�8ZL�R��eaZ�����q�F:@�l��f;�V8&�2���V�&��
{����{�C��������3�h.��;������~3�`�~��^}������hN�k��;*�]��sX���mY\�����Y����;��;� ���O)@H9�HH�
�5�����)�#�^���@���"A���>�:v�jwdUH�7��g�\��	z2��|6#I���d0?'����5A�/��b�.*1"x.���c�Pa���������c>�^�T{�[�}�mA�<R�Df��a�m��7M�W��UFT@�8��u���x�E�}����7tL���w'��r!�D?����������i���$^�W@#��H
H�
�����T����^���������K�2�HS��<F�����H������\8�k#"�C�qs���"��k�X�^4,b���<+$]�Q~�~0zU�m�"����E5K��m^��3�B�,E|����U�k���������^e��^���E��EV��.{y�\��wB6s��z���\e���HD�zP�F������L���Ru���RK=�� +�3ze������8_���f���nG-!��,��Q����ro��h]����Ef�q�j�%�h+g�y��@Gm�����p���%�U��S��Q�O��O+oIz����c<��jFL@Ec������5�J��C3�M�e[���R�B���!�I�ja�t��O-t?o
���������bT��Y#?������k��P����?��E��C�X���V�o9.����ye=�C�G�
U�����~�����K�@�p��.�������Y�C�PQ�f��R�z
�y��#��x��r\uOkn7����6��y{gt��#�0�_!OJ���(�Sy��:���nl����C^,+Q���:S��k���������>��T3%�(��l�����MPe^�_
�������O��{�T��C)�mn6�a��������r0x}H��!��3�^�R�_j�!�S7Sv����C�JA5Z_���,V�����:u��[��z�����7'�Y�K!�?�����e�#6���r�*���P]e�T�x3q?�Z�����P]���<�J�'��1���brH{U}_���d(45�
��h@���UU������E�?�����Y���7���[�w\���V���Q��-n�;��*�������������a���u�}���2����;
L|5�e�kUT,��Jh�{��	�[����:�,���;��*����Ay=�,��j����������\�������=}��6nc�	��`�E��f���*f1���>��~-���uuP��`�H��E=w�&������J#PE�S�'<��)�<���0��C�
z>��}�F�H���u���v�s�^O"���7+��v>}��A?)UkS��la4a�Y�9f��)j��,���i:���iibA���9�AS�n��4)z^��X.�`s~(��
��,����o�V��������n�i��(9f1��Sg��L�����g��}0Nn3�����1�==������T��$��[`��cz##��5���J2e�^w���`�}�I7i��(��������������G��F���#��{a�(��<���!���y4y�YC�
L��)��5D/_h�k�bcC���
�a��RH��\���	p@�>��7�J6�MM����`���w'�����IY��1���<���t�k�����W��w��M]��L��������I�E;J=�Q�M8L{d=�"�2i�]oX��7Ea�yt���������e�*��ruLz������@��b��2_���#|�1�F:v���f�T�4��.&�_��,�m9��j{Y}�����vZj�iS�U+�bk���l�����3q��_g��"-���c�v�t��Su�6�]�@`���riF+�T�`p���o-�^��s`d����Q��X��-�X���iNk*���Sy���4����r7?��1�/�*�l`��WOC��Lu�~��F:@��X��loU#�2��2>�"�-�u>k^�y2�H�����R$W/�yN[himwO�e�}�@m����,�7zN��=}�����w�TH>��6w3{'�C�����'����������,���������Q0�U�� �b7O�����9�z`���X���zglva��u3+��8��X7�=V,�
To��Z�Z�=�������I��~r��
��6�e�u�	��vY�V�>��wr}P6s�S�^X�,PMxkf����*��K���R����!t�I�O%�3�U^z�����d��p{`qx�����D�������r��e4"a1z^U-w�/C�;�~��{^,�f�
dj�������,�[H��+�(���h���.��j���>���]&	�2
��zo8��Ve%�V5��u�����H�p=�
�3}�On��+��^����(�]|#* j!?3�nQ�b���d����>�y�O9�G��f���9U� '�3:�l.);*#�`]��;*U���������S_�q�7�A��xX��>Az�Y7�]���Y�����fO ���������������p0����	����em3	�px�>�<=����Mf�/�y���F:@3l|�4`*�r����������v�����:Z�}���\�<��N���S�T��a������������]�nH	~p������i-������s
�FX1r��e�p������W�9��:N��C���}a\��w`��
��v`�����R@]V���}�B�l�� �ItR:�qq�a�Y{��s�
{X�k����[���5����A���$Y��,�U�G��2��d��U<�M���`��
9u�H����(���7�^����9�TP�����\{���U7{j��
��m���l��K��z������M u���F:@���@���h������,m�f�9�Xg3! '�l�����E����K2���,3���\Q�`����^!�?���������q�����8����m���5��+�R<�^S������!�����.���q���FS������@���6-���:_����7���Y/�����j���j�������EHrC��w�7���=�p���mM�)dt�v�������g�w���f�b������7����t�@�\�{�<�`�M2��Se
�P}�j����Q�^���&������l��n���%R}�2>
Q�S�SpL-]��y��o��R��^K�!���
8���h}r��%mmwp=Y�7��K�(�7�@n`������(J:��gUV�L������/[��8:I��
,N��9����O������^_�N������7�H\#P��w�(���4���(�U~3��5��T~1�E��S6@��hhYJc�9�}���t�V�Q�x��R
�������������t5�@E�]>����8_�����[��7���������<[PL��%��>���7����t�� K���"+5+�|!z���it����`�61�U��|��gYuv}<����pH��P����#@�����
�j�{��4|N��j��@���Z4!�zC�����eYe�Y��QO�TR�|�/r��6���c��y�
�`���7���^�C���G��M����c�[;�oYN��gu7~���od�r8t$�=2g���/?��J�F�����0Ppx��RaZW"W��OoW����
�nM����
�������N��j��07���W��n�0��L+}�h�����~���n�B
y��z�G�����+Y����������"������<cQ| ��h�<��!+�^�C�b6X<��v�/���!��C���J��-��.�/�V�Q�$x���7������ �D`U|�BP������n�
��q�a�?P+�B�n��!�xC��H�boE��R���a+1j�q���!�lC���F1����b*J��h�_���������d������
-�.i8����
�-�!�kC\��W��DevG�`��+������RO�7���bW
��VQ����H }�,/�����uz��F�@p8�w��8�vqB`��:�����^�U��
��I�hC@�6�N<�����ruBy����Z�UV����J�t�;h���2��&�8�za�C���
��]�YQU�{N1�,���E���B��y�M������{QkUv��3��i������?ze�J�iZM��^���a>�-�y�������"�!��>#���!o�#��U��'����/Q�:���z�425�_2Y�x#C%Zl�;����D9)�� �<{F+���_\_��O�/Ohb��?�,0��
�oh1�fV��WK|�3�|����/smb�����j���0I�V�{��@���o�H���tZ�@MY\�����1��7		�����jW��*�{��yqy~z��������f��^��k�����6��eu�n.���ph��n�}���{q����y ��e.g���-�������C�S)n**���Y0���g�i�IF:#+m&T��z{�avx����>x|��m#( [VO�uP��wMw�m�TC�i_�l'Mb��]� O�0b�gu�^�4��`^�-�Z�6�<ii�(��B����FL@������J���$?d��a���W�v���'��C|�o=o�z���}lr9M�5���iT��<����|���������B@*����-����=���}K�����d7����H&��@X�S:�A(��V������L�����@b����,����f%76�ja�j���c
������v�^�[���	�u1������D��0���c�Yf�Z}�UD�� �Vqc�����qU����E-:�x9}�,?���1!��,��h��N�	
"�%����Y4����CC\��]#$�IV�f��+���n%�r�l���������Y�����Z�	t��Rh�G[?���_7$3/�b
� &���f����oE����p��Kv:���i$uh5r�zF����.����~g�*Zm����14#���l�+�V��9��HH��Y[���m���g�/�/i�e|qz����=���r�������������O/�^��y'~�)��@u���xs�����\??������@�8�d#�I�!����9�H��k�{W �U�k~^\8��"w���L�����l:�a����f�T�����OMkK�U�������!�/V�`�C��1�4]��J���g���9[O����������G��p��pg����Z�x�����.O��(G���a=�;
u��4L�zoy{���+�7i�d�p�u�C7*�'����G�j���)/D�0�[����
�9k@��.�8_z/e,��(k�� b>X�V6P ��������hAO�|�Z��cQPO_??=�:>{m�^&=DL�1'����._����!P����RC�qw\y8r�O <���n�mK���'�=�rt!z��	��&�����+����G�0��C@w�#�}h��k����7�
��H@����a �C	����Fe�w��4��Jk�a6{�����v;rR�Z��+u��r�:qj���������������{��+�����w������c�6�~�9g{������f�c�W�;�2�F���uH��@�qUee��������;:r3N5Ch��i���]"���H�^��#�������GV��}��\��Rg�H��Dk]�d�/�(��!(����]"��nBj��LVK2>�����Z"Qm6������DV\����%:r�G~���M��}���w��K?����J��w�c�`��
������6������S�d�R��:�fW�^�4�g�_��qU��[�qt_���b���p���t�����J>���~de����(+���\�j�����Q��F�F���am�L�%�x5���o�k��yFgn����8�&��5)��s����#h��� �gm�^=T|��(KSF9t�a���l<!(Z^�����VT)7����AX�
��gP�yo��s��W��'��&��	�|�U@G�#�g��A�M������[6. *���X���������?�1����������UTlq��m)7��-%�������C[1��2	�U�8Yb�J�]�"*TG�c�A�[ET3�z�����)	<��H���[���#��Fa��)	�����j����Y�V���upQ�v%��Q�h�
tN��9>9G�F�;�9�[�E]�%*>��X�;����L��~��k���6��KlM:�F�f���Q��(��g��^����eqY������E;��m�z����Fp����6����F:@g\L�73W��CS����8"�H(�
x5��,�V-�/*�\�7���8�{��#7 ����|\��ewc�cS����fb��f"���Tt�v�9rtd��q��xM���.u����t`:��i~�*<t��C����������f���^�yi��bE���8"��$:��V��Y�V3�R��3X6�Ro�ihD�y$y��8,�H��x������p����9�rd�Df��f��� e!N���z�����N!�M�Y�W��f
�(hF�N���k:UFnwJ�=���(�z������j�!�?s�%���}[��8{rc�]	k�({���z��D�Z�\�e���;��r��)�����X�\����<��#�b6���1����)�T������L���e�z\,���*���f��r���������cy>_�W>��4{���r���BA�7�1�X��,�r�
,��n^�?>{����|��m#@,G�l�+���������Z���__�����������2��#G��p��������DA���rk1�$�KI=����R���y��,����su��F��	�0mV���wSmT��fF ?����T;}��������y� �G6�g�X��pd�z�KX4���L���.��7$�3��$�z�f��C�+S��X�f#&�V�����f��g}��1:��y����M�
�M�C�"�GV�x�;z��GG��px�����
!�-}������6�D���f�8�����9<4K$+�+rF]�����m��XI]�6�q�$C��4���6@����m"�6���8z�����8���47��&��V��z�l��W���WV�3�H�e�ou���E2�u�j���Q���`��T���r������AG�7��&*���g`.����R�h�����R#>�	����T?�*�pd�������:ml�\���C9T�<3^1\5�r�A���;`����"�1�(z.�]�K���
2�a.d9�!�"*Q����b�i��	����I*��4�kt�T��O�|B�grPj@k^���P{_,���(^��z���sd���!U@���i]���{���<(}>mG��7�s�>wo/��;w>��o���\j����a��{b1���^�q�G���/�AO��#��t��[����?�����.����3Q�uq�z�e��NL>���������'��Kb@@�G���5p����	f^����$�r�G�,�3��U������36�������e���1�C��:vC��5���m�6�o�h���������N�`����:�K6���o�*����<�C=�����7�_��������������C"�F^B��:��c�c\�.N���
��1��c�Q_��j�\����vW�����H���=]����i:>j����[$�2#& R��uS]�������~��������|n��4��<Lf�T"����������ul1�f��AV��8}����%W����l�#����I����������^�:}y���K��;����X��q����������9~�HH��������[�po����5k����t��{U��s��H��v:;�vnp������S�R��I>��Dmc��`c�`�NV�M�=k��*]���bU���6,v�dk��V����6�ul��6�����m��=^b�7�����%@���e+��U��r�b�<`�\N��k�H��V���������c�#@����1�sQ?��$���R�s{6�*�1�f���1k��XO��J;C���[^�b�:�=m�
��� wc��5��c����)��G������H������9�����y2z:�U��-��0!���zoL@��,��/&�
�������T��j{�Z���4��?5G���`a���>�KZ�:0�/
T`�7y��)��h�5�E��J����p('�U����|k�������9�>���r����@X��m>8�Q�i������m�Z+���6�lQb��l]s0]�=W�����Zm���j���{y�������6�{����:��t]��Os��f����ns:�NA��O3uPq9Z�.����eIDDY�����Z��R7�����=���X.�<_�:f^M����1 zc�k������o������
��8���-��MH�@�1gdl�� �a``��r���fCl�'�rc�����wGm2�v�~X�\k�Y\����EW��?`jc��Hh����)��������W"};�x��@�������")@oc��-���`n�VF6���^�88�
Mq���f$�o����|8�c�\�>b���K���&��g
����9mL�G?���y����peL�$7�.��A����W��e�����e)h�����>3�n����(�x���e�'�-����9Mplc�
M�nlq
f��f�� �w�IZM$n<^��c����5����������U�}-���V�@g��4���
���)1�����q]���+����n'Z�&�IQ��$����.�������
J9)�1�������ru@�
�,D�eM+�5���*��#/dL���7Zg��a3����'v�y�����w=;��
��yA��=%
h��UQ_6G����z��T�v2P���<h���-@��V�WE����'�ry���LM����*�C|��N�C�#*�VXXE�5:��l����kZc�#�6\�U�ns����Yqc������������q�\�f���Y����
I�4�����o6Y
E����
��Y�/����25���[Ye����ss�����[l�����
���0S"�[I�&����;%���F�����V�TY�k������VVA�/�)�t2V���i{�B^Kv�N�j��	PP+�,#1��/�G;BzWE1[}-�R���1����0P�k���B��P�G�c5���7��8����t���X@aH|~q��g��eqS�Y���Z8���t�}��	"��2��	@��	���'����d?g���e���)Uo#��$V�u$���������\W�����/y��$	�}�O�}��b@I�&e��i�������<�.}��j8�A��)I��p���e�d��&��Fm�&�6���7.�R7 �V�j}���r�j�;��
U��FH@\8�VI�L��������G�k���*�n������s(0���&^zU�F%�M���*Cm��X�a������f3������	�C�V���-����z�t����y��!�=�7AF���:7qqW6���I�&,����_��75m#�]�����F@@U7UY�,(���H}��Z�4e�H<��N���P�����{i�����_m-�2�:@�&N��)t��M\���"��4Y�3	`k�����lbq&���grkt+���?�i��"@q�I1�w)�dz�_�	��4U�������IR��6�Q���n'�%�M,�0K��a�F{*O��A�Z��a�3��c��O�l���
> ��k�T��,�U1\�e����G88pp~Bh�����	K7���p�����Qb�a�0����ktn����|�2;6�MX�����Hc�B�G.��.-���p�	���no�j3����o=RZ�6���r$����_�hL'�`���"�	������CO�s�sZ~�)��e��An�M�;lz'2#*�!���&���7�xNR��(m]%�r����Xdd>����eDD!��irQa9'zi��Vk��6�*��4-��m�*�#$�'v�	�)��E��{�Z^U�oJ^fR��mV����[�� �t`���D�:�M����d����|C�d��T�������vV��l�7'��V��	��ph�m�s���F:@�l����9����4��{9�P���
��`&|cN��,>�2�5�Q���v�	 �!-E���<+8!��ZITN��X/�EQ��=�T4`e:y�7��t[�m����p^���I����_�~+w����3���
��c9QYN3!^��.��22�YcD�����D~���H�V��d�����p�����}y�Z�}I�b63b�bsG1��\��z{�,���:ma0&��`�2�hje�Hg��<Vdf������������)��f5��P;�h_�P(�[�l5i^���_L�l.a&:� ]HS�UE����h�� ���Li'��UL�	�Y4�
'��(��Es����A�Te�Kg��z�[��1wb����L���9/���N�%)�������C#"��	�D$%&������5�����[;�Dm�:�JR����%)B}��NW�������]��OG3�p�*m=5�~'6�[D�mcP�2��$�����d���1������V�n���t�!�O���w0ybE��v����+K����,]h�g+X�'+h{���P+��Om�F@�'V��|�@��������W���5��yF����a�Q�wr<GniNf��n �<�B��Y����zGQ�d��
}"7L_�/�@��O����!�YA����4��X�����u����HPR*���T����Z��o�"A
V�����U."7o��������<x��:���A��a���H7��5\���p��U�f�tQej�Y5���M�����������uFF�e:Q�x��=����cvxh�����t�F��tsD��=���J�r�����>���PsN��v#" qv�v9l�}�;&���s������-�H
hp�&�����J���49���O6n�S����	(p��.�����{��(�
%?3��3�$�����	0�^�x6�.-�� �dh wv�Z��W���R=>���(���t�U�����"�7E!wI���m!�C�z<�U����"��.�N��u�8?���pU����eMyi��`���[��eEEcv�^��B�<�u"Zyx�C�U{cD��k�g����3~��o�c�����X`���+Y��lR�!6N-����b6+�E_��^=���o�( ��O���|<����o�U���?��8-?k��@=��_+Y�-O���^s�U��K������'���
�S��sY)�����t����.�v��d>u��:s����:s@������:)dr`��`�G;.,~/�u3`��+:=m�����<��r�������>�;@0d��#��t�25���z����Gn4�0��]�LT������{�x��n���#�v��C��W?>��G)���'���+�����K����[��J<��M��p������o�;��"~��x3�U�?&�f��@��E��F!@�l���\8�����t�2��x:��X�qD��b,*W��?�3����,kpM��Y�LZ������c7�����uF~`����N��vE��Re��vq���p����Q��>����t@��!������+�n���-'�+F��7��nW����q�]!�7S}�w����}��u�l":Pq�����ws��Y�'�GF(@]��{o(�J,Mex�p�����Q|�9���
�wXyw����b�jo�|5��t���>+�v����OR���*|3�E�YN�����X�w�r��>�4yI���������IYT�h�m	!^��~V� �����<�{k�NR��o\���2Q>W��5R�@�l&�D�m�� ����&�%�6Y�Oh3��e��Z�$U�-�L
\?��\7��m�^��u��`13nF@9.^Ea�.?��o�6�U+��>�}���+N�������@��:V���lS���q��F:@�8��.���� y��C!�{cK�8�����l ��?�r��~`#��� �����H����v��@�D���p+;�����v�td���`�����L0�� ���7[u��?�q�FG ����K@_l��Y��f T^��]���f��G�
?��k{U�T�~���k��������o_�������h���O2�u�N�{��w��l}�a�F:�����+�l���}��`������4��Fm��� 
���@��� ���rWU��$�ce�[����
�>���� �|��C7�i�EMNg��bO��s#����}�_�����iL�J��/����#> �
���@�v��nL����������Ut%�xA?�e�����$;�tc��"�Og��&�ae��2�y$s����
���g3�yz3�*�����F:@�l��)^�N�����-o�����in�����%��*�>�e��ew���b�@�i�Ge�&`�v6�PM�;�T6���Mf����	g�
 �7�r���eV�_�I� )�����&��M��0��W��:$�����Y6�$���^��S���X9������>���'86�����g������6����g�a�\���B�����.o����|,��b����z���{T��j_m�����,�����z� ���%�JsyNj�6�q�����4�2y��F��||���F:@m����dKZ�����"�i�O[�{^���z�q�}�0�������v��@�����'�<�N��G���A6��G����t�0�Hr��(����/��`�>�a
>{x"���;�h�r�uHr��������yh�����`�HR�+{yF�jKv=-'��������H�����|Xpe�}��.�:��V8��IJ?���m��9���iK�7@��sys�;��I�/K�;4]q���M���~0@�v��jF;���:�c�����6�d�=�x�j�U>����y���9�,�Y�w���lV�S�e���#��t�"�v$�.��e1y<�i)����V(%�nYgY�t�`W:[e�x��HH�V���P��]��� �����]���v�������
�=��n#�2
w��I�����|���{�,�u�>{x��~5�lk�@�!gDn�c���8��r�����Oa���%��DC[m������m�4a���� (du�kon�y��0Y�85#d��=b#���M7=�s�Fj��[}�	�l�����vo������N���{�oV����]U�_����! ��VOp��9��p/�����n!:���Dz�"m�;� ��2?��prZ���MF�:�{��h|_\�����s��C@\��7����lF��@����� ��`�����c�G}�f�P���@�8&�HZ��|S�+���@��g����F:@�l���}��<F��n�2�R�!5:{Y��@��6$����z|��
����u�����`��]M�?iH9��pWg�N�ro�����;��6�$�0��t][FC��d�4�s���)�s���d���h�����9��{{A���|����p�bN�Qux����y=���&���6:��@<C0$��l�N'u�!������G�g���jrT�������|!:���^�AV�����x��F���n�z�+��v��)�]i�M/�qS�6r�
����6T��C���L�7�5��fs���!K4���rC`�4I����N%if�	�<d�gSO�<�\��C�O<���$�<������������.-?��Re���f@ 9�e@m��j�T�Z�W{�P�0�T����geyg&*��m��G	�e��
m��<L�LR�������=������L�&`���"�x�c:E����yz5��EU+ho9U����%h���=��:�;�v�=n`C�&���m6[\/g-�c:,����%���;Y�e��|���8Z$4�4�	���W��0
��@���g�P3�m���k�/����I���3�b����v�Lr>�F:@l�f�x�XO?�G����\��2������^�[F�`��!p��-OT�Z�����(�7j�!�<)�5���^|*����q)����4�i�uSp���~Q�n��:5�X�����n�D	���|J/��{����a��.���ml�i��Y����a�:�H�����][�H�x��O�Y�#�9�m���R��i��G����(�
,w���=v�9��H��AZ�iM�}L�}�����xha�7s�v��Gt)��q�E
;��J�����7��@����z�w�����t��X�aK�6oe*�	lt��;D��"�v����9�s�����I��W����p�;�M���< �
g��	J)��<oOV_���SG7{I����f-�4#�!6��&�F��V��58�e��`z�V3hy7CS�;�p��
����v}�f��y���tx@�Y%8����6��2��K73B�Xw"��Z�'�<�OW����re�z��!�9��HH�
�5e0�C���]7������*oev�R�s���d���O�������I�g2�p������p��	��q�h4V��ThA;�������i���ZxZ��:��z4�3�C�G:$�%����:��6@���v�d@����Af�V��|���d�b���7~r�ir{���������Q�����O�r=Sq@MwY:��q]��W_%�faz-��(#���,��FY8��ea3��|G���#���l��F�2t�������9�B��6�'�����4���Pe���X����F��YI��Qj������:
�vg�YU�����Q����Y���'���;�������1u�[�4w���*}�b��;������u}S����7o���~�VA�z�,���*8���*(�|����y���]�F�������x[4i0��o�\��`yG��)t�&��;r�wG����>�G�$,i�����P�p�#�3�C���$�<<��$��� �G�������wT@��8_c�����#d��]����9W�����;����#����{��x�{���j��9�T���vZju|c�>`����<�ix��|����(��N�����8 ����q�*�@l��D���^�n�S������esl�������@fG,2+�l9W�1y�ro��''�w������.�f�
��G����@�#$���M����N�k�*�����������v�����cG�9��qp�����Y�����N�zV�GVq������J���z�������jv���B�c�]�l�ud1��)t�^5;�0P����w������H
�X+���]�������c�H��������3�H�o���
0bU���K�%�����BS[#�8Q���"�bV������i~DGI*��aG�}rwY)^ao9���
��0�����#��b�=n�6��VR7�\���"���Q1�:�0��pCQVK��6gU3����Ed�T>�c��L-��(r�����tmD�����W��:��W#�K��.uWt�Y��8�������=1��i����k�b��8����q3���k<�;��^x`����eV>�tP���&F�	 �����v��m�h��)�3�.){��6������6�(v`e����Z>���O���������7o�b9�m��Q�$`z`���ww"��=����j��$>��	H������*R��`�G���0���M����b2�>����m�����T��G����'�u�V����F��E�U�b^q�F���N��^F��T"��,[�&�C�G�|�)�8�d7+�������x_������J�V��U
T��� ����Dj:,����
�VV�y����^���:�(��<���9���7r~'b�����	��{9���A]5#�7����7*�"��[�������<�'�fO���lR�|\{���D�?���&+���M�M?�g8����g3�;����;bY^�M����5�z=a���u~A>�u�-[�F�qN�F7`�#��K�gI��X�;���r��AN��v,�/�� �������w�!�-z�Tr�9��ST�Ym&����Nvd�d����#+(K�����q�6���y:�.�v�gHw��YU�=�#���j;����
si[
��I3�+[�������U��\�S���1��n%m�������h7��	m����)��3+Skj2�fG4k���.����H+�=�|�J�:�[���5���v����JgU����������Cb�hl;�h��/��Ay����O4���vv4��<!����,��v4��������T��r�3w�)�r��*`�0��h���WY�i��C��!�F:@�v��m��W�q���k��@��Fn�� �#�u�l��:�T96���r���)I�z�bWm����W������S1�m�#;J�oW��T	�����s�M���������N�����G����kh��#ek���]���=k�����t��^.��/�g�yw���rH��U��w�cs�ZU�s�y~����J��&����l(���z����g�_���K���������Vf�|�'��N�����nQQ��	b�O��f���#;"J���P�re�@���������|��@sv��m��Dn��iU� ����� �'M��+8�+�{]�P������+8�����@�����_)����F$����p����$(}���������oN.�^�Eg���������������]�����S#)P�9"��fgB��mU��f�l��u���3k������]0;�I�o��y��hM�:�op�����n�Nw�������A�c�M����!��f��EM�c�l5�D������y8�^\�����_�X@���&�n y����F�0�w=Ko���T��\��]JGR�B�T�!��Q����h�iK���
X3�y�,\����E��� ��4kLF@@&8l�	�Y��������ex��L'���]������������[r%������sjbP�������U��z ���6���;�J���j��.~���7�Bjt���>��M���	��.��U��ME��)}��]����9�]h��}-�I���dN����aB'T�a���Z�^/7D�����u��t�H�����Mv�]Zo���O�Aun..�/O�E$5�����l4��@��*�QP����d8�s��w_���r����Z�e�����E�{���>~9>>q��������Nv��>�a��3�U�26��r^��=����������d���?������Nr�����"�_��~�fH���=v��>w�B;qI�o�mX�)�X���F�v��>w�-�
���<�T�hYm������,����&�T�M��n
z0�4�B>Ae��!I��Q���#}n)�g�v����(�?��������M�o���(O��u~��/�jn�'��O������K4�tv��'Y��oT�DD7��#�o�aI���0�.��2(Vc{�����$���J���zwv��i��v�*U��<�y���zr��LEq+�B����:��V^R��M���&�d�t���,�n����������4.q�b��o������X�|�b40q	�� �v�@��gc���*��m[�����&[�iOX��Q���)��������*v�k�"jQ
��T=;�J�[k�9���U�]�����RiG<��2_\o7��������s�������_>W5��w/�"iGZ��mb��Ov��>w|�@9xu},��uy���,�I&����kz(��c���w��c[/��i��������.c��Y�Wv��>w�_�W�|�G��booH��Hp�{���c�X��m�������Q\��0������`k��`��w(�N�������[�����:�G���;PD��#^�t�B�&��O�e� l�@N�n<��E�
nN����"p�������%�&;��U�'e6��
��V���1U��}i�M�s��%������<[%������g'M�s���=U��X��cb�_��nsg�3av��>w+�v��>����C!H����s���~���FA��(]��rR�)w�w���E\�����>n�v�^Y}������[��f��}~�h���bB�2�^"e�K�s�X�)���#�#�;bg���D��w)7�D��u�=��z
W(|���GN[c�����?r�&D����l�~�j|���'�������.���<?�q�>�]�#7/�N��������e��d ���&ey�0�����������������Y���v�5����g}��VK���x������5I	6|�����������P����������=�s���w[}���6�L�����QZf&�<����j|����j,��J�}~��
�D�Y���
@E�����5��C�c�$��v���W���53R)�J���B��$���e!-,���2�Y��}1���Wt,����(�9���@�*�9��g4������`0X#T���P�\��q�F�R����`��������������(�<�W|�������~�gMF�K"���}m�P�
Y@��1>`U}�U5�����z=�0Q������-��|-�x��B�e��5�����,��.�D��hq�RV�FH��[AQR-XV��A������'�^*�p��e�]=�Y��Pb���VvT�`(2�G}'~�(�����H�oAJ-1�����?����Q�q26esf�)>�R}7�S�{@���R���y&2��'������n�>���A�l�l��hU���j����G[uD��:�P}�B5�����Hf���������������"�P�K
S�cL�^<@K��tz�u���H��YR��&�}^x��RZ���q�;fD�9����n�yV������b~i������iXj��Y�����X �����������>`K}[���>X/�����Z�s&�F�`�>���e��p�����5������XQ�cE�2��uG�MQ�~��8�M}�E']�|:��_��T��q>���������gl��h���$��w{�*[{#���H�2��[�>g�i���)������]�����)R0��"�p�@���;�t���"����|J{�TO�����M��=��[:jY�~�*�xO����z:�����)��e��J�c+�t��pt��9|����4�M�W%
I}K�)�k;�r��P��+�L��>&�j7���{jD��3�T���J�}���h'4D'�=��<�h����5"��������Wt�`I&����*��kj�g.�s�d_����Q�����c��>g�i����s�#]�_��7��4AM��|���l���P����������RN���e�����G�s�4��q����]�?��;�}t��Q.���v�Q������o0(H�������g�����a`����F:@0����������1���(����-������w\����p:�������9"�{�U�g�#��SX����{(�jf��g��\I��a���
t����yd�6[ByF[0���\}3z��*�j�&VLlWb8���5�'�E�{��%�e��������9�g�T�������]b���Dwi�Cc.�/b,(aB�����U�<i�:]����������hK�K*��������A��I�����V��F:v�	\N���Fu^����,�{���^�����_\�G�]�M���`.7��3��2�@J���2�J#P��c����R&7V��e=oO������'�{��+����0a��6_����(�~�WG����O��W@g��w@a.�������W{�$��8��H�����y�cw#������g���w\d]?EogU����p����
�.{�����:�t@�7��c��AvH���mt��:��d#�������n�B��Bj�V
�G�������3��oN���JIK�CI��5�g���N��������W9Wuv�}�����1���y��������L���}�O��
Xn�l��p�gcc's���������TB5q/���B�@�&��w���h3p�67*�2G������l��4��i�l/	o���_�{��q#P���=}}������J���B}�)"�1���}��������B��q�S_������X��{������{�`�A�z��1�����]�C�a�C��}~z��������8-�E�i{
xs�xp�M,.�?��a5����N��'W+��������3��`���gz��O����<�++5����a����C������M�h�z0|SR�9����������q�)���4��R��|i�z|�z$�V�'�p���)���5�;���[�@����L������I�k�y}k1p
>��+0�w���{mp�6���4�h����.p.�F:@A��'�����G��wC��8"�@>�|ws\
��r*{{R�<� ��t��&�1^������EN[��h������{��nt7b5�r,{s��H�� �0��������1( 1��|���;#1PE]I�(���fL
��F�)�g����w�^�\��ym3R�p����d3�����f��7=0���/N/	�_��~~z~1>y���~��;�3������t�b~��g��B�cUN�>���{��H������@{,�)���b�=����Z!��+���Np���Nl��i�}q��y �S�_I[��rfk�D��By<�?���`;�}�>�l:Q_�7�`;��q��P���*��q�H�y7�������|yg�	�j,��j�$���>�W������vZ�+�j� T�m�xi�A��)�=�����]k�<g������?�/���������Sd���L��5+�Y���������@J�g�w���,'�4����y~:����������������o��_�>|`��BL;O[�U7�����>�e6���s���Y<n4B���tN�����&��Z���B�h�T9��Z)��)0�u��}6}%��R�w�����+!��2��K�]N�w/I@�8����$9��V{I�[g�w'����yI���/v^7�@^���|���M!��% �n'���"k�_��~�$��F==�n��B��1�e��^?,�tp��#'����R}�����{�v ��q�l:H�	�����S�����o.�.O�j��`�C�31j�H3���@e��-��V�R�,�������"h8Z���
�O;�h�?'�����r�|���Yy������]������)^���bo�I�7t���4�n��������.�U�g|�����m�.�h����v���|Oo�P����:�N�P��6�����tw{#m�y;u��d����
D��u�|}Q���
=WC@��n������_�*������#��7�����Y�����r�L��nME2��S\=7������vdw�)�c"q�DKZ0B��Ph�B���4����I-�3��"ZYuc���1%2�)�p���}j���7�^��}�����r�2�����x't`�T��S0m��Bj������f7<rS��?�y|>B���6�T}J0p�h�*^�/��08������eqs�M���'���z�>�p�Y������kc^.0p�Z����z����P�!���s�t��=�w2�z�6WgU������	����&2�zo��5i����������[��;��(�B���N��������7�X��2�:��~��]�w!R�[\��<���~`��p�������j�m��{����.�����X�W��������_���/g������v�����Bp���n����I1�e�Zu�����p�7��Zk�I�{8�0t��b�nCx�9���
���p�,�'1�!`�C'����&�@�atI�'B��N�����d�0�a�89�A�M��b�w����+��E���7��QMs��������(�^n@�,��%
������	�F����iV�Bv�6�h{��Q]sE�{���t�����Hb�����{1b����c2��<��#�N�? ����1�j��Z�����n}`@`-��uw_�K����y���m���~Zk
0����5������eNFb:������=�j�]��f�y��:���i��m*�����)�C1��z��B�^�{��hQ��y���V��x�M��.Q�g�\���bA��po��m�Y�6��B�/+Pef���hL�(;�������<��CW<{�:h�K��;�w�.9?�w5?�w5?�A5y�''��[�R;���p�
R6&�����v�Yr5q)��=4�!�
c����N-X�� �!�xw/�]Q���N�i�~��_0��b"���"(�x�����Q�<�8r>A RNn�t&�Ig���*����<����"��@	4��o�L��M���y/�ZN��_0P'�:7�
�
�k�	�j�R�����qq=.�Y�'��]��gS:J���w��^^����<��?�|)��.8h-@z���e�CL�^f��!�+����7�cS-Y����vN��h���fL����SZW���*��b��N8�q��-T1I���f�ZY�	�}?/���~$�6rcc�;�:�jaz^p'f�Omx���FW����W'�4���_M��;��c����Y�d{h#��2t���
^�+o�HH��U����Q;�N��h�O�}�\��]f��}PV	�;����z:�b])v��I����gUcZ��LU��
����C�T%	�:ob���� #xy�z�����F�����t����/\N������0�^�n�~�!|�@�������[e
7GG������/[���/���q����#��`��������2����m�o��p��fq/[N]�/P��+%n{�1|�@�\p�e���V�����������C@k�#�/���q���<��z9��R�����UZe����n�b)F,���d�~�I`�p�0ODc���"����?(y�*���]���@�j�P^;t�]�I�`W�D���8*{�^�P$�v�
i������7��,,E���������Ns�tI|�]��3�:�;rve�A��T0*�;�����3��9����s�M�9/���v��=.8�+a�M���#��U�������h��A��R��~{����*Z��l���h�;r$�=�����s�j_	_�
q(��P�������b�E���3��-�Zg�P-��P��s�j_��h���s#�Z��Nq�z�'���8�b���`���t=� �B������=RYZ��o)���E1��u���z�W�t�],�l�`D�a�MT�[�9k"�}P@�G�JW71R����&������(�wOh<���m��h=�2��Vls������fyU����������U�	�d%<�!�D�~3 NOo��&W�O��;�?�vv�1�i�-D��+O�������.�=�1�{��Z|"�)0yY4q,�����Y������U�i��
���(���8D�����Jf�|5�R
Q9V�9V�L�"{E�y)�)P=�Y{3R��!1UM|'S�� M4D��p��j��5����jd>
wj���W�M���dmg��_������2>]�	�}d���
!u�d�i�
�i@�-��N��F�5{iO�����*J��.�oI]�����������������x�BJn!��B�����,����n��Kh�}y���t�z[6l�v�<����������P�����g�Y�� �#W��6��hL��(r���,@�G��~4�[�m��|�����(c�
����(�Yj��qN�F:@�v��;�-�e��o�u$9hMA��=P���< G���G�����P����'/�Y&�t���Tk�$:�6��p�����������h_���$>��x#P�����F3*��d�{�zIGl�8��(3g�b7�!Dz�����$�X������?��,[�3�pzV�7����g�\
4�/�pb�}-��D�='����*��P�}�����+$�HK�S���s�����K���,loJ�t��P2��f��W��X�����}'���(hm�z���n�`���?�h&��������f���{����tS�qR��K{�	�>q.�"�ft���]�	������������3zT��qr ��]�����I��n1�:l�*�8cq�,�9����%}��~��E�R��"��[+�fL<r��#��G�en�Zd:h�\N�����:���Lv'K�l�Z
X�[W!�����q
�@4G��u4������C��Er�=z��
������N>���?�/F�.��FN��D]�-m5d��Yn
-�x���=_��PE�N7��f3G����<��:1�h��T�����b�x�� ��`k	�+����������{�����l>�
��g���i ��n�����7'�4S}��L"�H�w2U�f������M�����o��%�E/iLb��^��{���7���.=�a�@�#���$<b���!�9��.
��x=yT�mZ��F���:#@�GC���$<2�)K' ������p0��������i��G7��	��#��6����8ko��d�Z�=d@�#��[O������dq�FG-�}�y1T���(��,�"+D�Py�����+��7���������^�B�	��I��:������� 9�~����e�9�TG#�,�Y�����JBWc��8������3����?<�D=����SY2���z��h��I���.[�aq����l���F3m%��@A�l�� �"�,���Z:1��c
mhJ ����)g��U�]7����L�^J��Q)����<Qf���m����s�s�4�k�������<5����c��f;�1��&��_��������iNa��u1���u1��9���|��i�mk����v��^��Y^���j�O��CMb�L�V�k���\>�c�!4t�����c$M27��T�iY��:/��Rs��t���������s���4,���1@�c�6����5���#GN��}.�f�P
�[E��rB��z����'F�y)9���2�q~=&z��]{���/>:���ah�o�������o(��0[_���s������H���M-�c������T�Lj�DkK_�p�R�>��Zf��M�,D����%3j��bt�����^�7�e:y���Y:�,���Hq���UTBo�5Zst���.�r[�������}�v)G-
T<�m���f��ZUg��UQ�r���E@�n���e��'��1L�Z��<�q +,�l* �c�T��b�)���o@#�u����(u6�z�J���]�Fm�%��c�[6�J@�6��k�}���=���lf�45�����+�{�Ndt�a�7����,c������h�����)W2VrQJ��+���"7�����:���n��b����6����r��n�rH�/�.
�h�)�~H������$�l
$1��M��dO{4�}j�����x�3��T��3���UBTf�j��{j�h��mi�il�OMy�i��>�Y�E;���JrX;��<{F/��@",&��R�|���,�����@�q�8$il�H����8�O����_���,���h�����,gi�4R��<kR�b8$����#*�8Vbt��3���P�F
��MB���p6oX�{Qu�o���x�&���j����������K�������A����,fU���
HVd�,���H'f���
8�Q �����
�����
���w�����{y�b|y~|���������N�sq�N��
�*r�����C��1]��uo��X���^�W�;~8>��z����w��9�Lll��6z���]�� �"��8��������Q��_���O�1�q��,�����1��eLR7�\#�<6'g�xm���t������PT��=3�-�����u��
E��h������Q�bQ]���RGTu��������(��yOR���<�c����5e_`��j���������,:�Y�yw��W4[!���h���<��&W���}vC��n�����1������P������U�T5��"7�6���8�Q6�F�o�8-lola{�d���o�)e@���h(�|PO����-[�{6h�����O0��d�3��C�����q�F�P��s�6�Jc3�&�O�(J�y�q�ow��(��T�{��;���a�����o�o�p�	X������(�e[}6��H�#�k�y�R���A���_��7��_# L�3����m�ZB�{�QJ��k��*R_�>k�������#Qz!�aq�}oO�+x���-
���|�ei9�E������}�T�+Y���<2�7����
��Z��-�Rq�d���JW��4E�x$�]�S%z�F@@9�i�/�@����.��j��l�*�Qh����q��r��x���-�i��n3���r�����,�*���*z�����3�� H8� a�>���f��� m���F�����.�/&�������
m1��c7&���h�2*3)G����.�3�%����P5�LSd�������9��89��d#%�y�,���6�'>uO�EwB^�B%23���3��P���>�6�����vy�Tk�s�.�"��%3��r�rO�p�1�-7+�������99I����~�����R ���g[�z�
(s�1��s<r�J����4}$n�v`�����t@��Q�O�	Ir�-�R�U�h��T�F��ES�����4�c';��H��iM�6�2s������QB�l�Sp�W�������`�����G
�mYmul��6+�%���`�d�v��k>�F����k�����i����g\4@3G�s��q������[����q�k�7��2�b<q��63��NkG	�P��@�'	n�c���{#������z�p�����ew���@��e�����i����F�,`wb�%�W>��@��#1J���VM,���W�r����������?��ks���;�����������b)i�h��y�^B�~��HU���%-H��r
1�x�z��o�WZg����A|3�v���������N�W�R���O$1z-W�EV�������Y���t&����N�X����K@���{�7�7mn�������	��5��5�����;=�����(eO����7!��K��"�xbs���T	@�o��+!���{��x+W<��+�}�xu"H%�:^��=�������;a����bt������0���������g
L@�'6l�p<A��f�
���V�Y��Y��tB�)ST(�����w��<k������t���3�LI����{ ?,jU#��H�V�0��R&�������-��U����'k�$�]�m���]������rf�4zd��	g5��z4�'�����z)�CRh��/�
�������	4��~��$3v�hPfw=�&�9r��.�+��>������	�����!�S%�\���tl�,K��S���bY�f@��d�P3�����5W��m7y�kF?��	t�6��'N��PQ�s�r�@�'!KC}8�p�|���R��`^l�bu��,��
�Q��������z�?�D�}-l���Xa}����}b1���L#X|����t��j����>A��fY*:��?�m/�B�]iWZW��Z���0U�z�)���{M���h��u]*XL����>���.�Ei>�B�2e��g&Q{�\)h~b���2N�fv��d( �'�H��Jf������|_���u6$�2�.�*�?�b����\?�����V\�4@��<Q1��b��rStGwt�z�q	��+����V�U�����'1�d��*�G�k!���{"R�X�@�l���70 ���w����0`�{�^F[�$vT
��'���WUV�{�^(� ���f�G���I�	t�:KE<Vg������8�}b#����E�'V�s��R�O�O<.W9 'V����*��c��t,������"�$�w��4��J������Y����t*��0�$�q�3]m�]&�}�+v�^{���Xua���/���g�.u�+^��OG�}�Ye�%a����*�@�(9wD'���f=����e�~�����g8u�}�)��,�7����5o]~�d��c�
�J-_���`{�y�}�����{!�}b���2�@�'�]N�L���N�l�]��_��>�h�����,g����p�y����=�9I�7�5!W���R���~8�����j��H��k�t��i`�Z���q���]o4#/�D��z��e����GM�+l@_��v�����iA�������q���D�4�]���j�Z����`]�<����L�Zc`����7B2T@������������P�	K�7�����OX��/1�'o�������\���a��8��-�N%���'CG��5����3���z2t<�d}���O2���6��'������;��W?���z��_������s����	�+�����������W_zWE9��gG�$�����j��n�N���&a����ik��e��?����r�������7��������B�6�E##�Hg��/�_Q�m�F����tZ��x���Rn��z%9�n������u�����Y�8��e9�����p��O'��:�����W�����^�������v�}���'o��:��?�#����������b�@�~[����������r�^�(uOi����O;��<]�eN����!����q$.1���A|��W\� ����A��dK��Q|����_��7�%-&{��-���)���=��o������\�����������w��<;��|��������/����#��,������������L�At��}���OD�@����
#��I��2�����bN'�>�|Xd_�
l��g_�bT&u���.+!.�.�{"zRg�Z(���'�[�.����G��C6?t(�������zB�d���h$��k��/��a�U�YV��5aM���F%��LRHH��O���Wz�s72�/�����Uv�)��������n��z�����W��}4�����t�
D6�s+}���L����E��<��z�f���z
�=�����Ey�(E��?����Q��Ld��������O:EPuNd1{���(�O�������W��N��*�������H��2������`�<�x|�/��7�� :l�����}���������������IS/�D�u�R�z����2���(�����L.������V.��p:���/������������Rt(��a�}�se�s�5�"�������$z��u}�?�{��V��=m�7������DQ$�����/�!7��.~���1�*�����K�D�nI*�h9�M3�K+P��t{�}Y"��h57h6��#�_v���9}=��7���I��n���N��m��?���i=]��v��^�q[���b|��nW�u!�/n��9�G*+�o�"�m�<���Q��t��=�f����GM��|-E���8}�8����?�4�y�S/d�����m,�#w>�<nj>r������*~7g�l�A���'�`����A3������&���)=��g��	�^�O���5�R�:vUN�G��j��{��V����G����8g�:%�l������a���6���e��)-
]?\<T�9�.���d@ e#e��x����QHC�`��r����D��;6�,���r���)m�n�,��� �C���>T�xvR�����\�M_�����R�#Z0h�������w��`��/�V���+s���ce��E����he�#B^�x�������lJ3&���'�z�<����Qo+�s|���B�i]t3^y.��r�a`�P,D#����.�-J��U|��tA[�^7�XlY�F��K'O�Nq�J�*Ht������DO��y�6���7�Z��GuiOn��{����v��������p���,����e/|0�5�w��}^���q���p��R!��eQ���t�\���0�M�*P5j/�.����1�����C�����a����[V�`�k�����Y��������g����Q������:y��25���7�z7��B���U� \���%��f;r� ��_EXWm�y�Vc�U��`4Z��SU����lVl3��k���)�����E�m�z�& ��u��i�WlT����=��.���u�:%��F��h#�����P�h�kt������������4<�W�����/���~����[LE
z[D�ml��L���i^W�?�[�L��Q�k�%3����������i@e����1���0�[�<����I0��d�MSj�<���������U�3��������K��l)���d��������q��"o��W�VCZ��w/�T��w��S���b��PH��q]��a��m�m�
C�!�W]��h����!����P��7c:#`|#rX��X���-mb)���vv�O8��&��M2��]�t)&�{��Q����}?������>�����������?>��0������>���������?>��hz+;���6L�<;%v� ��cb��?m��'�y�v��/�����A�3��+"~�E��7��~�<�x+��o�D���"��6��~k[��oh��ob��od��oj��oo������7�������>�yC����|&;���Z(���oi�o�������M��}}�7���������O������)�i2���A��ce�����[�����%��Z���W�Za����������V�1����b,F��uV����z��_(e0��������������������;?~��f��?T��=]���^�G���)���� �7�ln0��p�?�����H�����F�]��o-������@�_�b��z�����n���>�}3��_���/��}�l���t��o��.��.T�F����M3�/;�yZ�Ts����'?��y���HUqq�C�g_�d�S��{�	[?w<�>�+;�bP���0��M�R���������}���S����������dqq''��A�<���N��CIu���<��#jD�p�6��W��{N�r��e�EY��u0��������=��dqq �{g�O__�}wvr|y���f��S�����,_����=}�i��Br'�k9����s�%�R5�U�g��v��*��T������R����iY���
���{�&���LE�RU�g��j:�9����L	0�z�3�Z��,�Z��T�����>Se{���I��]Z3���g��v�������=My0��g*�v[-��o�z��u���:�UO���L�(Ly������U1gn*y�Tr-P�����O�E>����
��,^5�aO{/�|~]x���r^������ p���A���SM�P��	^d�W�f���J�>������v7P��6:�9�4��������L���������w�~�D��o���{6��*����w/z���Q���l��*~���Z�do����/�E!�����"��bz��T(���}q/�A���_�3*^�K��I���}���,�������M�����:C����vw�!�i����@��@S��9��C3-]��J��R#��������VN�4���XO�mD�!d���d�<��vGW���r����/��B@b��T@eBFe��%�#d�CK�G���.�!��!_�O���l�gs1��K'e!+*���!x]�
�)�V�"P�#��k�,�5�Z�ivML5���q-�������nju�u����Fb�*G���TK������[T����{�������/.����x��A}��Q|s1}�u�u���=��7�o��u�e{���o�n�����������F��F��7�7��/q�y:��'��L�� L(�b^��r��#��o�o�8F�A�����{�������-�7�RqTdy'WV�2�O�&���~IEOx��,Kj��s-P�cv���{{�6������u�a������I'��EF3�w47D�&-�1?�G����
�b^ �����8U9f�2
Q��@���-{6����h��������� 
�[Uva��P6b
s��Z���b;��f���ka���z	���h���\�m����������w���7oi�s��M6Z���DWo�S�;���	P�����f��i	P�d�x�o���d��b_fw��k��vd(��n#�����)�Lp@��F#���6�/T'���3�[|��:��#�i@(�-����u���
��wg�/��������#.( ������$��-!=s����9?9��>y���(j2����RdcS5D��+i@R��'�Q��l�*�w4X��3�<`�wZ*@P;���_�+�� �M
�Ad@�`�g\jZ�/��%�'����
������������Y���2!6���Z��W��>�<�;�R}���.���a��
��;��v�G~��x:���e��0s0j�F�������Sj��!�!#�h`�c��x���bwYl7Pc�vQd1q�W�!������wu��H�[���G
����P_�D��w��6Z�n�2�����?����'����H��mJxd`h�*����/�P�R������/�_~�7Mo�@	��	�|^{�����:P��e����h����S���2���D`�C'���/�%��bB;=*2�ahH�Z
���������(-�#�
���T����{g���WR�i�k���W\�@�{�'j
�k�E|QU�T�y�E����#�_�����wEfi��������u9��$�I�y�����C�V�������my��z,r�p�����X���f^P��E��l�7��p���,
"����N��LiW:{���#�!#���R��f�v�PmWm�ja����q�
qi��d�
��nZQ������Pmo����AKQ�V�j&��W������mK_�?������mn���y���B$>�-[Vu�Y�����dW/�����d����vd����8�G��.>�V��tr+��Q������H��Zj-��^���G E��'q�.f��������!>�������@�0�����P?a��C&%P�jx�<�h@q-�P�x����dW��5"q�C��P��M�p_��<E�7c�V�W��gY�e�]v���`H���b����0[4�y�@!��E)�6��ms�D1],�B��x�Z'������*&zSD�x�]�
P�$���&4+e9��RMw�P�K�� t8`�����������iSaF| ��;�yS������r> h}���Ko�G��;�E��v�)kO��i������?b3(���44]��\I����A�t��
�n}����\��\��;����������H;���H����'T�l�2�W���F+�}Oz�H\��W=�fx���bY���hq���x\��q�p�,@�rt����|��
�����)j�9(�J�|ii��Cr�������r�����(Bz��]��v������~��n4�.B#��_�u;��e�E��r������d���aB��0"�ys3����a|w%rg����(r������;�KB���Pa�E�=�9���8js���R�x��B�]%��;�Y5I��vt�D���~�����MA��2�b��b�c�e���1`���"fF1R�����J��>�(���Ew�A����������{����0�Q������3�6��(��)�LT@�;r�W�a�4��&��q9�j?�e���t��>�k���`�}����R�0V�3��.��]����g��W�ld3"o��!�A�F���
�nw���B��D'���F������D=�v���_/o��z0@�8`�	f=��i�C^?R,�Y���=u�eX�CS`"�e��]���[��n@@�8��	H�GnX���s	k���������������"�yI������K= c,A��Q}��p�%�e��p�~�Cgk;%{)J��W�$f���T@���}3[{*Fl(@���������k*�{�������������'_M������W_u�}��c�J���2�x������/��
�3�a6����bg�=��N �8w���?U�j'�4��]H�aO����ER� \��_<�N�?=����Wc2!�~�������w������&��Z��z�|�>a,b�z�W�?�_��~a2->`��,4v����&��b���u=�V������g �hR��F�9�f3W���'n=0����t��O�q�v$��[���b������'�������;�A���o�o��^}
��~���}k��n��-���w����`�t�Au���%h�.�9]��=�/>����n�L���?�q0������j�w�����������`�����s���PID.�'��jhn�����8q[N��]0C������}7��t����M>���&������M ]��no���@�8<]K��>�7�x,]ufY��F�w�]Q>�� ��m�u��~w�MA�9]����9r\OT7������t�!^z����/���T����<������h���j���1������j�����WH�.�����Y*A��qsc�8q[N\�#�2Hf���A�d�g��u���E]2=
���#0������"������,��Hh��g3 g������~O ��f������y�%�[��p��a���'�������2����2gj�8%:-��iY��J���h�e��`��|:��r��5��[�sl����]������2�G;�tl�H@�GN(U����������Vr��4A1;,@��]Pq��k���CtZ@����d��py��t������'����!��K����(��h�G;h�JJE�������{�^� pyk�����d����s��$T^�*�//^{RE{K�]a��_b���2$:+O�ex�-����(��s-@���"D�������`]��55�x`��WO��.�8]Z
�gx�j��3o&j�����x�y+�W�����Uv��E7�0�r���,<������UPw�y(��}��W��zt����J5vc��Db���jg����q���MD����D�P��S�������R6=��^u��$����cC��hdQ����a�����p�%���j��t���f��JA�g����!k��P?8@|������'Qe_q��gU��{�����j��?g�7��$���M���.������C8�����d��X�q�G6!X	qU���`��N���?e?���f+EQY�������M�}
q�����Ud�����s��-�l��M�T9zH@�X�uHwiMw���\��~r����^��4X�U$Fu��8������7�.unim]���G_��y�+sNW����;=?s�?�<`��{��,���=���>.�	m�#[���rw�����w5<y��U[������1��,p����pGX��M��<f""�y���u{uz�^��2m����zU��>y�9\��>6�w�+�mfY��sy��3���pv��]w.��,u`c�W��	��U���z�/�Tk����1�asja���@�
'U������;1]e��~'{���^nS[�	����mBG4[��'�;@6���d��\(K1v�zM�|%��4Q6�l<��u]i� ��&�������oU�
�����8���0�����������;�v��B�4i��&��A��n��ri��XY������9�kc��
���s�X����{�Z<�\��d�x��m���=d���������[��lJ�z��r����'B�
F���_��cz��)�����K!����������.2�}���7�Y���F50�����k�X�0x��u�8tw��'>9X�N�W����nmu�L-�s1�OM���������?a��j����S�
�����j]N���/o��O�����}�n&@7�fP����;���n*���@��.���|���3��@m��aH�� �Ab�
�2@.�s����yD��T�U���u���n��L�`9X��h,�w+u����gi-��S������L����J1N�#�Ul��9�r0�����>�c���`��'����c�ec�a�/h�_�R�D�gF�
@A�\�Q��wxx�����d���f���!��e�v����$�����u���m�p����
��s�:e�{���1:$��n�`u�y���r!�� T:�Pi�V:�Xi��
���0�gR�t�d������N3@���K��^�E]���@����ka��y��1� ���z���3�s`#��������Bo��g�_����>�ap����kUz���O9>9bD���B�������t
4�v�{XAW�S������e����a���MK���k��ei�^���U�����?=(�2#��9}\��G-�}��@���I[?�1�r���M�?d�3�#W���<���t�Uu��!m���o�X�1��#�:e���o�z�H	B7+_t,u�/�2^&��cW���nK�~�e�2��� ��|H�|u>e�a��RJ���J��������y*��my��R�*���F�!�#�=\f�&X\�!�)���r���%i���P�^*�����+-$�,�����G�B_�6�o��>����!��C��~`8����O=A��0�dMX�p�z}��^�������:<��M����Zv�����u�]��8������1�e�>��M����M��G�}����.�9,���{�����?��?�z����Sm��q~=@�C+�N��GDC!��:��fn�>�E?H@z�{�yv����������"��Gp��,�e\4:�����V��hr5��LN�����,{�����{����<�.NO�4�8�d�6! ��`Gf�a^5�zh�3�^{YY��EwO:����D����wre5=�^S�Z���eS��;��i�U7`L���5�X�����5�N2�N����e}�U�b.��w"����,��C�]���I-�77U=�v���{�=��
=�����1M)�\����od�bDC��ld&�FTk�&�|^�I_x"�yZ���]L�EP"�����9 �����@����8�'����w*���lrC��Z9p#b� 6p5�0�-�!�l�E\7��l$�>�!h�>5�]��W��0���K��K�������.��7�j��Q-�$�"uvR��Nh���P�P����C�e�#oB�3T��Pe41�Nc������&�����G�����VY�pC��b+~K/w�o�W����0g:�2CK_����`p�x{ug���i�
�>��<D@yw�\�c]���7�� ���J��#�x�f��;�����1�|����S;��1��w�zk_��h��f�:�����0R9��t��+���w��`3{�� ����9
SD�_<��`�z�)`��X@��6P��xp��V�8;&I�S7[�������	��x�a��).�u�S|4}�K�4z��w9]}
���������2tz���z2@m��E�.�� ���O���R��D�x��8w�R#Jo��*:vP�����;�v����Y�V���*����=Y4g��������M�k@�8On���:��k=P�9_mcPX��s�^�W���9�n&<:d�h�d0:����N�|������e��ip�'�J���!g���&��=�1j�8��j��C7z:�t����qE����au�8�$��wq�6N{5O�Rp�pW����r!Z����)�l��Wx�%xh�p��"Y���Nf(�b��m<�~�:��7 w��������E�������ZO�������"��?��O��|Z����G_}�]�4+�y�l6�OD�l�[����5	�t_O� ��,+���w7^UN�}yx�T���~�p1����U��f��m-�Hg��/�_Q�m�F���N!"H��?����:���AhH������/o�z�����zq��!bLr�����DNc�dOg������K�&�{q��"-�y��^�����������t#*#�~����>P���w�?���}�~�k�>E1{J
�h����rs��S�q>�^/��CzM��8W�D��?������GI���/�k��QX��Q����_���kI;<��EU��~7�{������/��9��w��/_�������������OO�>}~�\}��e��+y�Y:{�����Ts�$x�Th���\�Dye���j!)BL�d?-���<)����'���K��%�{�e�}�e��F��*�&���/e*u^�����{�^�����h�}�f���eW�.�^y\�� �0�B4
��L�g_V��,�n���&�l��T���8����>��%����d��nd�_�y �-n+�������s�'����Q������W��}4�����t�
D6�C���gbt��N�%5;��<��z�T���z
������n��4�(E��?��BtCDI�fLd��������O����$��=��Z��'����}�������a��,v�Me�,��t]��=`K^�������x!�WuI��{����>�?��/�:��m)�4uDOd]O,����]�7���bejU��2��s����Z��f��Y��G��������K��<:<���������n���Qp��?z=��:��������U������z���n��9����/7�!T�����K6Z�&rvK!���o�q-Pt���%�e����c�����4�n�a�����$#��?��D
�F��������>�h����0�����knk��Z�7����.D�-#��6'�He����@9A zo�gb?]=jOc�y����D��I�Q����8E����d���=��\L������������������Gd��nn}�U'�?h���tJ�N�A2�ix/)t=���X����/�\H��`�������l�W�����Y�
�/���n���������'\��y:�_u:t���n������6��q�y����>��}���<��y���������>��}����������?���}����)�u��g��g�*����m'�>O�}���<��y�o�h?O���0��\p���55�U��mw��_q�k�� |���5�����:e��l����_k��[��<�jV����t���:���>O�}����A
������*�>���}j�4��o���S�>8�?O�}���N������0g��m�z���
�_���/�|Q�$��"�7��	*�J\�y�.L���L����>�&M�c�o?��.����n��*�d��,6z��$��=�=�B~q�\P7r�f(M�;=y�j�������9bi�a���y������(�Nx{��^p�O�8�
^���t�=�F�X�QW�EY,�1��z{���=�}���q�/�������g���_��y���}���q�.��:
��w.���yViy��=������.p�|��j�
�������|2[���2�i'o���M@�v�������������|�O�>�qQ������������4�8Og�
A��<}�.k��@��\}��Q�K9sP�}��k����I:����w��^�Tr�.]A�f��3�Z�y�Zn'����)����z-�#�dn��3��n&����e>��9S�I������*�`��F�&�>������!���M/���$�~�����Rn��		`=u�����,�d��2�/�L��V�=�#c�4�s��}r@�y#���K�T-�S,���{�I��������S5�^d�;��g����A9���N��A�-���6X�u;WkH�:y��t|����w��x&�����g>{y�����w���p�[V5�v�zE�6V(��.�<c���Xyo��r%�y��	����
(I�(��*��5�����f"��p�<@��D6�o<�~v�-n*j!�7 ���""����{7�:���Y���:�^�,����U��"�&�eci'OM��3M�i����������q=�W{Y���<pilwM�}�|Z��Z)9<�S��|@�JPU�����������ZG��D%����W����h��Lg3Q�d��Y�e*:����(���O�Iz�iams�����"���������X�n������q[����4�����C�>���b��ww�44{�w��7����l���.�������eqs�M��e� >��j�y�j�`F�E����t+��/^=����Bd=cJ\g���;��z��Q����N~Z����(U��������b~-~M�7������^���Ye���'����r,O�/��W��|e�km(�G��b|y���������?��==�8������i=��VK���i��X���h@���z��,]Z�V���j�)��*T���I��V��s��,�@8,�Z�h(��uV��e���0&K���
�EZ���L�[?��=F��������W&���"���q�Es��g���_����n��Y�	a�hg�������]g�k�Wv|��d�W���r��w���`-;_����yY?��(��\�=���C���3��K3p����;�GU�s��ss���^�}��p��������Fr{U->�'�v�Z?�zv�ctV�.?p���(�5�0U�yZ�G��W�b��4�|]��a����w�^��2�����E#4�y��j�6}���y,����=7�E��.�����x����W�{G�<@�8��69Ni���Fh��y�r��BR0p�������\@@U�#9d)o~"'����BC�6CK���h��l��<���e�s�\���Y5��[.��+=u�#C�1�g�B�A�>�����-��n��"���KC 6C��H��������d��fs�*�1��{���?���j�y%����0o����1��8�<��!P�����d����=_����r.�k��fN����?��;�4tZ�%n�$���@3�N���2�x�}��a�j�#�4�</�#�#F9�T�B��g;Vo5��.^�����c�
qm=�����
7n�g��Ie�k=]�����v�!��A�z���������lF@F[�\�v��6���"�e��N{��e�b�������Zw���c�tg��;��]x���4k�i��]y���3i��]���L!5�������������0��?����|�XU��A�����$>w�2���J#>��2'��#�t��wY���K].�l�����J�A|�m��]f���F���wc����Wx�����l��+��|��T�U����j�����>�����]gUWdL�Eo���6���T=�����C�w�O/���??�|�jO�(�1�p��.[�9lUOHG������1����Q��G�����t�=|@��e��>�Q}�F�'>|@��mj�#�9��E��t���c��/
:��rG������Ur�7/E/g|=[V��K�����o�\����m*bR������1���Y^?��H6��e&�Gw�U�]&^�(^tByn >�Q}�F5�z���o�2 >K�������5��0;��^��7�����PTt�qz4@�8��)�
<��
t(`tH�����u��R���D��J �
>�������s�6��~�4�d�wEV�O�-G�x���U= 6�hB%�����}���r�Uk9m^�6t�.�+�9��	��>�=4:�-�"�"+�����-n���L��B�������[VB��p��p|j�<����w���b���eQV�t)���[�&����&�"��
��,�l*��zP@eX�U%�KMv#i�����2��]�^�������|z�]�������^ULg�*�C�������t:0�*-�\��$R���Q:����g�W�|_t��l6�*�a}��i��6@b�hk��1i4�f�S��,�^v&������N��_���>�b}����*�1�t�	xX?���z	�^N�Qt���d#���nQ���}&b�����
}�H_H���]-�m_�*	��>��s/x�h���yU*D���@�8�U�'�%������i��>��BH@t��TL�t.r����P?-"���D$���Hk�����U����'�������5e���+
��6�>�*��k����3��>{�(�b������	h�;-�
}LEm�����z�h���\�ai}W�v�Z�������w�g�O��M&�e�T��T��h�&w�g�{�h�o�����fK��B�~/B����K����6����2-.�3}C�`_?�q�H����5ZW@����J�r}�k���������f5kM^4�5�s5i�������zU���{]m~�s�m�%I��t���[���(7'��[�\C�����~�.����9������]�[�����|��o��hq�Q7�p}�����X$D�4����ZE�������B��7	"q\�Z����,�[d[��H�|�>R��>G�����X]C����h.���e]��sP�Y����L���V�h���S�nx��^��r}�@��+������a`��+��L3�0���`@���Y	�p>�
�����]H�����?�����
��n�w�v9`�������kg�iP���m�����e��/
������zt��^��[���?�ZS��`������]���{h���\`ap���
{\�?r�^��jF��.��x]���]��T[���S�^�#���s(��ep����F�>aB�&��
��#7�l�?r������ �������6�������/�8a=�t��������*��d��;:����V?x�>�]��#�E������V��gAMG[O�����'���e�L��}�x����������xwi������.��4��M��v��1^��to[��}���+AL��������x8��V����`�g��'���Z���~@��?�t��Y�������ss�����������q��6�5������
�:7�q�#����8R��t@
���?}.�p��z2�6�=�x��nx�����d@v���V���7p���n���1�N=uH���5���v�d�8S���p����s��F7�]���n:�_�
8��hjj���>I)��JZyOd+tq��W�������gW��n���F��l�Z @���f��i�z���Xl�a�zP��CY�c�]g�Qy��l��_�|sy1>y��B���H5�\7��k;B����j:���=w�����}��S��r�����-S����W�$6`Ma�����HU�y���[�&@lt����(������������h�hk�j���ML��P�����M�e���6�2a#]	ml/����H�����oM��7(+���2k7�P����Z&Q��zU�LU�j�l�v�����6�x�����Yu���`��3��U�����vl�Xa��eX�w*������o.����}s~y�\�@�����^��UV?��m��xR���(�k����I:���t"�B�C�9��`�A���x�}�)�r�K6}%r�R����w�����f���p�E�����"�t�����(�3��i���s��N�-�(hz�]-k�wl6�6/���$����# ���'���
i�v����fz����BH��y)�e;u,��
5pj�l;�Z+�a�dW��9W�@�a�'����-@�v������V �)�j�MD���$m3����Q%OS�Z�������Qy����b)�����?d���-U{��P��x��/������&���������&�0R�4��=*�EZ�X����=t��czh��5�:`k����r����c��O;!��h�hp���z���;�}���5W�v�z| B�:\��������xN��A��p������������c�tL��{r/c����(��+`����j_�X(a���V�Xq�l���$����
mgD��<g�_&9����T�3����l��z�����{lq�6p7�����j�����j�
���jJ���p��j�p7n=�������}�G��o|��,f��7����^>����
��F] ux��,����o��?g�7�]����l�w�c���l��>�	`�`�����o_r�T�ZA���O-����o���e���b���e���w��H�$�WWm�&�������BM�z��)�,���������+e�t/l��5���c���;���s�<��o�uTc�����D0���;N���T��9g��8p������:�\�[m�U{+�[�,�a�"�������?���M�i���\:tc������"�n�$5p�������F.F���m�e)�\@vF�}��,i��m��G���q6�x�;mi�5w�~���M\���h��`C��GN#������Hp��r��b�kS�l4�q������~�{�2"�;�q���1]c-w�UOU{���N>{��x��l���2�c�7�&�ki^o�8����2��C�G��C���;�f?
��va��6K�*�m��H���46���8����M��K1�o^��\o>�N��H9 �R��������l�O4�c��d�����4P����k�C��.�q���qo���M���J�,o�-a@�8[O�����m>��b�m{��������A����]���V��x�Oi��C�C����'4�G*�?�P@����@�����g�:`��Y��5�n�uy�J^�?���=�����#W�siU�g���r��z��-P-���#0��F46���`��P�^�5@���$�gLn�5���n��! �CW�i���8�w(r�CWQ-�s��>W�������4.���u��96_Oh�3��R��>��r�9��yT�s���]������%
H�������?tF��]%��!����s��v�8�t�-J����g�8P��lB�} �m�J����6�� �6��m�^{&~�=����.i<��W����^98K���B�������~/}�R�	H���Z�4�J	��,��Z�mZyWY6��{�yo�/���<MG�Ev�V�kM�]��N��L9���	���B� t���}�v�<gU#F��Q��OM;B��F���c������=��{�j��)D����������?�>�r
(�%�#�����?�����F��O�������VF+	�y{��sE
�J�^�*?��S�����W������_�=���k�f@��%O������6?L��[����������nq ���+~����O/N��r/H��[�9��uM�9�wj_T�}�8kQ_���
t����&n����������{^+e��m^��6�O���}8�q�����~�yf���a��W��5Wg0Z�����.�n������������Or����mzr��J�������jCE��o\r�x���U���;D��/���[/�)�����9�:���T���6l-������
Q���y(��}_�g�R�J}86:��v������������T$b�|���6_df������Vb�]� V���^��>��|h���������:��#d# �� ���f�6�SV-�b�U���!��C�G��:�Dr�
�*���f$yM�����Fm+���UPy��
�����Zb�!k���M��yWy���t!"[���S�~��LOh��;�|��F���o2y�Z%���}��=�R���XD[����q�LX�������@���OX]�K2�;��0�A�����_���!���wu7,��7������X�G�������;��?���;�Gh�3o���bjf���c�a�p������Y�������}��������l�� 7�H�eg��#��s4�����`���ff�����"�� ���o�e4�	gV�w�SO�.���c��H������4�����E`GBd�������9��@���u��En���5�������`�#��_]�Z�3��bN����bY�aVs�a��N��Q!�=$��DGx���HeB�i��������fY��������Nn_����
�pm���"�	��K����|�������e�����,[��4��R{���0�"�]��r�#�o/s�-���������������k.�����Zt�n�Z|��~#����k)�I�����DA��w�G���X+����n�5>g��:�g}�T�sE�o�zZ�m^^��o��(p���=
x��M��/�}���Bd3�����a�$\+
h�(p��>�l����:�����G�]�!����[\e�x.j��T?4$����^��n*H����NK���=V��x'@�8�v=3��.d�JOn��8���z���'dP:7����Q��e"��'vY����a��mV�rs�&��L6O���@��O�5�J�u+���W���n9>X����S��v��G�A��6���u%�Ec=��/w������g��l?r��#��Gl�g���x����'�;��@}9�����;��8�^O��3>��=�����r�4[k��r���G���vc�;�vb�J�^k�v�mu�W������-@�8�]K���������������L�{��G|�ln���>-���|~�z���o���3����<���O9��I�z�sD�=�8�z�L�<r#�#@�GAnL�r<b��}��l�����yT��C:��uuZ��h�v)\�	�x�2��c�;bq�}�jKM���uc�K�������[�4�S�w����%�U]B�eDx�����W��i-��W�3B��H�������rBK��8D�(�Xk��.����;�����d6;��6�&��Ig0���r�����f��t�q�jD���ei�f&�����O�:�w8��;=!��Z���
#�jGT�)N���*f0������v���z�}T����P������#�cG�}(�?�-C�G�V-�������(�q���=���� ������VOEUQo����v����]r_�����{�9�q=�F	�L6�y���8o����=���a�
�*�Tau��}��u&�#��k�����}�M8�^���v�����V[7�lM�K�n
�w%��juI�eh^���R=&�u���2�n�(�����}V��R]�U�� f�x�B�(��g|�������0�x4@3�F�
���@�VY_��{Yj8;���Gw����q��������2��pY��#��l���p$;��lCyF�����rs`�:bj��@�d�g5�4��$��JH�U�CO��|fB����lR���@���:�����nJ�	E���#�"&���4y�4�9y*4�5�u�s��N�M��*)'�QE�9���p�� ����zf��x�5G,��>T��ZGE�O�Q�����j�6��<:K�z����`����}��~���H���l�5�5�$�!�@�o]�����:��qS����
����<I����z��e��d�����4�q�>����KCm#��#vF��~�,����Dj����O�r�s�9���*+G#��T��M)k��7&�^9�{���+��L�7���Q
���
��3�dY�����;L��hd�f�9�������G�
��]�W�����}��S_%dnl!s�Gz�b|y���7�a����e��a�z�+�m��s��}�D�8j�������\/����m��e���m�����Q�h�f�����v�Q0������m��e$F���o:��`���xmSh������z�����;�x��q�F�U%�X��z}��@��]8b��Z.u��d6
��6�;�xo�
�������w�g�O��M6��o����Po�#��g��hp�,����z)M�uA���B��;[��H�i3.�3�����1�m�:�u�_}�<q�!�z2@�9l�1��!!��]H�n���/m�9��5=5��#�&�!L,����L
�u= qAbSA��9�wo(���t�����=����d��v���p=��n�q���3�^����+����]V���l���\��������}�7��l��CUgw'�d���;�9�XO(��9(���4)��y4�W%��[��|&��jZ��n��z�����v��LVGai�W~y:�6C%�pR8�,��K���7v���t��mZ6_��1�t��h�b���nzuun��������5���p��a���T6�P��k�#�l���h��y��q�����/W��y�������Zo��zY5a���)
vY9���b����U���@�q�J�e_�C��Q��h���0UN]�;�F�,�;37:~���L���N����46$_U����!@�B�Y�p����m/�uA�������5�m�%��E]�b�d�Q��[���yQ�
��@�8rq�����
��}����d�Z�N-�����c= p���s�����*��;)�Z��)��BD�K��c���.!���M����z�L
�p����K��n��Z��M-��+�c4��r�q���)
����N��r��CZ��;s�f[����#�r��]���L9S/��g�(rE�Z��\x��l���s�������u
���?����O������^_�v����vL���]�SOhG&���aQ��2,@$�����[������s"���*���o���.i���dM��L��*ZM��S���������d)�{T��<����$�I�`��ss'��{?[�4M���������o-EY�\x�l���Oo=sV�� t-�r��*�������*$W����!�1@�c�ly,�g������`�1�1��>�<s�n/����2i�+NE��yO�o�]��
�Lr���JAi�Bz�SGDpR�B���k����?��H��c2�����qZ�W��T�W*��J��C���0K����
�cj�Lk�H���T��h��}\d�Z�B����z����D(y�	q����F?�|������Z�4��/���������B�/����u���2K�Wz8@����S�<��:Y��r
��9���[1#~P�kw����cHE����M�^d3J�����'X9�{��@C Z��]�G�������;}�h����Z��b�+��WVOa������m�Tq����`��E� ���B��������NL������]������cG
�m)���12���'���� ��p���j��e�Okz�l�v���z2@c�i5qe����h]D-k������(������gu��;�N���.W��sX�������#][.] ;�M�up��p�mo_�O�/O���J{� �1�u��d�$��sY��s<���f�B����:8t���(���������?x�+�z�-�<{z����?t<r�$�rlqBfbw���9�8f#�Ae9m���s�>DN����b��������uo3������P�^;�L��&\r�q��HEO�^���V��j���AT����>��J��/.�g��#'���z:���97���k�p�3z2����F�'�>~u�w�(C�����J�u7&4�-�p��h�g��W�����o6��"�'6�x���y_�v]Hl��%m�z�U�k����<��Lek~���3n���^����2��I�����>K2���I�H�
���N�U�Ym�j�U��6�����j'�����`��M8WO�S�y��>���/���hV���$��M8�VOTWw��}�h�f�iY �;����%C@��A�GLm{I���i\���]���B���:���6q3�M$���y���[jDoIj�G�[���PS���I��W?,Ls������W{�I�(���\�X�{L��������� ����xLd�y{J�����Wo�\�]�r�H������L����������WuK��?����7�X\=�V���*zb@~�m���Q�,��D��X����V $#����S���dW�������2��M8M-]�PO�?C	M��X�\��7	��
���/�����g���z�'���M,��Lt�z���v��������ic���+���Ml�y�@�����Z+9�����2�<��'B>.����H��9��@�B�)�`�	�|K�h+��U������~�����	gg��Uy��P�����E�F���������wq����,�L��pX��P�K���������y-������T�@���t�KhF�7q���P�����(Fw��D;������n�D�z��G;��f'o����|��;=��~s9>{=~{~z~�_����1��<��= t���qmt=
S��n�����r� �z|@9#�CA����f���8��,�fu���\�����u�Z����XV��u=�OtwW��z������n8q�'N��p��t]4@�&�G1]��	
E���T���&l&Yi��X}����u,V�t�t��2��c��:�������u���Qah<�VW*E����ZH�NX�ca����m�!
��w�p�����wU����$RK= �V(x��6n^Z�AT���H�?H��k;������z+�$�z
�Z�7����3j|�?����<�FN�����R��??�49���=�_@'��;�\�'7N8�XOh�����O�y\�HY�w<���;8�$q<����3�7No�y�Nw����M����xA_����V-��<���N
���dK��=:I-���(��O��`�q3B��Bi3h����E�����t�a�z2@�l�3��W���z������o�-��������NI�9��.���@�����RK~B�]���\}8���A��C�cN�]5����<�#�9��f= ���m�j���#::�FQ���,v���Q��@	[��P��2�k"����Hg= U66�-�8�B�d���+%�m�{&���Q(���
�vX"�K���}.F��6 �+a����]�5�����T���8��a���d�ugL+������L�N��'[����v!��������*3�Ze�gj�`��G(��5`=�=�U8oV��Fk���]�94�&�������F�}=��W���%6���l&+����9�`d=���_-:?�n�tA>wc�����K�����f�����Y�b 	�t���rb��yP�����������w���r����������b�&�AN8�h��������~q�M�������)`��A6�`�	g��8n���1NF��: �2����}�d�=�_^sHz����2'0'��C��`NF�}��B-'�� � ��~��M��	��P����0Vi=o_d���mx����Y��u^�L�8[��vv}�\���UbL0����eh`3x�8WN�������Ug`sm�R�t��j*�q� 
l��������(�����]�6.�k��-or3���2�q�.h�����=pe�{���60���E>��.z�q�z2��������������l����c�x����<v�5�=<����{��co1��{�`�]PB��[3EBfI|v*{`���������~�?��)qy$ Bn��Y�1�`�\����U?�9������9��@n'|q����}X�
�<�q�
Un��"^��!������8���!u~�U����I�!��4�b�g���}*]���i�^!�In!�����������v%U	 ���!	u�M��$Zx�U�6lf-M��&��Fv�EH����n
��&y���O's�����(#�����KoR��tp���;y_>�x'����`��#��'-rQ~��N��|�H��n���`��Y]����3���j����'sQ�s*�T���������\{Z��%�I�	h��@y_/��{�D�9�t��`��2&�������^/���6�q�q��U�)��G?���b	�6Ob�C
��MHu�S�Z[�.'K@�K���������+n�J
�=p�����IJ�GGW��3����r����R%>������6/����j�v�B�9��~%�X���F��5����5��2.|��4��t��7\�j.�(����G���?^U�����������U�L�=~�=%�)� �[/
��@�O���b�q�j������I��Z�sY��jP���K��W�^���W�@� ���
�Q�4��'����b�f��u�����lD��S��
���l����<����|{�u���m������d�����KmL�,~L�1`8g= +����t�,�j��i�@�]��0P;���07���{���������Vh�@���cn�hZdj��}���=��]5p��������)g+B���U�T�R��v��N�8�=o����?�k��(� ���u�5��"+�g���*��������7���asH����*T��#���rz\�_>z���E2��Gh��������������<QL��Fw���d��W���:��3���@��#�J����z��-����w�����O����woEy�<}n�9��N6�tm�����t��VYjn� ��k�j��
����F��z�v�@�N^���o4��~O �N~��<����26��<�d?m�3�����
�h{`���0�F�tv�j�]�,ub��Y�W��2���=8�bw?H��x�yg6��=8������X������� ���c3������{[d��D���� �n���~8��)�o>d%�g�"-��J�5r������Y�EH��Z|�L�A���%������c����2����N��^�j�����{���������@�8ck���G��!��|��r����qy	�k��m&]���]�����1g�3`�`h�5}v����ZOH�����<����.�U��1��q�b|�L�op��>@�*�W�.>�p��2(�l����$���]�����<�]Y�����L�Lv�������'zD@	X}�mF7]����sBM����y�{@=�������J?h��K�����f�����}�V,�H�8�}}c�
p�����S���j�VT-j���{�}����P��x����z�c��C��Y�o�K�k�������v�V7�j���[����]��6(�����99��������\=�V�`u��.���C�m���M��a���jw����QT�r4��w=��mx��\�>����!z'�X�'����!uxd��7��u�(=��>���"�2��	�����f�K&O�����DT�}���P�h���7�!���6�������Y�����L����@�]_g��1��7^�
h���B�R�������[n�{ >���X�
HG��WwKR�����4�����^������>@7��G���5��Y�w�*�;Q��M��Y��>�Y����K�cBj ����}�N�w���9�rq�!���pK<}��$�TN�8����R_�lBJN�����d����Q��z�"�z2��hAz�Q�L X��J/�V���#?�y= X�6����U����H��7����P��/Hh�<�l��I��QBDdP�����6�d�N�!���n.�C@��d����,JEE��6��74����w3o���p>c����N����
��x�(�sn�Z2tr�����	q��������!���,`���'�i���n�RN ���Fhb��[�a�p�������H�<������q��X�\���|i�m��oh���Pz{���C���}��F$j��M�v��{�"�iY,��<Z~��!t7�o���yu���{V������"��L����z@9��uz�@�l���]�����k������jE�:��ZP�����{��rZm�UDjce����CTCQu��!���o7��c�<}�����;�t���h���z�t`����t�������U����v���C���`����roqv���fJWp
D�z�! ��6�s��yE�m�����{���X�F���&�����������n���-Y������`����5}��!����_j��\ ��sSy4��9�~�<]T�=����Cw���P�!*��}�^�@����`3����g����:��{���C@��2Qg���x���sj��P��mm��rP@�m�_6�����������l�_A��G���~5�`l r��z2@n<?S�����,�1S���:�����~�m��*�0Lv��bL��/���	$����	aj3r������N���y�l:��;<�j�}`���uvkw������ ���fC��i����`�p��t���?�u���������619�z=�����[{�Wi��y^��M�lG:�������~����	��b?;���f)��.�H��(��i��lMr6���_O�k����������r|������^��o�fpO�z�g�u%O�����"��`h�n@���-������_�����U��y�w>�����G�l:{��C?�����.�����z�1������g\����l��`�lYo�y�vrP~����iU�+?�������?��
��m�]:�
6'GL'�?���k=1 �
F`@����`�������� r�����Z?��N�\�[t����Y���9x6G[�o�M ��/Vp+3e�G�;���x�r�m/�>�!��@O���zH�w,�����jyf�Tt����n��	4b��F�'J��l.�"�������5�����{�8�5���)���7���MR?�'���q���^���<����%���5"�����E�D,�J���wp�tlQ�������_���l�����8lN?�u��ip��w��[���se���a��������s��P�g�.;j
l�����yK�����n����s2������pc�"i��`p�1(5L�3��qA�����)�f����`����,p7>
�(}�8$8�
�����3���y
�7m��`F��r�t�y(�	R"A�9���|��({2���%p����Q`����EX6��m���<�)���������u88�}�������X�[x�I��t���T^��@���q`P�q`��@�l�G@��*�O�u��
p������g����Q8��1�p���j��r!��f ������Q����.��>�5~P����	��\�w}6�����W4���W�&���FA���d�V]{:
�
6�����R����v�1|���� ��E>����W�'���7�^<�j/w��=�8��|��A�p����u>���#_���`6��o��?dc���iN?��F�x+>��'�c4�^>s�}��&�q8y��B�����;�m�����bQ��q!�|�s!4�'���g����������Y��/����C&~���3�W$n!�Q�':6M��������SbRN�'���0�^>{����l���Fb�~�O��`��
��� ?��f��S��������������/N�����m��5������m2~�~�&����w�Ze�+b��a ��J����-e�M�G���>��fm�����_��9,�K?������~��u���{�;������s����zY�����n������s�C��E��yuqLY���7��������;�^|��^��R����r��T�����g4�i>8q �Y8U���D�?N�L(�P�tO����c�Z���i>+�G�|���3�>�I��T�����/�Z/�)���I�|�\�a���3�m��Oa,>�1w���<���aY%U.��#yN�|�fOq�>�a�L��b���Y.+(#�R�����|i<��>+���N������A�K����beM����|���WsW�_�7kUW4 z0��p~��b9<.��D��VK]E
~G��/tZ+�0��6�������)KfNLs�����o���L��OJSP��r��|v'���m���?��;������!WN��!���zV�ye Y�_���.��1����~+~{���Q����$YCm5�z�W��
�T��^7a�1��j*��T��L/W�&����E7�L����?y����{���.����}��/��bn����w��=|�������C��e�I���;Y��M�Y���;_����{^��]|����CM���,�_�������
	Z��&?W$o��_���.�������b?������e�H�����X
C��������������f?��v7�����K������$I�X������R�#���'���
�[
�Y,�r�,,7�����W�X4K��U��[JL�,��a}��e��}���/F�j^����8y��{��B��KOJ���\9��/U���y�����tJo��^n�a�o#�
u�
���rv��Uy>���^�����'�u3����//v{%5{�����Bg�[o|�������tS����}�<���0K�}�����7�����L�*��KNW5N�s=���	�E���+H����)0���s�rRD�m���
��=���>��t�Z!�����RE~�Q	�s��(U������J�M���B�/���9���'����u��U�m�����_y[�
t���\������������������_�����}���z�����������_����0�27eM/+n���%�Vw��^��������GY��7������w���wl��WU��)���7T��$�%��������-���e��(-��f�}���z��/V���<L)�}8��W��A���n���i����~���5�����s�����k�_��O�y��/O�q�jIC]�}�t������;@����FS�q��
�v�����@������Q�w���_o��o���W7:�k�������/��k����<,�D�Z��������P ^��Z�md���i�:������9��[-��=p��0�v_|��.�=^��p���<��pE���F�H�j�r.|�.�[���y��.sQ�X��$_�d2	�c-=�=\Q��+��b��:����D{�8�=�I���h�q��Z\���7qr+^������%8o��b��p�
m����+B�,��������<?����� �D�O�@�%d��"h�b;\�`���.l|�9}ur~s�����B�i�e�B��[w���"Pg��:�������`
�v���F_@���{4���U��S�����P|.P�V�X��XM�E����"��h�	�����
���hr��YuS 0����&b�:\,_��)���2>k�8b���l�5B��N�V��F������/�4������P~e��Y ,�������� ����4�2���#��E�d������8�s�&��3I�w���0K�E���SM^�h� ���N3z#��L������+:���f������S�@uTE�D��b�v"P~�(��;^��[d�Ft4
�$���y
X9v�h� ��������0��_
@P`�
,���0��%K��r�����<a���.�0u�^���� ���3��O���u�����O���7G7'�� R2
=A�`����t���|�Q�m��{E��8�����@����Z-@�}f�.��#�	[������u���y
T���&�&��2^~�u�7P�=�E�MH�)��F�@5���f��^k1M��i8V/�D��[��z�A�`�,���#�'?��
]�
�;�Z!�Ap_����z �������p����d&w���(�j��f���q*6��(�uX��Z���e���j�E�K
�������h:[��j3:��taV�j�a�dU��e0�jj��N��Z��T�hK������\X��_*3�L�yz��t�Zuj�i�TF����|���Y3/;@�:���P�����s&6�u��4E�E���J��>Nt"Mfq8�'7��6/��s���5T���i8�j�a�����*���')��]�^]F��}�^yu@����}���{G�e?����{�j�u�ZkT(S�Q&1�^�^��p]�[]�x�(�����	kZ��N",��� EbNv�X5(9�XAGt�4��"���&
��n���.P�.��c�E�O��
��@I���~kS}\9�Fv������jj���t�iR^{u+=��=N_��.����t�W+�u�e�Q�qte�O~<|10$��c��!��V=�}���
^��hhn�����c4N�G[V@�{��{/�-��J�����:�����g��+M�Q
�{��/��������c4N�C�5��13�������~����p��W���^��~�i�B?kw ������j*�����[��X\^x��X\6����O����������/�g$�J���}t}S���=2���g�/�P������7'�����f�@a���^���P�j���PV.�����'g�](
�d���������MeC���Pf.��������kff����2s����0��CR�����Py.�����`���M40hps���������������b�d� ���_N�H���������km������p��oN�������?-�o��l
�6^��N.��K�����f�R�_���X�zx}vq��0�5S�:.����e�kbT~��I��C��a�=�����+3�������r	�/���Z�Rw�*\����C���\��d��G�qE�v�r�:�x������W�g�-�K�^�BU�1S�U.�
^,��WGj���-W�h�(�����D�7����Ka�=�h�U.�,���ly
�q�v�r�z/���]o����Q;2��2�)F����8_.��zwt~����a��
���aE��
k>�h@P��.`��+,���qz���Bp�+L5�h����p1�T���j��$SM��w�Z�R�Vxe.�+����+�s����Hd?�wf�0�\�1��peGm%��dz5@a9�O�8���/��^/�`�VU���OA�������y���<Kq�)'��?;�������
�rT��������t���2sl1��
Hc����^
�C��U������>(���:PT.-|U���=�y���V��rL.[�}H����^U%�sZ)����s6���p��{�?���;����8�q�y�x;�\@�9�W��$=4_P��P1�A���$��+6��u#��Fg�e�Iu���Y6��j�T��w�T�q�#����G��U�����U�\��N�/o����p�4��*xhn=��h.GD�)���9�DWq�?s9�YU�����y�����p�c�U�%��l�G;C#�en=r��e.�.#���������iy��s�H���Q������-��c�@s9ZUQ���������l	�#�Qc^� (���"K�m]�mU%��J��\���c�m�nz5@_9F[i���;�T�����J�(����?7�p�\������\���W�����u��#�lv4`���������TI�u��Xp���j������r���rV��F��2�*J��<n��\�WQ�eP�\�"�F��\��W��#�I�+��k�>���r��B�x�!z�z�����
�J�����*�}�T8���U�N�U��\�5WU���Ck%��&�o�LWU��b:��+"J��=�i���R�HwU%�O$��h@G����p�r�;��n
�5��\�oWU������r,<���q�\@�s9��O��]i������J��e]�.7��q=R�Xy.���*k�G��GA�ABM# ���X|U����r>��3P�\���WT�c�)�zOaDq2ur�Z�z����N��m4��2������hz_���o	��������t��(�L����:��������n����r�?�C�c������{��xz�������J��7�W��0
@A9��_@�s9:���u����vz5@�8>��
zqi���9�#������ol@��\������W
g��(��wd��]��V3�J4���?yK�����.��y+�Y�w���]<@3�V4�:�0m�����ekN�z�v��8������j�Co�k>}<��8^�-�[�Sl7c���;��l0k���q�/���z{����YG����,����m���@���]��Y8/���S5���a�G��"z�����Xu���D����hZua�
����WG�&�V�����&�c�a��d��uL��>��bgx����F��f@�;�^3�����q�*&�dy%���Ek
�4��Y�Z����&�
��<{k=f�^1�f�_�� QE��P�Z>�xu2������C)����j�`X�|0wU�.��w���`�#x�����[z5@s9�V�N�����r��HyC����������5��v�b��gcK]G�X:(�k'��`Cy~�Z��s���q��xk� �q)���������(�cE��u��P��f�.s(���e2�=c:�����x����hQDA��_�����N�o��m��@q}~�e�����8��,�hur0S0��6o�o`����*g`UyVV��a���	dD�"��c�=,��G�*zU8�����8���T���2�`Wy6vUD�T�����8NF-	`�w<���[��N�I8��Z�,{��9������]s�!o�Gj�o����E/���P���W���E����GW��\V��
%T��4Bs��BoW�W�J�P�k�)��� �y�����
@�82X�%�_����d��)�>T�~������0�&��hb�j�6�$�0�S�@����=��c@"�:���<V)���F���Y�Y7��Gy��l�?��}=���;�����(���_,eM���k.��b.�?2)��WEsr�������$�_�~M�5	��$���&�r�w%����e#���
����X��/Az���B��s�/]������R�#���'���
�[y�X�C�_a����>]�j�rYz%�����&����8��S;�o��/f������8�[����80��rl+�_j��)�rK��%�F��;\X������RQ�������X1#�KI�[���"
	�dI�II�g���m8�}�j����3Q�Pr�y����)�Is���Y�����"���+����K�\� �>�V��g���3���J����	S�
t������o�{��z��k����_����~=��z��|=��z�����������_���r`�eNJ��W��~�s����7=��~����y��/p��������_�:���������w5��x��H|m���M����v<��y:���|=�������ks�����#�_��w}��_�z:���1�d
��1�?}=��z��~�Xe��?�w�B�����t�I�.��{_�}����}��W�8������:����k����<,�D�Y�������P|8��Z�ld����f���~�������RE���pA��
 /����jv�Z� ��E6�di�.L��s�������l���4uZ%�w�(u����NrWX��3�?�wN�uY@�k��r�K�dB�B�h*;�����I�9��=��E������H	*&��( ��@�u~8�xC�
�K��y2q!W4	�KA���JQV$++�Z<���B�h�RZ;\��]��C���y��-�x��a�:6�����I�dI�}m��z*<{x�K-K-��z�,uhK>	R�d�l��1���/�]4�J�t������<�����[����1�c
�#?��N�.��JY49�pyh�c�r��9B������t5������&	�,.�-a�<�������q�~$�� P���	�
b1J��Q*��X�S�,_p@5�}r/��L�p{4�3�p��j��b/������
�7�~&���#�j�qK	I&�$�?��D��.�O�j����K�+�)�W@�C���szs��#��c�'��8x�	�nt���$�p��W�����C���BG��,~�,;o�;�`.���:�����@��D�G���� r����������:(��N@�0����a�)&���NLa[kP<:\�G�$�c��3���������
�e������B�,���Fra"��@
>����9bj;��`2�����J���b0s�*u��H�#��Z@-p���$�I*Ay��c�y�h@��N������u��]�H�t��$��p��L��c�D<���v(xq�g�}�B�����M�Z���t����~/�"�Q��tR.��d����4\
=����-��D��V]/�q!+���r�]A�
{�����������������@>���&R6{����4�#����2O��>��r�����"��xE�a��:��X����(�FDi��f�����O�=9���bI���td";���I�w��F��dj"n:l>c}o��
����A��'pV�������wB���C�`��-�h�h�Xc����e���}�d���t�T-��?����#�8}*��b�b�
�86Y����3�YN�4��0�o�0E�H����K(Q�d�j��z:lP�]���C����a�Hb�\�%�5�,a(�h��u(QA6*��45u�]���N��a���@�3���XvY&0G�^�����[dBe��]���������i����wU��h��H�o�dC��%��E����T���6�AtP���L+���B q,��@��"�9t����
+@Z/R2m`A�P���
,���Bw�K�r�;���i.#��F;��������x��������2%�M����<��C f�����
�
-T���_yW�!;}}z|tszq�U�����K���#_k��f��\�e��:�1861�`�t�@��MG�b��r�-�P�k�����D�t��+Qc�'������ [S�h �VY��`Dk����m��P�1��@S�t�����'U��������ET�`�4�K�\���8Ye	`��l��wBLs@���Zs����E���[�.�3������si�����D`{4cZX�%��Z!J0��6 p��6�s0�;��@
.4�%��&:H�p����&�������i@��Y_V�)���.�s����o!��� �.��� g�T�	p�K�loS��T�R?;\�g��4K$)7���vNp���vf����.U4�����y�����Kmi��r<���Zs�L��k��\����EgbG�4���7]������?�����%1N,9�.uuC�� q�u�.�te[��O�P����Z��)2<�{M����v�����.��V�n.)���d�v���b3LW���k��8���A����hww�h�OP��y�o�Avi��.-��
G�"���O��[;���%Q�h�a���7 �����u����a�L��3A-G�9��9��M3H2�pI�)/fQ���q�UZmQ>�>��Rs����H����d 7�����M����!��9iix�c:�����e��j����>�5<KG��H�,���:��I����c����8��V�/������.5����{��[��ih�\� �������qxG�������w'�K�������=<t(��3	>S��p�Z8�P��4"�k��M�s���s���Z��u�d�v�����(v����?�&Q��qz2�g���(�������������=��d���I����$��>qZ>Q�7g���F�jO_�����<!c���u2K�����nZ���)���? �����6s��$�.�D;�����B#_���
��������>_������r�����A�79�d;���V���2��Wc�g����k��g�\�l1�^��}����S�<2pTZ��~�'u&�z����Qq��d��K��4��$����T]���Ow@/u9z�2����b;�d���s/l���7o�_�sG���lR�c��i��l�@�9��^
�a��i�V��t-�M���s�����c:RQ1�x���oC1�?���`����B���"�wd��<�2]��i�V��13�6�k��	���V�����N��HQk�=w��������Avm�fjD��Eo;����	��H�eFjB�&~�M������#.���|]����z5@�9�c>�VI�_����>
�����8bcuS�t�7�m
(:GV��Z��
����'YBM�j9�G�p��EVL���p�^D����:�F��6���'2��V���{e[�f��I�|B���R}��u2������[%���,�DX%kb�>V�!  ����U���I���g������-�������U���H�]z�ZL��\ �LxK�g/�9.���$�y��u��p$�j�$8��v���^1��83������r,�ji$*��)��Ih���0��X��u8_��&���� ��sf�|jJ���&�R��~�A)Q��$����H��v�(��������5S�N�c�����������D39��^
@C��(��Z^7��4�&��1M�M�]����Z�0??:]��6��]�vX���"���Z����z5`8��^
@�(hl)S�����d�������fd�I��6���n���k!J�m��,�1���Tz�=�~�+�Z��bM��ta�L�k��0��zE@�y���������d/��t�����/����C>����H�l� 1�<�q)B��{�����Db���\`r;�)�sr�KP�g1:)
��5Ge��"�?>���1����$�������i��>�������<%�p�Hj�{h~��.�W��b�=�qL������>M�y	�.��$�M��4?F���:*%�Zb�E�B��/��K��!M*I�G�tv���q.�1��!FJ�i���(�[�g�I�
���N�M�Nn�2D��>'IV���T����.{H�����;�e���rdP'����>
/��>�������1A�r<Psi,O�cy���W@p��4�UN���E�,mQ��R���X���
�k�]����6���1e�u��L,x�k��vM3�p�\�3]��*|]��t9�f.�<�_�b�����&���9~{r���������X?��_�\�vHu&�@�t1'�.]$�0�3IY'��#f���!����]�$�0��u���������H��������1@U�����"y�Q<*�C�]������?�j|t7�*���*��3���.G���# p�+���d�-�x*y�������~@�t9*�^ZZ�D�%����������������7�.>��������r�N��v'�}K1�����S�x!�HRr�f��H��1�UW5���;@u9b(����>�h�u���������z5 86�^
Pt��y����[��b�9�`� �j� n�q��@�LwE����%�
���4LM�cj�h�.G��"O(�B�$�������/��O�__�MW��t9���n6�
��6�Jr,M��(
U������W�'������T���(f�x)j�h�,K�t�o�V��F@�Y��(�kg���h�:�����}T���gj�!�e����hh|C����}���.!����2��1�����E_G�t�D��O�%�Za��t9V���4y4�h�19����5d���<]��I�7��0��?�kv��T.6��$�V����X����Y_�2�7�����U)@�z�O@]�J��'����7��
X:&��bV�Z���<��I�=�$�>���"�D�60A]�	��Dy���G����'������|r����b��u�"���v��" ��,�T���S �z+Rh�x������h���f�����@�j�h������C����7a��/���z��q�P�;,x��J�<���8�g>m����n$�3_��voD�������?U�zevD�8�g����$~()�����F�����dP	wA��M�Ux�S��Px��#U�����^�\\^�����h�m�(�+B)�^u�;�DA1u�����TR%S�J���*.�b��,����F���1����S��M�w��x��U�Z�g�������h����&T2��d���9��T�m�1kX)E�Ec��X�Y)��
����R�9e1�e9���M����1N�<@r�X�k^FA*�9_�e(������6���R9�UJ�|���NWq'��$B_�	����_1������x�J�H����I'��9���n��Uwt�=�������8V�V
 �z6Rm��9y��@���HT����8�l�H�*�y
�(��s^�<�"6�h��
%��=@��8b�^
@6h��8���@@������r���/�E�F
eTZ��+�!�b��V��:V�b$>�V
 �zl�����:9zW�@6nfE#W'���T�����qx,����+�T�(7���=�Yh��z�S��
�5��^����W�P�0N�[�G���{���]S��X���)Q\�<���J�
�++�-���3���tS�������X�\�I�,������N�j�Bq�?R?,��0V=����nV;���Hr���8�7����X6�L���q�W�]l�L5��)"��������w-c�9����X*����8e�G)E�/9�Uv)@�v�=T6�G��Vq����N;eAWOKWm�����n�����
Y�����4VL����Z�S�YM�m�����I��c���/�r8q���nk�u�D���R(�!X�&��u*�*��c��T�b�9�rI�2��"�W1��:���s���(V�=]��,�V	s������N�B�����c��1l��4�������^��?�%��x�7:�z����w8��X������a���:��FY�d�g�I���A�1��L&�`R�V��}��_��@�K���X��V���DrXf�f�y&&����?U����x���z5^��[�������V�X1���G��"JY����^����z�
�^:r�LSe���������q�]�:�6�V�3@��8��^
@�n�i�"	�.�m�j�Y��=%�����;��B���T�X��b����^*�H�T������9%L�8d���}�L���������3���p$]��.\������By�8�������)��������=�H�;`@��z
����R�V��/w���C	��%�#�X�w��JcA�V���z�G�e�O�o��mx�����k�ceT6�J�Q�:�����I��|�����S.O���2{�$+�=TNq�}���r�Sf���P��%K�vtF����a����,����	`i���W�������"��?�t��I�����k5�<�9j��z5j9�s>��B����s��y�@��*Gi	l)f�L��+�z����[b����W�o_�f�O�Q���l���g#MP��`�J�b�,�������o���dCCZkZ�x�%�D}�������oB��sZ��4sf#
��~��y�pruuq��T��3���L��s�����^�~�h�G����e�r�	�~�MoQla���X>(r�j��+�p�"���^�����������?������$�0��� �1��j�����E��)d��nZ2h���D��c}��
��e�\�m���%���4�����?�>��p���
��������EI0��z71-�kBK����*�`�����Kk�mO���*�bu��?8�������z��{�(���{�P�=@Q�E����,D��IjYU��}G9���-����K��}$pQ� `���+����<����#r�NIT��@���H-O���W��K��NTW'skn��$���QO���>�X6w��f6�����V�vf�O[9>T��H�~�{I
5�%9r�,���B3���f�a0����R���4���/iU������;�?lrw*V��\�^n|���l{�7�������z��>���������4������p~:������}=;Z��
�'4J�0���}��V����Y�����x�z5L����7�]�J�qo��	��N�����j��6��mn�|��s�z5���Wm��oJ�����p]��q�~�T�/T���{
�Jua�}������������_v{��6 ��6�>�
Z�}��W	����S�I�h:������]�0}J%�s�!����n"�z$����P>���\$����Orx/f��1Q�M5����c�G�}}��~����3�b(Px��~�	����,��4����Q�Wi��:������;��a�:��N��G�~s�
������hMiV��hL����Y8Y�!��zp�	\|.��^
@���t*^a����4Y���?�D�tg�yz~�mSD�wG7�'j>`��6��Zyf���O�D��e�������
��4g��d�Z�{���C~����
��>��(�X�~�~Qj�����s����~`:���|���9�r��qtwGg�E���1�d����/x�t�����F"�����7+Q��e����v��N�U<���~u�brA�A�NbS�������.��z}���z5���y1Vg@�����E�3�������H�����������y�8�4��<cch�
��$Ms=����>{Y�� *��n�J������8�n����.`�����%*uP�}+�^+�(��?�j9��]T����!�8G������F@���$|��nO4km��"X���aO���@��mt{Yv���Ay'Q*���UGr�}
S>.������X3(R������.�+.NuI$��,��Y�����b��������fC�U�U�<��h�ha,oQ�\�i�T
 G�/	U����yL��sI��5e$CN�.�8�^��*�B�	`�������,(�4���e�g�Mz?M��Gb������>�/�V
��]�P�}k@�|&�jNeiELs��������`>$����T�9��8���5w>��#FAX[�L}�3=���X�Q	+Kb��8�Ry��(�r���O��[I4
�����5���U_���o�������!����������/��@��z@�����\P\�p",��������V�rT0��lLCOu����q�I2��L��I��XJUx���
�c������o��E�r�#�~=Q��|��P@�������eZ�dQ���mq�������-���a�:Zz@g@�;����>���(uv?,`������r4z�	��(-��E�G��N�9�G���qAa����}.j�^
�&����C	drb/O;W��nX��>���y�y��y���y�V��������R�*�l��4{������}�F���.Z��4��Hf�YN��*���(�e��^0�����������
��(���d�P�q��������iC�"K�-����,�cKH������@�'�.nuX���4d���}�t*m]$�K�H��H���G�"��!/��P'�K�R�TFA�d�x�F��e���;~E}��C#�/����I��e��D��}�#�Zy�'=�'����5L�a�d����k(��������@tlq�s�������6��~��uGGq��i��Pk��k�I{b�;����d�������~�������_�	@ ���R���}Wz��p�c��Z�-}@p�
9�9�`?N�e�Ekx%0��AC���b-�Zf�Jip0>��)���t��P�Q�P��\����%�������W�9��������Fp_ �����
(��&�����a)KHMa���n���B��| �X.����[w���m@�on��m:}�	����e��JvLl7!��Kq�tv��z��d��5�}EY�U��-m�b��pt�V��={�|����?Hp�Im��o[��[Jk!���v����C{
��#������!����p��.�������K��h���M��kM��	h�mk����>9s3ELO���)�����6��G���H@�o��pF�}�	��(-�b4���������"�S)�H�G��%���D�d?�.����_���C3b@0��M��E���0�d|U����"}P]Tj5r�P	�Z�S�6������O>���������iA&O��/Y���
�ZN�"�.�q���m��o����4VA���v���e�4s��{�h����m�Cm�<��*���
@p!hs.z5�ln�1>K�Y��&��_$w/V�#�L����Q8���T�C��gA�NJ��m�@P��{�q}-�m�y@/�$tq85� �mi���9��=�3�6pOh�[\eP����Dv��\����/Gg�����>����}*�e>�G�(�9��u�P��������e�9���k�
����-�sjS�Hp����!���������$wg4O���(�w�g��"����u	���30 ����EFD��;�tz9O��6'mqg*����$C�^j�Vo�����=*�7E�PF�X���p}���:�~(���}��u�.�zI����]��@E���D�/�RZ���O�jJ�0I�'s��v�������_���/fY��f�Y��9���k��\H���l�B�2�<����(�!
�Mb��e�r`�N�&���I$�#��.��|s\�='@��@:��n�:�^��1]d����
����\4�B�*����z5���P��0����M���K�Y�{`��A��% 	Sz�����������r#�I�l�n�#�\dH\�-�
��Q�(:��6�WX�	z����A��VAE�W<�h�=�X��mJ��������m�}����"-��c�	�7SAM�M8����Z��.�����l���
����OB@�ow�8E*�B��Q;�i�����}��L ��������&���O&a�@�E������������
h�m��D��A�xzcy���1]��@�I��l
���/`(���(�hw�<�*���/$��R�?�me���_#^8\*�6p:hC���pAh7uA�]
/1�%o{��v��\��Wm_��6�l�Hb�g�n����k�Uh_�����=pGhs�u�~��r����$�E���`�������k����i=�������SA�]�,y�������9>lX���t�RlPV���{h�Q����;�b����B/��>�X�S��q���
:��DyO�E_��7E�I����7��s��$�AE7U6
`�����x9J����l����4y,��l}�FN{�}�� ������������c���]��KH�s	��@	�8��\��c��?F�����r��������8�%)>b�'��G�G:F�E]2�E���l�2V�����_pk���SF:e��D���7�6t�XkBW�0�|�Y��dr���|+���"/��<,�\
�+���rPDETg��Pt����'����wD��wDQh�Q���x>��z>��������9@��+����z�vp� _�e�9�8~��k��0)j�|����p�]4Kn�����y@��'] ;�t�B����E�s�0%t�M��R�b�L��kE�������:�9�sX��B������(����et�q�i%sy@/SH�����S�Uaw��2���D��G�H�2PN0j�n��Dv���vG���s7^��6]Ln���#`g1A,���p'8�X�;��S���/&�pJ�\4����a��w�{�n����#8����PS�Dv(��qM�%*r����y8I>��=9�')���Nx5t�Zf����=��P"���,�|�PS�������Y�Q�FBM(���������^��d�r$����u��B�����v��A�I�r9R����7&��1a:�1���c��H�;
�-\�	��O�:�_���+�,��'?��/������N�n�:��OU�
�N�^}v��S� o�\7��N.��(q��n��R"�=]���'���Cj�
P�����q��.����'W���������,���8�����������'���;:=���:���$SA��i+tH��M�j;�|:u|�����q�����'|��x*t��T(����\���\�����v�+?d�����C��M/|u<�bL1fu\������N]��e�un�F1e�C��MU�P������a�O�N]O��@w9���&��*}|\�)B�y="���Q�c=�h�MtPN�*�y1�E�����c/z'�� ��I&�&p�����P�p�.VlN)�F�a��@	����K�S�Kb%�n+����n�����;���N��z�� ���m0��l�D&�/�T���U|6�a�ntj�n8�1axet�ze���>e�&�u�%xdtPb����
���l�aa,5���I�����������"�	��W7�@���s�$8Ut6����j^��%�s��3��.@���-�3��2neNgsoO��JB��{�q������@2��<����3���T>VC���_��2:\�F��*��(��<%�X�r����a�N]�\���"O��X�e�,���N:g���h����&�X�I���e8mt`f���%�^t���E�=5�k�j,�����9_h�_��kQUp�M�>8��-��\i���_b���<���8�6J����p$zd4��$��i
,:\6����i
|':�/�5���5����&��]�/:�/�<FeI<��m�6#%w�?Fg#M�|��.����\1�j�t�u��H/�;�et�_��V���x���(B(�{��*���a8�����
�*�4�?��G�����@��B�J�R6B���a@��t�&z�*5ru�#H��#H^�i�uX9��n��y��w��N$��D�@�������A/h��'�GA|�$�^xtt`�S���������;��&I�8kt6u�0�����R�2����Qv���A�#��[����|0�,W�/��o�����K�i(��&�����t��d��&���H�M�����'y�htz_�w�J��&�&n�)*p�����v��Ns���!���W@�Kx7T�������l��S����%��`�����y-�!���b�[���n.��^
�������B������)����UL�_�����/&�X�1i"��N-��]JL�����Z$�n�7�N-7�����W����`&h:L��^
�KA�`���T�C�R���R ]XJ�2����Q�����{-����;] ��/�&��+�H�������R��;�}�DY�-O������c����!yz�`u���R����K���w�.��P�����8�E~f4�.�X�n2���������Es���|A4M�DU��0�LK��PH�.�[���[(��|��o�^�}���@��.��B��]�q\_����juL�	�KX];�t�7���&�y�+C9cW2���N�0v��BG�����u��KI(�\m����.��w��'�>�:u��#;�����������)������tm,��[�b�:}��b����d�����5$M�s�Tk��2���0r�_����dO^s�J4�4������.@Z:1-���%�]���$�R,�G����H��MI����Nh(�kH��0�d�^}9��.��-�aq� ��GL�c�o�?8�N�Y@���%��e9�+Y@��z���L�j`���|�c���e���q"i�Y5��e������^����G��Rw��['��6�iq����2��']����b�+*��w�-A���H��;:� ttu���/'�w�N�Gg��z����w�-a�$�kb,�sb��n�}���b�x��m��^�<���'���%�E,y}���]����t{��p�w�T/O@���M����������q��D���r'��A���Si���\6_H��4�E��VSv����/Sh��{|��	�2��2h9��S@v��
�����.��wm���]���7d���3{DC�p�u��4Q�X�)����w���W�h�{����2h+���b���#��wY:�1X�T��H�D{�F9��5�ho�]S)�(9��[V�R���u���@�r�q}w���vV����J���{�N20��]���}��8�mf���r�n���M(���hz��>�����a+���t��,��3�����m7�Gd.��I�f��U��w�����LS%�0H?�]�ex}sur����
��%����u��n�����_�q��8 w�F�_u��Rg_\^Z���6LqZi�k�u���-D�0���
\�*�On;.���Y8����)G�n x����@��v`Z5t8��E������Z�xI�5���s:���x�(3R�x�	u��J�M?��$��V����zl�.`�w�����:���q�RF���/�@�	��%��e�nv<�Ww�A���g1�7#I
C^l����@0�,�����\GK@��Z������v���Tl2��c}�H�]+i��=�u�I���p1+r�q<]@������F��Q��m`�Bt�i�����tw���w�h��^C#e��0]�T���]�bq���B�j����L��"�Y�5�4'�N�o����p��OFD 5��CQr�F��.����c(��f������m�]�&�F��K^s��u \�n������~����dM�A��qH��)%���KkH��hz�8�\� Z��uV�P'��R�:�����q�9S6���V�����[A}0
$y�.bU����1wk�r7��T�n��t��~�R��x��nw���F^4���[�+1��\�1�2uw����V[&����������vE����;��4��R��+�I����@���!iK
 Zwk�?�y"q��-�O\���9�z��&66��4K��"����YB�z�?dN ��j���GY�[�O�*v�����m����A��{�[��<B�r��A���?���������]��1�����j�p�;l��l�e��*Z�%y-��
�0�s�{=@���_.P��\Z����������j�@0�{�[���@��&�i*��0G�aeqX�|�1�g#fkKWP�{(�z�����S�=M��;�:i���=�z�W��i��t*�=���Uz�=����g;�%�<���<�3�-��U�H��mC������k����m�_��$���k����#�I��M]_�E��)E�9���q,l��0�Z7b{��cY����^�)"�N�yJ�/EX�)����h�f����s�Z�\G�����(^��v��C\���|b���Y���G�f�L=�w�mC��xKrp��L4D�
��=�,��n����/I�dq��f������������6H�L����#�e��������=��@�m����b ���9���cY��K.�5���{M��d�Qw~�R�*w*n�7b?�f���������Q��bt6���t�����r}�r��q�\n��Y��[����:a����8�Ml�-�l�����|��'����oq��{����$\����t�g��u��v����t��GV��{����TocW��=���W��}��y����{�:yAC�_N���&������	��,h@��q��j����m���������,922���=yg���sB�5�X}Td]�E�X�R��"�)���<
�Oa������,2�Z���Tl{�Q������u�F�
x���c�����
e@������������\����u{.�k���TN
���]����&�V.��\6���E�:[F]�������"f;+�aj���/�.n.�/��9�8;�9�8���{�=��(t�zqZ�Th�ik�<T�PJ��W��!�X[��V������u��m|��U��{�Z�`a�`h��Z��w��(�8/=���b)��
0�{\h���th���6��)�yX��oI`��[�z���uke4�s�NTkc��������?�����f 3i��T�(��k���.s�[�)$ ?�7�xq�ZV���T��GX^oO������n�c[���C��[F���;�yb��W't������^w�h�����->q!��j����R���C0V�4��&�j>GSVV(��~���=�.]Y@�M9��5�D��Q�{���b-��1�z���k��@:��t���|(������0�{,3��k@1�qc�&p�{�r�s��(l���{�� [L�2w��<`�8f�~p
��~�s����{�B��(�z5@Cm�
f�gs2��<��(���{L�]���z)��8J?�s�{�&�T�Bzs0
��=�	�WT�R���T��a�P}y�\����?��M����r������{T��cI���8z����ySo������\����!�n��9���������z��Jpy{��sV��1	�����.�1�#�	�4=Lu�f'��A����1�Cr�{�~knb�=����c�e����`;vI�"��?;�$��h�
��}.��^���6��X$o�.>y��i(�}��y0
�q��o,�}@)�7�K�y�EG4���gP���
�|�����y7���h��>�Y���[}�`W�8;�Y�/���\|2�e�1�(f�=�q4��Z�&��zn�}@�~�^���\�g�;��9*����7o�_�A�����{�v�aj�g4?���i���d�O�4��5)1�1h�J�����&��}w���<�Jt�v1{+�+�-�
�Gg���7�qb��\����zd4���'�P
�5�Q0|��2|�Y
��}���Mo��V����l��r�TI;�L����P��
��
@��
-�\�8y��W@��s�^]��58e]��|
�b|��P�/�������z5@����-���
h�}.��^
�^����*5�Jq=��Z�4��u_����N����}@��7��3z5��wx�i8b����|�j���A?��Nm����:G�Q8S$����>O��������W%�([����}��k, �f��h����Mc���b�U�]z�ZL)D8V���i��wow{�u��PR���c��k���9~-���0����>����(�TJ�]�q�El���������V�d:�k,���[N
���1�n�T���+�m���'�����H�c������Wv���Jpa3^g�hz:�K�yL4%/m�C���k�c���eo�W%�sZ���k�Fc��P���*#�<|�����t�e /`����R�w���������r};��D3_PN��PN�(|$kD9�
���w���gf>�������!J��T.93���4�o��`��#��1#�i���:i�������RC�I��u.;z�A:
j�P��-�~��X����B��DX=�d��=�L���W�����#O�5�� �[�M����2�2���>��T����b
���m��T������6�o<�\
;m[1<7	-\.�����x�N	�R=��|	V:����7��$�p�8p~�r���|���-c1���@tv/���Q�[&�8{+�gG����^K��>`����x��Xj
_�f��*3��a8��������sL�r��\@����k�ob��n�.�M�\:��&/`����P.Ja�~��n�Fp��m$������^������}�nl-v��R9������o��(_���>G���xY3Zs-;�����������������G�������������OY������z5e�[��v&3���9��^
��ZA����wk�����]�N���$GA�&q���������������x���EY����s�H,���sl	���\��&p4�
x��~
�S7��������.X���jjm����h����eibfw�����6��4*���4��f���������}��Z�%�iuzw������6%f37�����b6����/7���p�_x���x:'(\��[��48N����y����[��,6�,�')�e����t�|/,��l^�d
�y�ED�I������LEK�9���������8����������q�?$��a��� 6%Sy| fnK������w���#S1F���[�acdCi��z���c�����:�|i�����]����:��p��x���ux�R�'������=��
�O�i6��"&&�p���(���������Y�4�@�f����Z;��8q����0��P������O�QkG��F��B����/����]2��� K��q���e���Bd����P���I(P�c$Dz>�����B�B��p5�(�E	�����X�����(`.6e.���6�S�Sqy�F".]��y��J9���8��!$��<��Z���	k�\tm��tX��z,�`1����wZ�[��B{����������b��P���z1 ��^��W�X�R��$�-����K��'I�"W-�+���t���Bk"��4I�{����d�<s���(U���
W���3
F:�>�<.KP�7QI2�dHG��<�b����p9�i-��8	�P��p�jS$��z���,��.1��p����j]��W�w{P'D�^����00�4��`D�����{P'rr�k������1M���T�o�?8�N*l��=�p�+K6m�>�nyx���\`|�A�H������,t��tH^DKE��s��?I��r�|�4�K|P;V�����
e�Kl��O�s|`c����`��	�Le��u�C4����2j�Jo��uB/S1��
I}���ny�<�~cSub1���	 ���M�r���y�r���#������c?(P�`!��i�(��U���`�:��I��0=h��mh�����P��P�K��07O�Q�*u��]/K��X"s
�}���J�0!0�>_�q�������
��^� �PdR�"�*0�����O{�.���#��&fX��NC�*���9u��B�����*����W��)[\g����N���x����6��^�w^��yQ�~����\a&������uZ�=T��'.D�^
�yY(����n\F��R@�4�k��Lt��b��N��z�m2��="��g�'��	9o��{NKL�a��������i�e��5�U�.�������g���{��l@��jl��#-l�+��x��S.�����{����Ve)��-&!�<��UE]�
�����b^��;y�D���?@������sr~q}y��" B��z5�z[�����r���+����$��6$���N�=��]����\$��R����O��0G����h��y
d] ��\k[Q
�����*v�=8�������8�\)�Ma"�a�$N��;Ks��$���,��4�yP/������T�b�^f/��7��H�c���AS���3�s�����VYY�p������_�z�P4e(���u$(�q��jv
�eO�u\��"����e>c�w�a��J
8��3�Wt}`9*)��9�"�9O����P���I�f��
�d��������-�h�g�0@*��B=y�ih�j2���yy���NQ�������V���,�ZRr�c�L-*��7=�0j�j8=gE=���G*��$��s�M�)����YL%��^$����J"K��'���{��GC*���sV*�7�jJ�_$�������TNKJb!9j��k�z^��.�j�z�!W]���yS��b�Ib���
�h�����]|����|sD�=�RJ�$/�������ym���2������s���zp|f��Mb'�1L�E8$�~�������^�YsD�:����
�
�Q�'������$��z����� ���6��`��p��M�*��lh��`%D���,�j4�s�C#;W��ot�,� R�h�EY������b���U�'���i�����YS��� �����YE���FM8���E����bS&�6SD������ "�6E����A\�u�<T�m?�se����=��|.��M�����yS��N�
Q?�MW��(���� ��j�o�b���C;���7�6�Au�v�6=�7��\mzg��</���}��p�s{�\^;E&&sb�������.�e0�'�0���5���lz���fg\����Me��({���?v��RQ �Gm�����k:�G���Uz��"DV������5�������g��C0��2z�,��x��17y��V
���a`�oAh]�MN��C;���o������H��6��8������oz^OM��lz��(N'���6��c�	����I�I��SA�\lz�A��������|C������v7�zH�xh��#��
K{]�����p|�>Z]"E��P4rS+"�(�Fev~6=�����
��F��&M����)���=5�����K!.sT�f�X��b%L��9�q���t�I����R�`����g���M����7Jy��3k;������3����B�9�6���(�D����%o�'S��)����p�,�T�:�2����&O�4%O�h:
��Yq(��R�3�$�&�Y���;*�� 6��>Gk��K,MQ���9��)�Z�%�!����UC�>pO��2����3y���7����y��e'&�s���C���CN0Q-�8Un����N�{�RH��,�"��BT;�9I_]f	a�	�/�eH��>"]<��<�
�L ��C8�(��D�<���<�8�3�(�-��d��B��e?�v���L�7���S��y�����!���:��*�i�YZv24=�y�e�(�s~��;�4J����
�5���C6����\Sv�2=��*�F�?]J���@9��,I�/ ��#S)7"6f�@�9��QPc[�f�l�3�����]Fx7���6�L���cB���$�_K�=z�.��!��A�%Z	?GY�g�����~ �wAl.�v�1=��i ��8L��]o#�K������
�����DS��<�s���������&�9���J�z#��e�Z*���\�P�����^��J�%z��v./=��m@l<\@�Zzn���E�m2�s�R���d�����%i$7Y|H���T<�a���s�DrwB�]�
�d��<h��������\���IC"�St'�E��i���s�H��r��C8�����X����
���O^)����@]a����H���s��Q2�*���n�;����������nY�2�Pd�A�+���=����<O� �3Rr�|���=����4&���!�;E]$;������|3 ���P����Y���	���|.z%�b9|J�p����{nl���bznI�=n�=r�gc�����F��=L��Y�����y�����.�n����pr����$��Ng(�h07�{\!�7���Utw�o�|��8m�-�����M�����X��]���+iM&��R��b����gz�y��M���T�����q��z� ��iG�P�@6_�2:�
����A.`J���v�3e��������{�K�M(K.`X��
`���+�V����Q��[����8���k�#]����=ld����#HZ"��c��$����CR�U
�5Q������<v�r�U.O)��&P�/�Tv<s���T���R���6���R:�K��	�JU�I�E��6�\4�����:���C^@�\n�����A~�e� ��M��a���^�|�������n1j�-9>��[���m��oS+���t��^1{����g�����![�{%�rYed(��2����W2,�SF���)�SE���I�|�_i�j'�����x�����.G\7�������Z��%�����,���D����'����y. ��^M�p�c��|.�~�K2PP���KBq�.��T�-f�"58�4�o��|q����*�x����)V &D6z�F\@wYZ�������E�b
��~,�a���^��,�cG��
qT�����]2���6��D%a%���CM9���vR��m�.
`v�l��]j�.���C:���gP0�@F��1��G2���=�cN�+HnC7l�n)L��;+��l�i-=f��H�G*=S���|"r���
��U��
~&��L�q2y&y��~���i��:��8����;��]����\@�v�
M'�#(��`�\����r�p�6�R������]@�vm�_/�xm���*����l	`�-���JU�=��d+�J}Bp0g���Xt#��[C@�����:�q����F�7j�(�Uob�Pp[��
�.�C�����-�4���FY��}��+� ~���3�\�2�X��l�$`@�m���}(�*7��A:��f9h�(�Z�����zoc7�F(�7����/�Vq�����PM�*.�M�oZ�i�2@�v�����Vr�C����`9/��)� 1�,�W���~�p��-����v^�Y��\��v9��Q@�j�l��q���48���I������z=�mZ�q[-K@�vm1�
���]�����ei�E���#EJ��9`,�{��X�.��\1���m�Y���.�Q�\����M�8�n���T�	H�n����n������3A��v9��Q���Tw@�vm�ie��M2�:?�]���`i4
g�dbut#�a�5M�8.Dk�
Z��;�9���c�2��b�����u��Y4WV��g!�i����o&w�D�l�m!�HX�z9�G���m�6�r�}��A.s��3�pJ�r���U����/�Y���'���l�N�vd�B��K����
��N��G�e������ggM��t��#����t���VF"���y�M�'Gfrq7&p����8)�x�;�Lm�cj������~�K�h
����C6~���n.7�*��nM�����F�mP�]@�~|T�m��B�E�K�"@��=��Slq30�����=
�Y`������!���8��a8Ul�plNo0vZ�9�84	�L��ro��4�6��9h�i<e�Y+�y�����]Q;�q����a���F�kcD&�@�6��:G|���b�����)�9?���a; GS2��B��H;l���y��c�DX�����b��Ni�e��:���H�~~��|$`� C�ad�%	��4m=�	�8Np.�\>�L��b:F��`{6��	��{M8���/1k���y��6u�)���&��bX_:����Qi�S����:^�����r���[�g�"����sB����jQ��5���_��j9����kB8^�/���C���\l���E�	bGI�	��zR����<��`_���
Z���k�����k��������0�����^��{���5)�*d2�� {5�� {��rg)����8U���p��&��� {n���������������=����^��{�"�vR`{Ma��h���r
��m����Ug��R������d���6	���I��:{Mh���fK����dK�'1*���:g`N��5��c�N�8`)`��N����������z2�~�9����������F='9��������P��2/8��(�Y)DY�4*��B>�>P�=��l�`��W~.��d:~�"�����"6���Er���I����B[dOW�����e��c����m�v!@6��i�&�������`${���}�b{����w��Z�er,�*I���������RP��&�7O��������z��k�<.�����F.�fY�l��T�r�����zz��������k7�=�JG����$����F�����\Ei*���[����3"���@
���5	X�u����XYj5�S�&�T���~��k7@?S�u�0?d���n���*����G,7� �i�p�M	��rs$������[@�l$qc=q�����R>�]����Z#��p����<�n 1;/P�(��UOf3�j��=@��l�l*��q��t�\���?�Y"vc���"���^#���X2��k���RL
�$3!�x1k�r.9��� w{��1-���.������A�5a��OJe���G�G�-(�G17����y��|��|
��<�+f'`�{�-r�;�����pP��n�"*�������k���t��X���������)9�:��������C}x���q���RH��&z����&����l *��(��u�����Pu�����\M������I�����3����w�_���|x����=�M�����eBf����r�xR�����z���4gw�0l3�]�	��5=�����/�%��8�duo��L2 0�=�������67���c#������b�"Q��c���L�h����$��r��w5�h�B�"�1��!uURJAB,i��"W��\��]�|�����c	������*
���B9��zT���n6���2�!}���|���`I��d;Fw��$�-���{���t�����d��9�x���F�7����w�o�0O��"���$����p�d�P��9�>��u�Rlu�^u�X�^CD����H�*s5������qa��#@%�lT�3
�>���gc��r�����hX[����M�>T�������
����0)>uo7zP��	ke�&��	=�d�$������x�3���&�����Rw��)�kY[u�����Z�g�����{ O��������E	QM�f��3��y��;��'�~���>`��[����V�a���>�����Qs%%���<w&�HF�Lb+N�?�7�z�5C;\���)#L����[LsGd�PB�tOq5�E<vvh�j��.ms
�30�}.���x>`����������T�����a���K���
1LE�(�i+����[�/��>`��������>`��M���0$��o(�>G	�b����.8���/?�]�ex}sqyy��F�r.��mX&Of�p�~����}��]��GW7��o�W�nqp��r ��w@�����w�:9zW��C�p���������9�����i�@8iB��|�|	�`
�\~��vv���G]��t�Pg~���m�b�o
`�	�����;�;S��|�D��0��������/>��W0����H��t6��1�M�P�}K����B�RD�1T��ZQ^�Q��~W�����D��<�[�7dha����=^eI#�Tph�/��:�G�I�0\]������
��>G6~d��QPqC����V��0��]�.�_�??�9�8]���l=�7�9��a�B���E�Y���U�����b6?�gR>`����
���� ^��Y=Ph.2�4�~��������#�	<33�����-0rIr����!�(�UP�|��mq����Q;`
����������>�S#�����-42Qd���O_�-eqd����.�e��i����G.�!3����6�n��#W�������sa��F��p��5���������&����LK@�m���#���Cz������?6_h�L�����G.��5|��������wG?�����������&KKX��A�2y����d}[������������������������a���[#g�hz��(���������EEk;8��Q��`���{�@��!�e(>G.���/J���9�o�x�k�h�N�r�%&�N�*E�R��4�H����{�
��Z2�VM(�j;�&����`%x��|kz'������l�&4[�R���9;e%H�m��je�S�~��P�4���"zAy`���e	�����w���������@�<f�}��T�h�m#	��	�w���/}���X��7	Vm�i��aG p}�n����eg\�/��X��7�U����P��(�$�u������UR{NK]�-���i��0�[~�ul�%�oBI^
�2vy�_(D^}�K
��~��:�n�R�V�:��E/�lu����W|`�<�B��[M|@��9B�Q�v[�n�P��}{�n�<Y;GvZ�\R���H��w��q^�'2:	��E��p������Q)@Q{�ncO����Tf*D���"�nns������R�.��Nn�_�i�X�,��}�'m���eI�i���,�O��P��E�9����%��h�I�m;:�'��P5?��[w#�Z��y���{����s���:�_����>�9�p��~CG2�Osruuq���.�z*{oB�K�Gg�<��V�u������z(���
h"F)���D��P�W52��A���k��70���)�o(�:��m�@�cd�����E�W�!@�M9�����o2Xs�{o�L�.+�_���s�m��_����t8	)��0���8l��z��s�����:y}���fxuqv"�������XE7�\�0��}[dq�ZP�}[Xq5Y�������:SJ&���s�����)n4C��l:[J��G�������tn�\M�L�:[���|EbZ��Yh�8)���C(�eD 
��2 d1�����]��y�)\�U[����vM�]8�3e�KL������Xn
Lm���p\~KH��-��N/i
v������
��m[`r]���o�n��J��.!��b�
,�	?�8]I�UDVnT������F=v�i7	-N�0|u�w��iZ�A2��m@�n7	*^�"����5m@�ns���T�X;����/M4�^g��KC��t�����nA�$�o)�>/���M��sS�t��NR�(o�[�9�$m�I
x�m�C����w���w�O��>I����+@=w��"US�t9�O�������)���+���m(����S������lV�$S�E���������|��o�^:��d�$~��d���(�m������n�[x�^%��r��{�����n�q_N���>���v�X��s���Dle+�9o{0�}�(?
l6����P�7��"��mo�S���!Y�mN�����|N�EJ�q�uQ��7�s'���l���:���C4��9��#�=����-P����x�6�$8�o���-4�����9;p�[o�]��6/vL���I�����}��1��&��WC�"R���om���@�m.��g�v{{�P�X����,�K�koa��Y,�|(�MB�k����h��`@Z�h����8�*u;?����D/�}� !��&���q��R�}�ms�`��Y������9�_���7�m.�9���g�9��C6����-]���tN��~���V�]�;
�f��yut��]�m��_�R-�'��d>J�Uq�������;�e
p�hsn�������M}���l�~?��'�/�I�?��O�_������������[������3
���`$�_��,��oy���?f�B��"�B�hr�����g����������Y���<��K����g��H�BR��O�<��\b�?���p��}GL��y��w/�=d��q6�O��|�����av0JD�}x���g�gN�����,��&��tv|�'o�q��6u��TF����6?Q�?d��O�����7P��-��vk�A)�A)���h��8���m���(�n��u{�����c����1A����{�;������s�K�t�,��q~7F��<L�����������8&j������s�����c�����������W�A{��U��"�A|ppr����f����S�IH0��$���y�}}z��8�f?^�<��g�KG��|���Z��w�i0P����a�H��2�#q^8�j���L�8����ZsW|o�L�������G�F�/���S�a�=���\�Q�>�Qs����@��|�����/�PHI��?J�����6��f_
��B���X2E�7,>_�~�����W��Z�
����z;��f�X��[2��VkxE
~G��/de�7����n���g����,��Mt����/1�/���I��9�^>����|�����N�-�e���+�[��\�e��	��QX�@t��.��1�����������F�V�dAe��p���^�J?,��b���i�L���d�S��2�\����i�:�p2��O���n�'���t&,��������n�57�n��;|��������k������_*���������������^ W�N_���R3t�/��2�Z�F�^��0B�U���w���S��z�\�%�{_b��7P�r�_������o�8�<
)�����+wm).[ ��y�x���*�
s���=[�p�F�9]u����[��v���a�n[u�-��/V@y( �62����W�X-K��U��X�������u�Wp��E*��b�/�������i�(�1^���������C[��R_O�w[Z'��R�Ur����S����,K�'�����?�+T��,W���/�\
i{���g��~��&�@@.5E���Q�%���Qn0~5�Ur=s�oQP��x���.;��a���
�c����%�r.lt�v1�����;T%�� �����R��#*���1�]��&������7)�!i��r( ^�oP�>�hVb��������.,�\y{�:�tb�[������[]�����<�~�V�
����b�7N�	�+�zy����k�]����6�V���P���������_�|���z�������������_������N�z���������f_����a����&���s���_����=���=�/+��{�_J���h���_�����}�G�z�����}A�:�����_K���n�4����jo������v�nW��;�����F�?���<w�^�}����������S��s�jg
qM���%v����~_���/���d�']o��+u�w����i������]������{?s���
��o�������!����q6�7QK��h�<'hF8�@}�\�����<��Y��r<[0Q��I�/�TY�/d
�$
�IKc
�$
�ofw��X���H.4�.�+���d�d�G1~N rTK2�L���`B�������x:�}Z���M.z��A��k�p!&w�^`.�9NfOsR�5�q�CZ��W�-���� v^����&dB��'���0�l�p�c�w��P�������:9�9}}z|tszq�^�5��(\F9���Hk��q��Brq��I@��L��f��$�X�6�����h:�B
U���ha�A���9��KS���5�5��K��:{��b=v=^k`1��l\Y=�L�����)�oe�@_]F_�Z�~��~j�!�q���8�Ft�=
�bfPh?�nu0��h$�a����G����<'!��h�NH�'�=��h��q��d2�t��p�'Lc@=F+�Z*������?{\~�Z=%��[1ZD1�`�LS@����fS����2�0� �����b8JP�j
��aAV:��Q06	�
@���t*L��&z,`�i	@_���G��8������@���A��MZ� �i��_ha��������7�����d��&.��pa�Ijn(P)��hN/�������������-g��S�1��=��h�zi���nw3�/����m����t�,a�0���z0R46N'�h2�������h�n\��=\����C���h6b����.E;�0�F����������"��T��(��)cf�c������jj�J���k@����y�1���v��aG�T�<N9���pM��TV�a�`�����k]��������j�L�({�Qv-��=(�������i���\g7n�X�<{�tQ�%�!��->)�v�z����������!Q|��}��P��%�t��i�DF�@?���z���������%��"��
��I��XJz�@;��1e��B�n�t6�3���%�X=k��0�I'��w�%��KR��(l��^�����"P���?��������T�[+�|(q�QbZ�����Y�-GX�p�mE�R�1���>�h����L�=���41��wA����`��"2��n������ ��*�����C<�#K���t
��.JI_.gb/��f�QJ����It�T Le�/�H��-8B1i����e )���������`L�D��"v���0��(A}����//�+69�M�4�������s���-��&@�.H����	�^���|�>�*���!�r����UWNF�����R`qe3l������������V�����k�!�{�n���&rsF������e�T�����JQ5��6�`5e�=������7t���s����#B�+���-:Ct���� 6�(�������D�����Sc�1Zp�c����,��LD4��}J�4��=23� h�A�����:4�_>�m����	�h���8��k��tg�T��,��9�_��+'���v�����J=i/9!��AL�N��i���5�a;�A[��n
Y���*�${�'Y&��aF��yHL�1��}��-fc��*��b��R*8�,0Z���#���(]�L&�xx��<w�I�b	1$@�g��<	J�@�����������|\��O������M�75V�>���&�i�T���_�<�"��%^!�ht�>�Z-��:�/.�N)�u�$��E1�:+��cR4���>��'g����4.���vd������iy.}���-������iT]�d��X��)tHG��s�p2K���X�X*�-u��x��7zm�j`G�����C��������i�45` J�������"4 1�m�B����`������#2�Y"�3��(F��N�d����t_����e��CTy��D��;�P�]
�O�_
�O����z�����A����������������s7��PH#�L-�{��2��I�f:G�]$��_��cm�C����u	^�ve��H�����9;=3|}u�_�O���J������Q<��@:���x�����\��� v����P���B��������!�6	��Q�H���a8���9�J��V[(#'�
t����q��uV<��qv����;��`?�k$��\����������FI�9Oa�������f�k������g2P87:����y�N���xn�maKL�������� ��n���	Y1�W����y��>�B���X�g��4�t�����q�����z5@�9j�w�o��Vz��N#\��!
W��C�>���)<$1��_��
�.GX�;H�Da�����3�c��Ub�Wu���wUi�,�;e����slV���o����JH_���M�a�;������q^��������2�iu+VH�w����U�(2GY��*�p.v!������p	PU]��Z�M���gRf�����QU�j�s4T�.t���x�N1?����oJ��=���bw��}h���x�r�S��������d���MF��Q0Mhl���#�V����t��w�R���94&P]�f��@m���k�Syt��
������{��T� ���T�h%G5�	0?]��)��+��,��D�L�X]�S)�L�~����{�0���� �R�M]"��3THT^s>q4�t���# ��f�s�HS�'�zN�y]4��u�DSoL��2����#D:�+h�A6z���w?���/h�:}r�,���`&8��L6G�����Y���P�86�	�M]�mZ�z48*i~c�(��C���	�	G!�I� G�R>���w,�
'�J��T�;F282��FE28��^
�s��i@�h�G�QV}~I��z/p�8G���J*0'�\�����:7e�p��I+K��cr
��u�����I�N�AN���+#����?��a�I�������
�>��L:�.�Nf�<\�Mv������S�D���1�H]�Gjx&��2Iw��$� �J`*!��>$�xLV�|M����t$LAk�db;C���
�����h:~��0���,uW�5��b��u``y��r���K�+�uZK���W�<'��?��y��
��c�������\j���6.)�|�f�0���|3��9�����:�r&�]��8�~
��H1l�S!��C8���Cl��	��N�nCu�z���zxs�������	=
������,l�T!N��B��1�p,TU�>������T�q������Q�j�
������P�c��s�����@�q�$�1-�1!y�c��E1*�-���Bw�*8�2�#�$�d�&b�%w�PF����$��M�?�;���)����P-�DQ�
8j{f�Pj����hI~�>���x���H`#rN�Cr�K�$Sh7�������*��v�Uce^M�����	�}�N��8�bK-�d8+dk��@*]�R`����_�}w��
K��X�z5l��:]uD6���$���yu'��x�k�J�at�T=�;����z5@w9���*�2��yP�]�B�K$$[<N�qq~���$������<�^dK�}y�$��4���BM�3>v��p�f�u8��8}�A���P5hL$��pm"�R���(~.G��������D�:�*��,<����������L�&2�I�R���8|�]�7�2���c���m�(vz��#�'�w�~�~������uuo$���V������Lg|���sA��ryg�bk-�,�5{(/Gx�B-�:����8l6�c���-��i�������JB��}[x:w�d�<m�[Z�����45����������	I�s�Ns9v��=�4�#��
 w������i��H#R����rT5*?	���������k�	>?zw���~6��@�.C���\���W��e���+��
��^���m�����6����7�w'7G���/���n���h����Xn�r��������TO�������l���>�T\������H�����PZ��1m�G��.�]�=���K�>�:Jy���qD6U�1�J�q|69�N������O�Oo�z�'����c�o���������(i�s���GL�G��N��=OepLyO,�82jf��4{\9�{4���dz/CM@N��@�J���'��S�?:�y�N��`p����P�)�+v7���t�t������DIY
j�s���u���SR��G"�F9[�	�Q�$���, \��+ �yNU�+`�y���O����������3�_0����d+7G���\_�����nGv��0�1���xm�k��A�����5]]���(H�!Kt��pD�\$uU(F�B�W���UB�4A����o� ZwTj�q��D��8�����$�H�qD�\�(�5��<�4�*�5��<�G�'eaj�f5;=���8��^
P}�;�{>o���X�\es�ly�����#������rT�n�9�#��.��|��Z��1�q�9}[������j��q47�����n`[_A�0�<��W��Bh����H�M)0P����}��������I@�x����O�q|�z� ��8���Q�~���������<p���y��
p�<������N���0�A{@�m��U{���D�,���>�g����]���������>����;���xH����+�f���t������&���m)���b���������8���r������0�<���W��#�K
 �yR��tr6�8�r_(���cZ�:��)`�y<�L��O�O���b�]M���IM�m �HP��X|Fw���5�?�H�������F�<6�������m8W������V��i��h8/��J���Hg�y����1��
��<��V�����uEJ3�����9-m}��X�(:�I��U@&�ld�R0�����<�F(+����n��y�`��fe�)��);��y���*�������p�`�y6�Y�����������*��1A3=�7�l|3�%/�4��y�-���+Y=e�"�#��#OQ(�S��Q���Z�q�2���+NY���p�|��+���MknV��hXQ�����,,#'���y���������������wG���my}��f�����n^��U*��Q��j�p�5*�]�k�52_qrZ�8����)�so�q��n��|+�������Vg���Q/&S�F�<������b>��	26�0�
�(i���`����g��G����Xp��gZ+���6�-�DwN�a<���G�20�`�y6��nM2��k(T4<{�r
 ��R�����j��n�*+4���\���U�
�t���U����������
'`�ykN�G��r��<����s�H4H��B���_�����i"��LC��r�B�����
���u��ZW�i8�Zy"�u^�`]��hv^���S-�z�K'��:i(�D�g�+3�z�=���~�h��/-��C�������X�h~z5��
��.�*��tzY���^0�z�A0�c��+.�O����X�|������C�dm ����0�2*c,�2:�]�#k�G0�}���.l��'����p|�������X������z5}
�)&N�eoH�^�x}`4z�������[�37q������
2�ptz3<~{r������}��I��������$F|�u�U1�%P�tn=��XFI5�s����1�����,Q4���A����a��
[W}hd�a@��Q�c�������Q>��������\LGa�ts���
C"��r3N��b�3!��I�kLu�������B��"Q�&�l�9$M�q2}^�8�a�1�XR�����������fGM�����	�5%��]�/���b*��U2{�=R�U���� �A$;D������*�*\������4���j|>���T�����������\���X�>�R����60�.�O���_�O��.���g'G��/[�3���{s��>I�7a&L���%/Ew8�tS�F7�I���]�8���bd���N���mJe��*�H�_n��i��
v�T���9���`�R�z�T�R})������[��Tv�� -babI-�o,<����R�D�|,�Ie��e8�4���\iGE�w������k�>���V���I�&@1���"g����R� i/���=�r�X�g��;�L��������{[�IBe����Z�='?���4v�8������X�k�
]�[�>?KgbS���r���o�����q~����q9l[�kp@G%��QHL���;o��8�X:���e��t��x�E?�UT���1�P����h���z5�m�a*sL�D\2��a�JA���d�P	����`�K%��C�)��e����nV�v�)�~C��"�)��H���f�o����8������%f�'~9��^
�?����nz��XtQt��9����dB�3y����C����^���nD,!zY��|D��������3��1���@��
����<{�o����k���������Iq�Ki[��H���9��^
��v�XnH[�h��n�L�k���.�1����0����L`wK�x����y����P�}���W0���1D��C��Bi�7��[k7��1�����"�qn�V����Q^_P�G��yl������_k��}����
c���NC�	H))wy��-�+s����j�YV�f�z8Dv��H*k����\�?-��g�����^'�J[�����q�V�j�Y��:�J�7S
���q���d7������� �<9���}k�T�\����4d�QQ�q��4�����\���w�Fy/���� $5������b#��vo�?��<V���,�q���M�b��mxhEe9{D%���:w�XC��.U��������|\�x�m@�(�2L�I��aN����m��Ev���P�X�> ��M��Eavd�E�E����S�������p�}[�W�wz}u��v�=�G��)�W��/3������ ���{
���L�Eeo����z�`}����,�r�.u�y�	�+:G�|��k�������������h��}-z/����1���M��1�,�Q��
�(�(�L���{=3����~S�|�(k|��2����b��������~�����%�R��[r�T��G������k;�	j��H�~�h�>����h����W
��V>������s2A�����G*���w@{�9��&Q)��~���-���]����oq��V���������	%�>�p>�����Tw���������t�_M3���~:|Q���~
 �����������u���������r_�\Y,����r��E��D�AE�i�����l��S��Y�3�L����}�b����y��`@g�m�{�T�0n�o�KE__I����j���r.
 `��V�R��K�6���m{+�
���z!y���������Z�`��d1_���,��O�}`���>l������
��mk�^���m��nnq;eR4�]L���met�e��D�2�vq����5�������meSks�8�=],��q�LSa��i�WMN;j��9@���6���y��K������sn����ns���!]���Z�p�;�s��1)�$P��J	5auk�}���EN�L�`��-6RB0!�>�� ���z5��-Ly��nJx�|�(:�bw�#v�%H���?��mw���"���'I��G����(7I�qr��`����^�*���{E���Qf�\.<��������~]L����!�V������i�3�����m@'o[#���q�_M�I���6��Xc�i���
���D����w��X�#��)��(��B���]�[�����c�B��-���T�t=���	';�A.�r�&<<��L�~_�v1���������u����u�KY�o���vJ%��e��n7�I�����%�Kczi��Ln@�n�!����K������q��jLA��j�i8��g��	���"�p��R��O��P��5��,�����������X�x2�l~3m��n�[�rT��d���)7��q��m����d�����C��������Q����O��V��)'���e`��V��M���|;����T��-R6*����10���-����\5���f����m��+K�����{� 
���\`�r��C@o������m�oo����Q�����yO�O�0���o�x���].��i0��\d��"#G��kj�H+�����<��
S��g��Zrg{4�]-�nB���h�tG�Q�G����BU�,��>��c��S�qi�w��{WJ��H��<D����BF��o��
�����Tw_jjx��@n�T5*�� ~�(c�4sv��kw_�X-��a�w{wyG�u<q����BX��:�n}�	X�����v�����s����v@��6���9�x������_�J�:���������Ek��������X]%U�=�*��*�`	]�����-���x��=3+%����r`S������'�����u��l��������x��Jy����W��I0�F�h��l�x���^��Hs��]�=�������v���TT��R��U���#og�y�<QiN�4��P��5�r*dk�O��^�Q��������
��z�6`��{
��M��,��-f�n�jy�j�����y�����!|�)t��N����.���u���
@q:�"��D�DGS{3�}P��=���g�}S�?�3;f0�����)����6"��D�����������~�|syt��o:Jby�6w��:�?O���O�@o#��Z��X�
�w[�t�1���`�ES/J�[r@�n�n��T�F�v�tt&?�dQivx�I�u]~�dh�H�e������T���N����P
�o#�x�A6��� uzU��;��tLm@%osq���O_:�������
�m�^�l��6 ��1��Q�L'�d��������&���|Q���u��l�-�������'Wd��:���� �z5$���q���(�t�|����.�r�9
������*L����oR�n�M�`�q�	��mD_%�u�g�2n�L>�`z.��>/���c��j��\�i�6����u��B�t��(���5�<}%z�0�(m��2�<�$�EXC�����P���<Ji��T��<��jU(Ny���aD�Vz�����2g�s1i�R�G�����o�3���$��9QI)���*]";�u`D��<\�zm']�T7�9�b~G��S?>��������qR�Iv� ��r����&�zv��p��W
�5���s��J�����,�I�a�{Ku� 1?N��<�x'�������Ty2��x;��
�jO^�����"���s��
Qt��<�D�0moi�o ��;�
��:57�>m ��;�_�����|}aX
��(V��W�$�Z�ue������@O�5e��@2����B6�����:���q������0�(������9q�|����s�j�O)ui������4qp�;������j��c�!5#Bw���
����Qd�5���;���n�~����/E
P^��Mc�cys��������(]�����W�uX?p("��>�sf�s��s�3�p#��8��ql�>�:>�r/-��|�[�:?��6~����������T�b33x
t�x
�3�t�	O�<z+f��d�<:u<��	:�.0�x�4���y��2�=����z5�w_?\�:~��r�,���!9����}��-����sdv+�-��!�W�����zu����*��#m���$�dD��j~�������
��;MC�S�D��MZ����A�M=@����\�:�Y�A�zc��|�*}]��*�F��t�q>��L3������<eR�M���D��C�X�I<S��4|��Z��</F�i�LGE.����<=��p��@�F����r��b�~�/b]&�_C?�l�������e��h�k��(���y�A}FK:��H��D�U���u7���;0|���c?Y��dL��8��m��%���Q����L&���w ?}���:���=��������F@F�@2�1��n�#sr����������m��5|W:����H�U��x��E~N������B�~�2{�#����A�uS>@��X��)������H���������P��Ofe^%5�`:�}����]�8t�{1T�}0��c~��� ��w ���A��t�t��X��{����H)���\V�F9������R%U���
�?8]� Ug��2�����d�
p�;(������N�x�� _��*��:�Y����z5b�a��/�^�z����-�O�8W���
����h������X�b13\!�$���U,t�j�ATs�T0�;�/Z���L�����E�!�;M��TJ�Qp�H�(��r�Q�:�s��&Xy�
8Z��
�_.����lk�5��!7�5y(#'�z��;�?��&��a-�����v��"y�����M�4�h����)���`�z����������A!��R^��=�������W����-���|3s�_.���{r��
\����4�zQ�,������w�l��/��b�L�b�S��{��� ��^��w�
ajm�QpE ��"�� ��D��R !>�@��������NS�;�5�)^�K0{5�����S�S+�z^����W+���;�^oh.@�O�W��v���2^�C�),�z�]��*I���=�����!m��4�G�/�E���i����%62����C�W�������;��yS�>���%8%S�H8����Bh7�_k4��&����f���]�6��g��w�����j��8�J�y���p�Y��(I�4������4���`k�X�����8�]���Wc����A��b�]���
e��@2��r/�W�.��������?=?��+��v��n�db
Y� ����l��d{���.��w�����U��T��Ca���/��v��2��Wc�n�p�T���. �w��F�u�%i&����#��4���x��RA�r��T!A,�6~�%��^�N��\��hD�u�7K"�.�IN�R����a��s]&;�u����dm��h��.��w����t3�H��mB�S1)Q
�h]�S�r<u�%����Z�Ry�}�lK$����rx�N4�r������1���^����0�e����u�������	 �^��.�vw�-l'�(��w��!]�����6f|�9
��M���������������q�jw�npy������m���cI5�Se���4�wf���K^�&7��m�&��w�-`j)#�:HM�����nD�����q1
?��QV��#���^���e�B.�c�2����b���P����r��bp/�+��0�3MO�Q��[h�,��GY��^n���R���s����z��������8r/�NGSr�
�0��B�(�<rqgYwu��myP�E��Tr�������/������=yM��$�|EM/
�=��hgg�ty]"��ub����w�[�z
<���B�}^�	�G��G�NT�W
�����f}�����=*D�2:����`:�f"]�������R�c�~�N�x�������N�^��Ta4e@���Gd�^w���y��(1���	�^��.`�w�[�t������K�{�lo����k�K���K�V
`�w;[��L�B�G�w7����lytG�V���>�:��u[0���_�PS"~y��oC!���T�U����_
�����
l�n���wc�/�
��n��6���_��J����m`�wQ�vYv�}e�p&�8�f���'���g��Ty�4M$�F� b���=�,�7�~g8��n�#��L��6L�#�����o�n�5��C!@��vk�,��
��na��I�_��������g�z{���^.�ZFz�#�.���Gi����"���9	��x��"��w��7��>��v��[����'���$��dJ�P����p��'���b(�2uA��"��y���Q����H����E��@��x��G!�,i����DrP<�\"����,,�`u�L�n-���;������~��"%=����Y�:�H��������{���T�`v)��P'�Y0������������Z5���E1�
���.b�;�AQU(�TN�����N,6�if|W	$�������hA�O*_�I&�
(��o_�A��Ju�fq���LY"�W��>�8z(W(W�RL!%�=X�����`����3+����9�-s���.>J8u��JpF|�f�o���B�S\�7O�'�����5��r/f�:�_%s����������.���:�/2�"~�X���?���H��E,��8�]������|&lJ���D�y�(W����B��/�F0C��Ep������2�]���!�}E���o�;�����s������x��#}~|w��	x3n�r/���[�(�]����8�^��^uYB�v���vO��� �wa|��?<�.so�ns������].��z5=���nq@��������q�"�r���>~�/�9�{�X���������W���,���G�����'U;pc=�&}o&=�	���n�z�X�k��<
������=���Y���u=���Wc�b����,��=�b��i��if�^���\��������s�`���P�z5D�����Xl�N�w�M���c��gmK�5-��t�*eC�o�z���sk�������HO����+���z5���1���.��U�.��Us��p$x� n��s�C�$��������.���<L������l!���R��{M����R/�y9@~�y[l�
����������$��&$���3��7����6�)'�����S8���Q ����������/�f���|�P�tKa&�!��:���$07=@������^S
|�fP�{u�����;�xu2|uts4<���zys�J�_���W�������eg��+R����c����g��������Q�{����Q��Q;�C����Jj��b>
��q�e��ydU�&@t�5!�����*��$[�Z��������Gd�"{�	�]�V)���(�l�b_������f�p�s��U\����w�ex������#�����H�e��<�#���4�w�������&����X���${Nt?M�y
��5H.���$���m�& r���4_����]���]�5J�b��L�P���n�0�{3[��m���6��T��	80|�^��6j]��
qnYqh������^��<��i����Y�/H�����q��u��{\(t���\|sM���R�w��`�7��k�dFx�:3Q�0��8�(2g��\���@2@y�u����S����w6�q�eO���Z2��n���c~C>d
�{\tr�{�������\�������k��6I���d�a, J�0J��N�c�u)
���&����d�4�C�
P���&�A��mp��.%�2j�|]�I�y.�FHa����rd�m���B��_���������I	�vM���H�so$@�&����7d��/��u�I��w��:�Zbe�s����Y.j�����&��E{29di�����>��^\��^��/8y\���.���0{�X�=9����w���^����}����(����P�k�K�����a�����8��E���^���W��IN���qx'��V�a)�)Mon�������u�)�ZE����T)U�5*�����O����*�%D�6F�So�`P�-��
`f���/J��v@��A2�V�s������0�2�������\{����x1���=�q���3�e�	<�s��[�!�.6B]��%u	����Jwo�������sdD�m�X��{��<�e�kF���^v��u<��^G9�G��q/�)I{95 	��{����� 	��{(z�+�R�h��m���T:��``���a�����d@�b��8x���<��8D�.50��{�{�1p�{�-\���q/p�p������@���������}.F:[~1���w�8���\)����h��m��/�Am��Q����?������!^���+��>�����E�I���i��-|]�W�^����_>�J����#��eC�c`�v,��+�R(��}w�]t�B����Zz��������X�E��0��M���������P0���f�W����w�t��H���kBa/�:�x��:���e3�t�>��-� ���&�|�<����w��>�����!_\{����������}��)Wz&��!����@��#r��?����Cf���}�s~_��^�X�s�u��X��pWh����>@�_!^{���M���CL@]�o���"&���9�^
@Sk�������}@���[��*�������puts2<��pr����K����������+Fm���M�x5���}�#�s�P����N�N��9:{2�x3�x-������D����7��YS�o�k&Ydi�b�8����S`�<��\BZt�rZ�����o�;��O}@����&*�S(�N1��u��\�����=}�U@��z>}����8(\Q)
(h-D_��Ni-�����m���b��>�@����vC��m����j�c���y!E7��B�8���YE��X���g}���o7L|V��~	}.�;[~9;�,���_����n�t6�F�~��m��WpAm������C���7
&��1��<�S�����]E�)�}��li�Vl�W���lK����<8��To�~� ��+�RP\>������d[�����	z��m��D��
����eD���gZ�\��WI����	8���8���[(�S�!���F�u�����K
6}����s���j�5q�(Jq���a:
@\��r��������"����F����_��;G��2���F���4P��&1����I���/o��G�������/W�n������C/5<Q�zr�`p���
���EH� ]����y���{H�N}�n"��n"E+m��R�!-�K�S�A�\�"���X���H�L5�����������)���H[5A*�~y�ax��� �\�l	(}������,Q���i��p�4�����z�KtS���(��X*�KJ��������5���:*�~S��w����>r,�m�X�o�XBe�=�$�s�$z5��w 1��>�l��+I���QW�>p%��v\I������+	��o���tHV�rK.�f��G��EnQ�~�;;�LJ��5�L�~�k�o�G�&������A|����aR�lN,�M�5��J�&~���.,}�����2@N)��=�#�m�G�8��5(7@��)Vn9�*�t�K\���x�����6Gc���5h�D"-�u���*S�uv�4M#��M����J2N&�&N&��Y��U�\W��k�e���^���
1i3Cp|IM3P��\�����{@/-��2�2T��er9�VB��N�^�����c��A�T6Es~q������u@EB�&2�4�Au�����e0� �VP���p�zp�p��n�yV�4T�!Z;���EL�e��d��
�`�o<$Z���-��}��(���J���OiN�*�,�N��GCG���lx���.��ib�'*O���������F)��)2V&��
�=��3 1��'��Q� ��`7OH��G���'
�Y!�]���sg(�G�OX�	=l�xN8�	3^�xG�
MM	V?O���h2fO�^1�B���p�x��"���s� ��^
�w����K�.W�e���H�9��A=�p`4q`PA�������Gg����o��������_������W���GY���X�����,�b�h
6�]a�C�9�N��@����I�o?v��k �8�^
Pq�!{j���������v��6��*��\����!(���7�P�}|��n�0MY�i�
g�p��8xZ�F�Y`#���- ��0�z5�
��p~��'�[-`�6e�3�:`���z5,8b</|���M���N���: ��u-��Tu��,pve�E��n�����Pj�����+���E/ck���{������k�j��	�<��4����M�M�wG7{E�3��N�a������Zr���S!Lw���&D�\���,L��;5d(�)��A	@&pdr��+�G	@�lJ�^%p|+���<�&������s{���Qp�6���hb������Zd--�H���_Z��x�r`x��S���b�9�t�Yh��v���I -�*��{4�A���B�o��hZ�����T��7�w�#����Wa:�G��XU3��M�������D��`E���]����`C�6:��k�n1_S�h%*��F+*F�E���y�V��������dzx��&�����P�������T)�&�if���l���RkS�����W�JM������z��r��V�<�O3a$�RDmB��3��f���Wj{�m(��R���/��^-�� Wl�j��|����hq2b�-��4�\CT+��C�W���������I\~�12$m��f�����	Z�$=�D�gz���	h��&��7��X=�o����Z���z5�lt�����%������&j�1B;�~������"pM8��"p�/g?����-����&�k�7���@�4�]o:+�z0�r� /��A3|"�{YV�n�z�X��&���Y�F{>����-��~�����Q�WoAg�n9/���F�����z5���@�d�d:� �����2s)L�AS������s�����NfDo���m8�T��$^L����{h�.���$�,R92������X������OF�Vt���Y���AK�N������e�Azu�X!���"�r�X���oD�`�4z�);^���������Q�_�_�m��8�~g�a�&z^N����B���:,zY\-�
�Dcv��������]!/��m�������Yu
 ~���E�����|�/�bS�����f������^���(�E1p�N^��[/�r�O?�n���j��]-��w)��h�k�=���H��j�����/f_��	;k����;���76]�oE3�l��
����y��0����Rlf�8�L����`�s[Q��&A��8�Y~���x~��8]�>9;9�����V�lZ��]"�S�9-�`��;{��v�3=g���'�L�"���.��YM��?�4[~�h���;E�p��_���l.}:�c
a���o�v��p���)l�?��
�=�O�6��44?�E]��i�:m�����h��N_9Yqk��M�k�}������C���s��\�q*e��X�|w�����w9�u���������Uv�2=��v�1=�_$�����7	5��#y�Kq0|D�%���F="��{�����A��
�s������(+0��vn4=�������l���L����,��&���ey�i��#8�$���������_���=�����/~�%���7����a]�������r�	���9�sz^s2`�X�2s3a���s���.yE��I���a8Q��b������t/�3\�G'{ ���E����*�s�6�8f
���<&�S����y��>�TzY�d��-��b�F��p�o���FK���\tz���������p�����a����_a��3�����~������I�&jq�w��Z�h�b��=$��Wh@������EK�� %��G����NO���F�N'����i~�S�3�t���1��w�]�ES��@��E���o(�6.s���K�Q��v�I��x��;���1��]�	�����U�P�,�8V�Q��&�s49�9�(&��/
�����
�s���c��W���� �����i����z�b����/{�c�{����k;}���k;����5>\\lsgfgo�s���t�?!����Ob�=8�E�����)]�P*����2hV0'U
���
����y��;��o|c�U���f	a�k�wD���jQ��L�;�R�R�{N���S9K�N�����U�<T�2������iM,��]7Bv�4=��!�~�a��������9_�X�������Q7PqS�\�n���iz��]��-���U���(�l6��1Z|����Pz�hej�,���2�����G���;����|eM(�k���q����5��5��Q����|e�>y-�A���7u�V�?�H�jM
��C���N�9a45D�#�k��G<������_�Z���Z�v�kc#�C4'���9}�9�'��\L�Q8�hS���r��� �c��Q��=��w���s�3����kc#���/�$��s�����.GD��������{��q�{��c�	����{a���Z,h^���Wl�$Iqo��G!�w�����nB$.��$�&2����%}?�����D\���_��H�7	�:���@S���KI������`J����{G�P���?��,�c �,q�<Fh��{�] t�yDl�dg��������n�h�={����V��`�mvJ!�p�m���^!����Tb�	�8�
Z��:�'�s![�*
����a��]6P���@k"Fr���hF��w�=������<���X�v�&����h[�E�;����+O~l�O�l^��$�<��X�.�(�6����t����,�L�����������?�O���-�f��Ec6���5tMxN��LE�;�����4M�5�9i���&�-�~C��X�`�R������[/J����&a��;r�WU{ryi�wvw�����n�~�r4�gn7�
r�.?%X��-����/�e�\��vQ���B�]��f;���[�N�c�F��^�0��. V���� ��6K��X��0{/�a��Jqg����\�o������<��%|k�
��[�;����\�N����"��_{��AzQ��x��)	` ����&,9g�v~f��q�h�.(��XBA��_�2�4_q������S� ~0�C}����z�M�],�z����,R5�*���sG��R�=���|�����|�H\��
+��]0Z�i6���$d���kZ�B����UpG�.'I���CK�jv���Z��V�te��2mX��q�r������P7�
��n�rmS6=ip��|w����������	��.��������NM3p���W��
h.�Y����*��hEPO6�"8��4�/Gju1d��us�H�O�<%��r�_�*)�m'�{)�"5�P|]������48�YRa�i������(1���i����` ��V�oQn��?;��<:>��������:�:9�����i�n��q{�z���LW�`���8���)�Y�y'�y5��Cy�&����P5��(�	�����/����CfLq��u���c�����)KG1A�j��Ow�o��.����������n�������4�W99�89�a�`H�mQj��
x���=��E���:�F����'������e-K�h�� �l������zj�?8�N**�.����
S�S)��HM)���yPH�nd�=��-���?�)x�� qnL��}O,��S��4�����a�&lX*j�'�*+�6`���x��Y����z�}�T�&,���I������;\?��m�X��9U(���UPV��a�<�8^�x\��K�0mQs����$��������W��d,*��������:��b��)����l��um�Xc��f�c��u��@���)��J*������p�VS���X�F=S8N�i$>���Q���4Ja(��T��]��U�c��cH�����v�C����!@�>�Z�0[]��j+ ��Q�'��y2��t���]$��.������bfHT�e�����=Fd�<�T���I���(�X���w*s�	��.��5{��|��
vS^�h�1]���P���(N�4�/�`���kY��-,i!�Ra�9��h&�W���p�Y���md^����B���,���+f�p��p���8��R
��.���Pf��)3�X��yu>M<m�����O;�v��f��.��qi���]��{�[�zy�����������	;y;W?G����qj�A^|�+V7�<@��l������[����q���g��Z���	*Euv��l~k�{�a�68M�/2rrl�����Qp�V�q&�������/���2�\��R�Wl�=����D����(}u+�"��E<�W	��b9[Yy <aX�r�I��u��7�s�Q�=����M��`�zKW�)aF'���%�q1_-����A�]�6y�A+p��Px\;�9������Z�������[�ka�d�A����L��L����a	K5�s"{�]X�C(�E����r�������\kC���k�7�b.���b�
s;,Q�
W�l[�������g��)��T�8I>s�%���}a���t�GY���]�����%@^x������������d[�i�^�����R��;nq�x�z~������0�w��J_�a�2<@��lA}
�
pl=.�oQ$�x.}��+$U(���a	�U+(��z^����.�$���L:�]�B
�N�^y�� G�]��Q#).����:���G5l�0��h	�i

#���H�Y��"���Q���"����y\�v��^6nLJ������Y���OG�YK�(���^Q
J�UB����2��w=�|-�u�Q��[9[��W���`�a�	ixY��^i{}��}; {~MGlY�wGF�'7b�B�|;�����M�������Q;s��0�w {Y�����`78�]qKc<,n�X�p��&���et�%���@�	�)jg���8�0�=��TX-������bz�����<�=�
<@	�l1���7�y3���tl6�%���u2�x����f3�U���Q����#-=��!�����b�b�cW��h�"?��8�����s�
��F��HkC��S�b^� �]�|a��������Y8��X�2���*y�4	h,�W����C�b<T.%�)Q�Y���X���mV�5��\�/DS����(�F=�l����g�d������_��Le�I����O����@������7���w����g'������?
 C{V2�V�=qE�;����p��E+8I6q�g��M���N�N0�=�L&�G���$�&������8R�� ��B����t�O�v�m2�T���8R�e��X	%yp�
S����s#�S�?b!��bN�Ug1S��� �ua���sa��pv?��<TWp�@f���-MTL$%k���{�c�s�=����j%PJ��bTraF�.oF��I?�I	�p.~!	(|�l�L��"�
�g�P��'����{�x��%K�<�61o����T��2D�f%`+�JC�X^L�z^�bH~���]I�����{��"�3����
���������/����7'WW�/o���) n{�����x:�,���&"�����0��)��&������np��&�nj�������������/n����wu�_�O�OoN��7G7'���o�9�[BFP�T��%`8��a �$�zSC�[��x���mD��=��F.�
]OL�@=YY�oCy�|���%��b��&�!"�_� n-�)������G�9A�"��E����Q��D�
_�+0����{={�zn�x�T��[[Zf�%.G=�^�����j+	h�^�(�rB����{7E�kBQ�B��Z�kH&�R0��W��/7����w������\�S�O�8�\t��E��Y�����Z��]����z=���5a�/�Z�����7	���p6���� ~���20������e�]���s.�N�6��&{��������5����.���"W4�0N�[�N��fyQ-�
u/��JjTy��s��������[�����C�����wTn�a��m�(�P*�5��{,��������${o���hi�9��|o����!@��82�Q@[xi*�`:�\�I�SuH1�O\xy�N*D�[�~�(����4�1�4%��
�r�7�&m}0O�aqM!�m�_j2�'��:�Z���6�������
K!O~)d��b����b,U���q��r�L�!M�����U�����5	�]����#C������P�|���6>�����_��>�����_�2������Kl��}k�m�m�%?^��:~{tuM�������s��;V�����b���������kE�+����g������T����t��Lvt�m�L�_���FQ��[J������N������|[D�uit���+������$��fO-5U5w�|����>p����K�E%Q�&����}���0��&Q��e\�.�;����������9T��S�g�t������z�5�^.�����m��F��y��['��q,�fU�g�W��v�L���<}�z�ms"��������/6@J��/E������.���M��>�����v�o�X&�8��YO"�.c���!r������l��z��R�t8�i��g����YZ3t]��\������Q�sCT��L�<p����qh_�,��<N��^(�.6��������g��T@3���2�SP�88�
>�9���W�+:�r���d����q'0<p(�C�����<�1;��7��<J�\�.������.h.��9�;G`�����t!'Nx��wsn�����p���M�#��-Y{��8��(xx��@$Qt�
�����A���� �����d.�y�L�F�(
�GZn��u:}���k���1���8N�[/0����������=�O�1p���I�O�����?rz�����!�_���<z�����������~�#����=z��8��(x%N��}m�4�@���X�G0�����{���l��tA��=�7�.�	o��5T�4�<+�*�V��>���{/��v��<@�{������V�vx����2���Y5����
�^�����{��!p�Cg������+�$���7�a��h�Nx�#��A����*��*�v&��k����>rM�?���^�H���x����8c8,���f���r��Q�qC1��P�o(�O� ���=�2@�{��77t@�A�~���������\x��T/����Qqv�K@z��� 5��h�d7�jZ�t`t��j�xJ�X�{����]�(CN�bA�>�^<�{�7�%��{���Mn��Tr�����c4��=�o.c|T�=���u��v]q����P���_oT��7����G@�{��7-@�{=�=��{6�w�L�Hx/���}KBW8��t�����g����do�~���&��wMs_4L�
����[���]�1#�GL����tc-6Lz����f������,�gaX$'����N��J������1v����
��V�Uc����t�S�o.���������Z1R�t;j����\]�����U������f��|0 �=�h7�n����zz�k�X�����N�����c��l��v������wS�Z]����0�-V�~����(�(�?������#���*�[�3*������FC���l�������tz�r�����=.6��Xx��7��+�qO�w�I �$Dc_-K���0]{=N������u#B�����F9��$0B�b_�����i>v�*u���D���<�F��y�=b��5�n
���-�!tl�0������o�b
��?}��F�v�=b?�������������I�D>��}�[�]6-�h����^_���s������\����J�mf���G|i��Z�S��F�������������o����������\��'{?����������7z�����Odk6���G��!K��yz���>��}N�}m��ME�g��~��6�~��/-w^�1)�P�m�����Q���\��.�(Ic�����������G�=C����[E�;����pg��$��l�; >��������	 �����j5N)l�����X��Wc<���'o.C��������:;}}zyu�����W'����s��=~����*%��%#J�Ue�f�.�}�s�����w�W���l��T��y����r������M~�
v~�^�QI�c����"�v�+E�
�er����6��}��7���n�7�P���=]k
m1M+�O'f��������6�$m[��0
x'�����(��z*qfiG;�3Q���+�Yv]Lz{M2B�$��_�$*iv��O�sZ���R
0����Hy�#���j��8��Q�o������������q�w��7��N�/~w"U�.~~����e�(�x������Z-�t��.�jA5�M�b^g���+}����I�7Wk�O+xN�^���k����������g�a���/���������SLI�1�N��1\<	�wW��������O[�W�<�&�$p�S�$����/�)�?�p�>�����Z7�G���s��&����Z6��)y�������!���>��}$5_��k}���}���s�1���h���u����D�����}@���b����G�{���;/I<��}��G1��=��#%��Y��������R� �K�l�K]��O�{)���TG�Nn�����L��������}����7W�P��c�sQ�Y:������27�����'���o�Ke2�r1���}�Lo�eY��`�`�s��z9���Z|}����>�}o.�r��/��m�i�������[��)��n>���>Tvm�/k��A�I����.����+iz-M�����a�fP~cE��C��e���%W�W�R������u�J����l�����%4G���k1^R|�|f�����Q�#����v�hR���M������	3\+������n�N���g�L����>������P�����S��^7���,������S����t��2@\�Q�xF�D��_i���e>�6@�Asza���J|����M�G��f�c��=����
�8_Q�����T��0���9];���z�����-n��P\�����y3�����tq+H�����!��\Z[<pf�<���J���w> �}D����P�{kQ���������|���#(��(H�`-��>�/n��S�H:���r
� �'��=��*�z�&nz�e!��`������E~�`�;����*D��J*/J9�L����L9�����w\�����)����@.Cj1�W�\@��}(p����� o�"�x2����.��N�z���j�����s��Q�f}����{���d�3[M�|1�j�������+���(��+WX"������*+�����V���Yh���17������t9�����}'7��s#��Q�����`����<~w�����?]]���'l���S����K�@��7�{���w�5���G��o��|�	zK�;
shAtbB��Ir��S���>����48���5�M0D}��e����Y�0�`���2�V��c%��Wn�������������������J�����
���U^�*�@��922L;R��Xl[�n�]��tjU�R%��]M�������e���=u�������W-���Q3s	u��>�|U�
�������mc�����O�i>����;!lv��sw��F���X�?�" �;v���j��,�H�o\��J�c��YM�j�~��u�&=p��zpl�����S����������7(�`��N���VM����zu����x'w��/k!M���E�>���R��}k�F������
�6�'pQ�\�s[�:#@�t���! B�z�����Rz���������4�jc�w�"����k4����a�K��:@���|3x�Q���aE����z
����
���N��H�8;��D$c
�x���T^��`W��w���f��}�v3�#TNW=������v%�@n~�%\�����R,��@�������
��Z�t�t6}`���6+8�6��������e��#�@�;?��*���Ni����XH��2��|���\9`qe��y�d�����f��zU9�b�T&�����2��4�^��|���0L�c�;�+�1�l�� .�������j����w~�?f��,�]g�F��~sy��Y���5��Z�mV�n��"FI�#�l�<K�DN��^�H�T�D����$���eq���o��f�K��!�Z��@��l�:*�����\�D��| R����)�Dk?��j.x��n�|�Z�9�$���yN10q�����z���Z�V�\&��RZ�+�S	k�
0�����j\p�r���;9\����"�S1c[I
�T?���5���mM?A&8`��}��6b�����'1m4+��6=$@�����7�U�y�E���H�b��?C��&(p�I?�/1_X��|$������������f;�"�J��A��1�
�����v0l>���6���
l���y���X�@n
����qQ������y!G����`�$t����W"v^>�*�o`Szn�0����)�8y�Jn"L ���t:�m�8d���!r�[K<�������e���������ye>(p56���f����v���QKd|��aSan���Ezk�D���i{�y%��t)��
l��o�����(�n�4�"�.�������,��6n��j��������CAi�~�c��o��q�7�6�����qR��S������m��*��0D��0��{ON����!��a��G���_�n�S9(m0����2��l~�L�e.Z���#���w�������w���W���9~���p6��2������5���FX`� yD6��E��}�d�O	��c��>����
���zH���
����u@�I�C���`!�^7m��t���;�m`SS6��e�TR'�du������XO)e;��T���':6<0����q�l{�Wi�6�l�����wV&mxd�T�}c�1��0�yyz�rL�Q��e��5V�C��Py��1�@D��iZ������s\�\�tB���(�{�BD��2��MN�3@�lg�j�| �r���B��T��~0��{��F�n�Yf9���+D��!�cC��l����{
�p�t�_���W���������]�]&���n��a�{L�^8������1���T���7�@\�����!��
@�a/��}��J����'�b���OR��RC�����h������F�>�]X�v��|�KZ�7���H�Hf� ��hx���~�}o��7��m_��k��kA:�o�k���
�G�V��6F����(6|�s��H�^�\s�>�z`:��V<pZC����W�Qn^�6|��qs�����
��o��v����o���� �T�J�k��o�=�S���!��/}j�7D��Q @xC�$�YpVTk�<�M�V�Zu���3������"p���F�t&��J&�3A5x�����gEz�tFE������ml6������V1�3)+O,�p*b4% �TesJ9z'��?:HSk�������������t����m�=��3�	w�����J��T�%)���G���l������gFLp������Z�!@�z{{��$���\�n�3�����m5�i�����\���Kz�.��{�y�5	���l8�K�
j�5�@��i����'H.6|�d1s���{u�:�z����g�����8�>���%�y���4�2GB��=E�C��H���B�cI�	'w'���j���gT�
���x�\R#�e>N��uf���P�����7����������Aa�$'D+����T�}E�7M���1���'HC�P y�Y-�'.�%���~��a��~�wXu����#�j6����aOR8�p�$����
S[n�|��jr�������B���lzP��U&E�i��CU<�q���i����I��nE�%E����?]�������;����3X-3�`k<�*��A='���R�~�U����$	g6�����a�<P�x���5	�� �!�$��7�;p!,�\_�e��y�V]:�4�!{�1��8��c��/����|��.����~l��_�#r
 �!�.6�H���#�{�����?{+;���fSY���d�V/j���eX
Uo���U0�V�����@���������Wo�c��IN�0�:�|!��qm����= ���)M�^���(�0|�������!C7���������r��B@����8��.��a�&����{�f������P��=��9�#�l>,pACP��%��G�D�#������8v6���6�/3�`s�T��pN9�e���T�b��NMU*�l�9�X��q���9����Ee�.�F>���w��������5�����Q�e*������Hk6���JT���~�No��l�9��v���|JP�NZ��j����{1!Y��k��T[��Hg#��,��%��#�fE��[�x_%i��vH�Rb8)���3�f%��g���l����@;D��-�&��I�J���z��Cc! ���w:	c��0~���,������~���i;��~����7�g�B��ZT�8w0�����������p����k��7=���:��T6�:"c�f�t�ywrvq������������\t����lV(�Y��'~�s���.�D��lBH1�C� ��d��_F�tx��.�9�d<����K���]��<������������sZ=1VL'&�.��S����T�R���
���#���*�?iI�:�|ZdbuK'�V)�(�#�pYW�a��}���Ub`S��d�3-
�W�{R8�R�01�xa�����Q���f�7F�C�
���;U������L��fZ���/g�i�������u��I������\l_1��Mj��Nv�$��.@.3����{-��F���0��nLjz�v��������#��������1c�(o������#��u�cE�����������L��vy�����#��G{���������a
_�|pG�����1r����O��f�,:�ae����o-50��@����@���SR�3zm�d�w���NuK���A����4��}&�I��v7���#��G�:��2\(��#�s�vG���Ximt�b+k��e���5V�Cn>�Mz�0������`��������)�P<-I :C-���]G�"p� �u���*u�UOG���d:u&KytiY�n��ZK��(�w{2�6�N������k���������z�2ln����n���������)U�����1���M��8�s���R@\����������
�������]�M��@����(8?��B��L4�iz[S7��RW�s�@N�*�-h:����?t?cp�-Rd9�Cwh�(��>�G���JQ������
�Y�R�����������?{�U/8j
U��H
%����T����~�a��#F9��x�<��0]s��������;�`���cu��K�������p.Hg���v�W?p)��]����2�S�H��z��b:�������[�Rw��T�6���x5�k����%"����~���V'��=���M7E��E��X��:i��m�'�Le�����:h����>���ng���)�	���L�������}�mOq�ZJ1�n�����1���s��5���/7*���`��Cz�*��)%���C8'�s"���_����`��d0�_�4gE��gE6����Q��H�J������ |����T���I���3�5����u����H���eeI���������q�D-��j���{�=�<QGiU���#p�#
>C"O�9����1���=���nhx)�O=��ZXG{��K��E�sGE�`����b�jm-�%��p�3��c&z9��H�*�oz�1�{7.�����8�t|'J\>t����y�h�y�6����K�s���c"��=��mp�$�����5�����{QK&w?�v�@�Z@�*���Upx�s#��g="���Qp:����ke������
,R�Gd�����E��'��r���'�#p�#��06��(���fY�'�����:���Gp
$z
d�8BN�D��
�����1�B��b��^)z���-b���1�K>����D�#���Q	8�E��j���DC4����8N�D�d�1��c)��X�Y�z�y���lG@LZ�����7����L���k����k��=�{����e�_�\������%
�,���v"p�#���wy6��"L�&��\���������l�5mIDu�r%����4�~�&�OXt
��k
�q�J)��u~J�$3v�����z��o�i�h�JJ���e��@���t���z��=��eDCe�Y���FA��G/�x��
����(����V�����)�d�T����5T��2(Z��v��p���D��0�%���e��6��R-��pg�r�GIxf��z������Ad;}`�h��(���o�&g	��g	�������t�g�,u���iUT���o&�N�6��,cpB!>�C�
/9��N+��C}�����=�C�68���aF�k�:�=�:���C|4p������BW�#]������JQ����#������>@bp�!����r��,r�@�����N�%[Yv�s
�r�q�x����Df��,�����:1��&���y�l%)�X6��28@s���koq{U����z��2��8�2��u���;B`���h��N1�#1wd��~���{������9>�(�������*��2�k���>�d��IiEQ��f�Q����[R������A�1����96^c@������S���������Sw�����MI?��HG��yTv96}�c��r@'"�/�0Zo��1^=����K={(i��5��"�Zw���"����5��9B�)��M11�����n���1���!t{�Hh{d�9�{�@��CI��>��w��r��m����;7�7��c�T~s�(�&z��o���7y,n;f�<�j�����������������N��0z0'�yl���4V���t+13(d��7�Y���gv����P�B;��.����M��J���ni���������P��$x����;����nl���������AG��|���oc�|���5G��I�������P���X��k�;��{ww
�k�}p�Iv��������y�&��u/���j�����\D`������7�z�.�:F��1�u�Q�F9�}
��;������K���������-�F�g���78����~��8����B����l�2�����r�;�D�
�!�s�"���\���N����F���k=,�5[>���z��$���wg���i�Y�{%rJ<�b�7�a���}��kS��V�YQ| ��s��mEp����zE�U���EI��
�?���_��5�ML74r�D��e_��r2�1�ql�W�����6��T������7s 8�O�x�Q(p	���V�.�N/���i�ce�;��9���e�R�9�x����d^n����^�����c������S�}�����#�ZO�8�w���#���N�9'��27�������qG���`�q4m�Gg}�~w�����2���������a�x���&��x����7��!b<�9�7��IL�Il^�[V��R���P
�1�s��Vz���GQ�s���Q	m��pcVG~_���4������bl�!O
r?�cd
��YIGt�
s�Oa�����d���Y��ZshmG|,r���`�;�k��:�%��727��\c�Pnl�r/�Yv~sC�0b��.�lr5���4��1����<D.���
�{�tEy��C:c@�������9�
��8�	|6��vT��_l�1�H���%�������rpop��P����ZsB�F9�;����������VX�-%�^N��DG�g���u8\���������Y�_W���(�s5
\O2��QW{����7
\�=n�<�M.�q`[�����&q�3Mi3r�#�F�e���y"������m6j����������
}�����c^=4���VIj�l.S$fs�Im�O7S��������FIw���[�������V�������A(�Q����&HMvR��C���	�6W[��>�		-WSc�( j�in���sb#D��d���G]��d)�v�R��S�=�q���p�~�$O;�'�4M�
h7F�p~����
��"��-��:��q����g�i����VF��IX��_O������tn��2'l*3
l/��
��d���!������_\�5*dMQ�2H��q�vx�d��-�@5�d��r�
A�&�
�{���h#�j��UM�� -i�A�gOh�M�K*�OE)Z�	E)1J�!�U�9�r����E�!���5
�M/:y�����]���<y�����{|�)G��`V#�K��@��}G�P��R�-"�qq�I���`��$����M@T�&�9�M���U��ih��j����7�2Sp����*[�	 ����������mu~f),m�I9����������D��t�N��<}}r�s���_N/���\�}}�����'#��m2D��=z�O����3:::zj�x2w����c��'>ysy�������������O��//�N�&�m�S�9hl�TD�q���T�����n@�&��~��;�6���NA� �V��y&{�p�K'	`jo`������r/0��c\k����5���pE^��0����5Gn@�&V�u�9.�0������n�ey�����S�:1�r��0�>���zj������'��m�&4N��RY)T1�P����2c^�n,"�u���z���B�XZ�F6��v�~���s�;b6��Y��_l\��D}�X��z�BF�)S���Q��|.�3�)gk�B������>��Xy�:�4	��m����E�hc�W�O�WK�e������}���'Y��4��^m�S��&7	����������6"@����?�����O0���@��l��"
P��j��Z���5L����'F��o��bb�^������1M8d���O<��*�ka�%��L�i�r�@PA5�����D������$����
�7�Q�`y��z����w'rU����\\�;9~}�����4a�S|���Kck*xjb����1`jZ��m�K��q�c�	I�p����72_�����q��L�35�n`��m�Y�M�������Q���Z����ki6����S#7�ib��5J��F������S|��7���B�4���o�z��YM�L3	17����?��O��6�`�����4N��w|��o3���+]'���j.����Z�{��e������p�$������S:�OJ�A�/�R�
m�������I6yz ��P��'��j�u�~�kX|���V�������fs��:��m��~AF������k�EM����|��g*(��l�T����M�W���yM8��.#�k2T�v��{�)�&Cl�)�cm��%g�0V����^�r�w����*V�8�O������=Uw*�;��d���M�WE��8��&=�k��&"k�f��MXF�'��,f�� �,���:*�p)����Y'����j�&?�?�S��I�U����b��g�&���6a�g�:���'�H�6O�*z�A��0S
�zZ+
�������0	�{���9���xo�_+�����-7��Y��Q�, o�\+(K�����1=5�l��+'�p����[7���A�U��5o�
��)M����G����`�*g^/�77W�D������k\��Y����qD��	'*k�`M8������T�J�����M�������;jr�xO������bT��&��-[�
��c>�]�v+[l���F�[<�-���$n�=����y��/�gu7����_���p�(�������$VV�[�}�����
����)��^�c����Q
Q�����=]�v�u
/9�v
�~�|3k�3x�a��W�����Wz���R3����/�zQwF��X�}�k6����?��zA���k>��o���7�����X�"}�*�����3w�Z;L��s�v��>0����r��lHvj�>�n&>���U������Y������������o���7�}>���Hvt���" ZN���#g�M��^c1����P�#^��r�PM[��i����[�\�_|}x�\��-}�p1�������e��]%~H�����_����F��e����M�����R&�~�-u��������j�����jqX��8��6;�g��1���������pq�����z�5M����?��<��S������UF����u1y�z��f�?���|��k]��<����]�O:������>���R�t���O������������2�"?y��s�����#���Wz��kE������(+ZE���>�z���W�//��������3�������t�~���{�����._��CO���,����'o���E������|���o��#��,�D���e��������p�g���k����^|]e�*�3���>������gb���/�D_�����V���?sF�$��M�/��]���*��eR�S�Zf�_���4+���NGX�5.��k��q�����R���������ZZ�u��s8M������s��u���%n�~�����o?�h����Ut�
D
�����N�L�5��|���o7�hG	^����t������"W������U�p���h����v�KL�����U� ����oD�F+-�:��7���ao]��������.e��7�yi[j@T��#�
�5����^6��|�����Q�v5���^�=C/d�;,����]fe�4���d:���ej���g"�,��T�	��}�����$tG��������o>Wu������G�FG�����u�o�����C���9u>����Y�g���[�+]|����j������J�DN���GHo��o�	�s[��-wU�������7h��?��q���k}s�y��~���#�wk n��ay+t�#M�:�
s��[�����9o�ly����u!"�~1:jjb�����%`���\m*����c�l=���HblSp��q:p2�Y)��l\����ZL��y���k�1���I�yS��;�lG��}��9�h���w_�]VeR�k��9��tl����I���e�L��9:�cRd�/���]6~[-��sM�A[�"j[&��h#��T�f����M�����!5�t��O��,��}���t�����j%%��V*�&X�H�/���o�\�\�"[R�i���o���h����������~��rtG����]���4u��U)'M��+4i����ULpiM)���T*������}��(��Y�o(�o�9����/{s�s�����3�~���#O�����e���q}������e��������/���7����������������������)�C��c�����%|�����_6��l��*�~������e����������}������e��������/�G.��W>�@�V�����?/���H����������������~����������4:����Y���&_khm^�����e�o{+�Ov��
��?��/{~���M���Uz��_v�����w��aW�6����`������z����X,��I��E�W���&�E(	Pq]���b�Z��,���0�MKG\�p���g����}�=l���b�Re��UU[�k�@���XM��b�ja�b5P�a����+�z��|�rhvOz��M;DOG���U�K\�j�v1V/q�J��7
���6U���b6��R����U>%�|Y��]�q��&qn�+z��r^�O���\����e�~��t>�~(�
��u���F����[���c�}1Y��am����r�B�c��o���4KU.�:f�����%L���O����\e����>e�i��:_L>K�*)���H�y�����s�^RBw�d���*�S)S�	#��-�q�������V��em�Q/�p�y�W
����R������q�NR6���Vvy���\�yYLS�m�e6-n���^+]��.G/�P���jK��D�+�7[�C�xC�s����I3x"6��>=�}^��#���^Y������4����p'�=�J�N�,�l�;��j�^���z�N-�����l<��
��'�q��N384.��0F=�'2��!�&9���*��uQ�_�S,h1Z)~���[x�yV�����PX9��8]�2}�xR��]g2��]��u�������\��������=��#��=i����r�c��x�����TW[c������������*}�fpb.��(���J&�_��M
�B
���������x.�B��R3xB6U���qO��s3]�wT�b!�������dM�����Y�,��
����)�[�����|[�N(���R	�Vh���4��b�]dc�F�!��l�z}�g�`��VpM.���5_S.�Z���O����9��vTj)���r�/��\���&j���ff��D(����~w�$^��{��=wR3��=���a���������EJ[���;9��!�R����RX�S3�Q���8�A{����������X��;����m�-��'���Mu7�r\�l���zzX{N`a�}�V{�CeRH�W�*�
��p#�h�z��O��C���<�g�gd����T4�b��Y��	a�R&���~�K%L��Jhr�n�3t�]��v����x_<5%����PNf.�qDx��&W~�e�Um��_wq�T��>���)��&#*�46��L�RwM���A�[�Gw����dS�;{?�������������wEQ2-D��AF�+�f�G�f.�<�=e�����jSO�z�v�j���^��l���UL1��Y��V�[y�<g�����4�����x�*���G��DP��A���6����%{x��B'�����F�O��m���X��K�OK�s�*�QmP3�N6W�����T�w�-�UM;�o����5�l�~��4����������_fKbl���[E^���t��.���4����s=U�=5�0����T�S"�f�Wo�v}W3�M.#3�|�R��\D��Zr�E�Q�iY��$]�&j���<:!�y*g�� �Y�&����)q�3�Z�������7�����J.�����-�(\��O��zyD{�ga�?�#�s>3�zDz�T���5j�o�'�n;�C��Ym�,��~�{�z��sKkxW(��|"�������g���|Z��g�5{�hQ�5���^�PC�\<5�v�wP&~�*�+��V�3��@/W�Hs��t9��u���5<�Ir�������*5��L��MA���l-_��h�������f����=���������U�l��:,�)�S[�����J�O��������\C. �S1��l�1���W+x��Z��N��(QpE��eVF��:�����@�cO\-��{�j;D��v�m�NZ�F�L�wgO|-��}�����f�b=;�K,�>$pNaO�dOO-����[��I�/����6����S���a.�����u:u^bT,�����b��W��\kz.���[�rx�r�������/�/O��h���f������<�����aSO�[��E���-:��EZ�z-e�=�(������n;*�TCF��k7�������k�t�������\DJa��a$��Q���������i��2�=�$Ba�JA��R2k���)T:��s�.��FLg�[�L��}{�e���.����t�te.���~����6��;����h�s�+1�q9sg�����������C3��O���@'��5s7[c�F�p���U�b���G��k��%p���;����|~���U>g\�=o��U?��n��1o���|��Y+�.C��}��2[>���^\/|�yQ31�*{�f���4����'x�]��P��<��U�=���K���������+lEBv���+����S�k7���oy��&(o�������<;������7����'Z�e{�eaP�q��uY����:��L���TCLx�hE���BD;/�j���7zy�T��D�������������������n3��	��7y������T�5A:/���x�������7�W��Y�)��	��u����G\���v�ol�w�����;�d��x=����$Y�����}���hSS��=��y+{�1����|�����=�8�#�=�����c��#.�q������U�'������Q�q�%A#~:�on���#K�u�����v�nn��#n
���%+�us���MHl\�����+�g���CF\R��~!���t��;�s�G��v�R�������w7�{�~��s�k{WC5�f]�4�^���]�������F������������v.vm����t��W�^�%�Y��P�YP���{#N�%�{_z;5,����{_+�q����E������@�Fj��o�-���p�#��n����i6����X*��{�aP����8�^������mc�������xc�9�Z/�8��]~#f���'W�'�.�^�t��O�o~�o��9w�r*�.�������>�L��c���'��^p;��\?���xd��_�t#@��`���{m��+ `�t:aJ���5���J�����������%�N�!�z1�Ss�����;������av�$�u����{6N��^��zA�[��yN���a���LO%�\y���\q)�+�l���n�:w|�������o\�{7��Y
u�)V�2�~���<����!MIC��l��J2@���S'%H�B�,T��7�C���GE\�<������)�6���iAk93p�{GV�W\7u�����t�3�l���Nx�W/x��D�#O6��������!C��B����0����D�$*�B�	�&���&m�A+�NX��^M�L�]iG�$�D��B4G�$��,d����L��"�s���M$�+��^S�R��)�8��#���fE�]�������|���*������YHyb�&�&�l�V&����kGXk�0*v��b�oIUt ��p��M#������Tt�}����m[���M+���
<����j�6���z#n����4���,�ONW�]�������*v�R���7��F���p��^�,�*.#�����N�{S�Jy�>'&��h����p%��y[e��j!��D��i_�
��X�A�/�L8������,L�����B��i7�h�P�'O�t�Z��]����B���|.�/I�2�$�b.��:������:baUyK��u�aP����R�#�6���x����K���x!��)���:�@V��X���t���
�@�;"��U���R����th��Q�=������n�,e�L2���pj��.R:b�RyS�0�WXdT=�ENa���������[
VkC�W�qQ�t����eg�a�R�@7��=U��_�|�A�$u e������sQ�������|
K�*[������e5}F�mk���~�"�Ko%<M��C1� �|�$�=������}R(�Qo����Zq�I)i�N0���Y-��is&^�
8.�UVmBMln.�&�%��!���j�Fv�2��5��U1��Ku�a��HZ,t�F.���4��3��a��#�U�L�r������:��Y*�w5�K�
rZ�)����2EL�'���`�b�n���f� ���s)Qg���n���[��q�#���8>W������m8]��<:%#��@o��������D{O����{8(�h��Zc��@x��~�W{{I�����W�����_<��3���h���d�Zf�j9g�@�8zX/�2�����2]�!g��R��*�GQ��T�:�Us�F{eN��h|'����V��z�j~��(`�G���M���X�6���S��8���Lx'�8�k6���gz�����:��8���'.�������b���y_�>e=����u�s�*�R16O�$R������������|��|��}V��b��!#�n�a?Y����d��xb~%�}���gbH=K��Q��������//E;:��	�d����D����/�����+���|q��^�R�L�1������%b�&��o�WI�[x�>3A�V�e#��Ru�8YcG=����#������S���)�I>����� ���S��S�.�P��I���i��&:�:eR��~8>�6g�;�+��z�n����6��A:T0�$�%�AF��.�$��8��6I��"������\����c��M�����d���qq����.@����]�r.O�������_(�kw �T��Pm����e�P�������p�N4n7��v��
��l�Pn�C����^�����c�#�����0�����>��=z9�����6�
sy����iWu��^?d��<�k*�d�c�.GR���a�f��k��V�����;r�����V����rL��;0kb#.���^c4�ktVi��N��
����'��E�I��lX�]���(@�]���n��2?�����%�O-�J?��#_Qkw��Ys����;�W{�q@����fey%3H1�,�]����DvY}��jo.O���������c��j��4�
w�an��j!b.�.�d_<��I����-�`O�O�E:��=a��{�P��3b
�����������}��;plU��;���U.�����L�"�8�OSo��o~y��/�����A�������������'w��o6����j����$�q��oj�5�@w��n�o���:����j��)�77��=<<�C���`�1Y��1�i�o���l�E���h��.�R�������5��2;�OP�.+�,o��������t�i�������\-NM�j|'����F;�IJ^~����o+[�b�su��X(�{���������J�e)mq�k�����z������c��vYh���1�r���v9Z+0�.�0���Q�WR�Ub��FM����#�*�4w���^v���;2��,�{� ��9u�=���:2K*��Q/�F�3n.����6o����&2���}e�r%�!9����4vY�xc�u�\���l�^�A�����R��+&e���:BQ�^�t|�6n��F���}��D���>g���`�7F5��s)��Fk*�������E��Y��Ma�4s>R�C�&��X]��M�f��4�9��]Vx]����Z�f�?��f�m��7��C��}]���`<��]�����U��/�S���u##��=5&U���[�o��"�D�n�WV�ZY�JqW���8��.�\8��w���8�������#U��/�e�{"�u��������i�*m<���O����c��1k+Q��R���75�������$L�&�V�z5t�����C5^'��H�]f��-��.@�]+j�R���v5Qh�
�v0�5r����c%�[�����.���~[���<��@2��F�[h������T��I�Vz���|���m�����ras&O��{C'���6�hC�k����ze�T��NZ�
�/�Y�T�d��y�n��:uT���J�J�ZQre���w��_�.� ���m���<GJ�X�a�!|��o��,�ke��M-wD���K���Z���� ���Gg������"��k�z6�Z�q�����
���%��������Y|��k���k/[.)o���^��:�:~������O?_�:����H�s�"��%�=+o����X
�538�����6K�l�N�AMV�H�������l��o����d�k%�;k�Ui>���RcipaG���l��TS���i_=�M�m�~���D�s7zD�)��/�H>���y����1�b�Q�����}�L�:n��r7��O�-��kE�su��.'��#'�\Fl o���^_"���p5�qJ���/��F�k�G�����sh��#%R�np���H��E^�E&+y�a��q�[�7Nlm�np�6���'�V��:_���e���j�L�e��An>��zzG����2u��r�+���tIyo����.'F���Y
��z�,�����%����.�;�\zt`o7��G��P�e��:����;o�~���]N����c�
j���<ZO
�0��.��w^��0����������^w����e���rLzs��6���~��.��]+��t���(w�D9]��h1W�b$>�?��h%+��j�P%s�z'H7	xC+k�L��i�uvU�����0g�Gw�8���>H��M����BV�������������OW'�����o�/.__\���zur|����l�x[���V��������O�\^5���k�����z1�7pz���w�����jy�*?$����%��N��<��,��������m��Q���(���l�xmT����y��S�{r�^����(:e.�d$y-�!��1@�=N|c�s�D���2���^���;�Ew�������t�M/��b��G�T��a���ewL��@�D��K����G���1<@�{G�t,!��w��k!��wo���n���q��z1�����
�
�r�S���R�P.�Bl�(Q���R�^�����f�GZD�S��B�/im��3�y+����5W@���H���H������%.�RG�����lTDt��'��m��u9�N�����	�z@�K#2������� �=���~���|G��n^
����/%Jd��%Ty3�;r���Ve��Q�8v������CZ'^���gO��I�����F$�ye�?Yu��i+1SI���:�[�ez�)��X.��c<�=+�n���s���iq{E���~��p�7�?��g�=@�{Ch���Z��9��[B�w�:�N�S����o�N�e�g�����K�#X�qJ���|_���=f_�����{�$���k?��e���?@�=��g�@�8�(��~R�����R�`�=���:*X;�6�&��}�4����W��KJX�J�o>��<�T��{}��=@�{}��K�kDr�	�r�HN�oT@@\m���SL��(�\r,�dv���[�E�X��U�uI�8���+dJt�1����b�AjN+�����Q��Y��������q���8Z�Y���[R�n�u�����>`�=Vr[\o|�����q�V{�������b�#��a�Q��T�`A	�������;�����������8EP��? �gk^8{um�?u����2�XG��b��T�Ft/�zc��G,�0�#
N
"��U���@���k�������
������E�����}��,��B���4��2F~@�{6�my�;U�pj(R���i6S��j�7�$ I��:}���&���N�J56*�g15�G<@�{�b�G>@�{�#h��>������q-x�����6
�s�#��b@�Ne~UoPr���=�����^t.[0����������������4�G���X<iz5�c���Y��R�|��!�Y��V����q��@n�|z������������2����=4��^�c��8P�Ki7�Z��n��������+g�Z.E�A:�s�������;x��X�nc�D��)o�3]@7{���L:��g�C�����EW�W�N�&�Y�����6�S��k��o�DD^e�8���+��_�eS��^4 ��U�������7����q�^pZ��v����g��n?��t[/U6������'������k�.��^l	V��`��{�w~����+��{���$���g)F������Y��}�$��		�4	%^��8�xZ�����h[t}^
�b����0�^2���#M�({���"������k�xH^!�����Bn,3�e��0d/8����	x��d�g��q��z1�%���Gm�.{6�l}���E3��Yu��>��}j���;����a�������1��xc����/�xb��`w0>�������6���b������wt��6������h��sIx ������W�Tg�r����]�)�)C,�x��b��>K��;��J��*^Z�T����j�$���HF"�?���[-�����E�5����_s����hQoyt����PZ���%X&>��}��^��6��-��X��2�������&�[��	�z( cE�#(g�YJ�,���o�kR�j�yP��Z6��T��eV���n����%M������!5�������Nl��X���vt�vk����1��/o��8\���o�es���Z7�x\������p��H�w-[�z�\�?T�b����'5o=�lUe���e�t����B��g4T����&
����s�[�����67g�_�|����!L0��M��}mwo���V����p(�gi~o��n���j�����	<sW�X�Wt���06y}���;0�>`r}o'O�L����v}��}�]���U���0|v����62�TK��������-�����e���R�\��MxF��E+N�$QS�/@�F�@h�
-�>�(b��3j�����l��K��6���>����Y_� �I,���w��BA�F
��3J���|����9u�+T7x�%1��tQ����s�fT^��*:�%�}6��	���y}N{�c��\��#��:w���9G=p������;��s ��h��{m�4�A�����'�j����1#1`�}��/g���l��Z:��p���Iw���m����e�������E������|�lX�uQL�����2/:�z�M��}�tvE��(ti�4��$x���7���[d����/[�������vl�O����o#e�� ����53����%�Qf��k�r�J&��=�O�N��N���7aUu�g�"���|��$�P�=I�Nj��K�p�S� �W����j��n�Z����j.BQ�%wWW�����
f�����!�5�u��3��:��v�	,g�����E8��?�\�(�l����LzN���bA?�w��\�U�y�iN�������W��u����O��|���V�f#N<�����-����1LTp�~8p�X3L���B��7�..�/O�..���&1����H����*�}�<���S�5�����M�pZ����]��3[�> }}�l�O���63-��t>/V�qv5�>UWU�!+H��&Q�`���KFGG$��R�����X��O��)z��G�����m����Z��sL��O,K����mK>�R�Q�������������aY�XU7�2Wgw��-������[ug����"*�8��0����z1�p2���ti�����R�9�wJ�v�c��O$B��t�t��6+U��L_��oU+�_U���
�Q�Z�(���7��|K������p�*[�x1+M����o�+�����*��M�[����W�w3�3*�2��#�L�"Y�&�����2M���N�S2���&��:9���Is)!D�A��p��U�x�#(�
j���-�t.w���cN����l��t��)�5��Y%���e!
����>����cnU�����o�nYo�Y���k�?Q`�6s����K��R��Y.i���8�]�cw�b�������tw3{���������Z(��$��F��2�W�Ry���r�����5�[cmm������wBy}���C�i>�D�-a��v7@���; �>@x��/y[��}����]��~�-�����zB��W3����������w�����G�H��t5��g�����]������r�������L�~��s4��2���G��0 �}�cl���'�P��B��C���i!c��\��%>\�f]����9��r�Vb�� ��M)�zYv_p��^��,g��QF����J�����$L'���y�x�S�J&C�M���r��C�����V�0��
>�a
%{������3Vw}��f��	by������W��"�r����R6��G�zR"�q0D��.6��j�k�ip����mm1�s`�8�}��Z��n��,>	���hr��s���%���m2gD}~P��=(9�f�Z�9�np@PX.�PJlt����N?�����h@�A�gMm�J���my��".>��j&��o�s`��u�P�`�0]=�E�,�zV�>C���j�
��|/��<\�'gdN���ct`��u�0�`���p���Oul�������l��]
E�{>��6z����s��jZdK�(
f ��6b�8Dm�O<~���?��,@r
x@��7@�/�����}�f���0����F
�!��5@i0�TpPuWdu���z1�KyV�v�SpL�^�B�/�M�[u�;p��A�M�o&��l��95H;�����_�����L50��z]�����b��Vc�68{�|^y�S�o��B
���;���/��@tc��c���1�R��4�g�zF7x:�n#�B����p���������������Sa��N�c��9��Nq���;�����-� �������Ro�P�6�������2����&�j���Hz�V���R������3�Z�U��CuF�����������������U�����(��7;���NN�|��E����%��|��b?��U%���h�+S<���kx�?����w���f �Vy_��6�S���F�F��%��VC����;�(�Z�P}�D�T�a�����C����mkd�z)@�������!���7������C��0�����X��QT�l��\�>R;�������M����=r�`�68�Z+���Mj�9�Q���9��&]M�ZI�4�&���!������OM�nZ/�$:�Do���e������|�e��pd%#�G�(�^��U���w:X!����N�d�y�8���i���mZ�q_������:�A��������H��m�e+~�^$�FW����FT�)�����b���-J4��29��ROD<V{��b�^6h@=��
�����	�_~�<Rd�Uw�g�i='��������:�x�h�"�b/���\-Gn��N���3�@��
���hC���"���4����0/�fL�������z0�l�h7Wk��O����
7�%�F8���I;�B) ��l�G6��G����8��b��7������L������]��Mu^F7	�p�,L�1�<���nk@�b���,��C������;�9�v��AQ�jH���Q@5��I�D���I*Af���P��G,p�������`�NlX/��d�����)/���y�gJ��Q�lH���2rk�x��;�o`e��b:!A�b�u/H+�j��=P�Om�����-.����7Ed��
��T~��S�V�w�z�8�6����*y�1���9��{�������m|;t�_U��������[����M�`����F�}Hh�v�QM�y�|�:�y������T����!`{CV�������
m��0��_~�sK��h4��e��j�^�nQ2�e�C! ��������q�2�p0��[����%�f�n��
w�C�C���x�%8q�N�8�[�fgasPp������5Y}c|m�A[!��c�8������^�R6r�<�x8����6�p44��u��,������i�CDr8��|7��59���C@7�6��h��q��z1��8��� ��d��
pw��e��*r��^pnO����!����K=���K�����!�n�)���B@����[F���}�u���W��u1��m�'�Ux�a���
m�����<�����]��`F�%pc���
9���qm�l�6����b@�gu�s�&�l�j�%��4����r��P��P��p��E��Xa��.�z�8�I��\�������T�S�s������
��
��������ZV��5�����B����J��1��%�%_������	~Z��{���1�p�������]��8����[����Glwi�����\������n��?0W]mT��J���1���>&'+k�W���,�-����Z<���Z7��6����YY[GTc* ��E��.���l���!��O��wg��r\�^�c�k�a��
Y>W�g����}����5�]��>;i�n���F���t�I+`uC�2��v-�0odo�>�r�M]���-�^w�f���
C�F���y���&V�_���+�'�k���Z���;}�����{��@��M���W�r���C����N�~���[�b�06Zv�-��8\�� jCN�X/t{����}��N���	j�k
��RO��e�����Q�YJ�d:x@��CT��������*�:j�'���Y�c���)�6��X6�(;�����F���������������S��<7pC��Vo�^)0�j��_�`��8Ee�������XXoh�z�� �a<p���8��������x��T�w0V����5��'k]���6W�rh����"��C�6�������W,+�WP��|�L�r��'*��
6�8�C`��*I�����-~�tL�x����s�%09&+;[��s�	|��eeg��;rZ�z1�&��a}��Y(80V	��8�0r�S_���f���_��>n�t��D������3p=M~���b�_L|
di�a���p����w��x�!:���������dGW���:��j��'�8Z_���������-�.&QN1��!�*����|�`�%F�24s����
�����;tt��\����o%WD��9p������#'G��7�{���`��mI�x#�5��������L���
���Y.o�5k��'DG���G�:��(����l��B����G��3v����\r4�d���z�/��
��h�<�~n&�j�����l�� ���@!��)���
�����n8NX/�6F;	G�H�ZulB{�5�k�5����u����;h8rYq*� ��9X<�~[+�V�J�5�'�z']�j�W���"[>�pQ������J+�l[�Mn�;��ZR��=]�A���d�j����g�8.Y��'�������^�<1V,��M�����������r�)���+\s�s�z�h�����(���EKx��P"�o� ]���-����9�4�������E�mE�6�l�����^0��#�'�����?A�2���T�jo`4F�G?�w4!�6^����n��(���G}����Y��J������7KQ��O����Q?94rd��k���S.5����M����R���a��e�yGu�����d,oDwJ�8(Y/�{L}���X���^�#7�b��w)iB�21Fu�Q@(GC������*� G��yG�v��'-;b�#N��.}8�q�j�������O3Ge0s��0�w��Y�s�_���R`F�F������3G�2o�k_t�����SZ,dq����W'(T]�$G�����/U�2��������s�������D��|�i����	q��z1��e�U��E��_����3�8�:0pRCy`z�����0Y�"�G�x�^pECa��E>og�V9le��*����z7
�^S���/rz�-s�/8[�7�������c-#T�c�.,xnb]%�1��=� �q\|Y���.i�cA"@+G��sk{�C%-{,PAx�(x�bmOgNO�=u}����`��m�@��c7?�l?�����M��.1��u���I�U�13;�EGC�����[^KxL�h��qs��y����� �#����N��UV"*���5��X���{n,���{#��������W���'W��N�_����E{2u
<�M�������p�����m?��p(�^�6\Zo?�0�Q���X������	���@�Q?Q�`��M����$�����*�h����	��E��ev����p�r/8�!,v��	O�8��N�
�x��:h$�6;��c�<�&4���U>�d�D��s���P���R��}����u�w.��s�T�����#�B;HG�b��IG;�v�8�-���y�
~VL��	��UI�bKK�T"����_��D�m�hKwY�+?lm�R���b>}h6�b��!-n�������������O�oWJ9�w��8�V"�! ��!D1]s����r8��a��Ec�G6|���z�~�t���C��T8�
�����
�=?�l�u���<q���4^�%��V�� 0�pd�[������z�I!y
��|�_����d`�3�d��h�L|qd�g�/�G���^pQC����sK���8�e�b����x���q_�7<ol�ye�v6�h��v��44N�����y������������w�z-�h��7�����������������v���N��c��C�_�����k�N������1��c���~��@u�k"�E�������	�|e�!c@�CH`���Z�r�z�����L�p<���?���
@��l�^pu6z�
;[U��I�Oa�JW��V�Tkb�wN����e����I&r�fX��t��s�����|.�.Z���p��A*HS�����%�R���I��8SpnNJ��\����M���U�S/�lA��\t��U���#��2+KZ��35)�����YU:!�9�Pge�����"��������]��[���9��g=�s���z1���S@�i�U@�)��	=n%����4�$&Z���[X�V,z�L�G�8^�x�0Jb	Eo<l@��'t���2%��-�:E�!Qu]��@8�d���I��
�T�����������w��c�����������-���&-vy�R���6�����6�k�S�&��������T-������|y���;��F����MNn�����q��8���VE�����^��x�Ye���9��gv���/�<���������;������1'��O�s2]*l���5�%�w�=�����m(�um��@_�X��d:���)V���d�~sz�z1�-�T��1`��]�����R{Sx���t�u����S�-r�����><���w� �T3LP�#iMu6���hA���J}��Mw���"��47�7����]�����8������P�1G����+.�
���hs��U&��l�'�\w8H�8��3���W�>���1��!�p��w�/�2:�'?/���h;,��8�n���\�R����|�;�8�Xc��!��P���������qd��b�������Di&�i�x(��Z���o+�{j�|b@$��$2��cC�b�0U�����I%����v�1�����O��'~[>��/��L��V�y���;��7fU����~<nx�xWqc���'[�k�%t�`���S-��}�&Al������maa3�6(���|k�E�&F,H���3�/Sx���k��^���V��{,�e��40���o�1�[��6�H[��6�P[�@��������sj����N?������M.����U����e���q.��&����s���11)��������<�IC�/�
D������Ud�4�����������6f��}������j'��m���im�wNAk��y)������ja�-�js:��z��<�I��]�1x��]����`�����H]�j�4�7�!O�}OW%#!��x
�XXl�pQ���9�^v�����k]/�Es�wjI{r�I���\���as�������X5"��_~YM�����W���1^������?�}w�R���XcG��9��"�����{XF<`j�%,���%V?lm�O�78l8�x����H�������=	�����:`x|zy����	����������''�S6�|��:;��&�����s������g?_����
���vkkc�?:-�i;���F�3o��q�x �q2`jTw�����?u.~z-�M��R'2����k�/��e��p�q��CG��H��kM���WqsSfUqS����Q�����:z�7�`_��#�������G-���s��t��0�ncuK�1�m����k����\����S(o@���=�OJPB_#~n���ns����[c�:8��I��:�>���J^�'�n	�}���g��j��z��v�	 ��������������7ay�M{|-���~��!��6)��$�f��c��)��n��a$3��f}'�+���5�����RD��7R�G|�0��;�#�RJ�*	�������(���5;��I���t��w��dY��RF�=���Yj��	���v��('��\��w�Nk���q�W��n����7��������d4p��1�����L��fd����	���?e%|;�:p=VBW���*L^�P,i����&e��"�}^���/�hf"�����,t��k�r���-!��eC(WN���u8&�� ��D-���,�)�[	 }V�xc�VU��H�M��MX���]�>���8�Z�^�t��}g�=���7��&�MX���2Z=���PZ�Wk�"��3I���V��!����x�x��F���XDm�c��	.��q�{��[p������^��g�Q/�_�=;�<�Z p%�<�M���7aq����eou��^��djg�\7L��i|?�7�iw7
��&��>g�YZ���a�A~����M8�W/�o`���L9]�`��
���^p^/��p��E&-�O���ph�Q�_sT�V�p��5�@k��w�tJ#iK���:L��rOX�
��&���5�m������N�x5Q��s3Mo�j���{��b���k��|Nif�<�W�����i���(�~��]��Wm�Il��IZ)�T�7jH�}���j�$]^y,O�
xVxo���l��(	�!{�Q�G����a1���>+I��{��aX�+��6���6���`����wN��6�&�k��O��X��m���
������#%	�t�q-��v��$�WE;����q���p_��f�1 r����) f1�<�<@���_��������1o�����5�.p C�Y�Q��'�;��+^$��\�!����$������N���������d����'�����M����&��p��k}=��K��-�c�i��#����?��_�W�np=f�2�}Ri�����J	��cb��G��&H3�%��<F"��]�?@r�m.�67��������I��p��^p
C�n���>�
�X�P$��M���8�e�f�4
��yQ��&!�V�l��I?����	V��
n�A��k�9�pW�r^d�b*c��,j�[-��e���C�E��O���{��y�
�~��I��X*l�I��e:�G@�&,��1E����-��(����]k>�s0��,zi���MO�� �\aP=T���1i��I��"�	>��������[�<PN�l���%	�z��X�i#{��5�������W'%L��.
���+��&,��6B��D�Q��U����3n
�	G���4TFW����+���.������t�@����1)��[��8�$o�b����n�����|qru|qq�����O'/�t��G��wp:�5N�N�D��w�s-���O��p���N7���D^'�Ml
�[y���`-=��l�B����=�Ux�0�	'S�z���~G<~�\5�pU����8���'�5B����db��#�p���i�b�J�A�8mb�hu_����m.}�4:�#��9�u_�7&��>�'b��jZ�4�RgT���a��+���"��<��W�i��q���w����X}>|5H��k	���v7<qG�����/[D�D�VB�����{��O�{��X��/�DVB��|"�O��{|����}���`;:�3���?���>�����(�.��5�k4�a�n�G��������G��.�O��4�d8��(��_5�f;tJ���/�_�������p�-�w-Q���|L?�N�)�j���j�@������������yO�@|t�~�����j.|���~���B�xA�;HZ�����S������c:=�5��� �L�W��b�$}B�.��3��y�%�E���~�p���EQ[�t��/��f��gO�r�W����>z�>%�(C���q�>����-9�tYfW��1��U��~�-5%E���}�Fv��>����D*}��]��������V���^O3���������Yf��P.����V�sEY=���e���YW����������2%[.�����E���	���b�Z�w��W8#�u���+}�(�"��m�Z-��|�-��y^��[����i�j\C�/Dv=<�_rL��pM����Z����)�,v�����]���M�~�!��mSJ�gU�u��+��wY�u�����_^��C8�1�p��{����	-�_�r%�y��}�v��>nZ�u�4���3��V�U[��J�������T�=��0_��q��o���ef��E���u�{oY'K�F|7���l��?"	u�U8;�L�����?����r�1}>�v���ty�
*�H+n��7����01}������u��z-Zi���n@��o&Ob�9��,U�����^��z���/��L���af����{vj�>�����e��g�N	a�"������:Y��f���Ld��E���H��)���K��*x�<������X�V��$�]QjU�!��e�����/{0hg���_�'�g�|��g�l�xPu��<�u������c��r�+�2��F�E��=u������56����]�t�d��f�����R-9��-X�t�o<�D��EgC����fm:0�>��[FI-Zg�\�T�m&f����d%������4%�{��h�d���hM�o
���S,��p��Z�i���G�0��2z**r��;�\7�k�����)��e�l������iy�_��7�t}ZQVZs�*��
8��65#�J>f���p���O����Q��S��;l5�Ih��/�k.g����]��e%4.��>h����~���&���>B�=S;�L���3����y�Z�S��yw9�Y�]��,��5*>�o����VV���'�����2��=����2}�k��u�*V��{��w1��*����OyX|��������g�6���x���h�'�$��Ks=�����z�k�x>�����9{���vF�>��=�/����(I������9���Dz^�^^�;y}~yr��������MA�O�b�Rd"�hCDHZ�uV;�L�7��_H�DF������?��8��g���/A����AIOd�4��
��BB���}{��+�(y��&z	�"�6�s*(�^�`�#�MO��TL���yt�/r\7�_�o�eU��j�����2�M�*�{��MbX�{G�YP��Qp;�y�Xf�W��?<�&�g���~�%�C�x-~C�����Iu'��o�v�%2���g��(|,&����d"�U,��C5iL\���R<\>�u����������������[y�]F�_���z���+2���(ZL\I�t~S|���K�""K��ev�����Z|����Z*VK��o��yV=���6{>?.�_;U*>w��l^}�����v��8���OU�e�Qx~]L���������/���u��������%���=��C��f���H\a��GQp��/�������G^0��9r=/
��9�L;]+B��_EY�.���������^������'�O�����?v��������������|�>����\��RyO��������!�g����u6W��R���0���������V��_��%�y���a�}�����2�l���	i)����?<��ty%��"����5\���3�'��l*~	������=IHEOIlv�M_|]V�����JtOa]m��,������P?�d�����P��s{5i�����������K�����sy�u���h?a�����m�y����*����������U&F����[���p:J���?�V�x�T��|�}�N��?V���R#g,�������W�����T6�_��v��vK�u��o�������������Rd��7
y3��<�xt�F����%��������Oy����,�t
m�����z!�~aifM�2++�Iu5#�'u�-S��?�V����N�����?�������t&��Qpx�|��*����v�p��=z6:z��"f����
�G��M�P�������J����U�Yv+�V�j_�zk��Z��_�g
r�R7���;��c���zfJ�G�������D~�sL�4�x������;CY��~g�x�G#��������fP�
�H��������W�6���H�1��~��o]���_E������*�o��T3
����(Pb?]?j���z$�����&���5��c��d���6d��@���������{�����[��65���vt|��w>�:2��G���6K:
�V��WX<9�.��y��[F��ZWo�����.�����'
�F��jnD'�js�*Q��\�����n���k��1�MuwFI*�+����5���m>�N~���t|kcm����wk
��Wo�A0�2��Z��hS����R�������M�R�Ys�R��U��8�5�~c��tv�����{��w{���������sGz�����������t;;����_�{����|�o��*`�E*v�U����
&�
H;����o7	�#��=�#�"�W�r�k�1��T��Y��.�eZ�/&����7����N�����/�o��q������p�7���_?<�����3�/��_����~������e���}������|����������
���S:t{�c�t���E���]7�l�~������e�W������u��e���=��a������?����w�O6�����(v����|�������~���iQ�;��{X�C�;����nt�~���g\�w�����e�����?m�W�8����o�����5O�����_v|���j;��h�?
������"�0���+kp�=s�7]�Y����'����e����-_s��
sow3�o�x�u��������y5b��7h�	j���5�Ug������2���-HeS�#g�|�������)���!�D��)��S��o����������?�NR��,��qs&�������;��A{x{�r����b�9�����w^��%u%go��q�F�3<�S���u:u^��],�2���j�m�=c����G���'+w6s�.�g/�������:ysy��������7�((�M�I�E3��;���%�5*������_�\h������5���K���}�tR�5�l{U8�lA�j��I��U�,�_��Y:�3��>���%�}�s��({~��,��SI	��������\�Qj�,��,]>8��rQ�R�^*\�G�I��>��\.`a�(s�V�R�PP���y|&�������b�5+��������\e����b5���A����ii�&[����k�������K�eA��f5����b�W��T������=�����^X���A����YdK���&��/(����Sf�,�8�eZ�=+���N�@iYRSh�!p8�t��e6����jAk$�:�!�Sf���OU����*_�QI�4�
�u����S��!���a�m��=�Y(��/pF
H��M~�P�L�[�}�3��g?�a�\,��}�I:���4��f��r�������:W��"���D%��������������5��f��.q�ox��6wosc��Y��d
��������5O4wfOo,��=����c(��K��Eq��|���b~��W���4�r��|�_]��������fK���u.W������i��H:��x��FQCu������lvs���l>~��H6����gc�o��������:+�����TDT;�'�������&���i^��Y������e(�
8�����,����w)5���m���Vy����~���,z��J��;+e|��oU��Wb�rn���fpo��hTnj�v�3�����F-L"�%�UNN���o����o�VU�Ik�����Kj���%Gv~�	�=�2%������%��z�u��9&���R~���H��l��V�����v�X<��h������:��[����3ou#�Y�����[���9����N�e�J��g�����1��z^��^:�~m����VL���'S]�XZ���
Ag������@���D���@/�R�jw�����9s�E}������nR]�������|��@?��j7�����	s���4��m(�Y���yx'�]��Q�\���@���jw[U��|~���b��8\{�]q+�Ih���0�_�Ph�E+��g�Zf16��*�=|5��}]�n6��>�������Lt�$��
�c���+���@?���j	���eE1������,hYR�Z1oA�\���z��A�h�|���Sx��@z�[�'u���lD<�l$��$7�T�)y�r53,��y��dZZ�qM������X�����lK�C1i/��;?7������/��=��(������^
z'�HU����u��-�j��i6�����A/�2�j��NjK���V{^SQ�-��������6k$�5�z]����R.�L��1��Ni��P�L����q@��f5�w��������iIE��Lhz���� �GE)�D=�����$6Nf-�@/Dk�����L�G���:f�+I�=��(�����2EO�#���A���w"0u��KJ������k7]�KGJWqsSfUq��y���;�r����[2=�U��Q7�z���q�~��l{�S����6�|1O���u���MQ��o5�&]�l��*nh�������U�=���#����w��s��jEu
�Gl���}�<�/X3x.�V��K� �*�6��L�T���9��e����p��bK�BT��si���%U�J�(�&�M{=;��z}���K�����Q0zy<r���X��5�{�orj�/��Z�q���[�������������M����yv/7�J����t�Z)��$������ �5�7��i>p8	�p�M�j�r+����"��/p=\�T��5�k�~pjf�Ii�X���5j�M��Q��}$���j������������j����0�_���?����$�#�T��5K?]�������}'�#���|&���jv�I�C������5������	�.@�R������@�Q��lt�Z���j������>��������:����k��~�Vzq��e���]�1n�����w��v_�Jh�YO2�V������d��`���z���n��Zv������kdvw">gA�uC�����7�J]���s��}�L�T/?,��8{z�1��������/�)�#��o	�#�l��"�(o|Y�M�;��nu�����t�<�d�<|����������#��N�����g�P-�:����tiuV^��/�����t���zk,��cA�P}X���W~u���|3�PuAwU�n}�l��&8|�oz9������B��g���
R��<�`�#��Oo������8{L�#��y%��|~tk��eV��sY�L=�q���z1��o����X�n����8F=?����W������~���9`n<�<v\��T����0�O�#�G�����V�[�=�-�=����F8�Q?��� �'`Gv��)����<�R���v-*�:�G<.�����q�
_zQ��z���C*D��IQ�m�x�������}#�����(��on��|��j��8�O/�0����T7]�FF��6\`_3�R��-�q��^��>?�3�w.�1���T(@G�p�
]w�I��)nI�������)3�`q��������;������#U���b�!����R63�<%��{)@�8�������Q���p��^pM�\���$U�3b����VL���=��@GA�Y���^��U���9�v.�L��c�b��zy/�b������jI8�xu�������gZ;`G��\z�,�(�bx~,f����s�G<�tB�F��Q����i�i8&R/x��l_�Ge��CuGgD��:N���������+&u
/��]����]o�b���F2��UO�*+[Sa;pVA�X������=��hV��U>�Z,:~Q^��3:$s�!��������7/:�zw���>n~%�YO��Q�����W�����\���s����g�L�U�1��^"@@G6�.}����
�4���c<�r��(O��N8����^s�����DL���L8�eM���&*�I�/��k&r�����l�,���6����@�#�4��9b9Ou�[S�!�A���"!-����K�����%W�������:�{v�����6r:jwG'��Ap�(�X`Tm��ZU�T+HEE���������p�����%:b)Q��:b�P�L�9^����gY7 �B��[��GN��;g<��\j�s0q�)��E�~�[q�(4�F�LU�V�q/c�nkk����t���zt������6{��G.@��8�T/�����I����^Y-��|O���VJJ:�PR���H�(��r�~D����j/L(���
m_{b, ��=��_��:�:}�����WW�U��h@DG"j\����v�8z���������A	����U�B���G�r�lN���HG1�H�lD��b>��gq�(]z8���������X;4��/��8��n�x�8��.�wO�k������q��~�������������F�l��c;����DI�{7`FG3�]L���(�wi��$:� Q���
P�����'��B1�0<jz�\�/����P�/��+��8��.�S
t��>�$i�������x��S=I��E+�?��`*?��'�Sw����q	���������;��K�\@��G��r�(�Y�D��������	v��r�is�M/��L��~;<�U���8x0��.��N!��Q��G;�!~�V�"�:5��	G:��e��wx�r�����������`�������{|��Q�|��V��b>N���f%�9���Z������D�{\��lSu��>�p���=��3�pF�\����\�����I�C[����;����Hcr�������j��G���?�q��"�8R��Hk��I��#��P����N^�R�\r�;$"^���"n�DuyUR� �,��T��*p���^>������U����M>�V��[QgSg��q�v�D�L����Q��>-(>�-��c����E!��fA�<y���e_8p
Z]��Iw�����@���@��B��.��i�T.�M]o�����d��w�Jt��v� �.���yqG����Ou���t<�X@�n?������Y�$���:+8
|j�8\���R�n?��]I�%e��8��*��#tV�rX�^��n������U����s>���T��6X:�1�M\��,dJ��sz��
Ars�.�e:�%�U���X%���w^�`�>{2q���l���z=�K�L���5J��D��:��B�F��F��tr�8���+~���(�^]V�RY�$5-1H����qQ�+�e:��@�����5��3)�O�u)I6bl{������F��\*�T4$�������H�����gU�)�u%sWQmR������S�7�7~���
�f2��|{���:��=��e�Y����~w��8�V�]�w���n/h�2���G u�l[��@|�����g?_�����&��qZ�Z1��u9��.������7�&�h)��Aj�]%Z�@��%�������v�(��89���;j�I�m�:8�K�/JW3/�[�F'g�h��E�Q���e������iQ|�~Gf��=p�.��v����=q��=��&d�������+��0���-6�q����t���_=�D���s\n����p��Q�.��]��4����\�*P����vY-��E�QxQ-�/Ku|��c�\��A�	 �+�^#��(3�D��)�v]�����xru���s���X����a�qm���
q:����>��g�����s�������\�a}����$�yy�������P�dK��:�����#`��~�1=��5V�)�-c7@����]�b�O�q����J8��E"�M���kDq�84`���v�J'��$@{���7k7�>�@��
n_�{y����~������������1���>�r��qmy3j�Q�V���ec���)������~�<bH�f>�yi�u<����]�k�����BkL\���@2]������s����J�Jv[5h�����<�?4���M��4l)�VU���I�|k��g���|����]�@>�Z>�%���������J��bW����:�U�4�����8[T�*i���g(�^���JR��$��R�<o1��v��Y���W
����Ds]6�QOl`3����_m���f�����v�^�gD�D���!��qB�z1��p�3]k��V3XoW0z�.�����U����v�f"u7�8��m�c�p��k7��5�J�@�������9�����o�-�������E���s��y�����t"��T�#���38����ch[�E{�����.' �������fHT�8��,�YI���P�T0���o7������?Weu5��U.���J����w���+x@�~���g77�r��f�m�h��LQ$U�N���p7�y��aV�r�y���;{��y"���}����lo��f�u5-[��rV����o�%g���po���N�z�h���B��n�n��y����b���������__SRk��$�����Y�K?f�1�Y�N�T6�|��5����]N�6�!��F��P�Q:U������YgU���E}��XQ: �]�nT�}w��������_v�>��8�d���l�����'�������s��yv�����f��|����Rt���I��[���\[�Z�}4�f�/�m�C2[R��$�P��/������g�O��w����s�Z�i�E�e�K[��n���xV��L![o�=7�g���[�A�c6�O�{��89��>���;�9l�i���{���8.^/��z���qX��_X�b0�^?�d@�'�L�s�-h�{^�+tw��d�kq>��CV�����&��cw]���AP#q����*'�������a�/����������y��e1~���=�{�����v�m�������3��$dg�rO���h�I�#w�<��{S��K�;fL�|�c�y���T/�|,��yU�J���~M�+���	}sb�}t���bzeSJw���ee7���>31�l&O�\��p�2�"3R�������w5�M��hFvC�����XH�	`����6Y7&���������swv1�KOqZ����&������~"��>�
���Wj3W��
{kl�X%~*���8��~��P.��q���,����u�{#������z6o&�i�L@(�i�����n�a���M���!�uX3��m��kW���1�L��7��o����\�8�������q~�L� )%��
���'���P��Ik[��W@�{������~Y��qZ����c���|��M���~��C8�����X����t9�����A���P�K������G���gZ����a7J^�}��$�9���`x�9��8�J�+�i�tf����������������^.��~�FS��K��D������/���8�������<�����I��F���=�{l��{O���,�2������^g�����Y�����)Rs������y@��~�`���Lx�}�#
`�=+��/@~{AO���m��_���p/�%f���lo?��zD���������O_�;u�;�������u��e�~��
����y��^hY�0����b���!�9���M�!~��^O�fP�^?�f�^_�f0��E�������i^�J���Kx�r�X����y�����4{d�,"��rR ���t��b�*p����k���G��_����I�*�f��%-��r��D�����S��Tq�aQ���a"�g�-a�����E�a�x�����SO����j�`�{yx�~�����C�KZ����E��j�������zGX�*�j��xa��L��=�A{��3=�����oi�eg5����H�Z:��C��1�"�r8�hs>�f��sc��^�3���q5{�;��v���8�h=|����W�9_j��:��Jq�[i�;����_V<�'%��`�����	��K�;������^�J�F"�{�*��������9�:����fD�b%�p�z4�z��c^���f�,��Sn/�)��F �6�(���c�����lT ����s��7.f�lN����ce��\G�{j<�1���0�'���
�:n_n���A�y3|����WZ���mp+�
�l��Y����X=��^��/����U�������R-�/o��O;�_G�,��
�W�����U=Wf���^��=���j�Y�����a����.��J(o������7�9�T/���}0��3>�E}�B����=o��A����lH�{����������)phOu��zML��\����+&���,���o�[��������wd�J���s�M����$�;�6jw/�E��1M��6������-�����`T��g�7�,�k��������#I?�����F��P�/��p����#���V���P������~6����Q����:�!����g�����������N���b��|�����~�(4)2�H��[����S�F�v]r��7��o78&,g�l�^�m��?���S��S���l��1~����%�7�ek�����wb�D*����^�+M����npFvt���8�s����J���$CZ��
����d�>Vt��j6c5A�� h����g���EY��RLb�
�N��;���A�x�6�����~n�<�+�1��BF���u1oms��!W������m����n�<�I{�3�1������I�D{<p�W��O����T�n�VT�>�V
3�s?�����P������N��D�M2p���;����`4���/]f������Ei�� �9�h��k����U�}����{=O��
��<��X���s%'p
Tb�SX>S}���m�J�L���,���;O�l*�g�z4��X��1�,��,�*.B����������n���=�X���~��pX���l�F<�U�.�L�=tB�x6
�*���N� b��su��;-w0����%,A�0�����l����j�n
p>��TMz#�L:�B�]E��S"��a��};���M��*0��f���V_����P�����cj����e�w�
�$T�<�e}/�_�,�2����Z�:}s����i� ��kJ(o&%�N]��I���g}��]�z��:�D��[\{�]:���!`k�������h
��~��8��Z��Mn.�t~��FV�ZbD������<�f����\-�d�X@����������'1^p�����\��I���'o.���(�u*�8;��z�������[4�w����|�MU=q� �}9l5���
Gp7������!������7���{N�W�k�n;oV�hm�(�`k:����T��8�Y/x.�d�#����h����9xn��r��}n��D�oSUV�<�t��n����r�Z�f�������v�1vJ�0�c���b��aiaU�5X�Q9��D;{�:P8���N>Yw��9X/t������@�>����r����r����+I��*�������L�������>}}�m6��@�,�mO)�zf�MSJ	��t\�.+�GM��2jq�����`r�)��Gs��`������`������>�~��2����6�_t_& ���!��<�~_a]�tq��
7��$@����<@_����o��h5FR?ZV}��+^�������'=w�6����a��_8�.�N��t����1�,���oS���( }�����`_�*ZK���J�����!�d���_�=~s�Rm����W5�@�:�I}@�V�Ze�M�Odd^+#>8w�b!&�� V>��4���m�i��H
X7	x��8�R$^����9ZxS���8�'=w�D���O�dY��8ag��:�m"�zD�a�7n8����'�ote��.�T��f�E�S���\�Y��\����R�HQvG����&�p`��md�*)���S��EyW�=�Y�1�C���u����Q��"���48�����O�D-jK)+�b��dm)���Q�n���7n��i��N�p���'1<
��5|}�%�Z�T���F
�ik���������fx��n�vF�/$q�I�8�K���.��8��N�m����09�&w����� ������i�_�R��w�&�/���d��,�9)[����\�#���5����[�����������K�'�8�a�zOXp`��u:�|�VAZ������Z5��I2����X�YF�m�SJ����J����|N��=�2"��H����3	�=�	j���Xei�>�8p{��`K��~ �A_��m�����5���A_P��R8\�.n���m�>��fx���C�����B�����Fu����x��W���F#?�wf��]%=e�Q���S��������]�,�4���
���7��}�(���2}���"4s��fn�_2C6���x���0MoK��p�o���$p�������gW�N^��:}�#g�c}�p7��u?N��t��������O��77]����9�����L�/��w[/�}����%���"��N]W/x*N[��x������zaM����T��F���~��=�����^z\����5<pWrlnx7,�q�)�q���Q��e�#�q�	��2�����&�b8��a����~]��g���)8HX/�K���1�o��@���E����<PV!�P�A?
�;��/�3���v���.���n���]@A/q���A�+��\.Vb����A?Q]7��t�T����K�7���47���;J���O��X�\�5�0����
8F���.
�P_)���km����v]��~{�����������
<�z7�.6�?���kS��L�tB((GX��
�)���
�*������i�d_�n��|U_V�|���o��
�R���5-#&����������@7��t�G��=��Q���5g��	wT�n������'@�� <��e�� ��Hj��U��r�|��k�r6��Yg�y�-g�<�W�g"|cG^���Ta��z�B�C-�����~k�7�����Q�8$r�S:����bJ�y�WL�x���q�(��A�7h��D�;���p`�^��q�p`�@j9��|U/L�+��6A�oN�p�����g����#����q�� �3���
8e_��88�p��
8U�.�)�-��������k:M����n-1��;?$}!�.���ViYr�m������m�������1n��
"��a����c�j�~�@iN��.:���HIBj�����2���x������qK�A?�6�m�W��m���3e��~�����N��{	lr��^���'��!!\Cpe�#��e����"�����p��:#p��K��>���J.�
�C��GLWg����Q����x�:A����B@��G�B���!�hK�����"��������?�=�������}xd�����ft'��G�:2 <C�itf�x�V�Yu�`���>��L���f1���M�(��WR��ra����U�v_*(�sBiK@<u�b�O��|�����}D����&��npV	����T4'K:B���V���9�%
G;y��{��y�{�L!@LC1��hhhCC��z�<[����LE��I���a_XtS��b)���=�����~�r�����o�Z��U��,@I��(�in���->�>�j���^+
!@FC����*�Z�n�4�t<����u�U���G�G�:����L���v"*����3�"�����u�#�^��S��J�MQ���d�
��|*]6���*���N>��<�U0w�Q��
�a����+SG���N$���tw��l&��T���m������er���*�c�����G��j���^��f��5�wgU���J{Q�������xYjT%�'i�����TKJ6�N|�*Sq�zV�o�G.5���n���/e��d�}livC�����.]z4
�����
w ���g��g�vy������S�46�E4U:������F�W�C��RkZ�	�8�	m�4�c�Q�.�X-)JnL��e��:��)����{V�7��D�1���u������QM������6��.�����K�&��XXuc�\:�D����&�7�z�]��������<���M����X��i3QEr�-�`�[����&t��^e�t�ts�3�	����[��S�k��z�'8k�W8w�����<~wy��������K��vB������]�}_�H������;%j6�{�
�fAa
Z�>�'���u��U����M��zY�k��uMF#�mh����2���\��(w��� <�UY����������������>�ci�bF��~�/�l8Tiw��~!�lCf��bv?��6�tx�b�KAB����'U��-5�S��E�0�m8T���W��6�B���2�.�[�D�=��9����9}����E>:,n��s���������|�]�RBpOW�^(�<vK1z\��ypQE��`�F�PM����7�N��`�������1C:�l�a�t��*`g���
:_C�R�1��k�W�/[	%'�
��d���
O#��;e��YDwJ1w)�������h��������Z���r��4F��D�$)+5]��jH�\jn���0X7	�+���.���&g�5��m��J�/�R�r��.�;o�-+H�ft���$���T�n^�7��t|WK,mDZ��������h^qu�Vh_�2�Z�N�6�Q��.f���&f+�=���z`�3U�*�M�!nY�����aO~����~�
���6�v���f7����qu�J�[J�p�n,�����g0q�7�z�vn,�|� �C� ��N�U�u:�F������a1UK��,�����#���Z�Vi����z:�;&R8I���!���,����2�@��D�,TwK���F�XM�p�a��'&�]��4��^a���S��l��
9��nV�7�7��f�^��_��R�`�/�.!fk�Q�"�8b���������,�p�i}������ct��ca�ce��^B7�g�H�a�Qn��uc���f�.d�.�1E�!suM�>\K!w�Y@"�}������Z��e���2{]]-�r�ix=�f&��Y�Xps���_)
���|w����Cr��z1���$�
�H��*�+H��Rg&b��VvL��s�Lr��eM��"�n���*,��c�8�E�oW�2�WY�V=^R�g����$�xC��-��_��M��GzU����$cQ�e~���mj����I�`��tk��i
��.�!�
��}��1��6@�pmeV� ���[<�S�I�_��k[����<�V���C�����e
MV����W�O����I9�o�TU)��Q8�,UJ�{��Iv��iQ�V��<yY��R�Z'�*;t�_U��i�C9P���e�_@�e��������	#�b�s�4�C��v�&��G-��<+n��L��=�E����zN��}�{qK{��������y������}a���t��O����d����:g�d�P�'�u����E2�����F��Y����Q���p`
��r�R�����8��e^�����ni�U30���A7�
@����=W4�=5�4���1VFu�a��q��^�5�n�}��G,bNA�2O��g�Hj�OJ�����,fb ���Q����t���`��M��D��i��)�AA�]-�97[9�+�t�y�%Xx4�)Z�d�Z��H?�r�D&�8��oz9���=��O�D���^{c ��������:v�2�v���z1���=��������3l�w��\p��N���,�
�M���\����|E_Yae3dj;�'�:�Pkc��t��4-�l�5S_�)
-����f5U�=;��������'����J�:�RY=(��@�#�6�
����oD�7�W��������$����h��
�*'���F78���oR�����%-E�iG7�D](��Zq��q��Q��-���1�>:�����`����c'2-�V���U��l�.`�#�m�]%G^��!Y'�y��A`�����b@��K0ol��P�������/��y��4f��/��1�{��SW�Hk��\��,~��� ���`
p4~�h�|S���������1LH����z9��e#�r�1��b�Qt�f����S|�<���.I�k�lq����n�E���>c������;������O�~��P�Q�7Em`����;�PGA_u��c}��y�
��(���6xp�S��~�;��8�{�b@��-�k#h�c��p�[������Xp�08��`�����8�g�0��]cI;��8��8
��r�3`�r���z1�����\�f��>	�V�c����qw�)����3���#�������`�KS0"q�f�%++�h6/�R�:u���.+�*�I@��:��<8b�`��������v���2K'W��I��������?�	/�o����E��/8^aK�i<]���ljV����9J7�y;"'��F=�����P���������DU��8�V��_��Z[���Q�/��+�����Nz{�{����QO�^�Uu
�!���O�7�n���m=d��YXPx	H���Bo���Bo��>_����G���o��6�#�o�V>C��p�����}
@n���D'q��I���A�r#+�k8��t����n���{���qO�p�~��Q���h/�gR���QO]�(	vK��6�'�6�)<r�v�
L��6Jz=��������E�%�&3�E�x�[���N�c��a�����D���	�������#�
�w�����D%�)�����{jk�G�c�#���0lo��=U#������
T��k��N�������3F�]^|��<h��@�v��31!���`?e8�p�q_w����������sj�z1v����]��*\�����h����'���G�������s`�����m"�C�N0>���m��G��cF�������g�����_�An��-�plS���^p��4�c����~1�1!|;(p���z1�W�z��[/�]d��Xr����fS=����G�q��:Y�jsx�����b���``}�9(p�)Dw�6�G���7����\��g�'3f�d5����Y�i���|�"���O;�R�<�r�N�>d���$H=����2��$��A��������~r8q�q����j�(�sXyf��L��b��eJ�sk�%Y++�9���\'-��0/�E���r$A��Lo�|^���b@��T�L�8�t�����q_�7I:�u����@�1���7��lj��������k��_���w��m�t�����z�y�
�Wj8��a��#�
*o�m�s���#��b@��zc��4?�s(���7���Qs��3�
��m��z/�p���,Q2jG����p��I.�;s=�_MK}�6r��z1��������D�8���m�
��\�%�����k�;hq�W%9q?C���qoBy��V\��J�h�qv�SoJ��zv\���s��z1�#����J�
���WC7\w�q��5�B}imo;��2�[qlr���[W�j�����VN�o	���w�'�5C��
w@B�,	��k$c�-��Pd���cl���+�9��vi�k����l5�	�V��������2�Y�t�a�z1�+q"�����c^�^�x|&�.�}�����*��}g��)�"C�C}���UG6�	�V���!�D��Nx��A6�A"�yI�cG&�h�����6���~�O��4���U0����1'_L+=���J��*i�*shx~�KO��)�<S���$9f�����v^g�,�J��t�],1�q��%�	:8�r�4����i%���B��p'�$��J��J�,��2}`J������^(���9����4���{�c�k���]0�~�4Tr�
KZ=�9�x�����{�����/l��A�T0���s4�>�r��^WOF�6q�v��'�2�,����6�%����AO��VuQ����"\6YX@��"7l��W��.$��^9�z�ho��3.�:� h��D�!��`F�3�sl���	�c������wc�.�1��������.9��%��hk��3.R69��&��M��z��B_�9	��lv���|�(��*����������^�}1����G^h�6�EL@!������yH*���
�u0�x�����f�8��1�m��5aF���rlA�!1^���~B�1��c��o���oo������Cy�������z��V����>�[��,k�s�z8@�����f�#����$����u=��1'��P�������j�c*���('G���e��	`�N�W/���[��T��vX�OT�Xex;:�W�!O�[�}Y��
>�������`�	����,�J���&��M�r��,i��������;��'��z�z��7c<@jN-W��M8V/tw�}���@�&�z�\����V�����g��3GqqR9|5�V��*���w�����v�n��K���>%���5�7��P/tV^ W/�T��m�G@��k�D�����u-�
���u�������������5��nez�l�M!�.�)m��/��*h�9�����_���-���dI�%Iw0�$p<8�E�	 ^�x�S�zQ���;f+�u���\��k����f�0��Zcb�&�7��AX���/�&[M���&�YM,"�ll�t����&�K����t�&eZ�9.('tM��i�����:/(l���j�x5�D��o9Plb�bY�>�
p����u��`�	+�[_p8�k�)����
�>�|��	`^N(W/��~������7���\�M��X��.�5�'��X5�`����K������#�r����||�������X@W��-�&oM,x��_$�=~?���#P[/��c�Z�"v��b�T�k�y��W/T���;��,d����?��q��&�6n���9t3���D���S)���9;�2��'�?M,�����:���sw�'x��)�L�_��2U��E�S<�������;	`W�]�K_Kkb��1��EN��&��|��&!_�d����{��,�[!�>�;^�=�kp_=����M8
Vo��M8V/�Z5w���@����jY,Zk��Oy�|r��	���S�h�si���m�Z���+���Ny'��PnO���H�:r�,�]�b�Z�WmK�*����$sW�W���:�jY[$jP7
��mud���4}P��$����������:8 ^NvW|,��;zwYY��B���3#aP�^i�y�����mV9�����kb����e%����:�p^=uf�f[��oVs��B��Y��T��T�r}-�	/�Kh��.��R�.(�R1�d2p
��"�{�l�K���d����|�o�t�:OK����9����^|�=��|�5�n�N\��*��J	2�l~�d��I��g2�V����W��8y`3j`l�����B������}h���&�,l����B!�a��2���0�5�xV�������+�zM,��=b7&~l��M��L����o��@��,�~�%~�&��_u����2�	�\���K�hM��n+���B����k�h���,w^���n��d(��7+����X@�&=(���Y���~��
��&d����NM�~��*M8���"�Y���B�~������B���7)��z���������s�W���~�>�gnX��Y}�+�?���{�?�7�����g�*�4�<*M!��4�u��e������l�1��?uGC������>��0�lBIv���z-�|�X_Bb���4���@���c�d��a&�6[�[�T�1����>��h�N��-�?�@����Q pz�Z�d�<������n�ogl�Q&ER�}ky���Xo}��(�B����u&n\a?�Z�=��X"���}��� ns�;'3Z�-V�w��g]�o�\�g6��T7�������v4�Vwu^}/�������6z�d�������0�!Vwc������,�,_?8�s��?d�	}o.*D���i���W�&��*��M2J~���y�����R<�j^/��sg��z���R����{J3-;��XQ��w����][c�v��>��	7��!b�|������XV{r��38N#������q1���/�_�\�������:}s��������a�tu�K�{V�����Z[����o��o����j��|���W�7��i&���l��P��=}��;��v��>�76��c�|�������1_F{�dU�.��,��ZI������*�hx�0������}�PV��)�R~�4��S���AKL��*�,������w��[^�����2}���c���s�����Y�����<	�>���%�|�`-�_R  ��
�@��,Nk�e�uv������t|�T�L�g5o������a��<�c��J����>�������������5x����m!<Y���{��[|�1������?�7~�#�����uVB|���t�N���p/�M��Y�x�y)�I��er��<�R�`O���o������V
K1��kjP�5v�>g���-���������t0}����]���\�V����L��J���(������%;�K��>�l�v:�>�CJ�x"��TBs�M}4hV3\�����eF���_������I���B���y�M���G����HAofh�=��
�4�C��c'k��])]<�b[������1��������&h��������}w����>�8�<���<�����_���\�:�-�b������"^�I$\��5�����y���L�3-�vK��P��c6U��������?��]�;y{v���m��K��U��G��X��Wh��m�0�gh��w'����4E��J��&���>:���y�o�~���������-<�����z�����������������X��F'��0hj������t>���Y��B���yZe������Fw�S�*�E�T����?���F�*O�A�A�����G�vH�>����uM\�nu���5����y������<]&���E��Rn�j����~^�����>��J�&��M��@�3�F��;����X��k��a��
8�[�LmG|�s�|Ci�k���!�rF;rm4��BT�\6���M�w��^a�?�r)	V	���o����Yu��}��q�i��(�7�*��
x!+(�/��[9OnVj�wKU������h������M��l�JP`�
v��>�Z�'sE�Ec��z�s�y�J���#��V�����'��RE],�Q�N%��;Mg�t2}��@�=:>j���H1}�Z_���*����B�,Z��0X;3L��;`irQp@I��N;�K���+hg~�s�?�����_�h�R��O�E6O�]��-=��a��	�X���>�x�N����(s��o�Sz�}�	��V����X��H#���0�"�������#���8�wS�Q��w���a@���6����k���� 7��S����������'�"����7���~�������e�u���7�o�O���id�!����bFGx����[,^�$���2h:���������w��ij}n�$�O�4l[&Qt2�oJW��!��Rk��+�KG\D.��G��0l������Wg��k��c}�Bxt������Q���+x8��]����"���xt���_�i���G=uto�N�s7��8�����n���n!S�h7^������e���y���P�
HE��y�zX�h>��Y_���k[�����=��ze��G��Yh_`T7��L�D�0��Q���U�y)�7������mz�gu_� �x���o��"
Z-��@��%����?Ki��::
�`�A������M��
(/�x��$�:q��Q�aV�Kq�����{S>��Y�L��CG�2x��]g2*���T�{M��	\ ��u�cx�������<rA���;G��Y_m'�G���������"8�y��,��"=�z������������<����2z�G l�����kV=iO�Y�L��@)eD�U��bd:9@�����������'���I����h^c�f��vI�����h���e}�F�$8���u��+�y�����nXS�W����aM3�Y-%��ZD�����g);io�0x���f����qzS	C��'"f�_�U�+�a
X���������GugN!i&�����yF�;��P�1-t�&1K���7nYE�d�� B��|l���<B�����#�kq��}�Pe��Ev�o���	I��We���	� �G�c��@����}�vT.�6Z� �F��6�xdA�-V�u-��2�;��bx���G���Gd��_O�x��Qo�x����:~�.?�3�4���I��/�*�SL'
�P�c�9�{�n
@�G����v�w$EW�U�����X������}�<�e�$�t8DY����zv5�K��A�m�P-}�:U�Ls�rg�e�,��7���x��,d8U�-�8l�bry�\������AZe��)��w{P��l�Q�,���$��#��	"��G6~��x�l��X=������r�O�Hw���~Y��77eV]6��V�8G�8y�yZv�p�(��n���#�-`'�\@�3�������w� �h�9�i��Qv�������N���G=����:��Eg��	�	����P������X��C��Q�+"���\�(81�27z}��)�'������V��=iW���y:�&��	�Xl4���'�<��(��S[�~v�x���^nNL���P��y���F9��D=g7N�^�*Q5L�(�m��R����X;���.ho�1<���#�?y�/jWE���G�B�Q�IQ�����j��������o'�~��������	�M��<8�G}A�m��ty�c���5�����#x��*7�������~8wu��������^^������r������>mS���N�9��E#�$W��Bd�d}��bN3�=�-���O$���t��b��Vh���������[��Z����r�{���N�������=B:�FL
���mVMD/��#$�,�}�r)�����*�:�S������e1�J����#+1�6�(x8�z�7E�5��<Tc��/b������N��wFk���[�li6
H���A�D�1��$��:����a��� N��C��n����������|���y}����ap7���mj���i^�:o������j���~n��#@~�8�[���)_s��,��U693�/���Hqu_��\|��J����H���@�:p�]�s�$q �V��o��N��t��B�[5��Oy�B��F����*oV�M����K��cs!3�5�{��i��jI������M�Q=��@�Av��B\�6H�0���B�I������+��UQL��7dXd�s.���E���:���Ja�lQ,�%�����XU1x�]��0���\�+�Z���n][-�}�]�����qi��n\��CE���j��X3����}�k���t�6N]@��G=���>������r���kw.�:�LT�����U��n��
`�nL���\����~>��9�I�/�q�
��!��G&�kX���������J����T����gJ��:�1�E���v�����M$��t^������4��yrpS�%�Ms���'��r��Hg����TB����8-�-�0�-8�>��x�_~�E���'����\�������ruL�0�N
�N+c�@
wJ�L����=� ��;p�.a�}��^�H�n��BIh�����v�G��������������v��v�m)��C�V�f���z�WE���K�Gw��vm�"���d�8��y���v���Pw#B-�V��8��t��������w���� u���2�k?���68�>Z�:t�������r
�z��0d,�^��l�/�8�#����E*��z|zyu���7��>�����������W����M���7pu���\@��B}�e��r�M�&��:��I���Nww��]@��=��W�2�@Oxpu0y- ����2��wMd&��]j,(����S��qM*����Gm�b���E*�z�H~V/��p����0��Pqi�zH'�w9`�(�����=b�|�6������
�N�mD/��rk��_��%o���p��_��AxI�1� �]�x��l�;��)Ks��*�W�����w������4�1N��������M����]@���������t��m������jW���t{�V����aw��s����Px�C��r��d���zN'������U�,�e���G3^b.�T��9#��`�n00����E'-^�����D�Z �],�H?f#�#N1
aJ�y;�c�P{���r������IN��2N=��O;������(\���RO�1�\�k
�r��i-���nh>7������'����|��n���J��� �.]�Q���!+�k�:4K����j:�JY[�q���^�����t�\����wPi��c6���j.`�]�Q�_�~���{��3\ �]�,�8�����I��]E��xK5m��H��1w9�lcE �.+��l ���������r@�Ko�e6-n�^�|����GM�D�C=Z
5K�t�Z��s��w9Il���9=l�L�l�����������Z�(�rv�����7���4���,�L�R\�������)d�6e�����7)��yQ�;��\a]R�7�dv:ac��U==�I�����:�
R�]��H�}�RLP��*��3�Y����2����v9����f)�I�w<m�n��������?��ly@K��U����m��FK���U��,���yq]Ld>�:�����A���cDw\���]�V���������G��B�S�.��]N��(����Y�(H�O;>����o�F|c�������?���@r����,��iu�����������O5h���^n�U���sJ�&��.a��x�������F9�� ��t9%]���i�q �"�54!���,��^oE�{����&���-�Y��Q�\G^��E����)�E��*������_���:���Ty�(����-.���Kw-��������D�����������������F(���c��}��H'�b���|���3�f�R�r���raj�8�w�$��X�{��O����5�(�@�7^����WI4���0n���
o���������j���Tm(�p��~���hk��������e����]���v2�
�i��G�%&�q%��T{���!e�|������]m�������WQ��r��\��t��"@��!@�t�9��
����]W9�����hKUv�T[[eH����f:�A�K?���>���i%�K��>Q��|�������V�
p	���8j%�{����-���>
�#���^^�g�� V��$��(M�����Z%�D�O|��;���*pBBCg�e������So��`U4�h���W�-����'4K[+�A��.J=�����j?&�����e��=�e��h����r>��:���=�b�6����"f�xz�<v�F������O�I@���������Z�k�f����{6�V_.�b�XPm%B��`'Ke��<��uJ�7��+�^P�#PjC��|rhgw�p��Q�o��(�����}L�����']�����C���q��Z����l�:�sB��nY��������Mu������$��E$.AL����$�G�	B&
�j��]*�6o
b���VZ��A������QA	G��a1{$�����u
�Pm
�2I��Qv��E>-��dV��H=x�KEWu��v����B�n>��q\�~'�n�^b�t5��Gh��&�S�:��w~rt���.�_�$~-�:�:\��V��x����0�KP�.Ie#��K*O�t��>��.�g���Q�-������@�_H2��GaK��{a�L������tTq�����5![���ChQ�f�~A	��(��B�wA;n/����C@��&��f�9����y:9`kp|����Z:�Z�2Bb<�9�j���
,
�e�u��By+����
fu��4-�'��$���Y�(��;
-��]
� ��[�D�� �]
���I\��7��p�K���n���Pw�K?v,���U�Jl�).��y�>O����\�mv5l3R����1;:��8��>A9�����m�]������WT���]C^�%xe��Yz���!)��n�i2[���`J�����
���Y~8� �(���c������_2=��e8��VtX���r�O@���n2�!:�����(�������Zy�7lO�^�r�z{��Y����[_�
-�Nv1:�jn+���A�U�!(e���m%��m2�Dv
Cj[��/�}�\�N�$F�d��VD5���XfeIP��.Rv�%��<�p����V+H�%�e#�!�=�j��
�fE]�Ev1���N��H�!�6q�W�I���`V�2<s����6DNf����Dtee.l��w��=S�g���ap��H~��J��:x���b�
����/�lY>�@���g���IU�����h?S��`�]�}�(�G �.�@+����]���(k0L�����K���i�k��B��M^�4�3�`�]��V�!$I�<�RB��.�*����.������py���>������t��������w��~�y:S��s	b����=a���yA��8����"V|��{N���t��
��x��=�������@���b��8\x9!�x<�7�f~n]G�8���$�����D�F:6�����������`���d���_�?cu,b���jZ���],�uYk"��"���j#\��I��!�\������e�N�j���##0g���Q��y-m�zZ	��5��]�vu��:S��y��=.B��ALy_v�&�@���T�r�)���u9L��_s����9<??=��o��5��m�m�!k�����d"�};J���]���w����)zi=� �E�����	4l���7�/�`u'��?��7�t����������N,��hF
��#(g�k�.^���7�`������R8.PD�-�����r��W$���r�����z�������o�J�]�u�`�=��V�������J�<�31�W^�U�s�������j�uM��5����.#D�;|��#�k������d{��E���H�G���g+	���M�P#L�a��A�{:�)�[�����{�q�=��tq����G���6>6���;/��_wX/J�[L�(�/@��
a-�~��Y//�P�"�G+{K<��S�fy�^/��Y���p�r�E���{�=���Q���1}�]�����="&��w�Cd�t�QY�x%�iCk�
I@��!�]i.��/��G����}���nfy��aT���7���F���#hm�n�*|0��@�gt{�����w;<90l ��3����G��Z�����V�� �=4��uq�/C�^L��7��B��w#4�>q�[	/��H�UD���=4��*8�a4����g�0������g���l�4L�\l3m��)�%���16	$�e��,��y��a2�v{pA){:J���G���.|s5F��r��G����*_��Z���������U�����E���)�'�`P�#0`OP�-���yV���L��:�Y���"*�r�.bU�s�v���9�g�J������s[�G�!�g�U��@h��FlX>�W�5I��9���A(�&�3Q1��fCP��g�=�����5MK��1J���������k~�B�4�����G
�

s:��p_qm�dw��)k�D� �=�$V�!�����ai{=PP�)6�������8��o���al���CXd���f�8����h|3OX�}�ij���h�.#�:�'Y��������(^����p{L�a0��A{�tp��
���.w<W��5��G���!+�`X�UF�dH{I��f2Sy�k�Ng��rM3������U
�������O�tQ����������go��N�����)C��b1��H�w�,��kf_�:�@�=.���$�`c���{����Ni%�l����
x*"����
��/L��������!�p]�G���+|uR���4����D�C��Ebx���S�9n>��2�����0�������(���J
�d��"�eO>�0�?����%����u<������ Xge���V�)T�4����T���k���I
�[��*{��kFn��9�AE����`�m*Q��G- $M�Lm�	
�SFP��YJ-v���������w�4�b{������s
�_{��gE�����02�|��#�l���|�	[O����I�e���Ue�����eB�S�zNttr[A����a�p�Q���D���a �z|���tZ����G�'�	������	lev�����bI��s�L�A�(r��i{��v���t#(lO-�C`�^��b��g���r=��#�1�_O��y�{&v��������%Xlc��|��z���]�mG��F_+�rd(��V������x����+�#���[e�zn}o]L���E2	
`S�����5+eL&�hO�GW��d���M�^��?���7G'mF��zf��G ���F^���v1�	������
xwb>;���|��������G�����=����/��}���Q���,������=f!m}����0!�io]rz�U%�R{J��C��&����ZZ<���B{����h�i����#�
B�4��WYc�g�P{B��jH�5a|�j�p&/����r\��W��8��i��xn�,�
;\>Y�������_�o^Z�'�g���'Pj_�R?�����������G�'(k_GY��QM��v>���}4�5�p��O�������������e��O �~��������cd���^�|]�k NNl���j�D����m]�~���*Z��q�����f��	FF�I� �}Q-��|����9��	2���h%B<4H�c��|����1��	�����_c>����!�g��'8k�^�������}���+R�e���W)���P�����1�K���y������^L��2�V��� �}������ ����J>�@�����Y$�C+�w��
IZ	��w����V�1\��PM���2�������&H����l0! s�$�4��	���:���
p����P��F��&~�k��&p���R���r>��c�7$~d��Y<�DSW�]�Qp�{y|\]C���	�NZ�	b���VF�5�M@�7�}�5��WM0�����LG)�5��a4!Xs�2-��	���������������2Q�|?�5��1�1��2����E�V�!���sj�'�n_u[[��W��>��=��nL��)�M���+#k8O8�le1?�+S�>A��Z:\	4��[/�Z
�	�� p%B���(@M�'8m?h���Z��"�������7�T������V�!�!h�OG*�	���bQ+�b����J�S��>�����x��#w�.�XLa�Xo�Jr����?�����gMy4� BF���������r}����*v���e���O�����,I�k<!�-G^�[�$%���t�O`�~�51G���},���!����Z�4��E�V�mS�c���|��PhU	�1�(��`�<vh����F���!�
�DP�>��,��ku�_��K��)�TLj���bJ+^D�ua�M#�����:W�'��3��t�q5�
'��9u�O����/��6�4=��������/>����W��U#���Q�4|6�{OD�a���>�3�=	����+�:��C������B���J>D��1��[@�aS+w%��2A)o�!��&�c�z7���	��������b ��w��P������7�5\��|���{��\�
(�A��j�����a~	��@���e����K�jw�A4���N5Lw5>���~�
����Kl�#`��6�"��g{'G��W��}
.��W6 ���k�����Jm�a���`������t
������@p��J>z�L�)��.�7�����������\�rPB1�Fn�U���^A������:��	?��h�������gl��M�uy#���3�tp~��\��Je W�p+���*�
��p@`���s#��@-�-�>�&\k���������Al���3�n��<��\1��"�2{�Z����m���s��=�_y6�U���J{$P�@�k�[*�N@0���fD
��4 @��6=j>�Lp�LV2�2i��?��p`�U�7�5����O	Y�t���UG*#���E#��|l�[��PL���\��������% ��@��n�Uu�j�!�$?��,,=@�0w��g�}�0���>WQ�(Q��
�H���8�'Bk�8?��<=?���7����2=�To�|I#(���,�
F`���*�1���������"a
I0����7���8��G��Hp�����{�,��'�f��h���@�
:��\�  ��@c��������%48����
��������
���0�Bc����c2���=�7��-]�b�~[e��d:���fm��e�d������^��>������+�U|]�����8���0!��
g��
G���aki@_��wH�Z� ��s�#�����@�
���zk��_��r�	l�����@������jC0��07&����n��AG1�����<	�\�h6�mBu1�������O{�{������g��
uNH�)��n�����`�{��2�%�����$�\J��Le����v��(mC�'H��o�1 ��0Fv@���.CbS6�I��-�Q��Ds`�Z���#����������r�Uw�`��v�V0: cf�^����G���[p=���M�0���U<g���Mz������RBQ���j�i���Taq��|�:���^��r���-��i�o������_&���lu����S��E��V+i����f�[W�5��(�t�vSZ����M�����V�!�L�f�~t�.�#��#n�b�T��m�Q����E�*��-�D�]�z������;[��0�_��"a����k����������E�gK��%����;0����B=3v��-��W���]�+V�R�i��c$6V@���ah�/��"8��0�6���N��7�e����f}O�d��N��a��Jq(i���:!D1l�Mk��#`��0�����*�B$�H�����p����#����fDHF���:n^N-�vT���.^���?��|7���>���~��&����m���:�1���oG���������+�j�A��Iq�����J>�\i"wk���A����_� N���5�?�?�0���%A�}C�� �,*7$~D��t#�L|�|�aN��A��L�v�q�BH@��&@7���u��4B�1�^i�����M!����Q�[��$~�m�b��4���������W�P\[������%�z;��!���^�S�����n��*�X��5I�I�oy(3��K��t�]0���`���@�C��R�Oi������#��|�?���
.P~H@�G���)�_�.��E�V<��'���@�E�<�o��GJ�u��p�b!XzX�d%GB�l�;_�B��5���[^��3+���s����vK..�sv!��:@_^��z�����|��cD_��/��j��CR)���O�N�bBJ@����RGLpE�"��d�\?4
�]M���Y$X��
�-/�B��54��z�)$���={I�.$���M���C�W�!�EG��CA��Z�}���\�Dyyq����2 ]6���u2�Y~?��&q�c]rW��P]ll��������0Zd��<�$��1�=�8�=�N���F�
	>�@���h<�e1�a���lOs����X�(#/��:p�R����CW?[9y��,��J��F����?M��_!���8�^������)���JY�){��.�
.Z����&-����'0s?��e��b!2�v��gg�<W�L�j�e�I(�6�����hzS�<�u\��Zl��NG�hg�;%�F�.�Y���y�����`��8�����0��s�_S��P2W�d�4�����p�<M��\E���9���S4NF���#�.x�\�C�a�xp|��!�������cl�G`��k7Z�����;��g����jR|�����������rpt28;?<?����������r��P�H��zk�
c�,�E��,�8-�|�&�e�	2�$P�P��)��|/�y�����7�!���<+��d���'���8vS�%-n��=;���&�Gzk�����
(�h8�V������������5f�kmv{�=����4B�[��V��F�t��tO+�����~�A�J>����z�U���e�2`o <apv����_lb'~M�	?�M�b��hq�_'��8[�h_�����@|��>��{L5�?����C���C]|rHkdm��j� ��[��Ch�����5���9I����y0S�����[��~�a�J>���kAY0^�b�Nl	�64Ba4!�Ig]-+D��1�^��������n-����R����S�[��
B�������$�g�R�]�`y�v9z`�?����}q���k���r�Lh!�>4a|H1�p5�4��x�;��M���E���$�'X[x���1��'�N.��#����>�����8b��&���v��OW�!��0j:��l�0J�K)�[��9���0�F���}hK��l���}��7���,����hd�6��i����M����q�ZYH�!��~h�{S1�����]mO��l3���p5��V���!A���ZZX	��-IB�����q�}$��h�w�a�� �C,����/�b�C��E-`3V�w����*����WS}S9OS�F��+^��\�a���E>J����\�|h����G�!�7/d�7>g3*~E��g@��!F�-xw6����w����}5����������W�'�/��u���[��n���*������5��c�,��Z|7�F#�]�1���|T�:_��9�?��X�|��������
���lz��x�k����}��WO�_����J��a���u�������2��g}.�n����'�y>{��;��l����p�L�|w����xw�������Xl�����"�����O]7�wl��f�R~?���;P���d���i����ebM{�*~����JO��������t��a�gtY
��`�_��n��d����o������������mCZ����4K�.��~���M?~wp���������c������}������������������8��"���<��Fa�#��O���Iq��q�L�������O����Sp�����O���}��I�����o�����|��M���'9�'.d/�3Q
�[�7�>�����_��?����� _6}d����,���m�O8�RX4��'�
�4�>���_�?~z����I���c���S���X�2�r�G�7,�|������o��Z�
`%X��v<?�c6o�/.����w/Ws��\_���h�`��a����>s�{�S��,�E�%+��oW	iw����'�=���zr�Z�p���z��?��Og�q!����������j�������K���C)z6���
��}������A�����Q��~��LV=B���%[� ����R���H)7<����3�b8������<�n�G~���Yj������U�l����6Yxaw_8=��_:�K?|`��){���k�M�����<�����~�0�2.v��Oj�2�L�Y]��AJ<���oA(�d�M�V����f��T$��c�V<�\,��V�_v��|8|=d�_,~\�t\vk#pu���ks��`��0�P��^E�����C��]�������-K��DQ�����r��6[�9���U���+9���8 B�,C�:��
&���G/��0����8s����4Cy��'�7���m���^�{����R�I��l�p�,M���9�-��S@��������.`&�0i;C��������m*����H����i�Rl�0;O�����k�6s$����M��#
�5�
=��F�-�}��e�fG�����N����:p.L�����=,��*��8��<��-���l����6XY4�QKk�p+���5X{��[V��e�����}�?����L����&�h*���\P���}YC��jhq�y;C{��m�v�����Y<�h+����[���T���E:_xJ�u�k1u�UJ�2
�lK�j�f��d��������Y�/NZv2��}Y��0,b{1�������=��9���M���em�_����2�l�
<x�r�
��/���:�e0���j��\d��xr4MZk���'5���x��2���6���`�!o���S���I����e���>�c`�ez��m;�O���pj�"���[�FRh�<�t��PQ0d-f�����QX�'�wt�mX�M��6�O����4��������k�
���6����1)�
���6������?��p�������!e�e����+?�z����r������l�6��������Q|3���B��\��a�����-���
��ma	zr����a��$��mPU�Q�
���$?�
����;�Q�[����2���7�����%��Y�w����	����;��
��5��d���
���}��}���!n6��=�_�9���9n��_������6��(������B����5u�������f��0�KT`9��*���x��Y�o=��,��N����2�}��'�������6�{��x�?9���i�����w��������)7�����-S�����
~��j\����mi��Dj	"��A$������Nj��H�����45 R�8�~��H���Y,$�^6�����r�0i-;����W����q�.���+/)}jBw�o�����o{��,��
��F(������e�/f�I4M��\���+��=9����	{��f�~�e�I]J�q1XI�?I
W>�t�C{h��l����`U]Q��e'�e�j���glbl��_��l��S�/�,&���]6���D��Ck?�E%����������"M)|X���wp3�&�:���AU���z����V���k�w�O����`��fY�Wh��h�����w�X��c������l��+�Wi)l�����@i)��>��@�-��CK�#*��Rl��ZJ6���x�3V���o����\�����������da�^P���7pv|�|}�m�a���$�8J�-�W�a�t�qB�o�:�6���B���"r�7���;����m!����fR��_��D�+&v�
�)�����f��VK��k6���y��)�u'���u\��CS=���V�0z`������e������E��������n+=�o ,�Ea
%0��wv��m>S�?J3iX��?�,z�^���u�l�����=Z�[��[���E�[\�����2Q>�T����E���������x'�����>���jU��&}e=Z���ZI�����g2+�8E�G�:V�C��C��[j1��d��N`��R�L���yu���<^��g���J^|�K{p�����A]�K-%P��o�R<�o�R �

��JiO:_r�A�*w��
s��[�N&��_S�9��/������� G�g���H���
���k�
����7��������_������A����7���������Q�����f����[�x���u��
�_����fk�E�o���X��+n�o��;���O�����������C^�e����
7���������0R����e��y�ts*}s*�d,�p����eka�M�M7j-������t�M�7?��������DZ������`����
����7��������_������A����7��������������:���3���o����N�������7�}����l�}�;�����_��7��������
���0����������q��
��!���AsCb�m0�{w���y��y�-es^���������iYG�?N�f�8�v�?(�0���'��w���,��_WM������P����=����|��l��*�������C���O9&�9)��!�4��-����]r����WO�YGy�%�_Z�����du�\��>��o�eps	�rX�Ur�<�y����\�&���z������������l��9�"�4?\_U��������{���k����?��2�6'ANR�W�4�*iV�ME�����3��C�q���<=Y���/��
�e�JV�a�,>���T[?V0�����1��d��]��My��+�g�����6�r����"s�*ZB�	��v�^<V��ue�C��4�cm��q�l\,���E-g���33�'��Q/o���N����N�L3+����/�Y<L���5��9?�U}�(���;��R&U�Xd��Yn���t�����V�)MFp��#k���	3���93��J��������nNH���m��[���gVW}"�>j��N��/���~�bu��-��)U��K����*[0�Ig������Y���������M�|��J>D���k�\�9���<:=��%z���`�����Atw����D�����t�&:��v�])#���H���d:/������H�J�����^z��m�'K��F:���q��b�d���y#�=�Azf-��>���I����:H��r!:��tX����0����������;H���zn���;��s	y�M����J'�l\���y���'K��B4K�g��A:��1%�������Y��\A���+y�w����I�a4�$S�)D�v��-d�&�#��]��d���[�yv�����s�n�"�\zNu���p��$D�5�E�����E2���y���.�GM�	�y!��,�G�V>���������xP����,O�H�u	upu�r!������oq?A�����������?>��{{����YZ{$���(�=�Ez:_�i��M�I��>���#����s)�{H/���b�[��y�c��p$�M�V�����!z����Y������b�H�-�`�7�'�����k+��+sF���6�f��-<����*�&P>r��������gla��e3}�b�U��^�7M 
�
��������}��V����6�f�*M��O�g��I�a[.��|�!����P�����s4b���Q����=lE���l��?����C�l��H��D��:�Ot`u�Yg�4Oa����Mi����2�Z�>��}�K�t%�ram���|�iP���7�E%l����u�-�@������[t�����E�l6�:�kt�(��o���<�������Qx��:���e�7V��|B8|D8����������J[�O����#�����xq�7@�zBi|�����hp�f�6�Q��	�Jy��#��\���mr}?`��<��H@�Z�9���e2�a#����*��A>N�bQ�Y
/(�U���AA8������@���v	c8[ ������X7�B,����T+�2���{]��M"�2x�\B*L(p���0s��"4��*�-%��>�K�7f�Ch��y�e��~��V���a=_(�A�i�FN�J�,BL����*��s���tz��kF��@^�l���"�BHh����n�t�1f}�.J�����	���Y�y��W��*�������AT���>����"��<\Yc��03j�!!���BX-d0�A�CB�B��S��&������Zz,�w��z�~q1�f�m����oV|;����%����U�>�v����?��qd��2"2W4�Y2��Ik��8`�MOBB�B���!��s<<���8�����������=h�X�[�J;�z�����<��y���2�k-s
�2U�����G�_O/~u�%Y��7j�=B�z��I�"����*�.Q?(*������W�k�G�YO�f��Cn�<�4�����':��L���r����0 E��&���'4���l�+?����F'���B�!��������,�E���[�5S�����?����Y�<9��&�r��/CS��%:r��#���r�D/�����$B~���&&R�vW�!���j��s�w]�������{(�����{���S?�'�Xs������#���#����<�;�����������:��G{���������F+��
6��{7��`���g�4f�:��r�H�����
b�{�~��}e�ew���>7kP���>��������Q���@ia�:����9��l�2�I�w
��7���
�V9+�w��P���e���4jc��?�|�>��>M,��\[����w��������{������L���H�8������oWY�
&d��������/=o�:l&�W0�iy!����&����i�"��0�?��za��|������$����Lv38��d�2�PG;J����O�Z�y�l
�'��b���`�*{��8_�Y���f�W��sf6\j���cq|���G�f�x������&q�E7��l!5��*.�����������R����T���X��FSG"�W_�T���A���c����
���#�e����VH����,����`Me!0�R~�������R����/�_��	��V9�O��1{B\���h����S�6�t���`Ym�e��!$�T!)�O��6����e}��EM��~k4]A�p����3�gC�l��/b^nu��l��1f�0�����m��3��V�N��|Y�~�,��j��=E�	p�^���
�w���0��3+Q��k��z��^M����Q*7�h���WDl�����J	Tpum1\���j�CIt���V"6���^���/��B_U�k{����+�C�
�k�h^yp#][�BRD�rm�+��uy+N9C��l��Mn���GY�RT�J��UiNp*5fs����"�\[�F����X�����
W�q��� ��R�^a��V�k{�Zi�j�5P-���=\����:m;� �`c�!�!�EH��Z��"�f�p ���|�Y�Ml��hU��{2�B+���������m&,+���~���u��$���x:L!���6	��
�+u�V5�6�
a=,�9aO�`��
�16��b��7�*�V��K��?`:�y,A���+gCHO�Zz�26x��D\i������s�(����r6�b����*<9�N�.��-BMC�o��p�{����;��X8]�����2^���%!��1��U����~��,��i���B`�v�Z�@��~~�NC�=�^���OON�!l����2.���A���q�@h�`�I����o����7w:�>���G�-�g���9w��$��jmW+-kc���
!M:�UY�,�����~-��U}$�i�������>��������������9'�����������S�����y��f�`����u2�r0h����*�h�J��'y���')/G��[Uc�=>4��Y������������������G��_�Gb����F����1���bjSlh�/����������Y}�|��Ntl���!�t�p���U���U����������6�^�9�TK�8\�����������i��ic��&�����pt�6��$z����&$��(���i�x���8���<����h >���G����b�bU,������q���\\��_��R2j��Q]UO><P:�*C����������:\�,I9KBF���k�#@�5]�@Im%��!d�������.��G���t~�����{�=�� xR[��j+����.�����]�l�~z�w�������}[��-��7t��KI���bj��X�*�Y5��!L:�UW7{�b=<��>��������������N(�;��1V��P�
���Q�/fy<�_��;��t
w����.N�q�0�`����e����^v�����J��<�b��N �J��k���{F��L�!�3O���0��C��F���9?��8:+g�W`�m���X���m+}���x��/E ����=���U���}4]�n�������r�����EXu��(�X����5���
��b�Y����g�le�KM�[b���Mp������d������j���?#����4|���%��:��TA`����9U��R�u�H��C@�{�R5� },l��
�U�T�����]B��t�mli��`�n���.x,,�s~���]9��;����%�O�4t��Y�C��
�j������/��l���!�g�������cp���<�uz�q�����Xy���Y�\v�7�W��������!�;
\&�����a�{��P<��9CDx��q��!������8��<�P�mJ&����S
^�G\����������/�����m�B�t�0�Z�J������1bI��q�9�ju�40vt����>a�p�l�q[K��VKL�U%,dc�r�a�r6����HL�&�����M'�����$$���l���E����u������k4NF��F!�^�K���.����������$���A�:Xl[u�Ip��v�B���x��,D�{�.�bs#$�Z#j�a��i�t��H���tx������C ��g��WVT#��E���D��`[9�C{�+��[%�*�3�;��a&��/B�6�!�1	&��,��P�`k�u�Z����::���`^�U���O������������)A�:XL^%�fs��hk�A��~�G�+S:������:X�^9BpL��&�x3�c���fLP���J��O�:^�`au�l1���*�1��:(�Z�*�"8.51J��8���E^���:�����b���/����)���x<�d�9A#�nY�<����E6���w��	��B��c?�:�Z�� �Itm
��(fmb7�U�ou0�U�	���E�Uz�~�X�3�������c��X����r�y&��on])�'utQ[���OJ�|�B��!z�:,�����P�2D����`�U�l��	��;�R��R�>�b���
A:aK
����/�9^��#&u0�T������f��jg�F�������"A�:�1Q1/�X�*�7�G`�N�h����t�����x�����)
�E�Su��r6��8��-�i�	���
���y:�f	/�8X��:�������8�,������~������T�����4~|ut��<���C��{]��5��4Bt�@W�],i�X+�����Q|������,�/����� Y
��U������N�)���m������?���?!@������~Y�,qc��=`_D6�/���=�������p���5y2] �E�zu0�U���"]<�������i��6v�{u�k�� -U�i ����=_�`s!$��Z��!��{g���f�t�4K,C�=��f�*y�K:�!k�F\O���q<�N�]��DXz[	�l��6
�����k]����j�e1liY�)!�utH���C`�������_cu'����l�A������q	*��P�����5_L�V�����x���k|�{i��z�3����@��%�Y����/����O��m]
l�+,�.����K��E��C�G/+.�����u���X���\��r�<*��d�V�c���-�D�r+�M2�����V��/�/�����2�XXXe�w	���"����.A��2��{���5����L	�EC�"I�pM~!	X ����Z��)�=���O��R],(��
��5�(��j'^�P�u�Vk�����k���r�j���w�����E�u	���p��}�4�nR��V]g2�>`��@�9�~�O������u	������x
�.SNp��\�xW9BF4�*Y�����	-r���T5�M�R��>Ey�h������k��K@�.�������4�`H]�!�����b��(/�J?�,qY�������46L�L���OD�Qqo�d���h8�-����::�g%��<�$�SeN�I%A��q��Nw	��5�O������>���M��I���q��T�U]]�[����Q]]���l�,V|�Hy���`U��U]V��toXuQ`��.���9�^
����"�������.v���KF��R���z{�_`����"s�oP�k���r�6=Sx��O.���>��������*�`^]��n@z����g]]0ZY"��@���}F�%X�[s��`��%�Y�2[$l��p�������wm��l,�_^�s2E�f��
�p	����
%9=���K��n���.���!k��=(���<��k����[L��l����tr��u1����	���`��=�Jh��������H�'�ZW���$�@�T��$B
�P\�@q]Mp��**����;�&�JP�.J�J��H���������e��e�"�^�{���y�����*_N�ui��>e�B\.����i�����/}7���}�i�_�/v1�X��>�]��m��fb��-�!���QB� �������~�����}��,b�0�V��D�@�kqweVu�V��]y���u�l�q��4�����@H����f9B
Mc�V��|�P^56�!����YB�3�4A%���,@	.�E��������a����t1-�������iycDq���$�dG��	��� ����(>.'v58q�b;��:x�7*�9/��������o,�����U���/��~-^������]��{��f��K��.������hp�-��-_���7�a�'�,7��L+�6���\��r���P��B���!�9,g����.����C�Qv�����a:�q����R��M���t	t���]f�$��Y�&�ba=*.��L`�1���p��$��p��K@�.
oY�d4}���,��L_�|ir�}s�����"opm���#a���s�Xi�mg��(=	P��thd3�q���Qy�YQ��
�S�Bm�h�p��M�o�GY����*��.<}ep���tj]�9�b.���t�����b�,��_�(��4��2�������h��#!D=�mv��J^�KB�+�%��{]*���D����[tZr��8�A���+���b�k!]�� G����N�]�u�����]��u1RV������em������AY^)�Z�I��}�����l���<p�}�`0j!f4�K�����E��Z��:Y�D��3��#PZ��Q��-�5?jU�wQ���@�HQ	�~UX��e��D<&������h�5j��<��L����h��mV{;�G���ihZ�5_.4������K����	�]=�<�x��Q�k��TZ�+�#P^�,��G��^��L���
����g�V_�H���O��^qLt	��y,g��*�4J����K7����>�M�x	���
�������YpW�sI`����|�g�z$:�G��i�[3���Z� {X�[9B�0����������
Y{�QE�&<��8>���cC��+�i���������.�D��.N-3���f����Un���B�����c6A{��l��F�#�Z��������8��8.�}yX"@a�n[��$���(� ��!�h7��8���U�#_�����3�I��0��p��a�i�l�.�Au2c���<-}�����g�� y=��=���m�)z�N�2eG�������������7�/�����n��	�q'=�#��&�~>�xW����,�{�amU
{�v���,�>��UL�00Y���5��2R)��+]�X\�����7%$�rn6�.q����
:�UWD�y3uY�=Z?����[�u��;~W5k��k��>FPn� �=I-'G�Sr���7(�����Q��� ��#n�m�3�5�b�9�8F��X�D���MauJH�7K�Y�ZW
���=o��j�g���.��5��A2�]n�y-.Wx��E��CY�����3���r�;R}T���U�2v��g��n��/��8�������k`v�D��>�=�mh��!��4t�����\%S�9�i1��{���uQ�]{�i1��{���X��xZ\��3@7�Y1A�{X�h9B`��f����F���>{�}-B_}���!�E(���	�W�z`��k�4i���������C��B��'�u#�&���.�������H�����P��z���w��a����=����q6V��(B��{A��������!�r6������tK�z�ny;����w{'��;�@�=]�j��L����~�I���`M)*��k��G��h-4�x{A���������g�^�^���Bv0f���
�y��A+!Pso����
ni�����e�F���c�a6�t��S�Q�4A��d���1���&���b�$d�X�`�3�=X�L�u��=-���x��~[��s�z�epR9�Z[�c8��mE�}��{8�n	]�G��}V����.�r��]���ah�\��&�w�����{������������Z6����Z�������=H�G.�$�P���@X�`�=�	��!:3
�[
	��{�����i���W�@�Gy=�-b>������	 �����Q�<hCAoYW�t����~�}d�2�dV6aO����{ZD\����M�������"�g�|85/�8a��CHc$h�G@��&�������q��n���!��hB=Ks4�����=����
z������j�*c�����(>q#Hw�m���M�����;����7]�,[�rD��Nz�����N?�0��,�>��:���<�[�������4�n��6?�v�[a�%���(���c:���-*��|Z���hI
�->��W��R��3�9_�VNt�1��V>��p���-]��Qz'�>���X�f����^C|-	�eq��6�q������t4������lWW���/Y��F�e��@���h�1mJ�,�����=�@@|^���8�0���&��7���r�����rnz������j(���8?����Ib�@�������.��G��	���j4J��,�Sa��#�n��Lp�7!���r%)�� ,���_�M"4��+&)GV��YNn�lv+_��������;A����^5G5 �c�	�4��N��2L�|���������t�2�F|p�g�+�~��#�m�p���T�Fd4��Op����%�ho�O`�>��K���OQ��[�	�e
s�+6B��qz'�>�,G�������q�3~�1��r^�����Q�����41B��`�}�}�$����*����cL��
!;�^�����Pm��p����M��+<����'�t_�?�(D��>��!�G!����n�q��Q|6����:�l�	��G`k��>���X m9B�\��DY�����z	����Z��'�q_U[Vx��1�[��P!����`\OJ���k���n�����M
��?�YLd���>�p� �l_��!=L8����H�1x	"���l9B�t5��_�9���i����T�u�^p���o�J?����}c�&`�6��O`��������^ou1*������������.N(�)5-F�,>>�X@n9B�0"Yq�L�������,oRY1�M�<��[�03������Y-���" d���!dC�Q�'(d�-���.�B+���M�1����r�*��V�k��@x���	����c2��{�c�s�[��
�e��ey}M���.
����	��������/��W%{B
� b���}
DL]s��a\3���}��!� ��]�$`��l!�@����`C��������
�)���"!�&A�:2XUN�
�5l0Z�����4����Y��8�_?/OZ�-?]\����:p�����k,��#@-#�c����������-��H`�~��3�`!D@�~���&sg!�uq5Q�� �(�����},d6�Z_u�l�~Mb�'�e��!��&�\��5�ao�u��;��c,4$�?����^���������(����'ic��l���`�I�����~�x|�����c]u�����kI����t��o��]����]^S�wS��������l|����h�x���i�Q��u����������-�k^a����q�R6�����*X�rab�����o+�� �C������(�P��x�;|��Y@h��F-�������Z�7��mi�"�OC{�v�-�z���p�I$xm���l��@ke�L��>��<�Yl���ne������WL��"+Hb+�����y:M�X�����1�����tX�%�"��G���^��Z1[���m$>k�(�-�c@��
joYbk�3,+�K,���b���@��b�z�q�1��5�s���(���V��!���o-�l����,��/��,��F���f)���8�Y�OX���[��\@���g��I���N������V*M�����iYq�)c6�H��ZY�j�g:�[	l�L�%���u!@�l��R�jfp������c������XX�����@�j��R�z5	�������C���m�#q��-������U��b..SQ�E� �[��$`�t�3�t.��D��atm�[�T`go�|������<���uj@��F^�C���������
!U=����
dKl�6��P@-gC��.n�<�������l:g�2J�a4��wk)�G��\���F&~CI<l��s��
W�e ��)��E2�G�$���_��e�[BK0��C��\6�[�=�eu�n6%�����9�1�����*�s��c__���'(�9H(P>�����g�l!-����(��<����j��Y$W��L����J5j����h�:�,gC�.���7�8���0]\M��x���>�)k���t�b9��A�z�X6�G{�~jY���MGp�g�:�0�.�[K���'������_C��P#��O�@�l[>�<�o��
4C���r�hKiw��8fQ+#��	���_��:��Ov�hG��������,��9p�N��8I������t ��@G���(Zx�-��6��a�F�a��:�UN��eA\&�
`_� ?����!z�!	������	R#xG�v�����
��[���\C�����!]�7��C�G����)2�o�Tr������LBB�H�r6��Q���x��u8Y�������n|�3������S������SA�JFhp@��[|e;a3]6�~@�5������"~���$�e��B�<����9�x��=�J��j��uZ<��T��k���6>��2�����`}C���;�)��Q)Q�l�����E	"80���}O�&c��H?j!I��I	b��i�
��#��0s���I�����Y�zrr��^� ����
1�H�p`3����;f��-�W4�����S`L�r>�
aD��
�B>��Qc�
P�w�:���[�c,�V>E�����:�{.�w��0�_:��9��(���u�`�.�%x��0���p��q���HQ ����A��IiN�z7��XD�@��b�'�j[�0��q����z�fY�~' ��@/������l�N�C���J���&p�Q��n9f�c�lB�d�Y�����7���J�"�q@��U�j>��,��
���r6DO�!Ui������}(]m�|����)hjj�3(�T���AL�s�e�-�Ctk2�����l<��m��"TD�����9�����
H6�2~�Ve��`�&wH�
H�V��
F��!�x��1d���px:�� @��=��x&6h�����������y�����~zq���C��{�`�����O�B+tT����������Jc&$�K$6)PH������XM�*�!	�$�O�X�zxr98>z}����������_��2��l����!
\��-�h�8�K��hXM�n�����(���`a��=��I��Ao�@X��<*�Sj��!�����xWg�tg�L~ZVq�Am!��V�5���*gC(�6��U���f�	co�<��O|�47�%��j\	�X@cj/ ��g�&i�g��}G��"�=�=L��(���5��u��W�t������Z�w���G��Do�li�'�Dn�4��+yr`F�`dhQ�'��G�d��n�����y���z��o��ul���v�WH` �R+D���z6!���*��B��5l�A�P����E�����������������	�3��+q�5�n�,������g�A��������kx��x+Hg����-B������]Y#���y<��zq���8gh<�������AC�a�0���0���4@���!(Z<�y� $��c>�l��"�*GP�!��������������Vc(���<C��T�'4%<��[i^�9j���iD�F��-�����9C�2�v16�����lN�w�������58�25Oe����"�`��Xd��M���IX�I_�4��������MDWGIK�Ix��������9W��f������1C�,�v��jw�i��$%l�4�_UuW�X[p�gp=ZU��F���`T���p��r�rx�g�W1?k$�[\���,��C�G7Q"#�!r�(��3�(}	o���.4���"T>l�+FB������~`�`���aaN�s�F1��4:�27$��c"��!S9C�k��FO���������BU�?2��m	�24�:�)=Eg|.?����'�L�`��^St�F�����iR�H���
������dh�M:=7t���ehP�
C)���� #C322$��PGF�nY�"h���~e
t�Q�!��V��j������HE�bk���c�!�T���Ai����2$��P�N��BP�!/�LE=G*����b�)W/�H�F�����p�6-��%J��i�U�v��� �GH ���b�
��'�|=���
W�e������E!AX�XXV9B��0����������Cl���DH�n��XL��!uF�S�/���_P��24�)��$��+C��,�,{h�1\+	�/�'9����Uus?]���TIv5�"A^��7���.W��V$@$�r���(6Vl* w��b�g����#TL�U�����/���=��8�������rA%[��������CC���B�t^!�����b���
��i�y��o�		b4�D���fp�����@GC-:��Rf4Lc����*~PH �!nV��
m��"�F�o	�4���%����2!,�@��}F���RLkS�%��0\3|b��K��!F���BCR�ER6��44��*�S�k���d��M��i���t�/��G�||ou�o�%Y,�D��q$�-���� �*�
?}Z;�����bn���Q�
��[K�l�(f��5���8o���3��VCcjU�%N
���ZMb,+��a+w�sd!������US1w}ac��)�s��!�p���e2�K�kH�rz��PB�
{��''3��<�%0���d!��
�
�P�r6�.iXWr^����$[C�l
{kJQ��+�`��+!>.+�L�5�p�4�L�J�5\'V�i�!���Bb�fCP�a�P
hr�Dw�`V��	�5l�mm����?�8��Y9=��@jC4F�&=�N��!��HJ� �
����#P�^��G�w`�c�^�H�����k��a�|{��C)])=j���o
��%�m���{X�W9�R��X����O��2p��|y�#���\k�%��J��o�����p��	���B��W��UQ����`�R6���
9����0�vw�$X�������0.W��z���p�E�#�������o{}��0��m�6b�]�����x�(CV	�[�d���^��l���Q��������!:�m�a
��������.��xK>��b� H�G`�=���/+�Lg�\��9	��	���b�L[������_�&l���H���D�d�'���6�kCz�pO`�=�h[Z~~�G �=	��!��}����}{�w��f��x���0���l�������p��<� y�[y` ����i4�nSq�Dq�$7|��L������p���=���l��=	�AC�*#��0����h �I��t��W����<L�(�=0)��Q�Xk�*M�VOXS�'��/.���'�m��3����B�qc��o��
+�����8����@r�E�A�	���j�z��G`�=W���������)'j����;��r��VY�J�D���`9��!�V�� %�$_OG`n�%�=���a�Q�l9�Z��h6cS��7)���??��<����������K����O������R+�� �c��l�^n�Z�l��d"H �=S��v���\�+�&���wg��{��w�a%L�	F����aU*a��i�]�\H��n�0i�E�T[��~sl�n�E��HD�Be�*�fw	��H#'R�������Xm��j!��0�������d���������u��65��6�u�L}�pnO�]F�m|6�hX(���99����&�p���e��M�(f�C�o(��Fk+�PA4�kS���8����H�T"���K�B�J�4!����&�a0q�d�5A�t�c������Uw�{������@��=
���8��u���|>?�8<��p�a�X��������C(����r`��.'
��o�ht����G!Z^��s�K���)�s	)�x])��i]�l�����$}.���0x�G �=]�Y���n/�b�����~�k�{��3%x1����)�Teq���	���6.��U�@�:A��t��,������
�o5]Tx��,�Z�p#2�=`�0y��'��n&|S\����6�p��n��	S��)�cs����N��=,��i_��x�M���MM� }{�+gCH�a�Z�|��"��V��k8�� ���zkj�rc��5ct��5����Z���a,/�� �������7b��xx�A��0�W���
������GP�=����!�0(mm��y�����|���a�ge�'�F���k��S���8���:;:W�M�\��
��3�Y�Y$h�:V�Y$��8��Y�%�A.gIt�>��hT������(�{p���=��a�<�%��^�#:��������Xc��_{�*��>����0�r6�������}�O@�}<t,��f�"���l����a��Y�*�T�	����r���h�EQ��O��������Y�Ba.y�tV��g7|N ����{q���]4�O�b�@M��vW)��5&���30I���WaSZ���l�����f��O�} �T
]�Z���O@�}*���c%o�4��u�����$�2�������P}�W���*K?���7NQ��XyWB&PH���<?�����sj[Q�'�	�Vy�(���;6����"��s��
���3�@9V��x�=���p��!z9T�����clep5�]l����� �t1�����Y�A����
g�����he����Ji�g}�q�k��:����G��J��tJ�`5N�����[�*�	`��F��3a��Z2.�8,��QQ���q,��its�	��o��VRe�)��s��j�U�li1�L��Y�|����0�^}3��o����Ik'�/�m�:bK�m�.�L�	&���7�U���G��Z�V���J��a:�W�qN-��v���k�1���K�c�/l��'3��h���Jm�"'��1�;�]"�@�v`��������h������G��e�������� .��h����|~�i*�m��F��	)����lD0	�o�W1�!��r
���f0�9{��/��_������j��|�n�I�r>�jzY�o��6�i,e�FBl���r6�v��^����L�Q
��G&���D�l>�A�����x� ��w[K�����GscMx�>����sj��W�j��`���!T�W���#�b�Z��@�Gc��/���%�w���Ko�r��z�<�"Q��s;�6�L��� �2��b$1%*�uOG���&[���]gX���������������������sx���v�o�.�	v��(�aq��R"�	4�5�/�{��Cp��2�s@��+>��!>CGdY�{�<M�����]e��Y>���.��G���E����Jk $��.�tq��.��5����}3�����q� ��A��HUt�MEr�<!��p�$W�	��kz�a�����`�x	�����r6�rP���&(�>�w�z�@P^�9"V��`�m�L#~����g�%a���dj�=\y��O��}����}����E��g��zD�������h���Ln����}_^2�Y��(�
����{EBK���Br��T!�|���7�(o_��L*
�'���!qk�v�|�
���>����o���	.��q�r6����\H��m�9�������C\��FV"��n���HU������(!�;����d<���7�tRBYg���5�V!Y�a}����"��)�h����U�:B�����	�O����0���/f
�K��)\/��q55@K�� SXc
���`
B��}]8_YF��Mg���\��k%sCDK��!��>�W)BZ���r6�`��2�#��>���E��j,�����0�������x����~O����E ���<�#�>
�*�Ctm��e�sy�d�:=~�U,�O�Dp��)(#"C|dqE�U�)����C4���TU��B����:H� l_�����>�
��F�J����D�}�P���7��"��So�}�k����T�`�'|�.��{����?._�w;Y��8�������}����&
���IV�
��w<H���x�Wt9��!(���!q
��	��O��-l��Q�?=8����������������������GG�������CeW��q�=Swlmcf��%�v���	4�h����������M��}���!t�o8AY5��W�4����+�c0��
!b��"V����|	�0��B�����!�>�����.�N���i/���D���<fy�My�6���P�*�?qk�����efa
��!
SL�P���� 9��������7���jNF�]=���tM�{Z�����XL���������gt+,B�P�K�����+O+E��a�ie>oWy���`,���f�[=�oPw� �����k������V�V6��vU����g����3|�E{��������P������Y�h��E�7��5H(�m��r������"!<����,}�Z�Q�J>�v`�}�iw�:����r9�};p�<a�����X_UZ��������?<�]m6��h��>����%VyzL>�X�3��z/^U�6	{<!H��$=5
��5j=6
�7����7{�{��B���g��s�'iC������z�`��=;=��Y,BU4����,�Bi����E�
�^B�4�{�thDaK�r��z�>7��z&>o��-�����|	)� b%BJL������^��P�	]����H(���~�P��w��o�
��e�M>�M��q��@8��vY�InE�����c�Cfe�,���AtS���*�4�
:6����u�X��O�x�v���j������~g���	����������<b��l��G�WL��6z�>7~��b�y�����6�-������z���ea��������Ma����3�|���GA��Yo����L�y�������{�V��������x��C���]�]�t?�-i4�A����:$V
;�5gNtb�8��G�q��t�����x�E��z�>7t=�
��}�����!ZO����oB�iM�:�c�+���������p�"���z����R�/��4L�S�����h��{�_�qg��{K|2��]�sb�Q� ���,����|E�"�*���4�l�a����
Y0���)�R�\���*|��%��eC�C`��#/b������,�b�z��m�z�>7k&z*>_��T8������TG��������i�}�w>=�
���������i=~
����nz>_a�C�������@���V5����X��8��6e@�N`(;���m�{h��`��gV��uL;�*zO:F6�d����Z����GO���m���������A�	��%�;��?�������H|����W���j���+������KF�-��{��Mv��x��k
��=k�V��,��wE�,��Qi�|�y��drce���';;���������7O����m������h��z��,.�Ur��o�N��?��-���,��=����������������8��d�b>��$�&�����0eD7���}�3��=���}��<�Ep��O����Z�s,{���TkV)E���U:�����'����O��?(-
k��El��J@�]��p/� ��w������gtY
��C�[��l�����^`���:��lZ����~�Xd9�
�4K�\���Q�����N�/?;�~�|wl��}|�o=y�������=�<x;]���G�,�f�wwO��<@�[��i��x*~G\��a=@�^����~Fc��~��8�;e��4����Wj����o^�y
�$gz����u&����c��F��x�~��������<�1������Q�#����d��8�n�8gz��.�f��B�k�h/|����(��#n���`�����l	�X?�<�`�����S�/Y�|Y���WE��7���
�x%(�x~��l:Q��>%����T�!��s�N_�7�j8����\�����3k��QV	�?���i�,�P���.�v���5�4/�i��O`�����
���ca�/�����j��)!V@�_���P���#/�1���{���I����L�hL���a��<r&��i�

�2�r��556U��)�����ufb@Y<��������\�i�G�l��v��c���/�(q��t_��N�������X��%����p}U�m��]�Zz���>^��7L���������N���_V�|P�K�[P
�YFS�L����_�UtS���{�`��r�U|[[�~�e�����0�C8?)S���om��NWVCxm��� &!�,�WQ�|%�e�N�I\���o]���qY���0lO��9[b/D��`3=�\��.��a������Vl�c3���h�H3��9/2�k/��������(����mi����O*/�/�X�
j�{�����3��x��f0�n��^}�� �s���n�x�eL�����.����@��i�v��^���}���7q.���,a�i��|��cV7������xI��i;��bA�����G.�Y��{_����+sy��\��-��^������Q��Y�_�Mxpn`:�)�[��|5�y��Z��v����Z���b����
����(�/�^����n7�,��&��_�f���
p�~��gl���E����"�S�����f��W�f7��I{��	D���N���j;	��f<���e������H���������Wlnh����_��m���X�^�U�6W�����t�� ��h���
?�< r`C�L���_uGI� ��=_��w������i��������o��7��������f��R���o����������M�.'��]^+�����n�������fk�����d����oq���6{
_��o�#����
t��f~�\�����-�O����m�_�f�l�o6�7	R�������3���nh����w��f��k������f��a�-Z�f7}����M����,�uv��?��<�(�������[}��A��xf]����
}�������!:����#�8K�^Z�5����f3�=
�}r~��������61�G	D��2Sn�^�c[+"�������R^�����@#�����+6�A+��gh�">L+��A�\EX��g�\�}H����G=h��b(&��)�����Kk����
>����`
�^�����>5�5��vV�U��~|:}���X#���vK�������	W��'g�:�?a/��L����+�e����se���V���}���������a����l�������|)��f�es���b��_��l��S�o�,&��lVj�Z�4�!���,k��v��������Uw�&R��;/�����P�� �])��x��?_'m��������m���7�?��7��������������=��������Hf^�5�d�<��m����8�6����o������+����_��go����6����o�����6�?���Q�����g����o�����6��
����Z����m�����W_��7�����^�0��O���v��O���3M���?��/��$����3��_��m;��������H�����mw��x��f�w�������HDv�6�ZM��2�^[�� �i�}��������^�5�.3���;��?�a��j�7���������&�����U�����}5�KnU�9�JVz�����_����������MY��Lm~�e��L�"��dZ���� ��s�E��U��jxi�n�x�c�����i����g���X6�����G�������q��3���n�������X�(��F1k���M�`=��]z]�$ ,	���O���,���	dd��y����xZhV|="��Y6}"�>�����~Jau��-�k;/`���.����+V40=Ig�������n��aw���&���tc�2G�'�Go���.�NO�l��h���%&��=�&:��v�/�G6��m���J��Fz���cm���g2�l��_[���f�V|/=���6�Q�\�~j#�te�8�a1M�|���Q�$����_i�!:��t@�<�!�_��
��"u�� y�B�\��h�<����u�.�d�4����k��1}���vg7�!�":��tH�Y���9���:HG���-�%�3�n��u��������=�Az���$F�I2E�B�t���<�)��f���c,���g��A�����x�'KHks	}p���a�C�G��k���MXf�?�y�L�f2������y!��@p	�����y!��8��\-�q�L���2����xT2���B*\3��g��'o��@��k&�Y�x�2\B3\3��b"8�qr	�p�C<h��]�O������|R��McD�=B&<D&�\���!�w����OQ��1_��@&�t�_�<��	�	���(���|��lA%V���|��/r�|�9�x��C�����HgI���$���g+_���r-]�����3�l��f���s(�Ntf[n�t����'W�4��R^��3�^�akq���5�����|1����<���g�C:�MmE��'���tY)�7��M����>��>c#>P�������f��M2�}3�|1>�}t�-�k��o>���k$C�S*n�dH�<�����l�K�mlyi���hE)}�������o����a�y���}z����F�m�[q����������5���kzts�=���x->��G�q�����D7�{m�����8���+���g
���EGts����?��=0�������=�t�x�z�������0$C�������> �}���D���} ��/�����^����f== zz���|1)~�	v(���������U�����b<n�>$:p����|u6`m��};4�X�D�����\��/���`>�[W1���q2I������vggG��!��C�M?��\[U������:�����x�/����������/������z�m�������n���������v������,���D��(����^��h�_C������Y�mkbf��:�
��U�!�c!"�A�PC^�����������6}�p�����\����s���	�'���u��J����m��^�abC���^wMl��6&-�acf
!f=#1�b������M=l�i����x�W[��d�Ks��U��V<-6��n�@������tM�h��`�"B�z����Ht6i��#����(r�#D��ZD@��������%�GhD/l�|H�&;������\e��#5����&;��g^=Buz���&����m��l����!�����h�u4�b�x5t�����:�O�C�,>KI���+����g�����.�o0?�K��ru���5c#o<��l��%���	��;k���y���)����Lm�@�'D�����!L}|����jM8`�]�MG�@�������Gq�(�m1�����~��$aD�)�����	y����f�A�����4�OV,t��'$�o2��`�^}x?X�$���:���1����}n���z	b��yH|p��=l��<��+��~������2�����Qq��_<u�?�^V�K�s3#^�,Y�y�b��<Y�$�s�A�����
�?=8�\���h����rpv~�������������>7���'�M��	Y��e-Yl��2���b+O-����6eL��zb���%U�������+
����hw���>G}�?�cV��c�0N��1��y�.2��F�1'��tI����U�}n
�p�����`�a^���������(�4��M�����&�7�<����Y1���&��m�����d>)�SqD���@LYo	��F�J�N �F� ;�������,U���b��dl���O�o����~�S���?-�Iy��m���U���,�"�y��g���#}��h��'[{r�6�zl��1D��x���K�7����1[�)�R��`���t<�/�g����r����Az=������� OH�M�1��8�'3��8�q
$���BX*k�~i-jc0(�=����1S�h>W^�P���]�E�����7�(���V��	�z�������W���1Vp��`,($��� Am���!��1����J] �?��?m���!t�1^�x�+��+H�N���i������O9B\cA�*��������G'���;]l�B����/Z�;-����1�SN��^y����{����`��������l!(P�@�T[���?��8'�1�A�$#j���$���_Y�
�=�d�`p������!����r6������6dm����&�K�gC*'��,���MF��V��>a )oH�L�m��=|�H�A+o�j%�'���#`R�&�	����$�F�`���Km�d�����+�-�@PmA�Z
�"�"tK������\�+m��U��U9BU�����6BU<\U<<@���rZwx$�W�^�T��4?�a>D�����#1j�y��|�w��48c�i�zkc���
�C(Uk�-5>��g�������z�8��Y�� ������N�iSB�|��S�s���fbm���?��������u_��������z�k���J�{s�w��\����gg���M=�P<����,�_�}�Q%&)�������o�p2��p�s��a�,�{&!�FP����+ws1#��x����������;�V�'l��{��"BD5���[��F�H����m�?�������	F����Hl�'���GT������h`h;"0g���liE�e�K�8���U@���W������r�7�bv;����p����0r�}��k{�^*C��T|�����,gCh�cf���"����a�����������;?{��@�	Z;�@��k$�l~�S��h�-�h<��{e����������������CD3	z��\r�������+�{#k���i�[�@�m3�&l;l������^��B�o>���/O�>������5+W-�����]�o/�N�E�j d0D�r6�xa$u�SA'�����
C��mm�����=	V�l1898z�aFK���p��,7�S�6��	!5B(o���<�"�i��e�T)mc���jS��R�oL��>�i�c���%���u��r^���Z���x��/7���i6�� �!HK%@k[Z�-����\B<����KH��{k-!%�MQ����*�=:[i����Z���
U��+TB	���`���I �����L����xO@�����F U�m���V����!�M��(�J�7�Y��m�C�	&��x|���\��]�vyzt�������	����N%j�5ZM����[m���m����a�mXO��=�mD	���Xf<����A��<�'(�C��v�0���;]�3[Jys��v[O-�i�f�q��0�������x��%�����svL9�������Sl��b�&W�S][PA@�`c�C�F@�����1e�{Nu����P�?X��C�4����������� �2�C�O�XN�P�|�'U*D�+��*��K������8�&:+��K�WD��v���l���#��-7T�kFPK���c����+rn�f�t�)~�mA����������muN�����T���0�`.O���4�m����#4
d��!VC-gC(����\;��j�r��*�I���8�u� ^��pD<ThE�e�&�*^3�%��f����Dh�9��`PN��WRib��g�v�����^��a�����r6���hl�t�C�����q�'����������������]4.b�'b�zy�d��$>fO8:��
!��[r�{���g��&g���]2[�J`_��t+�Z�y:IAS)������!�k�i�����M/�L&H�1���S����?D������0j�!`mGk���C����+���$����f���x?����(��	������������G������q��
��r���@����L�b���p>O1��C���W�����t��|��������&I���<�&�n�[��-����V�������Z<\CS�`�w-�Rl���c&�����BY"�V��v�)��T�g�G
���5IbB&?�� ��l�J��O�_�vvf�M<`u4��N��`��YG}>�*�%,�p����7�xgK�h�;8�Y��R���a��5k8N�X���Lqd��#������� xt�^���t��S�-�������;X`�2)Y�/w�����e�]��
��f�4�����5��XK[�b�'��A������b0�g�����cY{0r//d���M�[&.~�F~
�`�d����N.�N<�*F��g#�E���u4����Ur������������A���-]����%�@�<�'XvGlY<Ax	B���u���n���z��m���4U[�����l
@���&9���M��1�n5��O_����K��>:������0������5�!�v��j�l���8em|^���5[n*�2���4+��U�%�
;�I��ZD���8��"�9��E)��I��m�Y��-d�r<�Q-�j@h���xd}�;���#�[W1�C$�3�-�;���;X�k9B#��\[r�n��VH�N!�x���l��ve�%t���l[�0�d���&gep��C��|��G;w|j���La����5L����	��!�re!`q'0<_�=i�N���K�m�������9�,c����N��S���b���������o�n�F������)2���.v��� ��bx�`�����j�`�yc�
]J�z/�G*���C�1�V&~]�EX�rQ�*�<����7��y�#hx�E��e*=�dJ��Z��pbX�.A��(�}5���":��2�������F:�w0�^���Lc\�'y��h��e����;Xq9BV1H�L�HG���.d8O[��#6�������r/K�q0
�i��c����u6�	�E/\��G���$�!ty�Z	7;!a�{%�x	��x��5�p��,�0��$�c~I3���x.�2sv�������hL0��1�^���{
��q4Z�/�����-���_e2������!A��������/�d���S<�&<O�P�7!=F!���>Zrq���d.TC��8��`h��
�8�^������1�@�~S���}45!A�;=��[��n�qlIM�y8�8���,C�l��h#�[�.��;�x�E*,�@�-Q������h�e������"��G���I����"a#��uD���e�D�*"����h/$�g�Nk6�B����	
����o�z��i����tB`��Yq� ��x��L���� w���K6��� ����m�+���l��������'�S���&Bh($��I�Y0�)���B��F��)zo�CP�F���/�;�;&=�-����� ����c���HR��3X
e�JEV��������8���;&��2P��a�&�x�$08^��S�5DD��A.A����H��������a4�G����������-����kj���e�'8�@W�^��.�3���I����cU��D���uRFU�WN��o������<xszp������[]���R,�r~z�s���r6z�s���E����t��p��qe_�K0��3�����/�<�q	���@v6c8�=��t1��,�qzu��Qc4`k��R�M"4F��&�����I4�n�)�m�+���'���g��Gd��:|sN&\��wuq��"��x��wE��g"X�]:�oY�q�cf-f�<��T�n��l�&����]9GB#��������_��C#bA���m��s	������B-ta���?�1�8q"\�w�1��[��\�;V�gp}�M`����6o���$��r������RaS���T�����������k��z�c�����~]d�&@xW�[���%Px�����B��������O��8-����1��HP��cD������2)���],$y-mYG���R~z����t�.��o��`z=}��3�&Br0�]�i�?�Df��y~o]	�������38�?�"w��T��P+,��d��&R.n;�P��1�g0��%^
B��-&��'��b���&B�0@^�i�����jy�}�;�k6�^�r�be6\��!%s����K�d�L���Jm��YR����`�����	+F�$B#u(~����)��O�������7'0��{��=��Q8aO|�^���`B���"��H��QTvY+�5������*���Wj�Z�v��y�����~�K�hD�oY�g�1klp��c��g)[��pe��\H��f��@�24���Ea����s�q�2��w�}���L���qa�[n�*�D�����C��5������;������b��w)�^^x����r6�
a�{�v���y�A7����	�"�<�~l�j$�5��^�"�a���w��r.���F/gC��gz���.R;�Y��b��������h{�.��:�]i��$��c����u�v��w��3n>��y�:�`�]*�:x�Y��!��e�t����	W��F Xsc��lu�M�������������� $AG��-�@�]����������O��]�(:�K����*b�vx��V������>��1T[����`��b	��EQ�-~���1]��Y�S����f� ��"�I�;���!�)�[��v�p�
�a�������0i���[9 D����	\�q�����hy0 8f70����D^�YFl�DY���@�]A�{�f1�]�u1�W�yD�`��<�a���G~:�6��#.�����<��u1������k��������������r�#8]74��������F��G��.���}�z��lG��l]��U����(gk���iq&�d�#\74�
�%0Z�h�z���=�����q���e$`[�,�A�uM�>��[������@�u1�V��J�����	�b$��	�E4�����_]4J�����C�rS$�S��k�����bD��		����q��C��\��q�r?$�T�,��K�n��;P���bV�3�}��)wg�������Y�I�e�M,0�{�r&�Q�F�>L��.3��{���N�����c���2\��qd�w=����f}�#8L���\���4�?��S<���������m�cz��zZd6���+�A������2Q��:h�V� 2=
���a"�0������X��U�����lV�k�M���k����r6z��L�'��L����/��w�+uD
o[����]�
�N~����s�����s?����y�R�z=�(��)�:��wz��u
O��
�;�/����������z����M
*gCH�o6)8�]zv+A���

8�`z�)8A^z���)!:�������Y�'Wc�F8��*=������N�E1�C�-�+�Y��]�?�������H|����W�_����b?��O��d���zbw�O��Q��W]k���Y�����Y4����Y�?�����b���������WOvvv�������Mo�Os���^�E4�_=a��*���p	�R�����4��3�W���E�-��WOn�|�rww��vg�fp7��4�wyl\V
�c����v���#�!{���M�'?���j�����R�Y����W����6����?m��������]�0��h��k������8�v�;�����3�,�������_:N���vC/�]�v�?u?�?Y�/��,-�5��?��,"�=��o4�����>�d�|���:{���h�z�bw�����{py >�v�:��;������<�F`��B���x*~G\z��l��������	�m�����Y�DAS�@0`q����3�J�/��`36��I�$����LT�Smo���j�v��K��	������l�z����8����	����p�eO
��I;(*|�e����(d���n���,xd��["Z�~y�!j�}9���T_�����og����ojY7<�b�J��`?�c6I(N�X�����j�����9�
�}�F
5�LG�g�vO~���5�v�#���T�h�������������5�/����V���6��,$)1o~�c���s�]����US �<��`����
�(^B����{���I����Ls(:�F]�E_�3Y�M�kho�q�#���m�����H)7<�N��P��4��?���zk���������{�N�z�XE���q�"w�N���}��,�~�x/���e���Vk�{��j�7?'�A������q|�������?���D3��eu���0�������e�7������Y57	��GZ��(M�����_J.�_�?�ML.��!�:9Y
��y�O�
l�r�s�^E���"����!�����.����,�nY�'J�M-�5����u0�B?]�m��Yy+����H'��LD�F���������z��/f�����mi����O*/�/�X�
J�{��)�~�	�H��_:�G�������N��V����|sa7�>���$�2��ic���:��MC�����_�l�&�y��%�������i��L�f�����5���j��x������jz�bn�q����S���������vp�e�8��pm66�
��!���$U������?t�����5������o���������o��l��T�o�����}<��_�-47��{
��G�'x��G[���w����o�o��\a�]�otW��t�C{\��.�7���w�7������L��^%�=�lW���3����=W���������o��������������C��/FVkw�����`eG+�i���x��nd��4�&��m\���g{'�8��=���7�����7.�F��:������W.~���?� ��x1z��T[�t��c�j���jO*���(,	F�����R=\�>��ea�al�&�`GD�Xd�-��2����*s��V2esw�C� x-�
��t�������0�'C�G�kaYXX�-k?�����:����������7};N��;�����V��<]��6��]Xs^z}x
��k�\�9���p"�\��i,��R��O����4��F��gC�����������)7���w����]�5�����i&�L��n��k�5DG���4�A)H��B�������-�E����i,�n�a�����M����D�hx+�s[2���h nN2�����h6c��K��52�y�p��3Y��M�Yv����ED��.]�����fq5N�[f	�sO+eUX���&7r�/���V���FBnd=���*$7/BX�+
��EHz�`������,��t�(������2\���r��w��Ki�%������G9\���e�"�;��������`�9��#q���h������u!f\J��;<�Uf�H��%�.|������B�",�������������
&�cq���d����4�WZ�v��L�&�^<��
�Y��
��l\Hx�-�UR���*��B��pi��8eR/�k�%��"t����5w�-��d!��E���sed�b�L\
�,�;VCbX���Xn+����#b}��z�������n*�W~-��d�{���X������QL��H����z�����U�zU�r�-���� �x5<Y��d6���g�$~��zGaY�{��t�mg9����d9�����m��y^j�4��"$����L1c�����]��d	�C/\
��n�����5�XV�5q����}������	-�j���A�X��oK(z9!��E���KM�����Z����W��wt)_TN�z?�r���*BP*�(Q��u3�S(��5�Y<O�L���8RH�*����6�X�6��o L��
���h�
#�(s����/{&���M6���Pn�]������"Kp�����|�P��5l���,�����U0�&%�g&�j#K�g�<��@\]haW�Z�r��*�8.g���X�^Uw�X��[-���5�����j[L�K�@(Y��$aX�M���z�l"���K�o��|%��Ap��
�I<��>�M�MY���D�����^�"h};��a��@W���x�.KJ����X������GM���4h�=�������+
��v%�%�(%�,��k-�Z����N�q�_�J ��/L�[f���L��[XQ*�rF���s%NB�X����r��T(��n���|i%�C('zsci��w���m��p���B��Y�,
�v�n���[��%��D/�\�S��l�� 5��G�GU���A��Fa�EZia����I\�RL��������v9d�o%�)��RX���sx�������P:�
H�8	��I���h��@��lq�
�	_���J�)��~v�<��Gt�.��H\ia�B�h��+x^�N�d��c]��XDMO�Q/��h!��v�������/��x�C�n��}O���N�@c�d�6��WO\:ia�N�[�^�9!_�	�'B�]K�4.B]�{)�����	�C����MH?��	���e)qo����R9wO�~��4�@�~e�5���=)WB3�k)W�Z�D�H���e��FHF@M����N��%$��z���J#$"�f0��q:�/
_�bGz���k&��V�������hQ��[����~��N�PX�I��+-�����D\=i�WO6���/q-��]K�>e9���3�����(![�������d#���YD�
����#-����@_�B�g�JI�R�����v��zM�Cii���5i�!�����$�/������v���c����
�N�L��uq3��3���'.����%�k��I�g��U�&����{#)�&����{$��"��I�l��Y@T������;)���Cq����3���2��J��R�������#: �B�l�Z��L���b[���E��P����K!�y��R.D�����L����*�)�\�/cwF6?�s4���}�TRzN��w��s�~��B�?� �"�8vQ���t2a��lv��'=��Rz�$���h�L�����R�/pj������X�m�BK��tt��7��������RzF�8���m<�����H�`�T�jbw�����;x�0N>�o�W�k�,�����Ak�i��x��o�������u~�4�
�����n�����q�{�`�d��*6qE��5���qv�:>k�����bc�K6>�������"cw�DF<���
�"	����^9B?0�$)�kh��<N���h���W��h�hB40hW���������z�x\-���'���C���Q�r6D��h[��h��dm
${��y���T��Mx�����#��rND�P���}J���#��0��Mx�����E�����\�����Cm;{�@|P�Q�x���Jm[�1l�r>DGC	T�R���(>
	N�L&I���
Y�N�%��zw���D�/�E4f�]�i~1UR� Hm� �����S����DW�+_`��$z2
��g�]��>m���W�� %�V�(�\���q�9�������,)m���+���e �VSy�}��� ���j;�W:�����8�����0+
�F)P����"�8�l-@!M��]�7Y��Ev���;����M�`.m����U��u�o����q���@��vc�M@�6i���}V�����-�3]�nk��E+���m���&�J�*+������#f�
�J��q"f(o���"��Y��v����k�y����J���
���E�J����(�$T	%0D"c��d�yw_��:�gx�q��lr���|��MP���jV��B���a�r6���d����M"�����������A����Zk���1+&����c��a��zF2����|�u��+kB1�R���A��e}�	7�h_/��VK�9������l�t�h�*�����L@�6-������<��m�#�>Dl�.HN��N@�5��8:Y�F\r�y�X^�D��!�i��%Vhl���STo'�d(�& C����N����977���W�
�\]g�7���5?�'Bta
8J2���L,R�(�W���cV�JQ�g
��������Q��mq���%��W�m�2�e�3X��Q!���k�Tx��"�t�,�hc|`�/'�H��v`8p�,�%��l�����"Rj��L��l��qK}��,J	���HByAJ��6����-�U�G�����r6D�F�@K��������"�?��,uMO@~6
�I��X[��*�ItJ��+,�����q6U�AtJ���!zd`�_��L�?��y���|ptryx~�����@���r6D?[~m�����:�s��"��:D?]Q�m�"��z���y<.1��1���(Vh�
���h2��l����
m�7��!T"4tl���E���.��>?<\\�]�������	"��h�������jG��G��.���:>5Ei4n���Tk[�dW*��)Lk�1�Q��%�T,[m��Y�r�]�p��%�Sq���Oa����k�hX����[K��?���{��-�y��u>�f�3��y����M���{�^��iEQ�jpI\'���
��KJX��?��TB�zk���@6��Q�\�h�J�3T:U,�+N!#�%�n�j>�������r6�����t���b��<���5�m�i��ic�&�>�aE9�8���	�^�;<}_|s������dp�npp�w���R&�O#?�dw���m������'�����G�{lg��b��y������B�6#Dm�����jE��w�V�X�$�Q{E��>��A�,a���
�&����b�/��F�	)��P�oJ�6|��������!8M�4U�i�-�a�<U�^[�Bpdk�����Nf�*>P�'�T9�Ao:��_"hL�1���������0� ���{�������{����������Zq�*/���_*�V�@+�D���O:]�U��U������54v:h�`����r��A#�Z3_�d>�A����X)gCtT4����	��t�����'tN�A����O� ����r6D?�M'��-����!�J�*�l�^��-�2L��-��u�h�	������t>g�yd���	c�-&"V���#���$�� ;����DW6��tqs����I>��[�$���x �$��.����o�o�������y@�A��.M��@5���Pk��0��V�%��?�5�P&��������r6�~h�M�������\5����J�f���1�)�T
k���r0S��tt4gaL�N	dK��s����	�Y�|��C������� �b(!BhdO�������k��:��h��
D�(.���S /�{�"�S�=Z&����ya.���r��
jc_�	�T���j-�M��t��������
�<����r6�
R��2zT�CR����Q�u>Lqn���n���*G�~b$�:�HnE�D4����+6�A���
�C�������f�[���XipM����\�!90�xk��|�4�������F�lN� ;���7�����:`*gCL{��f1U��9,u<��e2�d;��x�O�4{�[u0lU���V�t�C��v2����?�|m��S9B0�TYh��@���x�.�S�L��G�#�!HQ'Ee�&PQCE���z:�a/������N6h
���;Q��X��f�;y���y��c�`������!�A�F���8�u|�e�xiH�S�M�B��)`*L@T�`D���!�-��*���AM�����h#���i:����W��M�qD ��&�d�p�J<�����5!#�i6��3�����#����{L<z�p�A�:�8��y<��x>a1Rj��@-��{�E��mk�?V�H���SM(�e4dxq�M�DB�`��D������9~�QSN��,����t�7��G�
+<��Y6���:X�JY�	t�	M�n9��c-{�7�F��@G3t�!�Q�]���,�Vw|����'�<���$!2X�K9BB��w�����5[���
��4����J U��f���u��4����6���F�,�e2�������8�����`��
!��h}H��z*G���c�������W���YG�F���ze���_�������"�P�)gCh�)��Uz��V=�t��!�P��V�M���l$��7}���� f@��_k�j}3�q�F[	M�`U9BS�ki*��~������Yc�n>L���0$�#��X�Z�-
5�I,�)$��EP�F�B��i9;B�L�T�^��/��:}C�"�
�su�8W��\S��%8WW����Uh��r	���W9#�,�8�j}����D���g�+r���p7r2%����_X�o��k��U��K������U^�����ku�~�oA7F�����8l}����?�~�I4�n�Z/��M���k��s�rFzqq(��W[W��o��Iy~
���~R�Aw$���
�k�M�1\�
A�����-���T�������iy;'�����X`���/{�+����T\�4��W~Wc2����b�@��M�K��a������^�o�x:�/.*���(/��6�y����X����Jj���;���`|]���*#�L�S��^<��x��Kt�{B	8���`������o�q>OX��/�E,�O�X��D����%	t�E��-����������dG�����;�v�B�P�x��c�����5]=����k��R6��>S.\���C������F	m�������y����~��mVBK�|��x�/\�V��Ce#�fE����E��C�f���L���4���(!](�,��]�Zv1j���t�:f~r��zz?9o���.�/�X������2�Vo ����\�04&��T&�bWBV����[�r�����k�R�1�=Y�Lf��#g�x�Gr�#��P2#:�.:�q�F���V4�>��u�U�:/s���/�^[[R�d���]vua^���c6�g�1��
������:$��H���� ����w	�������������M���]��/��p]���.A���>"�?���b�"O_�o�^��+�O��
� .����!b+��Q^��@���k]]�XYE	����w�����M�?�V:
��J�"T��n���5�#?��h]����������/_Ey�X�U9B(������2�j���������!j_���
`)>���S���D���:�
iOhv���
��MQ��[����0�k
����
���3� \]3��%W
���o�1IpN��q�R

=�@[]�����|����j'i~�����;E#�A���.A��h��U)�����G���2�|`�H�p/"k��)+������5��R#1��*�
vw	k�ig����2'4�7�(�{U6�R������esA&����a�$O��5E~{v�����"�E�::Y��8��4
����`|]�P�T�X;��;��mk|��/Y�?���2�<S�X5T��<���6��m3���]�"�"�!�G�d��L6�L�q�7#���lw����������*UD� ��j�f�l�����X�e�����#~2����
�r���62�T����958	��EbaT2���tn�'gb�c�N�{fIyL��������l!&(x,l*�������]B���4 �������]� �w�s		��5��k�%Z��A�bPqm���y���=���
���]3��%�c�;^$6tfq. f���]���[t�$(c�4`m������=<��ka~��K@��)��Z�f�s��:j'�K�xr��-��`���y\\<t��x����������!�LG1$�
��7E�W����w�?����O__�gg]�vu8�<���k�>����Z�;����#��n���/�*�-9����W��Z��.28N/[C��6��������b�)��*��<����ks��K ��a�Y������rWwW:�q����^�1���x��U�C.+#6'������#(b��"���oL5�@ ��.f�,
���!�.���X�XH�4����K�Y��Uz�J�;(A��:�V��{����7��������F�
�}��H��W�r�v��W�m��A����z^��`��P
3B�%]�o������'�;�����`M�z����o��!���%w+I��A�z(�[M[��� p�0Z���|��p�L�_�J�r��Gr=���P��nQ�{'��?K��T�3��� �Mz��P��n�������-��<>�4�O�p�����'�yk���s����5`-���n�`�t$�e���C��JRz�^b<�����BR���n?��u���i'�2�c�z����F���u���4���P2WJr3 \��}:d�2��]����q���M�{���zT �2�!�K��Y�`��m=���NSA=�IK�@z=����R�T����V�<aI3ng[��ZA�zd�cKmz�Q����i�B��<��0x�L�����x����IZcXp��x=���A�z�I����GF����#��7���3'�7�����#c����6����L��j���{,eC���i `�-WM�1{4�=4 p�T�o���gF{=����U��_���d.H`��
W���C���j(,]L��.�l��qb�\f���������@��6�b�<2�jYcL�UY�����s���4Y��hR�+��F�tM�D��+'�8�J��=����F�Sp�
�['�����$�/����������aHtc��.�A����b�W5��Nv|��*���c������5M��YAY{�"+�D���):��UX&�+�n��&����36��L@	1��L����Q��E�;��k��#��3\���#����^c����A�{	.gCh�XN���������q���9�6�$�GI�������\��F
�((�L�(C��e�!�Z�������K���������h��=���n��g���y{��e��v�����l�F7Dx��aa��q��YYt]l����Hn�zB�L�m����s�#�m���l��?7���9�d�����C��a]-�{�<��A��R9^�Rwe��B#K7�������BN!T�(��.�@���=���t,�j��h	��E��-�tx�����������OTTS���c,Sg����{ZP\J�����=��!��8P5�T�������������(	r���V���)�H�E��I�WB�0�\N
{"n�G��e�c��K	���b\k�����`���	�	
fD���gJa+�?AR{T`�e���A��l��
��Tv�,����
���i�3J���Pl�w��xW�I����n�$����U�GR���k��'����^�:�r
�t���0G
�w{�Q����,�*Ad��iw�0�!�q�,D�G�������5G�����������3��$PoO�z�KV���!��V��SD����y'{^F�d�w
�=� �t�d����,Hc�i����I�^�a��V=5!��q��5�>W!��,TduD���.4t5�c9A�{�1�m�RW��M�f����?��]#$��7,H'���[~�N�@����M��6�����E��g����5�8��eU����h��2��:���O���	��M�P����ya�y��0����r6z������}�s�����"e��9�k�sV#5Y�e_>H��� 3����q<�:�����z��I���^�|#��,i��>.��WL�����!��W;��V�'9���q���;M�&�#�x�	��7��M�����G��zY��A�G�C�% |��_�������9Z����2�d��E�	���P�-
W56"2Yk���������r�����������uy:�����#�
���!k	/gC(�m�8R\j'�'�w_G��#A���1�9f��;�IZl��!�;�=�n����
��t��o�Y���V]W���M�`�}�lRV����o|���1�^����C/�2|���Q~uV|�fY������bc&{�m�3�>��hll������1�j)���}C���C<�|���>���.�	���Y7.�b4����u]��]��e���I�yK�x>A��:�]���	��w������xOqg���bb-�o�!�G��>[���,�"|	��!���}N��@����d*v��O�~���,����nu�M���5�`�}��V4���}����* �IH�g<w��1�>�8�1����X�8�����!wcm��CT���}�������}����2BE5z�u��}���=�l���&���D�o��'j�8��_j4t�N�
����G<��b9I��}�3e������k���i�%�`LI;U��������"Kw���}CoC���]�"��%�j�fI�����	���q��,�`�}���>8��,gC��i���<��|1�V�V��&��&(JHRZ����V^��Q"x�:������|�c��W���r���o�O�q`cx���d���D���!:����6�X\����9_�������)?z����@y�'`������cp1�'2l���
�NA
^������N�X�V�����|�$S���&�� �O�>��OP�~`���7�T����8�����&D��8���>E�e�'�]�w�l������+Y:a���:����~��'\����i�Z@���5>���f|n��x�A�>Y[��xj����-��t�U"�!������`	
�!��w}]�g���!x��n����L=��s}M`g2�NY��H%<�������T:������Emj�vc3.��C\�����X��Z�K�%��i�`}����^�'DNvZ���zS�O�����j�ch��
!H�� �N�-'�����u�{t��vU\���9��	���A��|H��3�
��/��M�?�|���������{�G����?^"����}����l�Gt��{F�R�e�����r��}m�\�@�},���
��P�V�>���!�\cr���.�s���lt9��3����ajC����+�m��X��E�������S�������VC�5�;����A�>��}#�F�k��|���������d���./��K���B7�x��[���	�O�����Q����Rx���G��0u�^|D��&��b�	���lF�i�V���bDW_Y���V3���M����������A��f�����7?���������
��[S��������]��
mI1wG=A�f4m@���M�����A�f�-f`�En@��Aw��RS�i��f1��B�2K�� q33ZX1��k������1����{�>��f�m���_�\���h%-��
��Y��Ez@���	���'������i��5M���r�>�X�]����.�yx�Z= ���~�~�M�v
�5��U	6�8X9B�t�+$y�, ����e}(��L�y1uc��!����4B?�����0 (����@[�\���w;�79�_p�
��������
���h�p\x��-{�����!x�������l
�%T�i�)�G�O�_
;�����
�.��
!��$Nuc�m[v��=�� ,��������������c&cp�w�'(70d�m��� �q�FW66lz@��.��<��n�A�r6��hbV��b���`l�\��Fal��
0�V����XtgE�	&6p�n�Gy�����@�+6}-����L�,�$c6����[~$[��2�::=��!:!J�n��@�E71p���Q�Y��F��'���e4�2�`k�"}�`"T< (��4�����c&Q7��@��i6��n=�����Ep���b�d�6�58f��DG�0Z�9l�{���9�t�929^�s(�LH26��?p��N 
'���`d]���l_B�_�Sf<D}P�'�3����)[���7����/e����m[�?��W���r�
�#�U��<�/M�<����T�xtr��/^���l$4E�V�$�t1�I���l���9V����U�&�3�	R6�����4.Y�B�����
�B*L�
md&_!&Tm`g8 x�@����p����,�C����<����� c�����K�I������	�Lm��!�C7��m�[�/��/��"g����%�T|�1���/#V��&$qa��9Skv3`3�h�������5,M��@���jm8:99<������d	��g�0����r��ZkD:=�:�VV60Y��{�l
�nU�3��������+]z�5?���6��W��wr`-���py�M���7%xB<'y<���a��;��?��K ��>�x���R� ���4Vh���VdC��K��I�x��a7�&�M�;�$���r��	�>X�c9B/��z�C���2��"��{e}�;>��up����������K���w��@bqZ�sx~~z�,O�9��������Iv�]j�.�#k����	K(q*����w�f;O���:^\����K��u�&2����\�Yem��z��ey��k9B�L��"�QX�{'�����/�E�����c����������xL�1���hFU���;nP���xp��;�`��6�\\����9�<v���6�Y���^�z����m�[�.��}�;huY�>
=���,���
����^���'���)��m��^��x��:���]����m�E�+O�^@`��qp���{Y���]�`���Bj�k����~���o�yS)9��D�i��}����_}+ �2r
��^3z�}Tx�Q�r6���U���x0Og��7"���z�^Lu��8�������5�C,r��
!��
�K,�gE��������kk���s�@����������s��'�H�<��'Qy�{�qL��Ao�u0/xp~?KF��	���r�'OR����F�F=]L�$����Iv�<��� ���������q �Z��5�Ah�q����o.�T�@��V�l�Y����z����3R�������A�|��Wa`���9Ri� ��p)
�����#V/_V
.��N�9�x�F����NN�H�g���|`��7���_V��E��	�l��|���������e�4�q�>h�Z]$(O#�����J��6g����Y�N�����.K��*
kYO7�4�Gp��+����$_�%���y�	L?��d� ��bH������7���|�]np"����C3P?$@��8����"C�5X�C=&!����Q���:C����������}�5r>�v5�Vbk�N���G�� %&�W�"�����-*&l��me��!�
����tqske���6B���N+��2\L.�9����3_4��S��[V��=fZP�3��`�C���M��9r+cO�����j�����l�i'���!�!���fH~�5��c�d�A�_�?L~�a���C)��������B[�C���E���X��(�_����,+$����GQ�w����",��~��8E���4�(��y�Ps�@�lY<�z2�����C�hBh�{�R#D�4�w��0[,�u���
��
�����1���|��a���d�h4Lb;�olsiG���u��!qx!�b��	.R����n�r�d���8�:�>P������b���1��&�����<6���b�P�Q�4�&��~�7���<:�_N��'��g.'e�S��g*B�L��
q�"4�u.5�D��m^��m��8�b�1 �!�E��ax�G�
�3!v�C���w��inx=�!7�!E��9Bh#���3�8�v�8 ����N��A�K���j#���jH���3R6���;�!�f
��!z�b��Og�V�9�r84$��^[Q��u�kG~,���3��D��b���lY�F�"[f�t*.�����T��[�����=���<�� B�z*$N>�^��������,O�!�C�dD�����s���PD�[?�fM�V��)���ly6���A�r�S�L�R����Pw��<K�B,8$��~
�e
�}4J�Dg����o��<����g�<��l��_����Y9�����c� ]K���-�N����\��$l�.����r��>)?R�9�E�
����*��_����bA�^"3
�A�#��!N��Sl�A���@\B�d��-G��G*7h��36]�;d�h���bA��dc��XA�	B:m�� ���,�y}#s9zA8�Z�R�w��,����Ts�{H��a�.������sE�����W8�|fq�C������Jw��X����D���	9jO��;�8~��_���`}��|tr0(� 3?��5�}��Z�I�0"�a�if]�Z���/���5��5F*I����"'�I<*&��pV�,�FeYp�f��u���2$h���!��t�C�R��2��e�a�i:}!n]���bk�#��Q*n$�������8y������]d�l�M!���.�e]��.S���W��Sb��-���"�[d#0����,]�-��{oY���B�Lx��W^Ms�~���������C]�tHro"0����ly~KV�P,Y���
	�>4�U�a0R��za�Xu�f�]n�-[�6��	n?�q��dg:�������*�)p���s��� �CS���n��q^�u��RO�����o�6v�9��� ����H3�;���f����0�X#E�� ����������R���/B�K���WY�Y���|cM
�8�:`Y�7)���C�����"8����l}���|+�����U�\��7!����l���:�r-^d0g,�IG'm��������ht���
f������#l�<���7�����<_Ls���}�c�����@�G�F=��u�b-�.��m��_��������	i
=����'��e�|�$S��x3O'���|������m�-���z��u�x���t����u�rw4�%d���{f�����XEi�E�W���#���Y�xx��������,�6W�K�#���9��dB�@�{�������p���G�=�J�mv?������
|`����
��#�����rFDw����G�0_���1��g������Kn����_+���-�R�]����\6���6����z1��B�h�B����B�J�����e]�M.�q�ey�!NP<c?fM`���!r�$���{1��+/D��]�E��a����)3����r�BcW��kU�������}r�rfA:O��6���Y�5/�z��"��C�����p������!Ky�C�=]�j�l�dM�y)���IC]�#X���5�T�2��8��1��qO
}	��wP�&z6�	��#p������cu!�)��m���8����ue�������'S��@|{���j�{��<�fB����$���`�`|{��o��4�m�_,���<��-��p�"�/����gq� 19z?�k�$�N~��P4`�Y�.�j�VVu��k�%CJZ�)B�����&�����d
S������@R�J�,�\}IB�����!e���&���i60	`��b�i4��Y��g��K&*q�=�.	����r6���"�++"�����u�4�������^,/��9[�No�D��*��)_`L�;� B�P�Z��F����Sl	 ����I6*���	�s����5�<��u�k
@el�5��m���2Ag�tq��93A[�0��LC�[�������{������������=����O�k�~WCh��n~���?��~8�D��=������i����
���
��#��8����ZeS!X�����C������!���V���{���l��ZW�d<���j�oeU�<�C�T�'l�"@���w�p������B6�Pm<�-��K96+����!2�+����b�l���(���l!���[���o
�.D�|��:���ew����g��}����9��e��YD��=�/[DH�68�V����<�	��$�������a*��������
YY��V��'m��-�>K�l�v�����Qr}��h���8B��A������{:��������E>�;���z]�D6�����l����U�:A��i�7��	:�gJ��a	p��t����4�.����le���Kz)��Hq,��r[~�����7�w��������w�m�i���=����ld�,���t��(����-{,��������(��R��XeY=��k���*��%������tdU	�	�>��F�Mvq�$u{��1����<��1��s�|@�gn�yh��N���K�"���*j��,���f|V/�d�f>���@����y��g�U���i���PG]5��WW�_��������f9C�z�0�R�B��:�v,T����w����������QL���������Tvg��^���`a��bL����>��Y�a�Z_���6L�y�>���
]�I1�'��g��rZ@�3��7�i5������'���2�g��'���F�p6�U�vrzr�����k:�b��m���
���"�:���G�e�O�?��&���������`�s�����m��d�)}���KW,���w=�xt= ��/�INv�h�J�u>���AcE$��W}����/������Q=7$�)O�j�l{�r=`�����,���S��p�[����b��7'Pl��T�~j�F[������Bu�JY��Y����r����G���1��xf5����w�l��d����G�{����������>��yf5�7����h�~�������{��4� ��F�Y�n�"�q|N��l�/&b`W.�'�GO����w��n���%�����bQ�D��y�D�����0�(�&�e��)c��f�/B7K��d��N��U.�m�������B1�������EE��&h�"����F����$��C��.���������x���O��yUH�B#��Lxo�?��I���h����u8���������z�F�����N�T>����d��4����:m����6}�'�]en���L�����?�?����{�N�����{��Mi�i�f���t��^�FB93v)��M4�i19�H��u�w��Gh�sY������GO�D$��������j��&�f���7j�����8�|���|�o�uawj�|���X}�RSG��pr�����7�����Sn����������`�G,�>xu�������|���o����s�}��C���9�?b�x�:HT�����v��<KT?�z`�����oR,�r��HH��]������Z���r?ru6�v���G�>6P�����2Z-@��ld=	�~�\q�-�[Qh��}�����j";���h�����5�K�$5�Cg����5�!����j��o�����,�Za])��T�0J�m��`����}ZW������7�SV��>��l������GVq�����(�,��\�Z���j�\=b*P/�?p}8���������������<}��.`c��mc�l��t�f3^ow�7c`3��m�����9���8k���Ewx��5��1[F�o
�����p#����Q�vK��j��{o�Qn��#���!�u'��Jo���v,�d�Q���
���H����top����z[������Q�w��?rG���u��?r��G��E���>��Yo��G��q����,�/�N���,������6$�(�Z��	��j5}����r������q��z2@ll,���0��
���}����1�|��_�6q�5�'6��z�@�#+�o�uP���+�R
#��,@?���
?��s�����7JA����(�������l>q��d�n���l����qz�'Z<16�WN�)H9~������r�+�};���\��#�_���B���cpH����9�cJc���{t�4����}��(�W����s�����l�vz�����������H`��69�1-�/�#0��4{�^����Y��O/����iA?��RgE�x)~C�m�������3����g�e%���W����<'���_�|2��j��i9ic]���B�_�^{�b��������]�w����g��B���uM�C>]�z&��"n�5Ro��&g�g�am�{y/j��,��W�n����/_N����Z-��(����X�W"���x9?��o��D�M|(�LNc����o����k����~D������?���&���7�����������9)����|���M^v�^6�*��/�?��Z��P����7O������A���c��?�a%~�a�OA��?y_�����&���I4��k1f���>������wG�����������O��99���x���������������������(/�O�Q��r�M�(�zV��o�\���n�`��j��]�O��)�����Ykj����*�G��N�����!Z�j�RTRY�_x��k8��7�M�O�T�r��+�M��>�E��I�M�4�5�W������MQ,���8���u������gR5�������<�"�`��1�5���zj����O������I�������9�}:���Kz�"�G��e!����'�SY�}�i/Rc=�������u����r6)>K�{����KM��Q["^���_����N�T}Y�^=������|�G�W�Y������57���L���tof�,Y!r���x��$��~�z��������ZR��m)5�DOdSS,�n�����)aC������ery��oD7�v	���w���A�l�G�l�G?���<��S�������w�E�����E�y��m}������Z��l�>�7��BG��.���B���]|�Yo$��i�/��!�z���q�
)��,��ih^���n/z(K���h����^4��
����k}�y\����>I"���:w�%������3��zfw ���r#�m}�Ni+k�
��u&z4�y����G*7����r� �Q���Q/��t���g�����M-�N�����s��E->1nZ��_�'�����o�p���|�y��|���; �Z~?cn�W�����X���>R��������i���]�A����S�l�Y��o��	-Y3�L�7���jZm�P3���4�]�]��Q>�M��I�z����7;�E�Ag����!&$������m�6N5	�WN����-��� � �N���x���X���z���!�[��2����Sg6���-�<���S���K����o�p����ng�����6��������oq�>������������>������������>�������n������@00]����]��v��O��cN������9��.�������?���?���?��?�R��^��}�����s9��K���[v��a����������������b��-��O��>��l�]�l�j9���{����;kY�����}�_��
u��E���������Zh�=���>�����e��l���9��i�73��t0}���4'�W�0��"	7��*�
E\g���6���n������n��y����]��S]���v�
���5/
�;��q�?���D/=�M��q6��'�y[MVdM/oP����i�E9�C����"�z����,���|S{��O������l��8��&��j!�]��f�V^�{s�&�H���[�{����Q��5<�s�h�X��\����6�vh���3�?���7=5�^7&���=7<���hb!�Ju�j�rs��U��S��m$��q���/��i,�{>���j������}Sb�
����Gv'���&$���E�V��y���W�z����c��1��Swk�e^S���Xe!m`D�L�+����_�+d]��;T��@���������,��$�"�bQ�\W��w[�D��G`��G^�4���n����W����r����o�r�j�j+r:��=T�t����@'  z��u��<gWz������<�!����B|d4����O���N-�t>P:%`��A�/��uC���/Q�oW�9��;4�*]����h���@����������2������{����RL����jA����X'�z�v?w�D�t�&��v
�Y
����8
��j~�
������Q��F${������h!^S���������Vs�o��s��F�y���(���f��h^��o��i)�xVj��Y{��������c��w������'��I�{-Y�F��,V����{�a]��,���H�^-��
�
�G{������t%:����p��Q��nj��G4���8�h-Tul�����x�?�2��v��;0�57dj�����-�d�r�L���g��f�j;g���Gd����Y>en�z�Tu����i�n@BFlw������!S����������������5����k: y�������[�n�m��`�v@CB7
h�����=�<�r�[�E������MJ��S������p������A�w��Y0[��h
��K��.��R��t���=s/�-n���v��l�iMuR\�C���]�J��/D�	�
�����d��<�7G{����<#��:����d�n���i&,���r�J�y�������F�L���xwz����N~��6��l�_�z�����QN�,j�c
��=�H� �r�}����QK@��n������9II7|^z����=#��]!@�?�#�8;b3�E��iB��3#)x���*pV�oN�/�&$�01�0fHgJ��&*�L1�LfTM,o��)�������T�r%�ZP9����	M�+Q������������`$@�F���<���b���1a��I�S�!YM�e�u1-�&$ �����/V�����YX���������Y\���XpU.n�".����O1qVDu��L;(R��NZ�>O/h��8�����1�����y�#�N&��r�gt
T5eT����b�/��d��v��zt�N�e���zP-�s
�p��!�F��6�8u`�?OX�����6����48u��&�� �������qy4W�z�����'@�SF�5��m�����Df|�����m�/�5,��O������{�����&;��{�������+�i1�EA91��������q��"z�9������O����d���9�V|*���Uv��fd W������[�G.�.
y����h��d��{S�4������r�Zz������)��{u!^LK��<����_p+���3 ��b{�t�/��~��wZ����e�
���ToV��\^�On��c0�a�/���{�-6"n�������$r��.���S.>S�\)
�jZH��O�S����Q�D�i���)�7�L�J��+k�5!jb|}3t�2�_��#s=�y���/��qb���0B���>��7��w���������_��p���*��M+4eM%gR\�b��'"��Z�  g#>*�����v����I�H�"�AMELJS\]�cb���t<����f�(����Z�Z<@G�,�����V/�@�F��i�U��D�,N���Ws���T����"
�U^��~����Y��b�����\t^�������;��4�vN�5�*]�sM�)�����@G�.j����h�������'eW"�y�nZ��07�:�M�k_�����1|�`�-���
�`��9\������QO���s>��xb)Z"#u���]:l�������)t���oV��W�x��Q��z��M���\|�v�?$&�7��Dd��C"j����^���]9��N/���pk��T����@G�Wsj����<`�}����"����yf��g������Nr���/�[d���IK�=���=Wj�h?���F��K��!������
�	�V���L*�1���:L�f�F��w�:JT��
K?�x.�'�9��\�ot���>�3��������~���NO�O������:��i���������
0���v�i��N����e�<�)����D�/�
? @�}eF���~3���A�~%.!�z!3A
�'�_�D�q0[�4�04�:-�>}�T�gQiq�����C��d�h��'O�S�_��$]K/F�#�NH���;oq{�P��|"^�)�x���;7J������'���������c��>N��+��}���^FCpi���{���A��;�/�=,���sZ���d�y��vn��j����x(u2�����
��}���H/���9r��D(�o/~y�������.6L�`�*�Q��� (9D���7k�CE������6�K�O���H�T����`i�=��'P�C���G�;�������l�n���b}O@W8�Z���#���tv~x����xU���#�^F����j��9�-�J�Vz�plo��>�J��I�h�m�x|�>zW��o�g[�D�h��UK7��) !�CWc8������z2@l0����^�����%�8���{��uzO��c�i���	�>d��9��k�=��������gC�
@g��[�BbNCo�����3 $�[�b "�;����z2@8��{�t�K�.���~z���[���?��
����EyY4V��� D������|Q�d *W\������c����ZL�C<���O[2����Zwq���a�r����g=���	��}^�������P�@	8YO���7g���n*=��7x����f80a����{|8~�v��������
�;�p<�@M��]=��~JTZ������t u�����~6�rj�s���������/�e�������^���a8�A�E�Z~��}{r~q~6tP�92U�����o�}8aJ���2]�����'CJ�Q�F���]��#O�����Z�������f��P?u�e1-��C�!}��3��YKdf�@�s��e�l���9"����!���!�z�;=�e��*�������qsS��]�n�����>�5��W����S/o���]��o��W�k{^3�^S������@�>���/����!j-����L�@�8�UO�G������Tm��o�k����"
3�`P14'�e��������|��q�P�9B�C_k)�����z�����K�N}:���3���f$;��g�|��t8����.��z^415p��h���U9[f����G��!$@M�%EHMcx��_�������.���.,��r1r��>��i-+��`�>��Z���\S
�T�CR�d�����M:9��$S���,P*�����P�G�v/������?��JC�r����x���0��������7����*�����ip����
����G����������?��~�������jp��y3����'8p�4
���Z��E�O��I�����;�&�
���"5��T=��n�3����oH���e���w�oNNu{�����h�
�h
8�w�z�����������Y������w�*�������M��*�5��ZH-
Xs�~H�CbQ�i=K�jQ�������~�������p��w��HE�"
�����z�"}���{��4�B�������1bX���
�~>9?��H�x+�K������=��qP�Y�Z��?�=>?� ����CT�p���>����p|!%��
��������f��?��8��������kN;t#qP��qL�mS���(M���j�����s�������>������r�������������������o��%��M�/p���5���N������j-1 1��d�g@a�[Uo�����
�,�r���>�f��av�����	��n#�NX��H�?{���o���\�nr�	���(����p�U��sW�v�J���w��	�`�g��'��@�2�����]��Z���d�/��cE�zP�^�)�}F[�#�|v/�AtnE]����t��.�rJ���
4�sU����3���*:��vG�e�9�QV�v�B����������~:y���������������D��������EA[0�����nX�����O
8�.�<h���n�E����V@�6����tg`���^��Qk�y*\?�@�ur7�[���R���2�����<z�d@%vD7�c���9�ZD������b��|���S���b��	�{z��3��A��!��Gj���k�`�@e8�TO�F���`2����w����_�U�!�F���A�����(���]�'l��r�J=Y�~3`J��_I����d?���H
8"UO(W���e�%�<�������h�u�|F�v�[��	�6#���t�s!b
� V���x������cc���Y���q@��`m/�v���Y������q
��^��k���z2@R�}q��������
R����d;���Cn�d�����d�3�I�q@�)����usl���z����
R�n�:6O����7��^�d�r�nN$��
�Lo��@7pt� m�9�t�������
8���������}0F��
l�l{��8@f�
V��V��\l�9�����z6��m8�l�V�`(���p�d�dn�(�����7o��	`n�9*�p|�	�^�O���7�<f�d�t��X���$-8�G��	Q�m�o@��[�v
����
F�XR�?�{p������5r��f^b����n��#�n�pl����������
��k�����! |C����4��"X��Z���!bu�{�<�nhs��^�x���!oCg�6q�M��������}�<�o����
]lb���<�px�w���<�p�{CmxP��w�P
����:��!��C��VO������}���y�n�C@ ���q�`���<hr��d�~�����#�����������&9��d=�F8�K�nB��/����W^�}d�.�8����Y��Ig8h9tr�]��W�C`�=���S���u,�[��;��B��.��]n��Y������d;��B! ���Qj���j{��/�C������vr���7'��O�����
P���Bn/��h�0x������Q�{�]�r�)������q7�(��'4$p����l��U�p�a�6Q����f�K���{���!�C��u@V��e4�\9r��v����x�T>�2��I���""]��S�~�rS)�== ��)���s���|<^�������\���K��9�gs U���
����������V��9DF��;����CGH��(���e$���C��K�n�(�����}��ZN�N6�hN6b���a�(�j��a��I���7����!��C+����my��0���s.�3���^F9���k.���:�O*E�5F���lK��R(���by$��iu}t���������H ����XQ�#Gq�����#^���thfEj#Z�j}���0���^i9��-&�h����v�I`��bQ�����s-&�_(Z�hN>t���28�0~������9(���ai���H3�}X�S0��33�/�����@��.������I��=�<��d�l9�{���Q#����%:�/��"���^���������y;�����,�B�Nc3������3K(����u����9�����|����=td�-��T	��!2]n/]c�rh���D����P��=%5t��QG�&�������:�V���S9�jW&t0@�<�,��-�2Qt����������qd����Cd��^��M���\�]M^h�)��x5%���������+AO��
�
8�}��?��W�tY,��z1�T9/�g���>|FO���2�����},�)XRe�+��C�X�z�]���~,@1��u{�m#�C8�Tm�/��+To��P���+}����!G���� �>����d:�����G�"��=�p�#`���u��``�7��Co�!"����B����{=�����z
���|#g��p���P	7�>\}�YSKi�9gj��is��#�B���n�|�r����S0��Co���O�{*S���!����?`�C�����N!��CG��NX�P
�:2��X��Pr����Q�S��;���9��Gr|?����[�����V'|d#��KO�.[��u���[�6v��\�v#t�E�$�(&�/��-�VjQ��?���Cgx&�|t��c&\{���?�Z�[�O=-Y������d
��v����� ����5]zg' z��cM��'1e/�N{h��x�+��E1�O��JD����D�p��+�N��lO�}f��Q
�<.~_/��[N����N��v[�K=& >N^�*�q5�T��b���y�Nz�TT���������;}M�T-q��f��Tw3=(�b�>�Q��M^TW/(�>������^����[���N��C��|�;!Q)j��W��nJ�T�����[��9��W�
�B��!��N<��n_E�hy���Yu�����7{������oB�e���r3_T��]��m�r"f~s��z�@�|�)�aM��P�/��]���(a�j7'���\��_�4������ffXfM"@�G�6���������3���Y� ��������7�f��#��G�!�����e����\��f�������{K+�V:�P��T�zD@M���]nc������C
��o�@�������%a-��kx�2�}d��=ki��m3q�>��{�t�~��D��`�M�#�G��'$�#�
�{����Y��������x�Q=�L�g�4#��G����h������E>�pF3l|��w��	�h�����a�z�q}�-��e�?��EN��n���.c��0�7���SLGj�"7�y��w��y�S��6��<r3�!�qCw���=��+$�k�|��w�O'G����hXpi�/���Y�Y��4��+���q/�6���X�	�u����Z��z@[G��S�0]/K?�0������! ���Q���{��N�`���-�k�
���&;r3*�t��w��aqP������T[G��XUO��Oo�O����0�j�y��1$����[��n;r4	O�`�����f����`c��u�N�,7����c��dV.��Dwd1������8[q����|qd�Q4;Jf��������<��=��Yqw1��}
��#7;v�8�����]:��L�b�[��������W����H��-����b��Y���K�d�{nk���bI�Ox
;B�^�?����\�1�:J�<"?G��V�p�'GN|��w8]�T���]��z��M���s����O�>�&/k�\4UC=(�)N������_�}���G1\����Ux<�1^��
�������/��9J����UD:�k��\.GN��1z(r�l�na�9r3��qq�q�2j����������z�r�r;����iu��"�o�k�^�C~Q*�_�U�2`�#�n��t��(q�L/`�2�\(c���8B�����8>�����s#�G��[������S�&@����M~z/=�^��#o�^99;����o@8���
-�^������a9rt��������rs1�FB�F_p�H[p�j� ��<�������0�:���z2@F��B�87��u�G.x89/�7� _~Y�����QL�?�e��\|o�����:�o��n���/�*88>p�o��D�B1@�����~dO�*49��d=��.4r�����P����>�}l�*,8��`y}�U� ��1Z����Y�v[�c]�c
���b%I|�������v�iv���KC1��c�I��0�m��-
��9,XO������:��/
����m����7v7�V���5�����aDm�kBO��7��T"���������p���;�i��Z����v�d���x\��nc7k��*��G���Oh+aN
��6�6��i���E�Vuq�������\-����Rm	�������q���������;Q��!r�|E<��B�="m)���x���Z-�1Uu;�^N�������7����������ffD(������n��h���LW��(�����VA��N&�����T�'Y��
88v2���l1e�7v3���#cm�X�>~sl+�����El8
S#���Xl8]8��H7��u1-\����e��B����0�1�����c�X����h@�u�Q��X���+G[��H7�|���]����#�@0�� ��3����m�X'���%��8���)�n��������|\�=J9���?o^,��'���<���r5��7g=$ N���t���T�YI}�=Z�l�C���Oe]^��ryO{�h�����������T���)�\��,���M�W�Gt�Jo�Yo��}��liLr8�����h��8���Of/���9ybE�N%M�y���/'�_\�B�������Y�S�_����w��X40�����_��w�G?}�p|z��?���`���(v��v%7Q�������>?�4�.���`{��4���t!��^a������7.�<�7?���m��8P����A�����mM��&����$�f�Qh�N�1�N'i���j6!��b����=�D���9aE�zWL��/��	�c'��]�^��d#��E.i/���.��/K�~�E�+o�o���:��3������;2�>�l���'��d�]M��]m=!�,��	���@G�i�h��<���=,HGw��(x��s�@G��M��
 �n�tx��Vk�d� ~����K�s��9����>������y�P�t5|�b|T����-�P:v���. yn��1������!��0�1�Dw������1�n����X���j�Q!7dCc�W������8��m�1G\��A��0��9�x9?	��0�-pt�G�/��0����S�UOhG=��Z�f��9�@��+g�}�����)+�.F��(�h9�[�{����;��3;������L�J�pQr+�w�9�Y_��s�9t#6OK��-����y������w���7��x�����|ZL}�t�Q�z2�Zg��~�����yq���!��oJ��@8hZ����9{f=�	�lL>�9���{��h�O���=��q�kX��e�����1G*��k��x�z�}z��_}{[_?!,:��h]�;�*7OL��������;a=��9�u<r=�a�n�1.��cG?�N0Ol<:;����#���l�\�%��N,����8W[z���o������yz�l���c��Z����6v�K8Cg=��%[��,7W
.��z���Mkq��������aKW�X�jk�K���mF'	 �7�:�u�����*_�\=@w����;�A�z'�ub���d�R;��;A��$�+�	��g���J5��P�t�QW�`��:��1���`���v�1�`�	�i��������R�F��l���;���@��t-�z*�
r�<��L.Ew��N�vZ-�����.i��-�@�\-���b�6���|����$p��<��	`�����r80o����O�)T�$p��w�WO����;	��l�;	�G�-��>$��G�-����
��X����i���������������u:��d*�BGc��lL�$��l�������:�+ ��{���' \V�6�u��&{�8U��){5�k7������'0���XoUb������b6!�*&` �������/��s��v�[��-q^�1�h���.RY3��|���v������9(`{�:jo�=u��bO�;P��U����?�~��H�VX��J������i!! {�yh�l�'��pT��J������i�Zc��e��+
�MNJ�&:<�Ji�5N&PO8@}
K�i��',�nT��'��.N$MO����� ��By$������=q"��'���{/������g��?���������*r���j2��&+�/^;,��3�I8Q���sF[��D�nK���K!oK�I>�[S�.��W"&��R����_sa�����������I��'6$�f��������pX���7#�n������|�6CG��"�q~<<�x�����S�S�������	�������jG2z������sehU��S��:���l�/^�� �pT��G�S87�c�[��g;���/��0�t(v=�Q]��7�W��8���o�_>�e��0�`<��V���q�)���S������r�O�6�o�*k����J�c:��=�PL�zL@c�������r�z�:(�����E{���{E�H���xi/��;�t'���#�wV�4�w'V�[���;Ilp�ncD����i�`>9�������-�o�r�������"�Q�oo�"��S�������t��a�kk��\�9����b5����ysv���J��l��[������SY���}��q":D�!��� ����X�n�_��|����T~�sPi=T�Z�^NE$9U���e]Z�o�FQ$xb%�w��9��2�S����U���,����YM�w�P��Z-���7��D��x6�6}#�gV�]/z��a���V�����b���X<��b��E��F��S�6�5�y.���u�}��]|.�+��&(xbE�U��)L������^S#�;�$p���w}������r����c9��z�I���/�muBk6|Ot���	G�7!��7Y��R	V�=z���FV9g�
N��Sy��4��u�X{bu��]��,�\��7�H���Z�����/�h�&d0��������
|w-��2cV"�	Y��q�-�*�(Ak�w����������}GFJ�����T��2E�[u�����X�K������-���R�7p�j��[[���t�����Z��w��K�1��o�����kJL�,y�*'���B�+s�t���jQN����mL��kU���$�jO.�z�@N���4'����m�,�{�����9��
�����<uR�3--�L��+�gIy_�$�G���M�~�N_!<�,{v��g��}��m4�zF�$;���;��W*mh�/>����*�mCh/�A��9nB�G��_���^�����������ncB{�sX`�B��AA�F��;��{��a!�<��d��:�5�F����	�Q��,]G�eI�������Om���]9�40+�6�����S�IvwW���������O],����6+��'�Rn]���eW��u3yO3�KM������d��0OoW����D�1���]�I��H�]t���e���3]����Gj�[I���4����P]���:��?U��y�U������1����f��I����l$H�	��bP����Y�2
�8�r�d�n����G
v��}�lY��N�� �\�9�]/s�'��N����xZ�=&�[��z�t{�Ea�{+hA�2��yW�����'U��q�/���X���I���@���?��<�z��@�w�*�v��L
�����.j��O�X������P�r���U�/E0��O9kt�`JY��.�F1p}�Y��S
��������B�x/Z.��y2�)�|�d~�^�$�R��������{�J��@�c�����+��	�)�����+��	�i����)���e#�R���[`��;�����x�)xzj3P����SNn�L�P�)�xj�C��H�)yjC��^�S7��`�i���|!�) x�����`�)(y���Z2$OY��c����)���e�������������)��7�RM)@�S�[��;�xje���jJC�Z��mTS
���jt��TS
����t�E�������QM) �S���c��`���=�I�� ���7}K�)�yj�D��p�����QM)��S�����R��V_�/B5��_O���_�jJ
��v�|z�)�{j�����>��&�/F5��OmP���R���r�)D~
�|H5���Oco�_���z}����M:-��&TS
���,].���R��n>��Mtp�C��lHm�$@�����D���y�P�)�s7h������\S
���
�{�,����u3d�rO�5�`G@�������
���&�~���{
������d�p���'��	�f����G{� |j�u��z�=Po�$����J���m�&@��6�]�;`O���S���6l���r^������zG���=9��8~���us��:�H���)���E���5d����.@���_|K��#u��M��b�R^�7�pY�����J3~8y�T�s�o�*g��o5.[������+�Ve�/r����4��-���iR�z���4LS��>��\����4_4�q�a��e\��*<9�`u��)s@���.|y�lr�[��j����9�>.����\\�]o@��6�\o!Y�:[����4�@���m�0E���-zW�@Om:�@���� ��2�=;p�&���rV�7k��	�J���fK�.\f��h����*b8t������h�+��-@|v`�����e���C��d�J�q���];27�<�y�A�z7,�x����6�����X�}���-����t�+,{j���"������g��8`��$3�ng�+:�ekL���'�#3��gV&|}i`d��!/J����MA���o]���Y��r�.��������9�����XH7�+Tx�jo��G ���37���Df@����7�e����J|�=,0�e��z����|���JG��,p����A���JG���������>�RG�������2�gx�'��C��2:*���`�����}�O��=&�v�}k�h����w���!�e]L�<!EjV;����l\$B��,��e)�R����z�w�=����\��S���������X{�(�s�����B3���5S'%���f*P���7�t��uI�
�����p���@����{�����2@�g��7i������\.�T�=����z� D~��t�cT��g�{��P"3���3��g��Y?��34���SG���"���(��8�Y$v���?���@/����8~_O���}��g,k��������ZT��_��a�r��*�D����������tDF����g,��+[j�������E�c�F@]X�����8�����|���N�\O�(���&jM�Yi����iTAq���v>��y��|f���|��g��'�z����=a/	����~7^������GR����b(F�zd�g�����R����DMiq���3���Q�E�8]OHD��?����	�J{��mr�����z8����XW��|w���������Gx�,q�H��o��c��d��p$��aw������3�(}�@�3�7�,�@8���0�9s�����azr��"��zd�	@�gf���^�l����(���bN�����Y��GI�I\@z\qgk�<f���G���I]�?��[��:���z��tfA����G0�,������G���f��l
�q(���'�lt(��+��+�P��EbQ��o�m��2W+#1������������T�����*���w���������!�W;_�W��8:s����������i7�Go����_}q�6lv��awx����[�Aw'���>b4��:�����������3�=�N��T%�����L:�v��4��a���F]�����:9�o�>|���U>�������I�����-��3@mgn��#l��������G���8��������BQ�/T?�$�p�#��8���]�D4{ds�UO�.X#�����+���Ui�A��'��|�a�F��Ud�Q�z2vqD����{����*����������'�g�������������W������c5���>#���Xo���l��6���bY(w/r��DkV�����A�g���1�p�#�W���z��XA��f��d&;��O<2��G����Tzd������o��:-f�������k|�/�_��2D�0nj4k���/o0��}�����@�z���UZ�)����7���S�5�X�nON2���������<�2��MWb�`�(����m@�,3�����T����[��]q1?���� <
�R�D��������i������m*�Gh�'j��
6�C���{r=�$�+��m����
���F��v$�<���yi"ywS��`�&�w���^�m��X�������yk���^���8}w!�����7����#;���w���f��$���#K��0���vB�����;=??_��|q���7�����#GX�y��a�0s8�������s��T�o�Y�66p/L�$0A�i���w�j��<����=�>;���������_Nj�;
F����u���\J)�[d���Y��H{Yb�q�z2@�8������ZI�zqZU��B(����7���!��*V��#��|�C3��j��9�#�E����:����;D�(r�idvlz���z��h�\T}-}���h@�"�Y�~`����������������m90<�\���������"]C��l<~�������~��{���O8n�����������x9M��q��g=?�pz�F���@�\����]9F����C��dv<�]uo�d���x~�#+@�)ek���/��j}������j�3��V�F�rq���P�������@��2eWD�j6}�f��}D �.��@l��"����9	���G���Q��h�nr����78��Hm�%���+�Y������ ����s
A���Mn����0r����xk�RSyT������^[��l��}^ e�7��@�G��R6�,����eE��+�������Qd��<��GD�'�����?_���;u��;S�N�/�D4J��5��@��&���^��~+�P��=;����w�r����1����rE��#�����=�8n=�<~m���z������O�_�K�{����7�yw7��F�I�-�"o��������z��q�����~�O���|Q��.�[�<�(e�������u{��Z��Wu]^J�I�R�o�������`�m���>��31��ks�
���]��W�9��?����C�O�~����_�z�DQ�������`zds�n��C��3�q�tm�>���
��.�
��Pz�Y���~�<���d@
��C3%�X�i_���@�Gn�� �#;���47��G��~��Z�D���e[��Os�z��]o1
��g9�'��f�����\�\�?z�a�y�������@?|	�����47@�GN6��}�in7������\�UU��(���4��fR�<r��{!�&S��eq��z2@�F.�GM�G��t-�3�	t�j��N>����L|�4�zF������D"V���]^���U��s�����I������j}���V�������L��J}��hV
���h�O��;Ye�>H$t=�����%��[���g��!���?zb]��E�=�_����F:@s��;M��8���nk����p,5]Fcc�����76v�>����H���/h�
��K����J(�(�Y��d�3:[,�9�9���d����]����N��0��k�1�?t5��UV;jM�?40�L�c{SP���f������J��{T3j���s��i����<��`W����s���C"��	�Q;N�;f;��q/W�a�oF���?0@*n4�vT�>w�=�c���q#�G,w�v��>w{B;�M�o��_|�[4)t:�C�����f��@�8��(�v��>w|s@!8�e}Y|hk�l.$�>~s|~�����������x+�	*>k���^h����Z�6��6�OvR�>�r�V�	��N�s8
�h��Pe��-�� @���hc\`g���m���/�����:&{�5�����5[�����f�V���F'����4�p��F:@
�,�{E�X[m_�����+	7��?�3��������~�[���F=����?���������&Y`�/��T���A��{@b�-���B��9[Q�BV?y��<�e������r�v���'��W.������$���� b#* =6Kd��,\���\��5L�^����m�0�%�'�d���}��'w�����W���0}�%����������;�@�
���i���G��]����y�@�8��H�O�������
���%N�Mfd.T���ld@���;T��
�=��q�������}���Oq��he#�x,j�(^h,���bpa�"��_��rY������Rv��>wx� I��{l��j������EH��l
�Q�v��>w,�@��,���<T������������b�ztK���Q�z:v"�>��?d��N�����'\|�c���c���\��xX}�eG�������?5��F����v����?Z@*mn���������P'���<j���L���O��������������~���N���?v�>w{sv �>w]�����������N�;K?��@5g���N�&���^������2�P�����b5?��6������?����6�K���G|@�6b���_6�������m��+$>��}Gh����
�^��z��2����#0�`�����z���p`���x���������� c�?|6�@���x,�gW�c~������> �}'r�x���/�&��_`�H��#O���w��{y�D��>��}��������������D������a3�>��}��:����ytqtxvt����$��w��y������i���y=��`�4�����mf}8>;?�p~qv��gl��u������9F-m�X^��_�����JY��~T�E1-��y��������B�����E����s��J�`�zY����}W�{q���z�c�.B�n�R�F��D���%�w��P����p����	.F�xp����8��>!�R��Hq}B����H����9/pcB�T���x�6s����{��E��\a�������z������~�o	c��]�h�X���u���+�S/����u��Z��++w�y�,q��(`�>�q��P> �}�nq�I��TzK/�����q}���o��xd�1�q�����<��T�0D:@/l���S��X����\��1�,Sh.����Hp�7].�,}������ �!r����q�F:@\��^�0����t��?@��n�*>Gy�e�������9d���i2JR�sA��To�n���y���*2��ju�U��Y5{1/uY/E�(��J��h�+��X������	�+�O}8��(�
��G�]�iq��|�����p����h��b����a�_�B�>��}�~dT�O����c�s�\��D�@�8x�HH���������b*���)'�L�'_d���>���1��B��+����F
�z?��Y%Z��|�J���Q����n�������K��
��/��`�&�f�(M���DA;CD�����<�.��E)��W��x�,���>��1�P'�W�81�q>��}'�]Qn��lm��;R�>��}'���(Mw�8�O�m?u�C��g9�3�������-���pc\8o�3�n�m�n�6p�.1&����X���f�0o4���n����1����`���y-�������J�E����b������nT���6�U?u[��7���j1p��g�D��������8�����$
pk���n�;����m��g��P�>Ga�er��6�^�`3s4������}8?|s����������_��v!x���3�}���pt@���S_dX���T8 ��|��n����UvW�J@���#�	]�[>�� o����t����V��gm������Mtr�6R�|��������^/��}�77�R�����rW�z�����$���cs0`s�������eq����tg��d�)��������u1��K..7�j�=�;Z��S#E�@6Wiy���c�S�W��jQ4�K+��h���(D�����l*i�`��g��&�� ��Mm�������:Uf!X������Z5�E��)��G
=z�H����oUmwu�
-����J���I�<����u��Q�)�Y�����G�����5�9^������2�Q������k���4nb����w�N���>��G}��U9,��W�m.��n�r�k�H=����F0�ll@��u�z��TWWZX�+6"�����^@���{M���/���h`?B�<�u���KH�P��jf�e��/Uw�A}�o#$�V�+[�4<�l����\�����G��������g���h�O�FW��2��J�;p��X���h'����7t����b����:���"}�N)�����9}��\���nO���6m�8��2Ovt���_���yj����
$/p��4:pD��F���������h��D��H"m�D�#��K�o+:3V-��7��b�\�h"`�'��Y�?���sq����Lk�������z������,:
E������6��T���xj�[0��3�p58zw�������u�wzTu�� F:�]���@m����(�o�(��{�F���|������A���W��4[���'Z�|"���(C�>����v=/�+�?����B:p��lo�:p����X���8�3�/G�T=��:p��88�i����s������`A8������v�:p���Sv�z{h4�u�d�=P"��:v9u��=�t�� ������m�+�u���`�����m?!
��_�#78w�9u�ir���y�D�M���
�6#sCl��?���8{|�v�@�H�[���i����~��j;p������X���M�@j��L5�n5����c��t�|Y,��'�����+�]`�����i@~���
`���
"
hp��9(�u��;�i� ����C/�����	�j)@.������J������9��
H��-�������r}�����v���Yopl���������5MR9�o����Y4'0W\��
��7�jO��v�.��j�b�tp�����E��.'�
�{��s=���AyU����%��^$�����:H�*A����|kX��nu���-�wm�7�U|v����L�����vwr��
�`��z��t
0������<lu�Y6��!���GL:�t��I�3�3M�F��+�����9p$�@<.�s�A��L�s��<�|x�9��N88��-���������'.�3���P9n���qh�@�v9pa�Qu��b�r�]��am�UN��������&���{��2���P{:}N�Fr��hk���;Y�e�����p���k�r�1���8�h��p�����k�E<��rT���]QB9N�y�V�1�w>���l-5��9p�
�]a��M����`��#�,9t���:C��� �����CD�@H7�9t��-1u��}+/x9<p���D�k�./������J��^.'�g���~���U�x)~C��������Fwp��3�R�_x�b:��E����<�L����E�ZN��i�W�^{�b��������]�w����g��B�����R��O����?���`��?��Q5;o���(f�"W����w�(�^=�Y.���|9]���j�W��������J$ ���S����f��Yw}��b�����?�9z��������j/*#������S��,o����������A������c1��\�i9�����������b������-;��@\I��4>���~��4�'?L��cQ��� ��8�'��>��Z�]�������������>������wG�����������O��99���x�������������������b!
����gT��(x�\���b�~#�)
���f��(��*?�z&�������x&�H��^=#nG
����%���������\.��HQz��W�����x~4)>S�K��+�J�W�����t��Q�H�p�����iQ��R����m\��m��9i(}&�O>�?�I	�S7*2�gm.H�������bio6��QZ'����.��l�������,��^�7��<��y����?7�������v����z
��+������3�lR|�j����j�M��Qs(^���_E����@���,t��]����������(�8;�[��W��V�i�)�������%D����ooBt���}�X���,�����jI�b��p��=�M����2v^�K�l
�'S��������]��%�6�	��p�m�GQ�l�G?���<��C�������k�E�����E�y��m}������Z��l~���L
��t�����������zcsUN����vK�D��nH)�f9�M�|�4c�/�^�P���=�X��G;6j��~����������?��D�o�]��H6�t�3�g�
�3��@�Q��F�������E�
��u&�2�y����G*7���6hD9� :s�b?]?�@��y�`��DS':c+���>���S/j���q��?�j>��fn�}��b<r������#��
�u������{)������/e/5���tI��h�^?ix�u�{/�l���lf|�0����<�no�-�1<�t�)�.�F������n�S�����5��E^��|�p�Q��54;��od���t��F��
1�8�����-�
R��md�A�����u�������6?�����Vy,�q\��Q&����u�n4hK	���KO���F�:���?��@C�8�8��lry�u-O�/]{���������_\��u���$n��D�E_��HG�
8�u��|�FQ1�����V�_8P�!��3/���Q��{k1���m[[f�(���K%1(�.���j\�����er����UM�
7��}���5������vg-�=��������q����oq��������������Z>��������������|�}�����M�x������'[&}����ri���B��5�m����?���?���?������������W��a��u���.� +����[�aV|�����������Y��YS����~_��}����7����Y>�\��I���������~���o����7���X������}_��?���-�`�W5������l��p
��_�}
X[�
���(�m��m!2=���/���W����V���\y�}
���5`�5��!�b�������s?���������}�z��tK��>'�3��Q����2�x�5
�9���q}����������O\���Vs=q����~m7I����X��~2VK=q�'���dv����b5^��*W#c%����@��7&[��Z>I}S-������S�����SO\�iq���J��|�f�9>����������j�N����X���O$qu����j�'.��9q�84�v�'�G�:�7�u���c���w��g�j^}V�MQ��3	�t@���c~.T������r|#�}w��w��,�U�\��h�u�j��4��3��ALl�&�WW�V,�����s2�"��	�	��8�����Lr�h�@������e��m��q�*juz!�6���j!F��{o������/GG�?��V"�+-$ &����^�T���m!�����x��	J���;���z�b��������G������[�2���/p�����+��o ����^�~�
O9{q[�V��~H*Ry(����hH,J�����Q���8(
{,�t��B��vQ�S���1M�_���L5��Z{y2h[����ZD@�|^���6ry�u���U�w�����b�w���q>�)��O5E����OI;Z�t/��bin]���*+2�s7yR�~U�I�Ny9Sg	6��d3�eS�	"^������Q����9���*��<e$_��h����jq��/:�d���U�S��A[,�� F�$u�15������*J���I�2���n�K�b���9���W���]V_2�m��/�(~,1ZI�y��� �`����PN�\ou���c[&�V��V�������4�B��P�Y�����~4�������}uJ���*grX��j_���\���B5=M�^0�~!imf��R}[�.�@[��]�G]�[S%VU�oj���&��r:5��.���Y�!����BWJ�����U��:���|��tNVY�j���qr;vk��g��tx%=T/t^,h���(�n��e���}�6�U�����
��B/~�����������>I���Vd�\�,�"W��T?�����d5�eU>�Z�<������#����-����6�L�]�K���L���w����r��SL�C�#���L��������.E���)G�������wZ���.1��������M��������2/�1�����|��U)���i�Q�@�|.ee�(�u��E���G/�����}�6k���N�ejQ}���j�h��,6Ek�j��_�PV�Z���Q�g[L^�zm��TZ]�"g��S.�
T�HCW�hQ��t��j��s�m��T-��Y����<l�����,/����Tt�����g{�se�J	���x�~�����#:�D��|��m9�����nwgUZ�����.���L��|Z|.��r�R���3Y2EN���M�D����g���_�y����Q���."�(j7�D]�]��uWUR�^���N-l ���
i�)��t�/W	�T�Nd�������Q�jN��RvK������R��8��/�B�������P�X�������M���$��Y�A���\��C@H���w������l,'?d�,e��9Y�^�Zu�vo ��������j�J�e5r>~����Q�������4DN�)�&���l��p�w�E��;\�f8&����
�Fm�ST��B
7�q���;c������~H�������|yUL?�uyYN�j�����#�o�+�s�������7�TX��gZ@@)#^)?u5�T�g=O�jP���<��sR�!����~2����@��������J�f�����T���5�(OWw�nF/D� ��k��>����:){��|���U��Bf�z�.����.����B/n��Ta??^�c9E��#���i4����:��������P�|��|�����f	�meDY�������p��5b�g.3� y#s���i���Y����>�k�y3k�f&o�]��{��5R��,��T��F���7b����rbX(w$�H4�r"�"I��_/�9��z�u������B���Z����y��u�����8���OZ��z�^`W,���A�w����}����������zg �i���O�������R$�R�w�'o���3P�����h	w��v�f"��������.w��������;��	���;����'�;����<�y������vg�Z�C��;��|�����'�?\���?����������
d�?~����d�r����Bf?8^�������xqg c��{��I�gQ���F�e�;�N��g2�~1�1/����m!&@��3�-&���=/�2�>��?�y������7�J�@1�U/��e������@�2^����~�:���vp����;��#��XT^'3�|�������\y�-�b]4�����@���AfIWV����w��s���-�b��m��N��;��������V�H�dg��������������������P��m=��r�l�8�d?�^��k�	�	r�����y��PcV9����|S�<q����^m��B5��J8t{�$���W��Rbw���tm`����uF��Z��f��s�G@�F���bALD��s����������U�E�F_(��I���������b�
��o�SoR�O)��+Z��M���6�uM���WQ���@�[�BV�7�K*�*�����Or5[�JZL@�F�����v�S.�NV����'�����TX�eK�i�h#��#^	U*D������.WCK�2 ���SV��@�|��R3�����(>g��w�zoL+'^EKDj���M\z4v�����
m��Z���:�z`O��#�z`WO��e=���x�
��_-�E7O��vI��P�1�6�J��v���
����J`����@w*��"� 0{�`�`�N��[ak��bV�E�]��\J0�D$�Tb�8=�~���U��|v]���/>�e�RuR�����\J�S\�R�.��s^�eW3���mk�`I�-���o}�:�{�Z�$b�M>�����s~�Zf5v���
/k��+���=��OIfTU����]�C��7VxG��~!��������������������Vs�����Z��(����2����Y�R���r��������S=]�h�=��vf�{�|����7<�	���1�]�m/="�7����zJ@&�-
z2�zs�������I�������[s���o�.��r{��B��r6���Z}_���S�zm
X�c�{wX�a�r2�>����'j$�koM��x�v�]|�V��� ������S=�[�^��r����ht���E���nt[��|"���m@E�w��]7�v��s��v�����Bs����} j��c�9��r�M3��(��k�$��\�.���c{���/i�����ZvW���?��t�c��@�������}�������������j��	h
G�[��/�=w;�,����n� 27������Vb�V��{(���b�����{����^@b8>{�^T���n�C��[������p���P�~����[�h��J��m�Q*����LP4������@���^�3�{��9Cn��w�bBsp�����9he����r2�6U�P��'�|@I�J�����.EK�G���J����f������^��������y*Zo:��s<s�n�T��,P�J�/=P�Y���M�Mu�]�����$c%Z���1�W��V;Y7���$�)�,�,�u��/��^�u�s�����,I�2k�(?Q�
�/����z����q�se�����?�m@?X"Y\o�M���X�|��q�z2�.����7c��zj�����<��f�y �@�fcq�o�2�KTF^�����z2�b��1�Wb ���YTSW�����K@X�x ���z.n���a=P�YlX�����,L�����]vj��)�
� R|sv�W���_���f��.��[�-��q���R�*.�	�����{/���c��wfr��� {��s/q�������'�Y�X��5��,",��N��r�����v��En� �}���U��{K��,��-uj�s�c�V3�� ������og���U���	������V�i��v�=('j�{��H�^������`�������.o���5
��<�����'��oe��N���������G��~�L�9��Jt���I�l��p\��q�uYU��7m�E���@&8WO���z� �>���B���+����i�h);����k.��q5�q�<���,��9a�Y�����4��m�M\�+���1�z2�2����+@�����RK����_v��~XU�h����^����Z���V���,�:�Zv�����op�/6q�5�>�����������\U��\�?��+j�J����U-w_o6w&�T[
~�������>���N=�n���@���z2���h����������[-��M��M���;C���/�x�����mf>���BB'zL�����Ic���>K��f�j��D�C��@�m���-�������������G���gq����J�*=�ykz����L����2�2�,��\f�u�E�k�;��)�We1���mYS��kF����{SP��S�>�/�Sx���71I��ec	�������b��^�{�!m)����	�nn��\,{����s����nF��*��as�Gwy}10��0`qB�]�i����
���kt�"~�E�B\,��2��L���1�lu{)�)������K+�Rr�7c�����,a\��*��f�MN�N�)$����� K�
��]H9��qu�<�G��#i��A�10��/`�>q���
����&����'��e�<S}��,�G��wHYl��5��k/��
s��v�b��!�_��~*���Mn� 7�����/`�>y���/`?o������g4����w��z����9�8k-�\]��b��j���/`�}7y����B(;���H��u�er��b�.����~���\w�Wz���������${w��[����=���W�.I(L=P�Yo�M0�w����bwK�_���{�
��V��*��������
�Z��8��:��&O���3M��t�a�����1*�	�����s�
S_��{�}F}��q�Z2�X;��2�����v����1�@kUiT[�n���7����v�� ��F���u�|KOJC�E����D�~&�X��]Q�w���t�����z�n���,=9OD���������[F+��z����s�hF0`d71�����_��j3��������e�-��u�������Nh�f%*s�/(�h�L\e3)�f��%`�r�;&�\= �-���& yfM[O����-�	�Z���.������!oM�f�Y��������z�(�w���]��1�����<�C������e^���}�rU�6���.�7V���u����;5'�f�T#�������u���t��Ee�V����%���\��n�9{X^�\�,R[������1`�c7�l���*��D(W����n ��� ���[W@3,��\�@����^�p��4��+�@�8��W#! O~�'D��
����ol�����=cb���f��Y���*������8�����%U�����9-&��3�P7����������[J1�$�z��b�����@�-�8�K[:��E����M�W���Xu�_y�?�y��A�)��D ;��'���G�����_c�����X�p��K��M?ts@�(�����C���T�����{jw��;����?�v����F��p��-��r���_v������]��
�@Ve���&+���:��BBk�����m������6���E�{�.����Pp�����8>�p2(���{�T~�N�������u[|����H
8-�j��YL����@1�P����2���Y:���wSL��]��K��(��������/ru����8��<l�{9��Y={��#o�����@`8.r�X����u�l{Z_t�ZVWWFd��8BRO�F��7,P�T�D�CK =�"�2����b'�DD]<( ZK�'D��$�:�$����b�Z��2�fT]���~�9-���]�26��,��z�+'&n������lM���Q��8p�dS��}��s�2l�I�[������h���2}�6I-�w����2�!�F��d`G%�����|�Qn\�Q,�s^��Fusr�����w�jvm=���F\���"���{�=�ge`$��Y�W���c�d������K91�
��R�Z(�h��;uH���2X�<p,��z:@8
SO��_EP���6��X>J��-� 6���E�����z��e��}����0��r�Z�����kjC]�E.9f�a��3��fu�_~19g`�8�@��X�O������M���������Ja��p�o�D�+s�`���'cW��c5�R3��;���si����82 �!o��'d����w���M�W�=|j��z����R[���Ux���Ge���b��0�{�>P1�jc��oUlBre����kd�]!B�s�����rde�!�R e�����@��-�4�w���o����mP��e��o���~g�V�s�������'�
+�9p�IZ��o�����f�Q����$��k��E���m�f��t������T�u'�H�������5�� ����\�a�M����5B���}>���y�A2���a��p�����R2"F��;"?�b�H��^*#!�

j�3/�im�W���p�!���\�o��;:;>�x}���Oo�/�N����>'�4�0���kS��+���3i~^�4�?g�w[
9lUOh�8u}���������\]mFT��gz���z2@�6�����9����9C���=��
Z��X6eh���Q���������(6�T���jhCU���^�T���u���I�t�:o8o��:D�a�����'�iC�>�7��>��"���n-Q|����A�v\-.��"
c�E�2{0J Z���
���{�����[]���L��5���YE|l����c�u��E�+����r�����Q"9d#K��|< ���d��IA��:LD��
��{�����_DX��������g��0�ZT��X�nWr8���hk����p�����_�t�q�����~5���r�+Md/���3����c�~�YC�f5�>B@�����n*�[�����M�K�!�XCc��@�����Ht
��
�73�@���h56�WTP����L{���~��������X,h��M1�H�_��
;�#6�\^y���:��u�� �J�Qv�
����6]�������CO�5�~
c��C?���^C�Cq6�X��F��!������{�j!����>1��@uRl<h��9��/��q��0y���.���@�8��
���
�Y�m��z2@D8��h�����m�=�V�EqU,
����������C��I3� X+L�w
Y�O�@�J�@�j)�4L!���Y(��P��CQu��i���M��{��2��ZK�\��f�ER����w/�m�4L�j}��t��/�;t��2_,����J��I}?�����prt��/GG�"5tPC���jd����&�]��"o�^��#��g�X�~9�|����i�M>��L��b����}�y�+�b\\M�����L��[���/m�v����Ff�Z��uP,�iQ\O.d�����hC7�� ���z���t����/����XPG������W��/��$7I�R8~��'6��p�a�6}��\�moj@5��j����o�U�K�y��g1��w�?|x�a�Y�k�;���Z�V�����0\=�q��\.��4��Pe��m w�1�=�Tl��
$��r�d��e[+��g��<6:��l+�b����
�\TC����k�����w2k����CI�����M�p�G�����37��'
�������]|to������_��
92W�(79�M6E����3� �Js�>7@��i��D���8�S�E~�,g�7~���9�Js����GK���r�M�x�������?J�����zF"KL����~F.sq���BO�IO�bR�},�h�������6�����������e$�~�N���3d��y�.\m�q�����+n�q�z2�zqP�Q�Y�?��i���Vi��YFEq�\���Y��.����������g�k=';:BB�����%��A��j��|�|�Dic�@'�
��?c\�(�F��N�E�`�|���xF����`���5�@W]���n���������6��z����{� \#��TO�u�E5�	@���z��<Io_9*+�i#��I3���L�^�Q�.���������z�����Oe]^N9�!��
���6�u<9��4�u0�
����������|"
L�3u����P#W�Z�N����4�<F$��#g~�KXwu�E�q�����3 �L�G���B7��U��v?\h�������y�>|?�f�jZ�w�i�p��#�iF��4~��������>�3r0#`F���u��D���8wR-�IF����y�]-���&�DFnLd���f��n�#Td�-�v��y�#(edA)����^����������@^"��S�=�*��& O��t
����@�Q�8tR���/q���P���:�3bOY����_��O��]�������������lN!�)���Ix��=c�0�>r�g�!��x@�}������S��cC�����1Y��5�,O�����U��G4�=f}���u��D:5��Wc@q7!����[��}�������/JZ}�t���������Y�]�\�_�ho��|�2��#��F���F������TK�Q�Hx%Aot�R�� ��x�D�xF6���������(�3:��j��e$����4�e�8�	�����A����Sc�!������h��\ch����4,g�8�U���/��Bm�
Sx-�+���'���Mt�ax�^�e8����HG�1���S�Y���;?��0�/=��4r3+�!�n�1�b��{\�48���������JW�����4reH���yb&��G����F�������Q��8�\4��������E.Oa����F��,m��n��!(��Z�	�81~��4B������in��L���F�S�������W�����o����6��&e����"�_�g)@9#�l�(�h�gc9;,�
��")[x���H��r%��]�n{��Q�A���k;�JM��w_\]���P~VM����W��P p�}�8S���&���ATKuD������� E#�A���`��V���Q�rm���O�cf7[U�	��������t�sQ.�m-�1(3���-ev!t)$q��D=Z&���g�u�@B��Ya���Yg�rhhR�p�yk4aV����K��*p���DW�Lu[	�$����t],k9-�G����f.��Q����g���GT�3�m�i�b=���nuN�j�+i�}C��3�@U#��V���Fo.�N��V���t��gZ��iU��==$����
iR�f����������PSz��j�7�:g�PY�U!�P�!W*NMO@�F,-+��'��f�p��G����+.F���j~��e�u�/�7;�i�&o������X�I�F#��@� �7�9�O��m>���]�|U"�F.�����?�q��x\= Z������r�w�����)Po�h��#}p�,����U���z2v��Y��S�q�� ;	��������_�+ev�>�p|��A)o��8v������������X4���:�08K��������Sr�,R]���]$c'RYz����mq;����/p��1Lr|��6k��bW���A�x| ���3��0s�����c�n��N��t�7���
�������d��,ZfU�����y�\F:;��l=��6�`}+�t���t��K�N��kg�������E�/�DrF�O�_6��{=�e6�`m1Tu���)���;��rPe���f3,�)�\U�{�rQ��+����|���1��c���@%���d����
%��&;vd�mP�mq�����1 �����&��9tt��������P(=�"Ga.�fR�$,�������:5��h���0L/b+���`�1j/�.����������D�����$�G(  ���@4�y�h<�y�f���2�z�����9s3	�s0��8x���-�E
����i�]`��e\����^�{������������B�f����J�p,��9-������*��:����q�x�JP�8������7�]�?Z�9��O�E1���r���&}�Esz1�*���" ���V�Xq_����X�;	�x��F���9l{�������}(��o�E����9����t��n����6P���.���rb$*yd�h�8��N�E��:9�r��v��-���=�pZ�n�1��cG^�����W�?�������*;��f�}8��7r���+�����������{s,~<|���3���dto%���&�������;=;9;?>=g�������a�.U=c�����&���S�!\��uY���r��n]��s����#Wc����jK�z��=��z�����esc9���n+�� �|vO�	xy@G9�\O�!��{���fg'
���Ra]�t�mv$�[(�mQ�/�����������B&G��ZL��H��H��(:��O��G�� ���%���]��+zH��)����<`�c7{������XO���SW���Yu��g����oZ��O��W�����%�B����(6�5= {���s[�o�YY��{�W����J�P�����c�'>�F��������9S�&
�Z9�,�alRQ�!��crW7��	��1�[�/�����~:;�8<;;�p~q�����������
�E���1��c�������:�*~1��������#=������F����_WbD�C�yH��rO|xq����c��
X�8u<��S��B�~(�����^�����8���;r"��h�#�>�g�)Y����
��q�<N��5���_B/������+��;�K��k,�p������C�9�_���JO������p���	������L>�<v��k����}y���@�cG��N��9�k�d����*�3�v�MGB�������/��O�4l-�x���;�A�F
�1����^��Xu�����oz���i/���w+:�d���[ikKm��1��ckK�����m`�T�B�����������xm�G���������'���xV�^0A�<{p����f��r�kO/>��Y-:�z����������#�A�ha`���q�5�Sl�`�H�k
������
`�ck��YU[�jq��V���Rt���*�c����ce����t��	��]��\hv@���o������������1�� |t<�)�
��=������h��]����|���Pa"����G~;��VGZn��
����	�q�\��]?��$nt(�����Xs���T���K��p����_zzvuH,��q�Gj���v}���<�]s����X�&�LI21�e*9@2�];�bA��t��w��/�/>\�89|s�?�O��^|x�������P��p>��%B���w���������i��n������.h��+s ����*F�K�gS����EG����N9�q�t�����3o����?Pxz*z8�������1L���FY�����d�N��`�*��Eq���n��EO�Rh{\11�p���Y�� `�D���'��kI��]�|�o�HU�8R�����k�b��@C|�E�`�I��"O�cBq��>�5
�	g��'*9�������&V�fj��{�sc;:��-���:��W��K����s���������zL@.,v�m���2��0�	�
��0�1�r��Z�j7�#h�sw/���bz���c1�[����	@b��i�[z}RN�I��6��`�`Bu*��O]m����c[��1�Z�9:K7e_�$n�E��j��;Wm����.����	@l��h7d�p�-q-KO�	��U?����^�Vqk�0�n������	��z!LX��`���[P�.�MI�a�x[= !����?�������E���Z6;2mb1O~!R��4�!��F�+"V$�n����\7����r�*&�������n�� �6��DO��5�D�}T��������Z	 �����b�cf���������
��@�����8��o=S_���b�=%0N���	'Cf>��_�Q�f��[f��EN��	 ���2��R��������(n�-#��/�?�=�b�x�m>@���2��}Dy�Pf�����,�F���'�H"�e�	g��'�$~���N��[(�L^ ��sk�����3{�ZK��uh�k�[�xw�7t�<X>�����u�^i(h�9��������bCme8r�������j��^��~?������^��<��m����i�m�7{���
jM��X�V�����7�~`&���p�j��Q3���j�O���v���
5����?�����5m���7=y�>���.��pY�z��k&t NV8w������*f�o��Q�������m�5[�;3y�po�l��\}6��	�����[�|m��������|Q�u�ird|��@�
��������q�5e��I����~����+J�����D�.�������9�6I�h��n���u���bv�1U �	g��'j��#9f�8�
�z��4y��E��XT�r
4!u\��d��#p�BnEy����������Zg0��a�m��&V�gUJ��_e���Ei�0_<`z����<���'�e�7@�l�z�P��3i����m�IC�[=9@�&�kd	��������FG��	����M�H��TV�Z#�������c�Y����G��`�d��}���ZH���\;���������so�<��c����]]����M&���k����^	�g������&�j���59�>�T��������i]H��L���C_��	����}}I�K�+������R��U�/w��F6q�VI��P�O#s����08d��F�(����M�����&Abe������?�-���yM-��.�P����f���E�;�CC����,�X6����|U�D�D��eGY��q����l��J)�Z�����
E��)��E
�����oK�>��Z����u�{L�L�3��/����)eS���6Ko��P�)G����cW���{7Lk�1�z�/0k����k���<MK���bL}�+�oU�����j�{��;Q�=������Gt�3������K�����@K*�Z�����n�ehZ��������d��
5�[L��D���n��<f6�������I������Q��=������v$3�m�k���ME�����^�+'�dS�I���t@%����zb��s����o�(�rIV�I������d����vox?T
0��s�m�/���5��>9H?|#�k����6z��k�b��)�t
�e����������0���
��!lf%l��pS��������C� �)����uZy���D�X���!1n��
��QZ��V=(�,~��*f���F��������%��6L)z��Vmv�������ed�d��pUP�y��P�iH�R�V�vUba���	��4t �E�����f����*�'w1���)��MCG/�Q���l8^�k�������.7��������D� $�M9�WO����5� ��
��
���<��)����h��U�
��O�����#���p�����6�*����p������Km�������@��n8e�u��/"����Y������K9�1���@_�i���w�f_�v��_��.�|���<��_
@����O����<l�/ojs��j�-���;
��t#]O�
��7��wJ�#VS�����fN=�Z`
���3n/}������7������	��,~���������{��^��
0�i�|�Z�����\7�p�M�z�nj!w-R�$������Z��;+�NM��](S@�6B^\~�#:d���J���c��d�V�����(M��g�S���\/�yS7�78o����B7e	�]�u���y)~��������WWK��������<�����=���t�`��1�	���ow�����O��=o�'�b�E1��z+��[�GhL�+�������	�]����^6eyYC_��rV��r~
h�4qZ�FC�����(����`�)��2Wy�(�����|>��*�A�z70����v]���c��6�l��\�R6�HY���l�����S>�D�{���F�s�����t��w'*x{����&�=soJ
���#b�xz�P-
3S�����X����A��le��u	�k��.��[��������R�4��#�u�\�rp���72�}���Dh�<�I��$
���j~zp��Mm���DA�M��Y5)��_/���c���A
�i|[��z��?�a�6E��K�-+������q��z2@�lH��]���&���c4i4V����fS7��x�4iJ��������<ia����t���nn�������
���4n�����$H�p���o��;�"��`{���{�~�Y�i]�p��R�9�a�9��� ����Tl^�`��k.S�4fN����3�3OW��|w���������gN�z8=��N�����(O�)��e���<�`�y���[������� �:�f�M�C�_N9|�.}n �)�0���r�Ak�=`ZL�?x>�@����6�_���r#�S@,�#�����2�g����'r�Y�Z�&���~D���^oM��)���\�2�gK��&�!�D%���.
�)j�T�\���!��C�:A���*c��]���.rs[3�To�n�.�q��*ju6��Gb������N����6f�������4L�h��tG�'���d�����ELt��R������olR����J���k1��cc��u.��*`=�le�U_�����4�����z�;�q���R���2�
gn�����v��.����\-�W�����.e���c���C���f�;��G�(i:�[~�]�������6��r���[7b���z2@	l��F#0�������NDvJ��f!y6�T��4��N����8J=Pm������/s����B)�'�Q'�Iu!���.W���ot��r����+��=�jRre���q��f�_���uU��;r��*
]���9��a�j����fG��=5�����]�x\\�_�����8��������z�������z�f��X��B@�������}��a����3n���B5��z8pNX{R�x��uy��f��9�L��d�X���e���3��jA�:�uI���^K�]mG����s|�oQ���YigS��BH����v]s2i�����,���YYg�h����o������������C�[p��6NK"��3+�,��Zc%����~��j����G�I�A��9B�f������������
���&�uy=;�R���[���+*�DxS8W��r���Ai���%R�B�_}�=��J��)y�Q�e�TY��@	�\���h=�
��2al���HlJ�����P�����7��[d�7g�������e{j>,����BG����n�x���3���*��9$g�t�,�Z�w��S|���������B�������*|��=�_����_��k��.tY����=ogU��a���]/y���'~���Y��4��9�t�!i=A�M��5_��%��h)[�������xg
j���z�@���P���}�R�����9��Zh�<�� Z�3@2g�df/�2�s�q�t���D'�jbix.<@t��Z2�b�8���@-g,����[Q�7�	��ls{{W:���M�9jT�f�9��Yg�r@9�,��yG�g��\�i��P���6;�?�cU��8�`c}���Y������U��8�6~�r������m�#fH��)���W��������!�p�K�s�,���+z0?�P���%��������d��$����p��[�,;R��X>�tf>�X����_�LgN��>-,�z�
������d�Y���/{���E�E�~��p wn��*�j��1��X�;�1�zg���j��5���F_��kdL��w����o���w[^�,��bV�9���3sC��E~���$Ly)��e�OD'G���
��L*������04�����SS�� 7|
��P<���z�:T�3/r��{5O��
"���=U�����3;-�����9�(wZrV-�����|�`��V�F�%��!�@�����4�v���4u{[}*��(|�Ek	�gj����Tw�V�OI����
�B�	��jK�e�Q~[�h�8�;�x }�hU�'��T��~��J#&��*��������w����(��oh��^n�X�m�G��B��[��B_������q$�Y5��q07?��v�m�z���\�W�����9�����.5�u���E�>�U]]��7�Jg�2]����\����q���s�b���W�O�'�8��r<`d�g��'D��F�Z�Z�O�^�)��������;��JJ��z]_�-n)p����%�]6����g��'4ad���N�)�m^>���+s���a��j��d�(���(���:}������br���qH�ioe�pD(�����J~����9�/k!
\���-GIy�FnSJj�8��)�T�<)o��N���hC��VK�U�P�%�������A�2M4m�#�)���(�0>������a��z�r��1�5����Q��T��Q�9l?��
�s���:1������r
i|?�D�Xb�s�q����.TG]/�#D�X ����7#�5�,��`Vf]�����P�:�;��G,0�S/#�L�8dZO�.P#�^�y�R�6���0/uY/�.�E�.K�GY�4	b3��[��6q2U��#>�\Z������t�c��d����S��L�Lc��4���En�ZBt�M`�����m~�?&�G����my9LhwH������jee���%xO������'�3���m�I�F��=�Ho= X�������(v�������=���u����c��dn=������@�G����A���(_�6I��I�f�V�(O��$?,��c������G���+���;1����Ux����zgX�d�7������gq�Z�`��@��z���z��#
����Q��>�y���z2�*n���d�En����]t�1-F�#�4�\��Q(�h$��:���G�f&Q�.�
�
2��Q	�7��Uk2��,���<�|��d�H������#`�|�����X�i~r���]����\����A[zH@;�.��NrO���<_,�sQ�������G�-��TS���$�J��G���$�
�zv~��������&��G��
D��,'�gX������>�SyfJ�#��7��h!��%����/��[��k�#	�����RN(.��4�8
l`�������2�X^��"n�b|�o�zda�����c#2����<�p�]�e� �G[�#@<�8�g=�|��}�-��`J��G6�g��Nz������Lz�h-.����U@�#GWh���}\E�������Jd�rVe@Og����.\�7�I�:�2'���aW�b\P���O�>u�h�������!�M-L�	�����r�i6
�]2j�8�������)v������\\������&�q����k��G����h�n`L��|�o=b���Ko�=���FZ@�����G�TsgmS'�\���ZuU?��n=����]h�(���Q��$���Q������\���}�����-�y��"��<���t�
x���G�K���9Z��=�5=cd�S���]=&�V&Y�4)%}�*��}}��F�:��
����?�7�������G��FR��M�.>���r�C@'�����YG<r��=��G>���E�n<J]wI����(�<q������mC���l��V3��������{�&����
�������iJ�>���&��-�\�%F&��'��5�'�E�*�8TYO��+�l> &}_M��]���N-�sZP�����Pp���yq��@�l�3]z��#�l����!�������Wc�D��4V���U&��i�"�Oe]^��ry�=BU��
hnR0��%7|������ZAA�#K�A�l��MP�Vlm���=�uo��6s�.�FH@�l��MH��U������G��I[��yd�h�����Cs����Y�����L���P��5����6�nY�f�(< �G�f].0��
��K�x5�8�����;�L��4�{V����(%M�����vZ��q����}/:����X�>�4��A� Hb*��y:B|��7���S7�,�5`Z�z��Ey�pWb��v�EhVi�����s��Va���)[���-��C�?<�!���U}�����
����*}�����e���i���~��~h���v��>��F�Yf�.����ff��<I���N��9E�g&���g��90v����M����>��e�|�b��>*F�EH��U�,B@|\�����`���A���T�!Z���'�*�U6���;�gOVV��3}�EY�����[F��e��!���\���v���(�v��>����pr�zG��3J��b���.f�"�v��g������#n�$p�n	���X���tz�[����" 763i�H���g#��[w�S$���WZ���7zR|����@���qg���`�*`�^�u5��[�`���������`T�r�=��=#"�'V����%�ny��n�P����t�_�T��)P�/���������'�L��L�����}#n=/D��������#����������)h����D���t���xTq����*����
����:�����@�\�����~h�����+��;���6v�o����->;TM�?d��?��9������D/�'�rQ������f�l�ov��>wk�D3}��+����y��
�X�����Z���&���F��5�3���Cz�����vj�>���Ac�|�+�@Hl����?��j��Q?�N.���1 ��:��K4�agD����;Q)�p�T����Tb�;��2n����U]�m!w�����if�|����"��'H�����Yo��
�n����`��E�@y8�g#�<��3_C��0}���b�_�m|�>�M�"�Q61<+���� �]�,[������@�����(f����*7~
�������'\������XU�%���WI 5����*	Z	;�L�?$
�z�����c�;�L��UO;�L�?��Y�X?�������
�����t�!���tBO�Q�6���bL0C�y��!#(eY0�3���C���������j�|+�@����:?�	��@�9���z���X:��" ���|���!n�����p��nU�]��������]X���;R���v�>�������������Ev��s��y��1Ut������j6��2��r(����p�[�'^Oi6'X
g?PD�/z����8l3!�-�k����������p:�P���q7������������)��Bt�o$�����Z������@h�aO c�r7j��n��_7���l�t2}�f��EE�"Ej�F��j^�:&�N�������J��L�!�J�����T�b*ms���T�lE]�"��{���������h��Hc`h��s�e������Lw*�]P�]��iB��\����BE�������)W�\�6��:�vv�>��nG��s��q9Y^�t��#�N?�@�@F�<�qY�^Sj��/}n]�[o6�0�h/COy`s9�+~���b*�.�e�4�)�F<�f[!`yDru'����
��V"sY�n�o���$a��=o���!e�q�"$!j��+u�J��>�]��6��~�E�G4E#��D��A�m&��L�E�]}�����$^��v��m���n�� ����P��s�(����������3r���T3���Vse?d�/�j��E��@����Z��/g�����}n
����j���t/v��`d�c�e�����1]�����]�|GB���A\��X������i�6���`=��6?c�9�����^��?�'�y�xu�CV�W2mg�4���{�F$@D�h�n�o��������9����ju��^{��NR0���#�{��\��m���F�D{S�OE-������'������r��	����xW4j��oA������$�����>Zd���o���U!hRel�������Y��������w�RI����,c����}a�]qS�Z,��C�5������}���+i8�6���xY�|��FT�Y�pq�M@�#�����^��IM&�\���Z,��I%_d�Jt���������}w��e�*��h�������M������>���Q�������F����#��X�S����Jw��*�a�l����>qc {�l����/�\��������K���e���#���m��b`�����a�F:@�l��f�p�>�����f���,}���iO����Y��v�X�d�SB��{ug���C��x>�����R��|o�N�\��S9q%4���!��jI��a+z 96��l|����\����Z6&(���d����t�/�%
# Dqn����0�m�J�:'r��$�e}m�E�hR,z�]FSg�
l/t�%R�_�
�h�\���WyXC�q��S��@�p���{�N���:?��6���"�f
=��m�����5-Lo���b�Nf�/���UgV��;2��yy*�W��"(���p����]��G�����ih�)0��.{Z������u+�j?B����=k��k���o��f��Mu-Z��I3���}�8
����o��'���$�Hl�Nb����jwH�5�B9����N�����iT��t�G
u�w�������.�ky�v�f%'s��j8��]��3T�[�_�#t��������-��ZAm�,LU������xW��h��m�,���O_���[A����H����o����D��_�z�����{9�lvMu��Rv�������w&�(�n����1�n�f%��X�R���T]0��Se2S���\��������������Z��W�^���,q4��VMO�
�R��&}I�*}U���Us��9�����5m&�����5s���TH�0"�g���� �}7�h�=�S6h��b]�V�z�;��^:h�����n4s��T��@���B#�.V0{���g�4�j=�.����@�;�O�lk?q<�\Oi�H�v������y��nD2�����n5���q�F:@8����8g��9�����W�_"��3���(���z�2e�N���Jn������g���p�������F�Z�""�� �O�Zb�x�,�����[�)��z�����Mg1���Lj���_�G�0���S���G��S�|PQ�n���NlO�V5�Q���8j��Q�U
�����z�;]������^�Y5[�dj�%��	��@715.E�7���[�h����w���l�_[I�7*W��>��N#�po�5Q�#j����6;<��uR+�w{���J���$nA�� @d�����X�5�H�����r�Ms	@�>����}�7���fp-���#�M5����B��*��K��?97#�+�|.�������f�������������?�9��i�}9Wa�}���������+e���}[Fd��y|�H	�d��CE��)���Y�|��(�,�z�9�8�e��
��Gp����mp"������� �}�y�l���s�6����<��H�zG��tf�����hp@���q�t��@F�@%l�Z[�h�po�1������i���V9�c��p��k�=L�����UW�l���w�}�:(
���
��&7RZ8��y�w������r�m��ZP`�;N�s��D�ot���^��)K7�M^y{[LJ1�����5KpU��H�7�BA����q�bj�F�>���?	q{N����s�B,�x`���r<�����%�o���8Ag*�UMo�9���Ln��?Z����l�+Rp`_:�l���;��Y�{������o��M5i>����X]�eD�L���W�,�2+����!z�Ki�3QV�����#(�:6�\EO��Y��&�D
E���3���M�dz�"(K�3hu������
FWQ
R�u��y��6��<�9PS����\���)�	�:{|�d��@��0u�;�a�����\]��\1&��������n��=����'����YY��
h,����<��9�57�<���S�������!���U��}����vB��A8�p�M5��a�`K)Z�Oe�xRQi���V��j}�?�{(wqu�M��,����������f�M������r�nv
�v�]�h8��H��
B7�� �D��{V-='�U�$�w�o�:F:�Z"��y�\���(='z�CrJ��,��c�������46��O�:������.�f�����*�EB�� ���qd]�n���������Ba\�@t���������N���d�� x��FbxD}���C.c�4�0���S,���E;������c9����B������4���UFKYi��y��mfE�b�p��8@!�����Y�'��M�{TT��E����3Z�3C2kE�����I!:�S=���/�D�����T��!1��p��8GX9zp�*f'E�?7�����o�����t����=��g1~_O�5������}��.o����n}n����
6�[�r���]4@�6�m���� �se�Uo$'��������U�RCh��J����f�-7��NfY�.d~uU����9���yl�>���[;6�S��8��������f���S������[w�a���@Ql��@�����fSmw3S3�o	�e��N���4��LPm��+���oxI�uQ��R41W��6�m�H��] v.����;�i��u�m�"���g��%���X"<��F��A�h� ��.�!:���H�h����������=J"N_�� �?++�\/w�[��;��T��n��� ����
}z/������sm9�7�2�<�E��91mk��c�!�W/�{����S)%�.��C�T.v����Cl�����?�#I�V��)4����;�]�X��l��T{�N�����Xw#��V�]�Q����B��*��
y��<_���QiX�����g��:�
���o�H���+����h68 ���|�������Z�}5��}:�����\?��U����{�o��8��[
��e�����F:@&��d;
�����+p���9�fV���R�w�{�G�b����+�?p=|"������z"���
�����{C�@�8r=���T�������>�!��4 �a�R���%"��Z��w���
@�A�(\^���U��+�m@�������RG�@/��u%��Zh6t����A������xS�yW5~�9V���e5z�@�\�z1��m�<��+�G��0����w`�t�5&�K2$RRK�5/�[�����9�Q�0�K������Fj@X�q�('�&tY$�w��~j��k�4���,�"*f��Qg�8�) f���w-�fs���� �0�
�yv=O��-����gO��0����*��r��'y�'\���|����zGU��:�D�c���}�2�*05�)}�8P'GrZ>~S7�����z���0������=������f�����EM�	���Va���vo��6_|x�@�X�_���%�m�d�|q-K�����m�|��"�������
H��"�K�l���F����H�Z�?���?`�yW���0rCz��?o` �IJ5��`�@`s|7��F�'�=6�'��u��`�8���0\��g����:0�rg�)�
u0�X1Ue�J��:3�����W�������&MY���)�V84�z?Ko���O���7ZD!�'Z�	���8~��3/f#�
�j�a����7@���W����U����d��!�S��/-�>�,u���~#����]��g�C�# <p=����O3K���������Yf�C�� �6���0<p���>$�!
�����m�s9���7�����<i�������IC�� ��"�^�<���6��N��s1�\���e����]���<�t6�����u�������c��#���!�]r��t�Z����x[�C�� �m1�g�B�� t�`�
�VB�� f��Q�{[n���O��������!�$��6�I�o<��1���C:��?#��)�*�[�*5��8+s���H���������^��x9����I���_e�����`��[�l=�J �"�n��n�/Z�h���v�f0�.Zh	���0p^0Z7��[�u�O;��
��s��{ |����e���������s7!�������.�s7���������s?����	�F��q�@6
�����u���?��w����[�!�NZ�������������w�V�x��!�v0X����.���a����^�Q���������/����m!�-�HH�m+����B�>��^|�|�%��<�}�}������@t������Dm}n���}���v��:�F4@<�)}���\�=��35o����
���~�'��$k�
3�Vh�R.6_���x��q5��mJ�a���������3�a���2x,0f��8^B�
�g�r~����Z������`�Ch��`�����u�B�]<�����u�B?�'f�C�5!t����	a����Z����������-	a��V��������`\V&�"��n�Z��\t64���htL�>����?;����C�_ ��?�aU�U�y��1�>�u{9t�#  M����m^�������F+�+9�I��P���H��������
�C�� ����g���qg�n��5��9Sv
��]fO�������N!�:����9�����9��HTl��~�P5T}�������XH��)�
Y����^����(��%�/kjT�P���GN��l�:������9��!���s���@��y��S�H�qMP��H����Xf���l�����aC���,n/=)���"�j*�i���O����e9n�M���#v�)#�	��r����s�
�����H�Is�V��fBzSL�W������(15O�
�6�����E�(��:��/����������U,��r����2�������`�!��7N��*s-��a�:��;m��Y���(��6��
�r(�1��|hC����^"V�#
����t@%�\��u��*�C-�+�����rI �����?��{	]m���"��\�izg��������O�������=���t@����t�c���������~������N_���|��4�F�� ������jQ�O��v@���U�DuDzL����~�]�1���d��U;��2����b�����s�@����^T�&�Tn��d�����e`+�Y3�z����z8r������#����v���t�$@�F�@Y�����B���C{&t�_�f:�����#^7���Tt� �2����k����%d2?�zd��a��%y9cG���#���H��e�������)5o���s�x�^��O:����_���|qZ�
���-�����5y�9����%�y�-I��n����\�1�j2�pr�������Z���p��1��kJ|�����<R�����|6�i����UG����<~P�����������AKK�;}�P������Q
a����9��������������.^o�"�aG�+�����C�s'�����+����l��zg6v��`���g�|6��_�X��u�
c�:�]QF�8�r!�l#�{�f����
@�Y�z�[\�����z5�%*�F������Tp��6+7��#�6�5���`��%���m�N�5Lu%~X�������H�Ts�W�OG���}f1ts�N�F=�Xg�]�)��^�g��6"�G,.��
�w���>M8\�?��@�fY`�njK>�(j�cg��}6�������kn��F�q�@�Hn�!��������U�f�.���L�5�C�F6�%-I�	@���M	�5�NsR����I?����=�T;r�}�5�����G���\K��2q�=�zu;5��6�U����������_�Y�l������bc�s��
@����ts���6��2��#� 5��m�����`���.�����~����z�O���U���vN��f�� U�|7��7����Tj%�F�^��W|OW��s��rI_����<1��6�����bv��1�{^5�5���n
���:.;@�������^3�Tm���gd�>)?��:�+�%;*j'zM���v�G�8S�_a7	0�Q��z�2�h�m����QS��D=��?�f�J�H����.����v��qF�
��Q�TW��N��Q�����y@P�cK��?`S#�����qV���Z��o89�>���|p�wU� �;�������S/�o0����t@�� DsL������G_��������`"�������Dm���V�Hc�
����� F,�(��:����U�zO���<selG�Yz.�H����������{Uy���A8���wEN73�����N-m��V�|}�}�=wm�gD4���P�G @��u�/�!g�	��4��p:7s
����!6F��5�\�����5�KC�C�=��^�R~�����A�\nj���������qv,��-k�Ot���8��5>2��H# �h�?o�,x�:��m�7m�B{
���,~#^�7T9JZ��\d��Q��������T�? �
�����6t�4:��cyB.b�9.���`H��G�0#T4T0����X���K-_0�h�/F6�a����V��������Hx|�/�_4�_��f�B�l��]x�jb&
���%vb����y�������������������6�����}@��Y�vskQ,����P�f*���������rFO�l���ZQ��L�\G�0�a4rl��8��Pn�3������6����U�I�/'�fy���o�@�������:��6G�����/n������FVO�]���64����~��PF�j�C�e�3���fmJ�"]�����X}qiKu��	���h7�-[�n��H����T�����]}o7���v�1�d0vDc�����`a��[f%P�$|yu�.T5�{���9;���rK�:c��Z���y�HCW1�e!�`�&�rFk'���J�T���>���Fw+�Dr��Q���+#(���6�P��I���X	UPsI��Q/��]>% �.o�i.2������&"��jO����~��B�n�y��Q����%���Sb�'�v<��@����m�iz5s!s���������f��u)}��//��e����r>Xqk�[�X{9�A�4��1G#�m��������Fz-� ��i��>��=O��/���B:���O�0]3k�����+��6Y�$����+m�.�p��������b���(�����/���w�A2�Rg���y��j�;��
��Zh+�P���\��],���/�c���r����8{��2�m�^��l+69��������H=;���L�{2�Pb|q�����#dY�V���8x��lk�6�p7�c���\dBm���\W�����B��n��2��v��7�:�G.�� �B��?���u������M!G��1 4c��4�8���b��l�H�5�@%b!L���c�a�,��\f�������6��-��j6�^ �u2i�A�,�n���8%�-Qy�y�6Q�n{r������Du�v���r!Q��H�+����!���Yhs��"P��<��\>��^����#��gc�d�|jO1/����i�!-%������9
�������N�#G���������h�fw^4�	W���@�8f��O�fL^�f�Yt�����/1���D(�X��(��r���:���4�����99�A��'�
\��L����+�z`�wq%��E�&���O��u�?<i��������L5v�\�8�7~��8���5�\�{q]W�vU�%{W�xK���(l�q�F:@^l<���z5�[���.�����TV��[�c9�0Q�������h�e4�"I�/����X�J9'��������}������h���=-�T�|*�7��Z�euKU�&t�Q�a���R�H�m�i�%o��j�M5���
H0m���M��^�D��	��F�(/K��iTW+�"yG�E�{o���FP@��sk]��7-?��%g��A���-`oc�5���7�]Gw��*���y}��O?��Tw-`�~��]�F�(V7�X]#�G�K]y����3W���I��Rn����?�����t���e=���q���P���5��S��_���_�g ���{Q�
M]g�������=�O�8��b#�6Z�lq
s4p�?���?�bU-��}#���i�q���^�{��Qkkg�?W�w17�����MT�E^���Sj�_�]{�z���S]�Z���l�;�:S��%M�
�����KA\O�
mTO������4*/�Ll��8A������[�Ww�>�m���b4�VWUSW�'�+|��J�(+Y
[iy���;7�	/z���<tl3d�^[L��9���v/f�nQ�V�
<}����<�A�m�d.q2n�]'�9p1~2�a<����o q�$u��j��;c�����O�r������tD�B���l�7��oE	[��q����tZ��g��v�O+?���0��hulA����`����>�:(�p8@/���R�Qf��|��|�;q�X�<�r�����_��u�x�b������������4u] X�rrzr~q��(�v:a�;���3u���Dr���{K$\{������kgG2���"��3�:���,7�]���/p\f��8u����?��Wx�s���Qg,}�P���x~�#�s�B�f��?�O��@=f�t=9�&���W��6��t�)"���8{�r6���4tB�(4���hf��6��,�5������XE� �r
��lk�|C��W9G��x{# |��������m�}��$@� �c���E��;��7�����t�lg���#Mu\JXP;@�'���������p�]^�����3�<��U�������O�������&>�~�����o�A�����7Mr�����*��U����@�����	��+���K��o�}��'���	��g���'��$�O�VZ�'��T��B�;+,�����aN�-uBs�j��\6�jT�����;�&k�����D����-����J���K[b�$0|�o�����������7���?�;}����?\|wrz��?�60�����&vE��M$��?9�e��S�	��+�����l��'V��{1S��6<b�6���H��3��Y���t��:���[����%	���_���v~><9�8������f���'��������%q���H���m�yR���� P�m��+ �[�uw�n-����T���y���`����[l}��iU����]�N�.����8�A��%e�|)[�I)�,u�Tob�)�nZ����D�� �&_��flH,�@[��H���k������{n�<{�/�' {�����=�c������S`k@�m
�K�	�? q�`k9����WO�6���&`���C��K�; ���UD�	l`I
J��Rn^t>P�����5�^��	��X���1Ds�O�]��#l��R������v	��X�Y��������r��S=� �Tb�e��<oIQ���#�<+C����jjd-��+��K}���RlZ����n�)	�+�o�'�+�����{��[�u��� r�S�wN��_�f��S��M,�`�r�&��'�O���X���(�Hi��
�UQHT�������]7�jV�jZ��G}��yL��Z��_�t�����:�O���������������O\��~?~~}�J���I����t?�9V�t@�'�{6���?qD���'�1��z�`+@��L[�f���D�9�[��X)���E�0�����^L�Uug�i@b��N+���hk!���
�����6�2o=WD�:\]��
 �$~��wN5 �	����������Rc��
Uct�8B�	���x�����n��?qt�Nu���{s��� �?�m~�������%A}�y�9!]�/W��,N=&�'�_s-bd�}drRL��p9@�#)9D�F�2�zE����6�Q��@ ��v�}��a�}��{�{g�sbh���
�z���O����p��+'����=a����E��v�Ai�)���������_�4��I�M/`��#���=I���l6��4(��z�������t��{��'��O�B=I���@�'�6^~	����[^��iu�1���������V�=�vq�����A_��������T�P�)��JZ\�����ZF�8�����P����vm���;�<��t~�p�u�Z��7&�N�s���i�[]]��b�A�f������\k
����][/�i�&��N8{q�%du�9����\�M�Ps����z���c�#���_t
��BzL4�!��I��o��5���P�&dA��}���������
W�d��j5m�45Z�+k�T�@�	�8n���snL�/�j�'	H�d����^|���[�'�{N8�����9������a��P���2�[�'�TN8R��-��@2'6��n��|9�����
���/�����6/p�#=�'�=����6C��}��i��E����2�7$��9���N����=
zt�z��Cw
���B[����[	5��JH�:��)�������k����) �SW�x�������)��S:6��KN�
o����Kyw����<��v�J9Sq#�&�����*��6-.^����������mY{��:�{�����PE�x
`��?z�0Xq�������a
X�����E�3�w)�S6�Jds-�qIQ��o��x���n�Io`��v6�{����Z��/�P#�	�2+��<)��Sn=���Y��W�
i������HM{����W���fD���C�����&+N�+-��}S�j��E�kOn���O
i�<���P��~;_CozL)N�z������Hp�����@��
���O���k'�{�Ls
�4p�
F�����i���n������M�����VS��9�^�m4�R]m�l��%	\;9��h������D%p����[qEg�~�{g:���l!l�N��{G[�8)���q�x�Z6ns��W�����fC��i��i�@��A�b
���#��t������S��l�����mG��h.~8�����,x@�,�����r�9S����n�(FBNm����hp�)�^,O�O>"�qj3�K_�J<�Z}���c�S�������W$�[|��G�^_�><?�8z���O���_"�K�p����k2�����j�I�Z
��Ku���i�h�����vd���_�E'N9��.��8u��N�Z�`K�Ju5����W��#��D8��z?~���+,�u�p�1�����4"��)
u3���S�������lE@�hV��X/��A�Fs`����{�R��[z���3T���9�)��x��b�2��_
X���W����S�0rj��;����8�^��Y��S��yk�5��i�%=��M0{h�V�9�g# ����]�X����9M7�on�5���)�SW�yc���y�Wl
���h~j����)>����g����5d�:u���"3��N]j�����iS�Y������K���}[����b
i�����c���/00�����t�h����P
�����6�
eq�~�Y@@c���^F�HvjA��p��i��=�*
�n/�S����xv�������w����e�a�;e���)f��f���$�e�-��y����������i�����u��g���
����2J3�����]��q�c{n4O9��H���*��t������3oPm��X����M�}�._L������@����h�!�������C��5������&^��s��tu��C�6'�.�)��SW��/2T(~�j�
�j����0�~�L�?t��(?s���
���t�3@�g�O��E���Y����e����� ���oAhd���F�����Zno16�X�!�m����&�p�O76�#�Y�'�\g���/Bhd��\�6��|l��>CD���������S�c���CM����+>���Y0����?A!r�e����=s�c��
a q6��(���,^�[
��7�}�q�F:@�lp}g0��#T��|U�i7F]@����mF]N�
e������0��iO3�����:v�qO6d���W�q���M�l��g������{o4�U��;iY�e�g���9�m��}��v,�L������(�q�X)VYVf����
�U$��/X����\���*�� ����i0zF;v���������=�-���x�?����a�6�\Vq�������f<��O��F9@��=������?��lx@dB^d�q�p�t_�_Y����?E��1�;��3/$��-����e����O�?4�h}�����)"������P�)������f<`�@�?��u�6��
f�J�vz�f<�^��S�7������[�����?��>���c�?���1&��i��[6mn@�R.��QP�x���s0)�]S� uL�2?����	{���[sIBoT6�o?JX�����x�Q������*��N�������G�RY3�w��/ �SWr_��mc�.��|=����NX�w�������o����)��Sg����`li:���?� ~� I��F��p�ib?�R����K7*<���%=�e��yQ.�K�C�GD��������E@����{����(�9YOG�?�B��N�>�z� .�D�����zW�����[
��-f��oZ-(]�pk�/��]W�/-�����R@���������B�~@�W��e^������lt��c���fN]�5��97�����=5�f�(���,*J���hpoO'�����UO�c�5O\��������v=��<����k�������V)����=�R����%7{&`�S{�p��L�DY�:�6en�U�_�RT	ex�t].�����<��M.f$���RF��P��(�w�1/����l�@����L��z��^����[ z��zV�z��Z=�e�.����g���U�*��Z��#@��}���: �S.1�QP#p7��a#�o�:��C?�Z�����:e��Q]���K����o�5�Z��t��Y���n�D���o���(�Z��'�4dO�������w�����$�Z��C��r�l��#C�e������n�~Om��c��I/��d���q����y$�(;���T�j���>����+frD�}�\[�<T������������S@����y=����dZa6�(�����X��|���(���X�����0X>��{J���/@=���k%�ap�VM�e�7"��yO�3��Yy���W��M���k�d�B�N��WmV=�`��h]���B*%�[�����4����h������[Fd2��gc������h�vK��y1}W\L��v�N����G�K`���C$�zf��]<���J�g�V}�>9��v���S�o������}_�]�2�����3�����}v��������f�=��i]���c.�G |���F9@���nW�2���O��H`��%��Aoi�I����i%�4��c��|�CO�����������@���{�daZ���7~{������75�J��&�����nj��P�� {��l`u$}�����(�5���vq��Q�.p��5��v��3��:���
@�g���)��2V���{D����)�q�����=��h ����#L�f<�r�����+��I�J�Y-�qiZ^R���Z�Z�[��4���BRg������y	t��<�a}U\{����?����
�t����e|a]�Vro�Y�����}C_�����������il�X,���B]q��f.���U���)~FI@cX���{��wD��Z��T�������cP���5���a�k@�g!7�;�+�"_l�^E ����hq�|�f���Kk���Y�VU������������rf������'ut��w���&]��������?��3.��1��wfc�;'6z�����0�
@|g\�v�x!M�Y8���%B��������g�re��n�]�gJ�zEr���0o��
��r���|�4����W��.i���1��5W�����'��<^G�u�(��F�P���lBJx�C���(�d�:���=w�y����
D�������V���#!��L�<���,�B
�i�c�����>�P��\�|nH+��3+'m�+`���q�$�.���x?Q�`V)1`whe���|�F9���R���8o��+����J�>��zU{���+>�*��������JD��R)���w�VzJwV]/7���M6a�W�����t�e=7�:b��KV�j�|5/�����W�4S7�p��k��]���U��JZGU67z�9�7'��7������p9��r�x���]���x������r����)���M��,i-��"����eg�gddk��(�6����������t���l6Sa-����p)�!�[��t���!���C
%0�
��������������8s$�3@g���#�����[�)����U����n���6rH1��oc.��3���������L�z��3���%��H	���s���F9@G������^�N�n�5�]9����'�o��e�Xg�)�3Wg�����
�s����������M���������w/~:h�q\����Y��� �3+��
�;`�3Knq�U����/~��'�m�TZ����{���Z���b�.f�y�� ����Y�C�Q?�v6�T)�u��	7���a��e��0���iyz����|u+_5��_v��mh��!m	�of� b1�sp
��|Y�keS��-E��U��(?3���;��,����zyL��6��FN�e8����Q@���uSQ�������*�GB��6����sN1�E�
��U�r����fEA'�Q��,�����.�_����*�)�xw����m�]g)�&#=�U
U��������j�Z��pQ��"�.V�#�7�"���{���<?��k)�]��u%��b���qg���+�K�������3������;K�0sX�k��/,N�~{������v���U��c^��O�
���P6rcp�vf���s5X��eg���p��e�jwX�m�m�`�u@�,�s7-]h0pw�R��1 �3��6�b�9���>D��'��$E����N?��"1�����`U�:�'���l4]���f�������k��������++=��/��613����x/�g2��8��t���8g���a�\:62��������9@�3u6�Z�9$62�X�+g��������(����F������H8����
���hw;���u�����=��N�<R%��"7;�+Z�Wn�n_���5V�����iv���t������������V��mb���.�Z�c(ne)�����F�V�����N�!������&D7�
}����F�V�������~���U��sw?�.�4�h�����n�����
e�1��*}>�����DyV������Ypb���s��7

�2|�����6���y��qU�5!f�E=�#u�����X!�G���g���_+���/�V����7�_��������{r����$���x�b>�O������g3�SS�(���Yk�j3/V���k����O?���%}�x��|��-c_yy�����'�����F����/��?�T,����x~+\���ZO\����?����u�YM	��,�����v9���������j��[��Cq�3��q���^={����X�?�#���UF���y5����Z_�������������}LG��������P =�����f1=��q��p�N�$���)���������p%~�a�'A����|�GW������;^�����������7��������y�����W/�y=~��������?WD���T���Y������S�xs�h?yP,�o�\J��������>"$���2��b�����x 0����n[v��%W�������ZjY�]Sv�G�[�N����G3�X��/G�]�mj��q-z�,�nB~�p,�O���yQ_�ZH���1oZ�9���}&�B>�?�Qi�}��4�A[��ym���_hW�{��m���$��}�����>�������=pQ��#Q����B��T���X7��F���X/���|#�v;�tk�\��OR��i]-�95:A�K��A�K�56�i��!������)���w^z����cp��KW�e�kn(�DY��]��M|,U!j���x��yOo�g���?�\������5U��V�������z���
���E�fZ�P�2%k�o�Z��+�3�.���b�^�>�S���(���K!*}�O�?�W�?N��*�W��?y���D���;���m�����d��6��7?���%���U���Rh�s���?�M�T;��ew�#G/u9�"�����7�$�"��t{�CU"�wO��������7q��S�������X�?>I"��7:w���X�s)�D��o�t����w��v'�5�s����[���q����1���ANA; ���p���b?�>����y�`���P��}��'���G�������|��U7�'�j�1���I��s��_���n���y����O!�Z �M���������������@W�gs����
]�l�������P��N�:	��o�LU����3!�������������z�GD^gin2����	�yo�}6/���O4A,�5;���o����3[���
"������T5�<%Ph���0>	%���n
~^��Uy�Y���g��`U�{�?�ZFB������<��W��?�����~?��={��2��IW������f���M�
��b�fu5����[�/Uv�=���;�I���������.���5�=���V����]U�2�����q�Y�����&���$��H�2<�N~1;!�O7��:C�7:�t��Y�;��2/�����������
�O�n�{�|��;���G�;d�[���6����I6`��������FA�r�	Gm��{��z��%?ip��d�u����|�!.��&!f3���_�r�W�������n�����(f���r9�1K&����.���~�J�|��ko������O�h�O�L��?��������|���?lm���?_��������O��?~�����h�{�d��p����
�+�4���������%��.|6���6]�����������/n
�K�n�(~��$�h�����0������0F����� 2�+C�a�?D��[qT���BGU����W��������4��v{�{X��$��p��?�$����w��_��l���?_���S����G
�����v�k�	��l���B?_���������1~'_�Y���O��3���|e~��s �]��
�;�����}�����p�����X���2�]x�\���:U94Mn[���������br?>|t_������,>����i����p(��k�3���x��B�#�{Z��������f<w�����Z���-{��2��#������4����<��Z^������,�6XS���Ky�=�5}���������/^����gO��|��_�5�����������t�i�<..����kqJ��������Zi����eM��*k��}�����'��
�;/�7E�N5����jS{��lCgw�)�5��y��5��R��\UK1/�;�����V��]������f������[�����������fZ����
��v��[W^����z3���E/kU�,����@��od!����T�o���b�_o��UM�4kK��^��!�5@`|^`�~V=k����12�^��g��)������%�)������`����b��b#���B�R�����Z�n���H���w?�Z�����<�=�����F�����p��p�K���f%[s%�7Y#����&5���`}��������l��3j	��oU��;���V��q�(�Q� l/l�';�����T���{m��>X�������'
�<��f�_������X��)[8Q��HO��
�����G���H�QH�F�v+�#e�\L5�Bv�~��+Ih������K���(��`��n���=O�j�@!�B6RCQ��x�~��r�L�����%�&T����i6��)��b���e�&"�]�x�>�WoJ������"XER{�Z?(^�*��r��������
�N/�(�(�S{�js(s��p|�~�f@�B���B�2!�2��������0���`�#��Y�W072�����r��t�����
��z����s��C��M���,enzs��f������
�v���L_������t������Ub���0��C�������A����E�4��A�����;�c��@8"7���)��,��)s#���rto��V�-s7 ��ll���~|����G@@"7Q��.��?�[��JB1B��Py��i����\t����p"b���YqA�<n��v
�b��V��{m	����>D�>hG�E��G�s,�'���t�I{�1����7���n%Jn�C������r���8V�Y�>�����w�:f�������t����Z)���L/��f�/�jf_m��/��K1�^�7b��M����3����r�}���^g�^g��Y{����|u�Q�3=�{g�g���������K�>��HK�JC�,���wX-��t8��=JR]�UE�0�L�k�6�2��2�j�j������a�T%fT�k�yAM�s�w���A����������o������nO�'a��k�T��{���hU�r<�quT�1v^�U,a��
��.�P���(�=U	���n�$����/g��rv�W.��QO@@F@�M�~+�k���5w���M�i��]f�T+@_���u�v,]
n� ����l��z�'.����N~.�&�r&@z'.�+���S# �����]1W��U����A*�k����]�e�������S-5����t^�����������c\���8��uQ����/ ��;�tk��lGj<^�F�|���{2������B�.J����]l�#�?�6�o(��Qp���F��|G��v�)��������U�9���ik)����`t��4��rUj{
�4e�t�V��=U
��r�{�����g��@��!�~��d������	H�(����V���N�8����Y3�:��2�7P7�9EM����0�fx�:���k�}:����{G?2z������D��lI���e@��;���C��zg>n�Z�w��Y���u���'����{� ���`���Y7�|�����
�`W��$�g<�������f��>6�[�P��Y����C����,�6���bM�W��s~���+�U��,5�b�|)�����U��wv�u������5��<u����o�sG��?�K��|�$jTzQv���E��[S���07�8����6��$�U�������%v��;-U�'��+>�Dt��:�W���P�|{����/���M���j����[��	>�^����Tn��D�����mEx��C1���B�����qU]?/�u8t{��W-(����u.�\����|���-�>z���b@���������n��������|�=Q�g�^�����7?�|�������|�ov�����3bju����wo�1������@B��n����go~��/��?{������go��x��_}y�������O��xx�[�E��KXr]_(F�������������O�H�����x����xw� (;F}L����y�����x9��<�^.>��r���y���;������}����6�w����ls
���D���S����Px��u,�����$�8t����:���y��;�/�Gh-�u�y�������������XKq������^�0o��J��okO�Mz�b�U�/�W���.����x>�auC��-�}���L�2�H�H�0����r�Q����D�6�/����w9�Byq�vA�����u��@qxzc��^����+�c���}�[o�z�|Q����jM�������B|�v/�
)T%3����������p@�8�}��T�j���LF;��_V},�Uu#�e����s`|s����v-In���Z.��l�Z��_-~#����)n���=�y���w�0���V��:V��*���l��[��S�*���f�R>�����5@�8�~g���Q����G�rUm��+5a���3y}k� qy��}@4e�s��*Q��P����W�F�k^����UE�B���q b����w n�s��~��/������A������w�_<7����~o\v����V7����+��\.�������@}�����w�|�c��b��p��1�P��A����\���jw�
1���^:�@C(�������S>y#�&�"a�Z����i�b^�w�����9��T@��y�DJ9�T������~���w������{U���}<�R�/�(��9@��:�� �5��@El0���A����]U���+�����m|��%Wn��9����gu����+��������E/��W3�~@X8�����D�D�"�����l��m�h�+�����]u�m�^�U��������[��a0�"e��b�b�m:k��K����m��g��g=
6$���=�
v����vn;�k�C���q����t�"f#��V���Q�������j>;kn�?��}���o�-L>`�}����"���t�vE��<������tH�'�qh9�E
�{��r�5)f���>G��������]�������g����}+���I��k���xx���].���2`I���z7����Wo���x����O}%�1/���Sc�(x?q�Bo�W��S������r=�z_4�sKv��>&�5��'a��p��b8�fk�C�*����i���x�G�t�����>3\���e���Y �}����w�*=�W���Q|n�#R��)e�9_9P�����u���Y��CoQ��S4u�V��1��OFM�����7]W���j���j��v���k$/�m�;�����t�:�N~F����a;/�����oc��6@l����Z�%�k��.*��ZN������)]]����O�-(��YL��Wu1�H�P���o�����M�@�YM��t��)
�Y������\����Lf��m:����M7W�&(�Ff���R��&���i����e���0�>�F
��B��fX^	]�t����c%�i���N����wGR*$���vE�h]S�A�5��kz�qq�0Z����������Q��M�7��vy�y%��)kZk��9���������-�[iiaI�x��-\�'P[�,E�"���E�*f�"hQ7N�j]N7�|e�t R;��s.�NE3�3��������~��~�U6�z�����t�=�����XX���a��x}4������hx�
_��p��U���'��]��o���p���	��;����x��S	$�)OO)Q
��H|x���g�~=}����
0b?s�������L����|��\�8�/k�oO����<#��A�������/_����/s\��G���������|yI���������'����9:uC��]1/�$��5�v�4rT2]�������b@&'��Y�.�:�M��f�#��?����Tp2Z��GX���x�]|�=rW��uqIth�}����^y�����|v��'������-���������r�������~}����_^��������������G��+|�k����Jn�tX���9�d6s�mD����mSg��]�������xv�N���[���^t1�jh�u�_��E���j[.su�h������p��3��5A5������$3�=����`������I�A�M�0.1X)������Y�����&��j�j���v��"H�����D%���R�;��lF	���FUj��U�8 �����q��Aa��>B�@��*|��j��Wk[��0{��]�����yl�,�L ��s��.>�}��q
����u�i�~�$q���E���M�T[J'�X ����=���6ca����3%������A��� ��W[�y`�1�fk�rmSb��2��c��O�Xw�:����N:i���[��e������|~DY��y�2��p���W�%��u�c��^U7�G��*���!�|�Q|`MT�f�����S^D����Y�����oi����/�����5@],����2����������_h�rJ�u1:=��a�qx�`B��StV��c\����Si������� t��I���m[|�k~�Z&��'2��������P��+O�c��56�d"$���Yh��k�9X�0t�I]-�9-�M�QW���(6
0��+c�lZ;�p0
e�������Klr9z�tP��m����8�kn�^�^v+�7�5�?`�~�z^���z����Y�������{z��(���I�h.�o��Qs������.��Z'j��1O7�/���1r6.��^�n �a����nS��{
�^�Q���u��}z9`KA`�R`x9`�@�I���#�j��N��l,b��1u����q@�b��00nh��`���q�=F��<`B�m@��g�`�<`�A`�k�
�i�o$���@��*l�
���u'������gZ����zn�!8W����J5�4:�$����8���JA@��n�8�vd6���%o�OD���u%s8H�L�B��s����(QK����/������*����N�s��2����/m�!�V��8sH�B��(`SD��)� �N��fN\�H�$�����-w-�%�����zNw���P������}/�����?�b�3���
3�K<\�[�H��"���JOoN�1�ql���r8�����^93��O�e�G(�X��r��!��Q��{��zQ4O���m@����E�2'`I0�[�{;|;��g� ����������}��n���W������5w��^����z:��@<�����;Vn��^��to��n8������#=�6����IG�1ov�l���l*	�M%z1@q����*����6�&����z9@�M!z1�Ks;:�i/���[1�D.MfV}#��:^~�`�;�' `�E�n��l��m�-"���4>/*�+�F;-(�%#`��\���wKQ��L�����q8�-�-���Gj:\��a��n����#�������>���W)�<oM�������HF�e������*���s��dH��4_��4�6+���l���M�\�)<.n���y%m����^��K	���x��y9/��&	�p\���������K��{�����������IZR���CPm7U����#���
���
����2AQ9�o�M]R*E����
��������*jWK���������\49���"_�U�67�������{�Qp� 0�����\�f���������r>��C�K$�������fl&����=��[���t��Zr��s�������g�����f���&pGu�d�-pS��f�Y��=�1�������7�JO��
#!�_�B��#<�%��JH��S������m[�y��gn���^P�Ro.��L����!�+������]�������A��V�Lz�X�U2d����������"r�@�7h��������gM{x���R�#�-��Q{����;���{^���n�]�B6��a{��6A�^m`�G�&��#�!��r0����`�E�;F�Bu��"�<�dx�I�k���C}����72�m��/�Tx!������]�|����/����}����`kF�%����'�����}r�~��B��"�G�{?��B��"����U/h����p���Oy�mO@�}g8���!�g�.��}!���2��E����F��T�@�l)�����`kAhK��OA�������x������%y_�������E�K��=!�g@+p����%O����^�/hJ�]c��a_�Es:�r�?��]3�F���5����Zm������������e���I�d�n'ZF��f�/s��,3�,���LK;Bn��^P��q	�_�M@��,�M��������zYL��R���5�
���C�r��kB�����}����R��������|J�.7�j6
���m���e���V�"�*�\'�B�n��&��F���E!��C.��q��D4}����]9�d L�*\N���!��_/[��Zg����3����ww���F�/9I6��
����Y��m��Q��C�����UF�&*�q�k�KL��?���y�f2J����������b�F��A�wS�r���e:uh����v���#�E����*���eo������V�������`g@;��}�����]���F}����?��������{���~~����`�@�|�������v0:+��NNH��SV����`< jv�[B������st$ t~�����x�eY�<���E��K�[�6z�S�h3�FL��q]�Q�!��C�O���E{D3���'�y�����i�$��������������V7
(�.@yZeS��V�,�\U����zY.�2Ga]I�]�V8]�64II��<�a�w��/�b&��G�iu��3H��Dw�����C*����<�5�TTp73��h;���$/q^Uh�J�L�c/�5h
P?���zi�(�� fnG+�`A���`�L4��V^���z�Y:��``�@��_ ����]���1d�����b^zq@��|�qn�������wsy���>te����h�y�]��l���W����!w8]��*�����gmg#��4=g��y�d�y��(���������qtdQ�u~�e\o'�( )��v1;"���N��U�{a�j�X������g(�f��Qu���L�>�0�S9��S��	�W��Be�W��E�U+o)�2���������|~L��7������U��;�����V
nQo�W���@�Q�u���q)4F���u������J�6$n,o�]��Wb*q-���pH�%�!�T�&����S����u�����l�0�)�T�S��T6j�J�������>\���������Lx���F7	H8T���kb*���������lv��������]�*��n�9p�BKlz���J�������f{�����R����t���)fVwH�a�^����a�f-����n���vI�7,W{#gN��M��1Z;�9��za@���
�����
/���*_���#��	�HF����k��X��G��O���nP-p����-/���w�w�b��'

�y;�W��,0i&�?�+��h��r�s��[z�=kE:���S	�Q
������}A�M��r�/�K����j���#�Y&���[���B�,������Sy;-w]��"c/��n��eqIFn�D7	��d��(7�?�I����}�$��Emk�Q�CF�	'��@�D��o�[U��l�������o	��.�nY\\��3��cs�'�R������Fj�.]`�@�n����d�~�T�6��C�9
f�	��a�8����zHx�����b����di����pz����S���z1v1�\�V���]h��r�
���{3\�n/�J�%qf��&:��G,���.9��k���h|���"�8n���d(���f���\�z��}�[#�G��`������3i�\�L�$�T���o��hG!�``S@�m
�������1�E��8�_/h�?Z����K����\�o�t ���O�	�b���h����Mq=-a����S�cF����������$���O�7��F�fK�nP��h~suOb��j�?�h|x����oT(pS!��G6�������o9N��f���P�����6?�����S��
����0��q�\���=����1�[��o�l����9���#�w�o�_����`�@�zc[��M�����@/��v��
�kj���
�������~�*����W������ ������>5J^�]w����r��qg��]��}�T���Y���t�t��U���6�\��O������iy>m���z?���������NS���8�(
6�}������mr���Pr������[�J7��df���hN���$���b�����e�<��Xl�Oh&�>by��$y��<w�������)���kF�p(�WT��>�z��������,r��.h�w^���E#�Go4Y��G\�z.��^<P6�}S)z��"�R�F�Y`���^2�T�,<����-��Z�(�j�>�
js��Q-@H���F�`��w��he<b��7�bn�z~����iN�3�J��M��W
��#��vO���t_������������q ��n��u�Z��������$���@�f)���&$v����)'L(������0��#���� �#6G�g�	�:b3���0��#6���Ls�%Hw�	J9JQ��	���:�I�����dTn�l0 ��dTr�	��R��a��kL����
v��;�`�1%�����_�%MFV�'�J0�h��_�r���f�� �#|��%�J����D���%q��^���~$�r��Y[)7]���DT��iG-�lh����P����]����vY�l���~$�s���;trd��EK�v�{!�R6��@�������6���"Pr������[e�2GC��r��s�����T�^�7�E��c��b�"���_�����"�r���x���#+�|�����Lf��gJzf9�2���|D����G'�����O����O���=�1��G����{���k��j�eN��eU������uq9�f��n���Jm���Q tl�p��u"9>qt���]H���Zk������'n;1�c69��vb@!�l��m���E�f�n�C�����W�h����y��N�[�=|�pQ@ZbX�b@$�l���9�U���@u}]-����\f��{�>,��y1���Zk���\i��Mv����;�fm��#�EG�F����rFw�����Ns��!�5mj��f���1�|g-�4[a�F��z3�/jk,ma�Hy?:��n�]�b6sx�z��������K����L���d��v1�ZU�jc�#���l��Q���X/��]1�P|[���1@�cz�g��_���O�6
TG����=��oG�/���5M���4���;���S�9L�wl��:�4W(-��'~����:�s��8iClj�7N7��Sv���ZG���{
�������P��&yG�!�slKe�;�����z1@m|Gk��ZYjnQc���T�
0wH���l�<u�xo����b��u{�rqh3�X$�����9,Z/(�-�������+��k��GW�f�]a��$r�:�r�J/SDi_�Z�M���(��i��l_�:\t8�b�+��kPm�D���sI��b���fT�=u�{��'f�\t�q����!�&�1�+hJ23�����9�z������ud���%�dV��r������%��ukO����H ����t�t����?f*
(&�m_��N�;f�o`��"������#���O�Z'@Y�|T�����c/��Hx�.�n�
m��s��>�hw���<`w���F< ���nQ;�z����a^*�:�;��&/�� 2������a�����u/�1x��zr����'��i�3��n,���;��o=D���-Ux�p��t*������$�*ax��������p�������o�����g���>}y�����L�Py�A�z�xx���z1�_sp��!�[�ojI��4��7�������S����&��+$�*WI������y�[�����9���n�;���&f���\�d8eY�qn��	���#���	@�8v]����n���>j���s��V��c���d����8��odN�ms�8�&
�;v����OZ���,��v3=bl�,�S��9\/��s~��s�
�E����T��k;���h�s�V3
�*��c.)�������>��!�<��r��>6f\����j�d���������F�������^$��+)��c�m7�%>T+���v�����J�%�U����{*Z����c����a���6D>���2�,v�����������`0�@���S|X��&�����.
��PQ��>�;�8�u�(HL7e�	Ta��:��5�L����+G�����!�C���zS\U�����U���vsQm�mn���|�0���P ���@�8h]+��q�W���Rf���w]�����9,=��a���������T���x[�.��B�>m8[V�����oj���~�@0-���+N�����\������%�����:M��?���������o���|�^t�_��?�����SG����w�F@��P����+��{������[ n�{4���3��/3E�rF��F���O*�;�\�&Jc��8�����b��p����u�&y>�~\�Tvr	�R{�i������6���f=�n�1@sX��p���t
��J�$;fAlc��u��|��.:vM�<		����C�c_	��.��_�r��;qM����.��h��1�S''��|������4�4	`�������@���yW��������XQS:�����2��fp��Bw?m�~P+��g�]O�,�;�v�������	��l�;3z|�s1�u�e��M@��5��2fx�K/���gN$�����\x������I��N"<�f{�DS��gLat
�x��H�(��N �����Q'�p��U9	����������^�6��-�����1'�S�0#��NN|�Q��$@'��&l�.�h{_>���Ij�Q�X21����i�~��ZY�6*t�v�\�
 �f�p�t��<t��j��R���?��������_������#�:
���6'���f�L�_L��������X[+=�r��,'YN8d��s '��m�!���0NW�%dp0��=o����f�hz~�B��C|u!�o���n������� �,�M,	�G*97����Z�����6/����N�L�z1@*B�	V�����3�8����%`,->A:.1|��i��Gv�9@}����������6��dV���v�
q�S`:���0Z����	�h>c�����M8v@��8��2*����d2�(�g��U��I���hH��k�z|���D� ��&��
0���t�b@��\77m-gjn��r�
��cx���t�K��:?�������e�U���|�b�����<#�����=z������$��&�$�r����x��YQ7�G�lf��4�O��K)�o��������
7��x�����YWU�?<�&i>�V��\�^+��P�HF���ph����M����`
Kb4�xF-�{�v�gsI��b@��c��v�c����G�^WXW5�����s&�b���gmW>h{������t���M���<js�Z�`�������7I}�����V�+4��M,|��9UOfw)Hf�(�I,8�R��������SN�0�`�	�����|��9��SR��t�f���

qo��5z��l%��G�k���*q\�7������TZ�-h^ro�_�����5����o��EqL����B�v�C v	���:����Js2�K^�J��}\�wN,���W������Ng���IB�p�^�O.�5���9q��%����&����o������K�U�_��+��������j���<��.70z�����7��CO<4������NV��\����r������J'
�����(��9�9q�}���&s��hp
�m���8��L��X�c�M���L�`�f��P�#�K2?�)
�����������H'��xy~P�\��&Y���twj�+8#���Q��`d3w�������PW������M��;�uN��UL/-���Mo�5����v��5�����2|S�V�8 ���[���'vq��+C.�������+[~g#�����5�������3y������qLC��1�	`��!�S�'����[>����tC|o��)����Z���	����,@�[�o����hk@�8�.uU}����5�6�d�K��/f���k�����i]��n���&�J�p���b�}}bK�-����kf|�1P\T^�F�D�M�(��@�	��''�K�>���U*R6��`�W���oWY��tp�������;W�\�Hw��1�Ev9����;�:�s[S� �����.QW}�61m���
<T��[����bW}��vs�0�����d��L�����������x�Q�/@�O8
^/��-g7��.�����'��}��
��5U�1�	`�'#�D������L���9-�;�q�M�;����~T�P�Z�wH��u��
R�������e�n*�m�t�>qM���'�<�4\-Mq���&
O&�H�p���b�Rp9�������`�����v���	��'��W��,�����'�<����7���6}���z1��������������-�� �������p��7g*���h��5��q�a����V�`�'�|���]~��O�0��E����|������u��>	�7Y�����I��S8��`�����g����~:�.�q�1.?��'����	��7����`��+oV�`jKJ�	@�'\>j� 4l6jo��X��+���/���h��0^>`�'n,����������:R?������lJK�Kb��\��+/��=����k�G���Z���l�Bap�U@x��?N_����~ �nv��O�����
�����o�O&������I[�MN��Uq��{��?1��@�z�'q�f�8��n�.���|mz�f�`��6��z&�*���p��;f��0�>��oMix���>����W���}f��Z�$nkJ�p�8]�1����1�t�2�f��1�82��}g�R���K�O7�����^
mS��\�2�4gP(G~�c��M/�>�����^L}���3��5����B��&�?���K��4or�%�yO1�N�q�K>��`���hI4v�vd@���$�-�6o��@���ik�@�"o��o��Oq����M�[oW�����"�+7�|p�	��GM�@>q�����zp|��7&1So�nO���z1@R��5�AOSM�*�v��>3�as��	����v���P:�C�������Vy�����X��qb��	���c���p����.c'�/�dn�	�'���;�n��! �'��d�w�1�t������l����61H��)6�;81Tzb�Px����"��!e
 /���t+����h7M6'\0�WL�<�Fk��M�<�\'k����:�y�����f��c
���������1�t��N���N;
��)��SG8�7�s�OS@J��������)��SG>��c@���l+Y��1��P7�I�J1d��@������9=������)`�SG�y��w������_u�a�%�.uK��$9����J��5S�-��h7��s����#���r��dN�@���)2�s����+��*�Y��s��f9�����3�lM5��)��S�n��^PWrY�-]�7�L����9���f
H��#��9�L��r��^.G6]tbg�m��]&/����r�Y_�5�+�k7���U^������l��z�9
\�����vDI����)�G�������0�M�����Q��R�O����q���o�p�	p
X�4-Q�{�n;uJ�3��&�)`�SW�{�������+�=l�^`@z���w���'��NC7��ih%�@��+d���=&��N]�j��q`�S��h�]&���N9~[/����ik/��M�����\�P6CH�@�������V�f3�v���|�f�O��M3�9�R@����9������c�
�.���u ��
���7�_�����x������g/_��}����������/�N�?}����nP8�/a�u}y�=��mD3�o)�6�_��h�9VR���n��S�������#,����7Uh��w�B��+��*O���@�)�7�:*0��4����[`0�i<J���@�SI��b3:�z
���5���� ��*� `{�%]�\.XB
x���WO��:��MU�u�q"��������WO��WO�����)����;�Lq"������L�'�{��.={
���9N@��5Y����DXO�u��Nb���x�S����e�1^x���g��|�"@���I��m/`{��3��/p{��`�g���"��$��qrW``��#����=�E�qO�#�����:gXg��L�"��H��ob&�E\���v$q
f>��V����9	�#@���}��p*?M-�e{�=|@�4u��;&|4��`	������?�X����F�������������r��^�3�o�vT4	���#����$������C>g\
�����b��r�>M�~���{^@i�WU3��<�$@���o�:��\�u����r�����SG���m��1�6�>��5�����N�����S.W�^�	[��P(}���w���R����iQ�������k@)�|v��?���P�@��f7�;P2��R�W��f �m�|!� ��k�����R�g��&+�mdH)�~���|��}BJ��3��=��)e��� �@���aH)�~�H�w�m|H)�~����4bTH)8�����#�����}@/��o�-F����<6���-�?Z�#JC{L)�2��bqL)2��\���cJ�y��vt�11����J�����`�����������Q�l_�\S���do����n��h(���q)��b���2��+e`�B����)������V��
+e`�B8�m3�� ���a������~N(E������%e`G@:������+�`��C3��:��P=�C��A!�A�����ok���-e��8��c�ch)�fC�G��2@�g�S>�~����^���)T2@�gn��3@�g ��#9������*�-���	���0$����}fA����e�lhl�P��S��9��,~��Y���fy�E.q7���B3�������&���9��3��o�
�f��E.�^�����1.4����<�d�M/_�^%6+<]���B�~�D�����}6/���XU�������T����(��o���T�)���7����X{���R��v���|N��jsy�P�����g��y��C�U-.���n
�"+�O�3-��Z��A���OL���|.�M�������'�{u#���8��~��Z7	H��7&5���������wS����Qx�E]����~W�8�-�����b����K����q|�^P�xo�� 3�T������)��`O@��	��
�%��kl`��Y��5�[`���S���IL`�Y��~�0b�6g�9s����������^C���_�_�b�-�V������cT���Y�h��L|�����7+t|��~�����{fK?6�������b��8C���jT�P���Mb��9C�A�����:Z��`0=�����Z������q�^.9;]�4<yf����s�<s��n�w0h�� k����c�w��=�F��R��{�$�(=KG��]�FX����Ak�
@=KGk���Ft�8�]/h^z�� �3g"=��X��,-������������ec�F�C���g	@=�%����������=���������8wS��9f�g���� ������qg���V��8�Nl;��F��;�N��<���U���{�#����E���P�H���}>\7
��
}��F�"A�[�E�U57��*}�V�~L��.���X�}�l�������a#u��	���'�x�������������a�US�s�����UU��1O�����;N��W�H�t����(��oq�����Yk�l�n�����\p�5]�/(�
D�]���C�����iV���"�����������O���������X�b��������"��0H�5M�p}��M��j%=��B#\"%Q�b�#���)����U@m�\��UB��s<���X��<_u���/��|�V��c� 5V��a��4}>Vpz���F<FQ�H4}�8�{IG�U��2OyqK�.������(
h17���p���mo�C��V������^���]�Z��go/���Bk��,zM\���K��W4I3Z����G�p;�L���p�6���q���4��|^����L���y�w�|�'v�>w|��(^Rt�5	�q���w�I�"���@�\s����!��������R�&9C�;��-.n�-t�<72.nT�5'��9���=�t��h���w$\���-�d��0.J��s�j��[���w��nf���sk���r&�r�D^�jP�J��������?����%D7�j�%.ojx\X�F���������t4}�X�@#"G���n��Rs�a���p������z���*�N-�������2}�+4�&i��X���Y�*i��D-�*

.���x)&��Oe n���	i��1g�����x��e��#^v �>�?�e���s�V��p�kyt�A\��������������^&Mv��>[���&;{L��yc��i+1�i��U������&;�L��5��Mv��>��q��MvV�>��9�M���5}��Mw�6��j��M��4}~�i�������^�&;M��U��r���}�d�����5�M]ONM�v�[�@�&��&;M������L������L���4}�X�@!&�=-<i���������A��n�`���s+Gan��K�����g���*���E')mk��&��&�U�|~dX$��*[�����@���������Y-~�3�'�a���s�y��)���.��'�aG�������9d��g�v&�>w� 1�X���e&j���s���c������g�,{;qG B�+�7�$��&��>6�"a���e^o�������8�������7��7��>������~&������'s��������@���A��N�}�'�'�$��9{M�}�:�'�u�Ny�����NV�=�����(����m���p���0.�bF�9��a����F
o����
���zH1k�]M}�T������\��q�F9v
�94�ik{�@|Q���Y�C�Mg��7���w�f&��6 �}Y�_�	O;���d��>�Z�y�Q�c�1>��}���@�G�c|�h����4|@g��m�$������h��p�W�������*[8W���(��d��^���E���N�����Q����+����{����m)�G|@X�6�zl�����EZ|(A����^�>��>�R�	����#���6���lO:�~P���Q�m�nM�k��#I�����$�Hj������[��$��JR7tk����Q�v���0��������r����+?=l�����}W~z�������]���Mw�~�w��}�O�w��}�O��������|�w��=h��l"�F@+9��(h[4f���g�c���O(��M��QD��������"�v@�"�6���SD��GZ76��TD����z@!s�;e����1�4��}Gl����a��{ze����fb������^��)W�r��������:^���/����c�#��W�p r������]�[m����^����j>�i:M���h?��+aP<,7�*���`;�Fr�Y �����H��!�t�����wJp�a/tL�������i8�(����(P��M�0*`�}�|�T'I7�Z-�9�
po�Ki�\� �<gb������Fp��k�kU�t�gP��-���* �}�=:�
 n?q����O\5��.�� n����r��pi���1T�`������u��|�w�e-�cb?@��1T�a��)�wu�C����m��C���*@������AX� �Z��5E���{����w��=l��*��}WX}���"�N�]i�l�C`�>�������F�"��8�]��]��Ad���V�=#�;�9��(HG��g��)��]��"�2�9�����#��U�m)�?W��>���o��=]���[�3�<�o����sy��r���2W���Y��)��}W�|W� ~���S@��
���S@���M�:���� �.�q�>U/
FO@�\vm��B�������B<
up��Q�]�.o6]c��@�W�9�#���WP\��O��x�����i�������v��W�����b����2w ~�;�pn=~�;��������1~<��jw�p���l����O��w���w>X�.���\�l��#6����� ����F�:�O����v���������i� -4n����A�|�Kb��@c
�\�q�.3��FXS�����2�p��kn�a��FP������/U@� p>��^�Qx8�]/p�Ahg�p5�z�
�{0�%�=p��M+�EP��0:g�Q�y��B��!/j'��Q�����5`���1
'����9b�<�����"F�BB�������M������`��ho�c�ag��n:��y���n�t�F��=�F;rw&=p���l��G�2�y8l�>�x���{~'|�YN�Ypd9uBH��CA$���n�S�2������ @�.Qx��  �I>:�X��c��r�X��)T�D4x��R<w��t�D����tj�W�=g�-<<�G	T���cK\�q����p�G�@��7����-@�.��U���y���7�E�\��
U�$���o���r��mSJ@�.���&�c����+�>l��8@���9������$�5�nq.@��|(���E���C���d���O����K��a+F������!�T,f6i�|�����'oD��f�7G����?����*���:v�@��_�@��.A�~�C `��b��@ �����X�����.��QPF���0$6x�\w��� �.}�Q�=[��a1���h���a1��)�ma1�\�u��L�n	��=H]=(yr^��U������+6?tu�b��C11@�o���8�_��Qx�11@�6�}lL ��
i�{�%j7�J��o}������=�`w� %����Xp��9i�^���/�����<���z
N:'0yLz����?7nl=)��F�B��'��;xJ�k�)<{��+�3���=tN�~/���}�s���)�!���h�����>��{��
�'���8�<>�]�|���["O!��C�H�y
o:���!�#O!`�C��7�g�����
R��U�`�Nz@|)�~hA��_
�r`�~�=��?t��>���*V
�r���r���v7�9�;Q�����`G@��0��f��>6��?t���8<"��?t��MLP)�~���=8}��y��J!@�C'$^��0�����}��P>�2����b���U���g�)����z�)�|���z�)�|���}�
���hS(�0���$B����=%^>}'���Ny�{�o�hSP��C��r�����Q!@�C�n�]�hS�����r�`�h�{�6=tJ���K�h@�C't}���h��m{�	0�a�2�0��6)kd�����F..R��{5V>tb�Y�F��&:a��9���TF.�/��;���rp�Q����J��#��Ws���U��r�P����q��������B@���\��thJ��
-}h��~���B���\v�m
�6���!��76��
-(hC�0��so����j@���c�	p�!���.j��ng� �
�0�����[�@�p�����{���a��+?tN�>>��}U8�
����
o���	���|K�_w��&��u���\zx��k����C�zW/+�7�/}��F~k�}��%>|�e���`�:�^���������{�����!�z�s�W�\;��0(�
��D3����E�4��{���A�����v��0N��{��B^���3:�������=�A����Q� ��3������P��#�
=�}G��:s�z��G����~�C<ojVM���-�]-�l���t��X��P�<~@������0�V�z�b�q�U9V]/���--��D��:C�����C����+����o�=���0�Y�,3����5h�k:��1���|���7:���0s�H��/�0��5y|��;���f����m��b����W����a�`�#g���B��{��Q���H������������j^r�#g@��v�vE���S�G��@j�Y%�CQw���T��)�q�~���l�3 oT�����z�����k��\��E0:���������X}v>rL2�=�������rE��������Jq���nd��o��������5�o�����������w�*ko�F��#�0���?�0V���K'o�F`�@�#�1P\=����p���0���a���q$:]cCU��#�4�����w���O�*Tz�Q�F9@���5X�<����~��U@�#G=(z���M��WEQ�,�:[o���u�:#���Y��[����0��cr� �H�~�`U�������>�pU �������j��5b(=�$q����Fo�,x;k���"��G!?�s2ol�*�|dI�d�^��`��>����U9R���#@�;��6Y�y���	`�>�P���qxj���1�{x�����(
HUdgA?�H=��<�=����#.�����O��q�zc��xj�������!���9G��f���F9@w,����\�N�T�����E~�8��(��
{�@{��o�1|	�������%��#�D���#�.7G
�G1'��i��ovW�����K����{W���o�_/�[j�?N�z�/�\���T���f���a��.W���l���WkC�q���Qw�������J8[��|�V�uEP]^�c��h'4=?�$�����RH��������g�&��n�����b*t�S|��#�����4��2T4��w�����
h��7��G�~U�ZKR\�|���N~���jF����������P�;�����}G�����NLnj�e�*o��ju*��A7���2�����|�\�p��}�r~G�	��3�%����,����&�����5;�Tx��B���Nl���CR#D
��K�n���5����\Q?��j����#��6�z�
}���	�j�3w���<v4q\\�=���\���t*��p�$�����`��*�1�F9��O�	7��@?��	'����J���4�"�b ��B�f�b������4Q�V3W���-�.�F�.oMNC�p"�\y�,���;��8�Z��uO����GK=�K�����^��F������m���x�xzR_rc��#����/z��k�
�t�:
��#�~�*���l�h�//tw��&���fI���^��_������Zx���)��x7]���fd������`�T�A@MR��wQ �t�f$t���~�3����������O�^����������%@�����������g�U$���%������HGr��B��FB>G��3����o�I��f�����;�r�d���WmV��1d_����sn������#g,�W���n[rdKn� ���� �J�,�rr�{C��v_�a�f�1�o�ZyM��P^��b��y2P}S��1������E.h����9�W.%��	 �����P��*�k7�[]�c��'��l�1�zc�l�~�K�U,��,�n>�hu�����Z8v�������`��s�ld�(���s�/]�����OF�:�y��bN=}�<1|p�%���Lo�1�t�!h7��n����]���'}oY��Zi#����
�]e��F�����&��3c�F���~���rl��*�I7�8��b� Xl�����9�W`��YL!F�Ey���W����f�"�wc��������V�
S������+��Z=�c��Z{���Y���tD�G�0�Y�@LX������z����%b�f�B]���R\\7U������8�``ZIX��)k��Pf�Zq����2��<��8vM<������Nu�QG
��w;##�FF�`�q�����:C�
����5���m���ZCp�8p���7��L��\k��;��1�}c���'���7�����w�a�(X���X`�E���8��B9����;���=��cb����)�c��� %��
Jo�����[3"@��N	�{UJNu��2���@K�1>�6�Q��l�q����5�,q��j�1����@M G��f�&��+��W-�)<�m(Q��w�
�M����W�P�3r)A+����,��/>�*�$�T|*kq��:����1@hc;Bkt���N9�=��y�b@��6"��Ee�#W���pB�@(��0p���5�\�W��3����hO����5�x���sU |ll�c-5���-�1�\c[�gC��s��Q[^fS�~����cQ��1��|�E���s�q]������"<9�����IdlYW��C����M"�XK�I��B�a�dm���������:�=1��#+Y�r9�	�kc���t`a`��@��%7�~�9�&���%��ewYn�m�1�F9@�����������~��?jv�^V�L7���~{���(�*����d���'.�k6�b����m�O|r��N�+��
���vkz~��5��ub����|��h
���F��'��8���t�2���1�r��:�RoP��pn���nw��h���$����?g1������;c�=��m��s�	`�x��%��zh\d�����Y���<0�e�b3�k�w����1���8N]c��'/�r��)�	Dg\�NL#*�F�G�_����������o��$��:���S����������v�f�@kZt��KY3(n=��Zs�������n����Z�����r[)n���&�w:l%��\`���B���$�v;��6^�p!�sy��r����.U{����(�f��lv?d�	�q�`��vx�,@��c�����8 
 �1�l�q�q1c����Y����Y4�9- �I ��b���Q������D���9je������d��=���DpL�>�";qK��P��;��;�!�{%��N8"[��&��N��,��FW��E�b'�)��K' ���Jf�Y��O��vq7���wW��&��N�k��?��gK��6U����qun��
�o������+5�H8{������5'����d�����tkf�
�	������vj�w���F9@�����������J�d=Xw���6����\Op?�}���#M	`�GV;�v�9�a�����.�Z���f}���!1>�\/��P;�g�(3� ���B����w�c���po��%����i�,g�RbqAA���I�e�"$Ol9�wMH/��	��w/I?��q�}#!:9��J������wx^�Z����$ z,%�7�,(����ju����+�*�&\�U.@&jr�}XT7�M@�8���)_���6�z������I�`%�V9���	@�[���=F�@�8�\{�b��j�7��*_P6�$�T�Hj��P�]P��aPL.�f����!����S�Hi!Se����������{OU#1��i��;v��7d�R�{ue6 ���7����b�^�j73i���L�G�^.
<:,>	\��0F�l^EwS�7 �����l~b���^�P�����[
�� �mX?q���<��w�	N	`�K*�{0����p�X�����{`�b�4	�8�p���N�.����`�@ ���F���I���ax2�M����`�@:Sy�H_][�B;���p���PY�=!���f���!�V��X#�
�8��N���$���{}k���I��yH��������:
vU$��*�6��f�M���"����md;4k�qo`�*.��Q�QK�p�H`���'�V$����W"�����ag��&w�����\�������~+����sNvP$��2q���������Vn��0[�q���
l�H,Y��'����%���\�o���+v�/d�i����$)v�$����y���~��a��T�����9�GFKa(�u��7�h`�F���0���f0o.3�	vW$������7���6�_PN�j�H����5o��_���|���E@����;����|1-�)�:�d&J���]U���U~K��U+�.��yN�����:��0(���0�I~N��(�'w�~[K��7�+�*_]����B����d��
	��|g���^n�j�v>�8�Pw#f�Z$lz�������H\w]��n [)����n�$�{i��V���51����.6��2]�sL��S��[�{$�y���#q��n>��L�3)��f5�2%��z��$N� �^�����m$A�F�a�
I��!����Xp����#�2��eH�q�����O���@`�E���B7��`��"�8�Wyw$��#�e�7Z1�����o�>�lM������`����`��`�A�n<8�����������m�VPg��pPo�v2�y�y�2�1�!a�"�N�R�p[
�pV�Q I�<�`"@�u�n����������P��;n��!�`/@��9�����������9m8n�l&HS�w�9�o���c������i���u�� ��:��d�}��	�R]��w��K������O�H������~�������1������~B#��r2�fq�\4���������������p�����dn��y:���jV�~�����;\��c����R�1�kvL���F9�N:�%R7����{.t:����fUsZ����rN�>�����
�H S���u1�}u �	�7&��H	X%$V�7N�?���f��^�����dz�s��d5�pK�����	�o&���8&A�����*���
OM?�2����������]�6���}�n�1H5�?,�?�����XwOU?������\�	9`���]^�J<��C����Z������(����Y�Y�����pxx;�E�}���O|�-A�O|���]�Z^R�;S'f���9�7N�A;��7$�9������f���e4��?�r�������:�n��O��	s��.+�z{�|�h�3����q�|(x�\�G��?>^�����V�/�_:yV��\���+����������x��qzr�M��\�k*���i��f���`Q������f^�������������������/�����UA��G>_?y ��,n�5J�X�j��-.���X+��x~+�!��V���W���������q]mVS��/��E�~<�l��e�x.�u��Z>����P��l��*��?�z���^s�?�#���UF���y5����Z_�������������B�A��b��s���&R���xz|�YL������D\I��I|�����0�'����(����8	�I��w�9��6���y�����������������<�����}�����W��_x��������	�=~���s�At,�n2B$2��������%�9����B��������X)"��L>>y '���������7U?=y@�n)����U-T���?>c���r-$F��#��z
������Y����_:�]�]j��p�]�`��V>������z};/���XAV6�M��A#�='���������?*���D���1I(���z�K���8m�o���������5�}:W��+z���G�
/V/����5yC>���w��~��0�K�3��	��7\.f�'�v����������%l���%��?��H���V���������������/�`{����F��L����������[�A���>���.�����������\����Wk�0`���gM���K�he��z����e����2�<p�W���]h���b��}��8���Q�c��������8��p_5�\x�#j<8	N�'���������xr�*���U{T{���7��J���W���K!_��.���7S�����Y���M�,���BJ������_���K��<T%�{�4Kk�����{3��>W����=�b�~�sG)����������7�'��
8�'�v�}�S��;yI�����N�3�\�I[~�j��\�ZnoE�����~�}�����T��S��NxcK�>�9���j���i3�?�f9�nn�}�c<s������3���������m���iG.:k�m
Gu-�h��-�����a4������H�g���o���e����������r}'s��_������}J��*cM�����5�vdi&����,�O�8���������n�v2�����r���*�����������\����[�������3Z�f������m?�rd�?�@�a����#
N?�<�<Z�V�VL��U���6��$�N����hKz�`���vUQ�s�`}�!C&����o�~����Hg3�j���	����Y�i�H��>�)Vm�)	~�����������M��nV�rY�������3�B��x�mg�G���W(q����;����O)�5��K5��������Nhg�E9_�����jU����$���/Z'��5V7����?��pB�2�������}�����T���l)G��b��Vg��sZ��{}��F�]@���5%�U�:��,|��vBy%����O{�9�\CB�k���mg��{���6����4����/�!9���w��0��s������K����iCmc�]G���u��n�<�~������������r���X[�A�N>��E����j����{��z�����fG����Y���F@�o�%��<�W����k������5Kx��h�g"t�+��{\_�O��W��+�����j�+������~�?a��?����+�y��0-yo���x�.�{��!�w�^��������E_XP����g��/+���c��c�����,Q��_�e-*~������|i���������1�_��/F�b��|Y��t|)@����K���Ent�R��/dC��E�����%���B7�~9��<H�������h���1��+��������C]s���<\����=l�g@�n'��N�������
�Sk��J~~%?�O��z[��n���[�Bl�� ������_a�>a�GQ��_��3fL&���WTVa:4�D3�U�JO���8�W��A�aYwHL�s7�����������rU���0�`q�n��v���1������i��0�x�=�v2�����>���X��� qq'i��Yw����X����$.�AquF���P�o������M��|!��k��x��?�b��c=�G\��Z1�u��
Fg�V���4�`��N���0G��i^��
zN|����~ZU�e���:�b��z<���������gz�����u��|�����?�|����7��bAe����s����~����c7��%�����7�_�j����[�����������B(f3O�=���l#Z��,�E�U^���Z|Ch(�K�tI����_Q���u�Y�k1���Fy���y��UQ/�	%�6�S����Z�tU\���~ &������j)�g���h&5k������04�}A�9���_����~�0�W	��<�����"���*}2�gR������_��S������������A��)�s����\��SD���l�g���C�I�>��8
hw4�2��v���xA�u���o=z��|Y_U���o�9mg�_ `/`����������*�W����������3: l+lZ9@�V��.t�:u�zs������c!�>�mSv�E���^�����bV���<����.���EUV�W������[�3u���X]T+Zi��RX��/��7aI�n��c�E{�����������3q�5��:^�Pj�f�����T��*�-j���Z�u!^������P}������i�+����I@,��b)�P)����D����J3���rZ���{�	B3�*��B���E���|�����Mj2�����/��������g;=���XeS��o[��+��f������#����xs��4��rV���*���k9��������
�����P)a��>��m�4�*(�.��R4���7�	3��M~[m��~�2�K46�;�z[����@P��$=R������V}�P��$`Z-U7sZO�{v����*^�x]Phs.O����l��5��%�z�?��\���JtSjT;���,i5BH������nVyt����VQ�>�:���?4���cI�����l��I
���>$Y�����=��(e^u��j�d�W���ihU���f�A�ZB�����3�Z��;���u%C��V����}+��v:����=Xn�����sE����W+9��	�v��(n�Q.��� n�j��A@�"�v�7bx����C����4r*��RTK1�
����Chu���fqM�<�Z���=�\/��	������%�`I�Ej/�~���Y����U%�}z����S-����Q�i�Z�R#j��y��)��\U7��uT2��d���3N%��,~U13���������R�j�y���8+�B������g���*M��N���7��/o��B;�B<�v�b�ip����Y�<�(40�T��R������RH�p�6���fM���^nG���4��DFV�l ,Qx��e����J[��=A6���i
��Hdl�Hz��U������������J������~���(a�8S������r%,���tb�e����&����a�����>"/ZT��C�E�" a�U��=se�DWS���X����
�U�3�A,�*���wT������7�W|���Q��r$,�J�|_Ga�E��n]��JhncGY�W]�.2��?=���.�t,����j���&cZ����c9��6yMM5�K1V��=�hK�pj�=�`�U����|U|o��>m�f�A����|���H[��X�\�(������|q�O��/����P��V��myQ���4K��%V?O��z������������`��^��J�q9�5��(ebU�&6(t~:��E��T���F�	��B��^	X$)�s}������e�/������[���6�!��5@��z-��)�2�}~+�"�h�N���,��(V���9uF9�W�X 1��=@��v�:}�����ANE�)�
Q~����9;d��i��K�.`��T���W��y5���Mc�M���:�?�*�d �U��V	��*�M��kk�9������w��D1q\o�1�X��/E�O���u����R�����qnh4�Wj��3*6��,J6�*�����V���UDR.?�#%���55wCN�0$����N��M�:V�M����#�kR^�(,K���-�D�����fP����SM��Pq=���V)w�z�M��Um)<�����j��X����B<���Rc�7�D��1�.�k���&�p��x�iNo��I3(�����C1u{17�eu���*���M�W��3������f����+�Cw��g���`�����MGzxL=r�����L<��|���G�����5�;oE�yw��{#�Z��^T���ntS�h����[@6d�n���MY���6�L���.�j5�8�i
�4���3�`)P��U��ZA@SF�������M���]�=�R?k7��2����(e�hg�|��fQ���0��0�K��K����~\����
�^��^h��N�2�\{a�tZ���Y!���k�7e����hI,�f�:j�t��;�
��
�7x��������S��<Y:j�t��[�`���������W���[+��37}:c��vC����f���2�=?�=_:/���K�*�A��^��4�]��.@
2F
t��l�f�@'27��f���E�&<�9|W ��pt�Z�+�J��n����mK�5���x|����VA�S��������]q��{}$�O�R#>w����h�����K��e�����������]e����S���\�O�*#>w��F�����",�C���eF|>�^������l��F|>�^bj0+�7�P�'v������w/����G�kU���.1��q��(�������X��`y����:�+h��T�� �+���\�0M%��K�v����b����)G��t���9v��zg�E������������z�h������]�����rF�e4��Y����j����
@��h��s�l��f?������Ek�����xj��{��0{�=���w0����u��!��%En��
�q����-���g�����q�����
�q���ek��(*�5%�G��j@��,	..�/���S�����f_����zS�U�q��9^[/tZ��vi��z���i�q�����.5�
�>T2��,�,.�Z�/fB+=���^6`�}�E6��?e�N�b�z��W�}����@�`���0c�����?��@�9Y/tS����rEkh)3���r)<|��"Y]\��Z�5���0Gk}O�grK<I�~��9�W�o���YNW\dV��������G�Z�����6��6���1+t{�����8[7��nVo���~kF���a�����y�'"�i��|��w������u��E��H�,����@���^���@{}���:���}|xt��z��������'��Z�o)���d�
	����mb�_�5���	����
P�>K����W���b�v��h���^��p�������8T���
����o����,����^*�Q8��,����iq������E�U�CT��7C���r4�^�����{�iw����#M���P��C=����C�_�fA�daOq�����3��k�x�;YrS\�yY��(?�G��g���>�rz��d��,������^������w�����//~�����0l}��@�{�lJ�-@��N�@c��)�/�Q�f��
��>GVZ'_tgs��I�&��8H�� ����m#����f���dx���x��@�fJ�6A���C]s���M]�����OZ�p���b�	�F��%��V�����5���%=M�6��"���W����>7��i'y���=@\X\R��(��,A�C�PQ�L~�1�1(�Iz}�VJc�\�q�/u�0-�"����S�W�b�{�E/�/�.������\T7�DT�����<����.��w�E�Y���K��.�p@�l�����J��l�(��s����t��	��s��m s�����/���}�:�z�^6P ���(���}��m���"�u�����!�^�n=`}�������I4<�M�e}�,�/������C���oz'��%��|tl�{��{�����/������?��<�����^��|�����O_���}�L�t��������������~��o���+tT��|�-9�o������/��2�<��r4�Q��������U��R'��/P~>G��/@|>����T�����[�.�q}��������1|g��9�o�V����K}�C��V�>����b@�����z�?����'��_�O����7QP��>K�X���G��������~u���t{@���=aOwC������y���}�3��/�H<anw�G�&�n��,7�6����#S��T��\Y��8�����e��*�r�Z���{���[��Z}�MFu��O������n���t�8HO��r\lo.3�T����xq�Nv�8L�|���p�0����������*����O�|��gh��^
A���\�{)��������8ZO��m���2�z�P�H��[gh'�j_�i^�����f{���5���Mv%
8��;�{u�7h����<X^�ay����!��f�9lg�i�������pp�e�T��]�i���'^t�%2B��o��%J�F)+�W��x�����{`���{���H�!�
>���Y�5e�����I�JH�(������v���g�5^1P�k�������O�Z������pu�Po��8��1�cU����+�V.o~4�4h�T����F������t;���l�]�)T�F��<���������a"P.6�����(=�=����l]-�����c��%�I��v0����#��^Ih�4��.���:��p\�^�>��V�������W��
Y���80R/�G?�=6N2���mG���Er=�a[�E��t��t���i��{�(K�.�.����p��!�������������v���*$W��J�)�f<�f0K2X{��2�$u��q�2����~�*��du�Y��(��C��������.x���x*1�=�Vw&7�� ���8�������u�	�`�~�`�d�^�������_X������}2����J��x���n��K��W�T����b��e}��^	d���/E]�Q����F��e�`���E��a�����u?�$i`I�jY��!��PAK��=(��Up�<���gW���/2����8�x�2
8�������t��!�R�J�f)��{|4k`�Y�.
k]����f��$��`����,mj/cV����������x���C���:T���z���m�!��8H����+��l>Y��N1f��y���U2/��f��@8�aKhA#��#N���m#��%�-��g�UA����:���������V!g����V��R\��M[�Ls�2����|}u�n���w���q*�QV��3��������� �n�s������qUa����V� *��
p(t;���k�{-o�9���<#�������1�t
�Bg���4�)�������1��<���%�~�2`�K��Syh�L
�2���{>P>�����8�������0�=)_��sp�gp���
�j�}�.�Y��c~0�7��<���+����7h�?��m����#�j,Y�cwru�m����X����H@��$���9�8g���*�h���`������i�]r���F�:��jd�r�O���J;`��:]��y���W�%tw����*{�����hs�����F��>U��#��b��q ���V<�X��U ��������LCh�
8������gPn*6 ��g�n����4O��on	$�}T$����Z �y�zY*�AU����$�QG@c8<]/(��]����f�����7C���.�)x�������-�y�EO��
��+��Zi��q���I�&���tt
�K�������	��������
c���zV}|��W���j��u��m)���^9x��z9v�Y�\�y���6��o`������������%�M�����Q���,6..%f��u������B���N�c�����vi/t)|���[=
������^��
6{�9�[/��AC�������!��hB�n���G[���X�~��U���y�6�����w���9xO������<����3��������kE��.�������W�X�<��r����Wo���x����O}������~q���! �C�l�!���g���������y�r����I��#��A���������7������z1@����$�3M��9�^~�j��w�}��w9yLz�1�z1@�v��F���f�x�4���[B�p��?����E��g�G��b�H]�6���!�z����)X�������z��V�~��������%�q
���$S44r��^���Gs�u�;�E�gW>�R���o��V��%�G\�Mh ����������������b�w���f�i|�	�NM`��-/���i��e�n��dmy��]�E�z�����h�����tlG�����+9�P��UO����;8��a8Z����&1��Y��&��}��qWS�QL��3h#����]�����RrG�T�)U��e�m�M��c{���wz�*2h��l�(Xr����_����~��i���7�������#�����! �����9�Sv?:���������49���;�����FC�� ~��/-~��6k�x v�]�?e@�w'�@�������H���	��"a�����Z���:���<�G�Y,������p�.luX��c��b���w��&�����W�]:�y��)�n�6��@|8�\/�O<Z|T���������W���X�����������h=��$z�s����N�A�Z��#��b���rK���!@�C[i��{��7�L�U!s���!��!�*�Q\j�oS���tN�\�;�(�T�������i�����r���nFg���9���|����s��7����|�[����U��TT�F-�PJ�O��u��G�=�;���N}�����gmY4�w�!��&}�P�a2Z�v��>]���+�����\�\w�q�z1@���R���O��:���@�8t��z9@V8r[/�
�c�����\S���~��K��X��9��$E+
p���q�3S��L�����`�C���Q�5��e5�<t���w����C������9�xf��VX��e��:��B�MA���9����n��gj�<���q����H�^V�ok�(��m��w,^�2�����-�����N'L{^��uq�<
7~v9��e�@�m��R=�sQ|gZ�)��o��ns-~j^{3����C���������a��q�����37����z1@<8��@���+��O!{�]P9��U]���a�*��a2�z?tp���F���a�l~6�|���FD�	��y��F������ �!��9N+�o���z1�gfn=��y�TLt OS��
37��c]�M��t����v%s�X ���!sfv��Ap�~��=�H�������"�����2����������!kZ���3��(����)*� gH������b����a�5�����Re2��bU�vK��)��W|"G�\$��_��#Z��D�u�]q�����]�"�F���KKt���<T�P����v��������G�FG�.�8�X��"G\*i&0a�V�{#��5�2@t#�����K�w�!DQb�(�H�;���qi������
��3��yT\U}�,�i>?��M!�oU�_���������}~�p�-���;v^U�t	Z��X���O�f�P��m�.��A�����������qO7b�?�o�H�b��J����:Z�S��F\�g��F6HWt��*:Q��Z���r1p�w0���p�b�,�{����R��v���K�l���C���+1��A�i.��t������D�����v@�F6"V�i�8�U/���qN0���y��`�����&Uhf=�B�	r�w�Ym��^����G���O��1>��g���-p��`y��4�R,��IpDF�Xv��2-p���x���.�y>�c�	\���7
��
�449�06��hr@#� �@��#��ir���@_F��es��`���
��XqY��)'$#>p��*$m`O�>�$d��6�6�5F�8�~A/tJ�1�(�!}.�*#{|��l���|1���e�Y���#�(Fn�b���Q�����Fx�}����^}���w�p0��]n�d�\&\��
.�b���8���	���
;�
�f5��ms�+Pg�GF@2F��4�c
��`�"�����30��)w��IG�����������#4��
F\����$���M�X�����c�m-Y�pK�`��l�/��H�N:��~7 �N������p��+�@��9q��w/�.����h�����8�bJ��[��W��Z����u��* #�X��z����{�a�]�N���8���c�����\O��fd��ps��pe�1�F���(�����^O����~C ��pCu��qC�6�����W/n��J=�*5O�6n�e��.AR2��.��t��d����Eq#cK\k�cE"��h�@.\�� 
���"��gh��P���h�u���(����l1�uPfd�2�K�1eq	\�b�0�p���q5���1@$#��z3�d��5�I����4�mp�H[��2�l14c y,�ik�@�R�i�������$@��L�m�c�//NO��d�V���R�m���.����u}y�\�'�� �(s\�1��Kby�����(s\�1��T���5r�PG�����>���\:����_c���f{�,s/��*�����d@�Fl���qmN����z)\��o^�j(���r��}j�r���eq@��#�:`�\����iGv�<���7hl|�&��Y�oH��F������^?}{��}��6vDh�����
����n���������i���x@��O�rH���5�r��9�;9k@w��j������z�v��O����v1�$�,	��R�G+����F@&3_�6S"��b~a�b
�(
�������gf�(Dq��z�����nNR<^Blk�G69��w/���8vM
l>/]�w�I������:=��c���4�"k�$s�%�^z�,r��q�|\[��r�%�V�k����?�|������ f�r�go~�����g/�w�g��]������^<�7=������gc����
w�[,��c7�9<s��3��Y�^8�.�b�u��u���N;!c�)�\�^}>�q��
�5�_��F��\���|��:.�TD�����n��l�CU���Xx���|Ci���"����bE�|x���u.����29����j��`�cV9��I�1a��������o[��s������;$�<�m�1S�������~�Q�t�
i.�R�������[�q���v��{_f�c�E>Uu�s�/e��,�
��_�����>�]�����kL[�Z/��!��7�.�>�]��a�&s�!��kp_{S3���s�O�����^���/gO_�����E���E=�n@���h���wUm��^���C� �q���0�����,����=��V�b@c���d�%Ym�A����
�D���u�0��[����q���:�t���[�q1�t��1�V>��c��6����A���$2��hk�p?u�`�@Z�\�W���s���u��t5�Fl����,��)3m�s���5,s���F�0s��e���W��V����w(�j��m��������M�j<����/2��^���|�%���##o�*����=<���i.J�,t����Z*}mAyek9�����(�k�@L�.�j�]�x����%�8����0��Q�����uSKU]�D��#K�ZTZ�X��5��H���L��D!�EM�&����I�f>�8�w+[`EIkw���od��n}-���C9�{��6l-,�m��b���a��j]N7�|5�t}]��.��\��f��B��!���j����;@�c�6#e�����=
 ��r-��`���4Q�����k�4�aU`v�e���@P�l�WqQ�5��2V��X��j��2������������������f
%��!u�mU
�WV�X�:�d�r�v%�@v����g����������b@���^�h��W�����?�s��^����U1��c.skg��<����@]�?���!���9jx`����5c����1�^�(���8�0b]�:;��������W���N�����<��1G������%^5�`rc��=��5��)����/?m|�k-�iyQ���������1�eUX���S�s��+�9�K��o�EI/���|.�:�U������9����d����M)<�]
�:}�$���=��y2���~��!�sc.��0���X�c-�����S'��lQ�PMQVE"VD��/��q�Nt��l�1���&8�k������~$��-uH{���O��H��T��	~"6�F�J*�w�^���;������5gy}&��C� G)�+6@_���4��5��g�sF�wX,��}�e��������6��Vxp��-�j����:����u]����-_�&L��R	2Cfh���i��3�q�����39��J����b�A
w�(�7�I /�U���i��	�0U���^��S&'x>Y���~:�_�p�����J�?��������M0��������d�Y&��Ll$�_����?9q���h�Aw#�J9�;��^�j&\�W� ._��ir������	�����w�������h<�����'oQO0{!�����z)D�
TC6Ww@N������5��h�2k����v~|�~:�d��������G��\��]��
`bb��l���^<��1d�3��#�i_y���{O���T�b���$pL8��c�<G�X/�A�����fL\hFu%��}�*���pL���z:��t�N�pT��	zrU�����s����������b�Y�@FdmM��emM���l�7�A��%^U�r�x������jE�y[s:�R����(�#�UU��=������tdZ�y��YQ��g��	�����
��(����b@��A����E��^uS����y1`.��.�W�,K���
y��q��v���
�!�o]2a�����?�H��p���u@�kZ��	���7R:dOi����u���� �)��������|�R���|<�X�a���I�0���T����_ `�����~���F����$f�hc���ur2i!�Q�3����K�J�>��a����%����Q��I"o[*�L�'���b��e�3�	�����`5��v��K H����:����8��5�����1q�>k>S�_��ks����}0 D���4���@<E��:�>��.W�df;&�6�.�+vt,��E�/gP.���a%�x�Z���iE[wsZa�Z�Z:�g.<����Z�t�O�������C����R]�������(��DOD��V[n�3����:_���P-���b~&+���;|mQ�8�5�h��[�*����^�!5�(@M5m.}���%H����B��7�h��	M@��.sr��{	��B�4���TTf7)�a��64��P��,��u�������03I�����g?�C`2�d��=�����p8�^���.Xc��L\r�z�/���fc��]	�9�
���	8�d�,��R�H����y��i���$��Z"
�$4����:���
�L��p�o��@.6�-aIS��y�$��y{<A�i�^��z-	�1U{�\U�����PN}�p|��M~+y���^�P���ol�%�.���?�V�v	����Xz|g�����rAM�b�Dmh[���o��\ ���W�z3��}+���d/AK4�P4aS�����7T��K����Y��7���gom��\�M�W��:�Q������H1��r��w��M��@�)�^�W�|����5��ia���T_�W����5��M���k��&`M8��=�4k���B+��A�vUMEo~�:�c���;b���lM\�_�S�FQLR�_�\��r�:�C��@�RF���_[2���U-MT,ll���z1@3l�j����	��uH����P{�B�#�\.U�?������wF9��&,��La�K�6@2X`U�F ��&.�`wrA�\��w�|q*������H��%�k�Q�#j����xnM2��'	�Nw�Xm*�1Wc��^���KP#N�#�=d{�����=4j���5H��7�W������A�|������R�$�����	�H��1����Y�fub�i���`�;���w�m�����t���$����?^���;_R[.f�T����eabMTz���*��%=p�x����;w�����:/�Ev��
VM
~��W�}$/�n~C;vgU��r��R��z6�	�d'� �	�d'�����}��a��c��`�=�]'9�������'����s������o���o�E���]s&'.kIq�Y�Q���y2�J��wUt��������VU�A4��_J[���s�����7	(��{B��IY5O�����	��	�U/����5�`�+���^
5X������ts��Y���} Z�Sai��Im��i��(W���zw���9��N\8\S�^���AT��a�������4����l��9��,@�|'����V�Te=So{Dm������yP�	B��~8���{jv���\o�����b��{vz�n(�7�@�l�W
����k����n��
��Yf�a��I�/|����X�{'�'#�	�'n���1���nC����r�v��N&zRqQ������1�	��'^T7�����_>��C/��(��&��`K����D���.i�gV�����,tk���2yb#��M�{�����m�5!��jS�o�,P�V�aa8�������dm�"��[�
����iU��n�e��Zj��}�"�I��
j���=/��-�*���1��Z�	��E�X.Eu	�j�����+�l�_���x��4�$������i�`)8	�[�n���|��I�F�������{�*��v����tA��0�N�.��,+�����i������P����q{oN�15#)7~lz�-�O=�0h��pm�a�D�2��n��\�T�X.����*q-���*.�Jt6�������72Tu�k@�'\RVa�V����c�b]��<��fS��<�������#O\p���O&I����7�/�d���Y0}�	� ��l+�p�tX�	�eV��y[�(��@O��	 �'N)a�
���~����l*���\'�N^g<�8c��\�X�i����+�C�/�5^�j\������fb����Y���������)KM��j-��S�D��)����(cdI�<���������T���|����u0�Slj/@�'.Hq��w�;���� ��qg,<a�zf�	�1<��%<a����L<�h�B��T3e���j�P�I:�e���L�1<a�����J��3
o�������n�ma�:@O8BxwS��qK���q���*�[3f��5;R�#8�H��)O�8����AOl	bu'��.�V��'.4pY�D��n�	��'.�b��#d��7�x`��Kv��34!Vc�!�$�.��`���a�A����D[�W�W���S���D�&O\���m�n`���eQ���p�:4q�!��i����[b�	@�',"�u7��R�C����kH8@�'�XH��&/*r�8�q����	�������V�Q_�5����q4�Q��D���|Zt��;+��h���E�����?ns�5�u^^��Z|�b�����#Hx��t
lP�(x��0��z�����"��-��O8�X�:<qC�'�p���:lxb����g������A����., �'|:\��8x���+@O\�`y�FBlY���t�X
�N
O����u�����=+������K�v5���������������d�$c\�[Y�d�i������'z�@����@O�Hc�i}t���Z�)?��'6��x
 3���t�Y����9[A9��%N9���,�Y�Vv�I9�X/��.�������y���&�@��8�@c���'�9�z��{�bz�k�7�zz�����`���)G
���;y��O���f���_�>�],RGrx���F�
��������?����J����g�_���jhf.��j�Y�_��fW�����u�a^�G�6�����mTk�����u\��8uIl�'&
������#O���L������w��{Zv
s�%��\"�)��SW�����X{�a���:�P@f}w�/�qjK�k��J�r����^�!�U�zy_�j�l2��
X�4pfW��|��������P�'�9���p@�^�����2����-�����g��3��w
��4pS�	��b�Yn[���xu�M�1�.)6J!==&�D8
\����?%k~.�u�*�m�>�6�#[N���i�����z����w�v����
M�K�.@'��^�EW��7��.&*��
��&�M���9S�^N9zY/�R��O���W��fa�-�x���3{�x	X�4�%�
������g?�M����7h��N-���b}U6A�p�r
����������_�@xPnd��MqB���	�����bx�4�����n�����})�qS���]�cq�E���)����L�
z�������UyqK��y��i6Tt�h��n7�4u��u[��L�^��<W���GV��y$�G'�G���2_���r&$���@�)��������^W��%y�y���Sf�$^���sD���MYz�����r�+�#*l���7�O�#2������)�r��>�T6����O�����~=}q���6��=��������O�
@w���UHJ�5����u����u��AWg�J��g�D����.�MS���:�k��
�a�i@8�V/�A�4p�	�F�U�6�|zu@/�a��s�(�
��{�hS���)����V��U��U�E�rQ��8���#�jxA7�mj�m�^�z�����)�tS�������\���5�����TYc��m����|��ZZ�f���#NH�����M��&����qB
������:>�\o+���$Oe}�/:�oV9� ���Z���������T6��y6��.gm�<_����m@��X�����o���m��"���������,��A�V���n�0>�Af8�8�}�XO�4�4j	�`#����S.��^PA6���)|XT7�Gy�����W����-��^�9���H8u����T�7�-���3a�'1l{"��"�p�%0��R7�K��t��pOD����@
�\�`������������:��q�#���e��w+1���a��Xp���+|v��X��yh�-��6���h���5@���3l^�{����8����p�.!�Ta���unpM�C��h�4u�$��
88�W
N)H���n��H�F��}N9�Y+0��5��z����x\�wN37�K������P���<����i������920Mp�)�Q��yC,�]<0�?���6������r0�^�;GZ{�����pO	t�L���S��'1�,/��e���N�_*�9���:s��3�Cg\�e}Q+ s�fLn��7����9c�f��k��,v����,�9���D�|�'R��������x��0�hvi&�p��I�]���Q�u�d��8�D8;a��6a��������B� 7�R��Vh2@�f��`�������������+X/tf�%n������9���im��l	~�:�
��kxK���d���8�V/t�d�a''=�}k6�I?k`k9{��!�P.�V�n3'��x��w���E��,p�_p��^�6��(��,9������mf�"�%��"e�
ly�n*���,�y3KZ������N��MC����J���K�T�	����v����Hr�^P/'x�x��pn>�@���l������<8�Tm��|�@m3.��^
�5fP����������7�YQ��tO6[n���Q�/��Sa�i�60�k�����`�Y�tBs����S[��#�@���Bj@c��-�d����_���y'3��fn9j3@�f���U�t=u?3=�l���z1�G��vAC3Q�������vV=�9m9����^�f%���0������N���V����]��*��Y1/�������<��@Q8W+p�Y�����a�Lx�Z���9"��1w�/�7�]w��'ug��8�Ww���������q�l
7����sC~����I���i��r@t��	h�����d����eU�����[m�(�V���U��
Q�u��cY��C-��Z3��Y�&q����Q �c>�5����#J�>���3�w��f,O{���j��B���F6cs���P�K�z��b���~QDy��y]-�u������#�����M>�e-��Z]�)�3�fl[�g�)�W��My��=c-��P l[�1�l�q�����	t�w�����J��r��f\ZZ�)j�a�t99�P�8@U/tdG��yt�q�C�����������?r�YD�iV�!�\�J��|H��
j0qb�2@�f.	d����p����3k�����[��LF�ov���T��|��� N��{��M��R�f �8�t`����#?�2��?�X�+������
�����.�? G3��G�|f����(s���9�`��vw@�
>/��s@3���;���.�C3������5fS������l��t0�]�s�M������
de�g��~�?�=��:����@����
bbC[������p�eb��(��|�>��e���m��,?���^�If;���^�0f�8 Nc�2����qRS���!�m`�lt����UK�3�>��=�1�k��,��N.8�3�h�@���=@��%z�	�kQ�����V�bG�s�����[;8}�"��{�M���;t���%}��qf/a�U3����6�����si&�_w�;�6[��>w|QV	���>;���ie���Z5�>w�����A���1ao-�����������:C��U�|@����X�z�|%fj���%%_��u���f�1��'v^�>wk���%6��6$�:d�8���Q $.j{��/�����K�E�AL���E��.n�)��<"�mAu~�"��QL�7����[��.�(��l���������c�{�����[�������W�����r�vlZ�����]<X�qsQ	[������
���C&US��YAg���Btt�b;������c������2���)�)���1��l7Hw����y~Y{^���8{yzv���=}���w�'N�$kQ��N��'��fr�$�f4k�����]{O�x'|C���C���,J�2�6���WX����N�:Sr����G��n/>'n4&p�(�;�H�[+Q0TmZCf��iH�|l�YO�kc��������������7Q6/7�R|H�-m�x�o�5x�N/����i�b��&=�-��a�x����d=e��?�k�������9bd~�f��V7"�4�BdKb*�{�^��v�6!�Pu�q�qj(����((�����,'}�V�v��>wK�o���A�z\z�E�}���T^�=4�Az��~���7�v�����S�%���'8���lF���Zd�$p� >)j: YL��]�8�E"�	:�%W�V���|���e�j^�\�*9�ar���E@2�4�[���9�,KV�pw�/�����-/�o�m]F�$ +����jh���-����PU�|=��d����W��*j��(
���|1�]��EO�<u.���������N������r�}(�U����^������QR���#��O�����������D%���|5��I����3HCq7 w*O!�f-FE1�K�lEh������J���������X�h�7��Z�qV�����6BB��Z�����bJ}&����*s�^���<��^�a�4�B���[��{4n�,N��Q{�����x7��P;�J�����iQ
}������f�Q Y<�j8Uv��>.�M���T���OK��:C.��������9�����r�t
�@q����c�������m�-����G�S�������k�Q����G�*�����=[�W�������PiwKq��������7����X����-W���Q|�(�\ \6hv���!Y�|��;3K�;��_^�|�^��]��������������,R�Hq��Gk<^Z���>���9��)��x Z\�����q������^�S����9~V������I�����M9��?��w����,��U�v��>��������X��v��>�/�����Y��m�����>�dh�s��/�J���"��By�����hj�>���a��M���R�>2B���U]�`��e�dX~���0
��S��C���(��2h|#�kc�)���I;�K��%�v��>��W2~;�s��N����lc+���V�r�$��K�b",\����Z+����%@�X�\����-��b'���}���������Zw�A�(�AM�J�=@Y8�oe/���]�;6��v��>�Kv�43}~�s�1���@	-�Z�6�q>bH��0��Z��.�����G�5vR�>���h�|�d�(�oC������C;Yk025����(������J9Y�����|@O�'<e;'���\|�f�6��j�>@�}��������t�Z�����l���+�@m�j���u�w��`���3Q�]�|�6����Q��C�������T���67k��
3�Z{�����f|@V�,Y}�������*�*]����,��ZI������b-�R���(����������5���S\q�����|w��%L�����^p��jC$C:-���w"��W��mVK��@A8B�p~���k��`�� niqE���w�2��?�9c�FF�Y�����2F���(��*��(n�KWK�)����[����f�=GLS�n�����M*g4�lR9��Z16����_�S��c^�]���������
h�S.Zq%���$�v$0��[�Y�=�N�f��V���(&,��M9��m��}����2hJ�J�I���@:�,�F9������k�s0t�k������i&��\���	�+�M~~���,��
�����ja��F��x������z�f?t��]�bw�o�����kjBa��6�3������F9@8�Y/P�>G5�p�>����;�,��j�]T����(�e��s���|-j�(���D�5����~m�������vL4�#��y�(	�Qen���[��TG������b]��^�o(%G�yk�w��()����,��`�7b�W�hj)�=���=m��Yy!�0��	!��-}i����EWdV���hUrk��U9��T���q?����+�VN����u>��YQ@,d4
v����Z�m�y��(	��I��L���.:�!��r1+?��M>�^����W�f����m"��,���dc������F9@38���A@�r��]�'gl���Z4���b�(*Q��6��wnx���0����]�u��F���k�S<���k/?��?�/���|u�M��s��<���l�>"]�c�����By����������^r.@���e���wUU���O�4��zU����S�~�X��A��?�-�=��}'�j������H :�l�D'q���{Xy��|�����u�!�N2�G����M�B��;�Y�[�4H�v�O���<6P:�l]�i��}p����<�����u
�.q�����2��	��_s����R�b�=�WS�K���f)���rj�!�b����C��T�f2b�T�E���0bR#\M�J����-���L�Y�@X������;I�l+O�Ue�4Q
F^������3�����i�G�����h�?�'F[��u��mM���S������|S7x����X�cd����8�g��CjD�����gO�?��;��C��P�q���&*7��z]y�����4������0
�����������-��34�'_���B�iZ,�Q@�l*^�|������&����-���������^
!Z��������.����������z�����q���pa�1mo��:%�n��<z��;4�����/7V���8�������g�y�C�%[�������0�r���W|H��e�y�����>���`S�z�N�YI|�������]]>:��������A�F9@
lp��
���������f���4E��j��q��<J��$�W�i����<���) g�����,�������P������^��[����^��^X��
7�9>Fx'!����!w���Y�*��_�����n�X����0���l��t��-���G?x'�d@�~���&=��+7��x�\��q�Z9�a�7M�N�?�{;zi'y7��Z
p	��������	 �M��y*}�����
4X��>!q�
�4V�C�;��m���oa�]W5���C`�F��cm�O������kb�~��M��5k��J
��8$�G"�l�`�b�k
�{Q���H5@����zH,@j��F9@4l��e~�&��,�wD9�f��j���vL�%	�e���D�������Z�>�V|Z�@��.V�A{���=3����S�E.eu���C�_K���}���{!���C��5}��*��k�������F�@mC�����l���v�h����f���j>k�yg���G���
�0��U^�T2�B���dT���_���@�6z�p�<8�/7�z����x��8�(����Je�����
3�YT���~p���2A��\Jc� *KWw���9��6p$hUE��I�-�g�5F?���������	}sF����j�.^6&M�
8�(��
�}|�K��[	�(��LaM�5��%���D~!3p7�J�zh2�� �r���m�y�U��8]��\�aU�Fi@2l��=��Tj�x�if������Gr�m�m��3�"�����;����8Hn�!��I�N���r�?���t���XzM�*���T�����\�HY���c{�AF+�o���8�o9�C����w`Z0y�5�����^'i��~I1.�F�v�
�F���Xe���Ez�7����6W��bme`MU�1L�6�0[�������s��85c���������6��JU�Tm���L4���4�]S$�o���
b����0�L�fm��R���7,o;l��b����[�V��7���p������.9m0�3,`�\��`����q���n���s��{ob��+��f�J�a���oh�v�((�@lT:�"�-��7"��3�����N�mIo��d��4�����	qp9��r�V��0�?Uw=��m����t�������b-��\b�@\Zc��p�8
��[V��%��������n~�\p`�b�9�}ep�1]�7P��v_�(r���F9@)&���������C��r��L���~'r��qs����K&��S%�l�T��	��r��vA�}=@ \������W�FO����+�^��k�c���@�����60�j���j@N����0���:��x�F138�!.��
'���%1P�{���9�H����;��31�C�c��A��x�4�t�:"G�����:p��mS���}�
(�#[�:��_��}�������'@^6��,(M����4�SKJ��Q&��.v
x�����wN�d�w��] T�|tf��:p��C�C�\v`c�3s��t2�\P�ECJ�MDrU,��p�l8!��C�6���;�B<��ll?
���{=U}���$YK�E�O�6����w�l�[����^9z���88�3�a�����$'�	����hpp���7f��9�f}��9����5<�;��N����}��wH���y3 6���:��!��C�'/*�|�9i�92��>�M���m��DG��)�)��Ln�z����M���I��m@!����! ��p���8�l}@jhIRv>�<�V�	8� �p��8=�d�Z&��1@Y����r��r�*
�Z��2�Zyy�9������\��?3��v��rU|,�M����e�0�����%���m�o�A����z���`�\/8r�����B�Qd/���#L��!*9����]��e�Tq���.�q���#��U��#�S���{M��W�z���	��T���xY���r�3]�,3�s��)8�s��9����H����z(s���F9@�l,�9�qh��+���lK����Zm���7���I�$�����*��GJ�;����9��6F��Pq������D�5�&���8G�C46���>�W�Dd�M�Xp��w#�8I��p�$��C�l�,C7�H��K������e��0�����*G�Nc�F.\�)���	Z��R��E3���rH��+H�#C�:�1�m��$7��*]�E�}���*m��uY�����F�0rq�\���Y���r<�Q���M*��LW�r�O��)������;t"�
K�Y:�l �6t�p��r��Q�!.r�9)�M���|�I/V��7+��@��#d�n�Rs���;�t�P��3�p#(��CWm����m���9$z{z�^�o�M�@�.s�v&�����V�T�#b�]_mjO��GrlX�$F�������$H�$�(V�������`Q-��*�Qv�<������:�0�@#������'���=�����u�HX����Wg�������~@���Q�":"��S�� ��'(��n���:�@j��>���0�a<4C�8����"*n3�u}�����D�pclZR5�c@�u#�gp��c����v
�r	�]#;{3����E���G94+��W��_���O�Qa���?l�ac�O�-�������`#��gG�C@��l��=0�8��CG<4x�J��o�����"0�.�0`�����w_�`@����E�f�[]�u���1�H{�!�F9@*�[<�����!J[m���ri�����F9@�l�������#K��8�-��!=w��f4����^C��/��/�y�(�H�x"�����*�6�����CK�\�i3���0�O�x^l�(�;G9w�����5 �C.4��?0q��nK�����\���#�46@#��<�������+1B@1�6�k�Y3��! �CG�8��������.�r�A�t�o"�r���9��	�t�
��c�1�st�Gr�g+��������d"7nE���l��u���9&{������77&���2c�������|/�X(���#��Cv��N@�dW:rL�:rM����yo�[����O[�;rn`�#�
�w��"jG�s�����E���l�������|F��r�<������:��"lG�k��!������r����"@iG�)�#XG`m�n"�XG,b}�=���6Ku�e_��i��TG�����J�_��K�A+�9@R����x���rM'$�2?��%����#�6�s�GG\�e~��9r��E���yY����85�axs8N}"�#G����U5osQm3ct4qd����_o����P��������2�d����MO����8E�+W<��8
��=Q�]*�e.:��f���U���E��#�l46�+GV\��]��7���#�C6�b����#h���I�Gd.��fKZz9r��#@/G6zY��D����tz�Z�"::���j�(���9
8M��m>[Y}����Q@2����Q2#Y�*/��E#�<*�*_.��\��f^N��p�7�=�Z|���@�#knde��F �fU*�</�8%U'��Xz�7�d�G\*���
I�p��UY�;�`�e����Y����$����g���p9NJmR7W��FU�vL��Rzl�T�1��T��B�JS���9�ed6&W��	���i������r)�|���qbxiu1;���
�f���_c8�o�3�	4
8�(`��Nx/mJ�}e�@��q|�Q���6����1Ei�u����6,=#�������Q��0�A���C:YOl)�#�Gl��%v	]�ta8@G\�d�)o��)�`�������m��*H�+�����\S��$�c�K�;g��RQ$$��"�	���?s?���
�V?M��u1gvlQ�O����`kj#����F�u�P�5z�����i56G�d�b��j�x=�-��j>^V��%�5��S��������ek�[�b;Y�E��J��6��(qc���I�g��}'R�h�H�.���i���l��t����5��Vv������$�N��>
����^���R#��~�>
`����q�<�O�j�:Mrp(u>������Oo�>]���/���������������O����~������[9b�U��H��Q�&R�'��0@_#�]2���9� G(�R����
���������b���{U0��������7`@zF�����?]����������^k�$�C�hdcDY��(s[���M�\�G���o����v�����,��F������
FumK����Y�di�H�F�,�l����`4����&���4Z��8�Z
���u��T@�Fe*�_ ���B�m��G�|����P�\j����l��*j�����l�\-x=����UQ��4�N����Rb��e����D����v/�t��
0����jCQ
��������9S���'����D���dwM>��.��o�85.+O��{������h�N�f-��=c�yY@@4l��
�N����L���������T2���E������^D1`c����Hw��z������#��,�T����7Q�~�yhK���
��XPv�E�_�6I�����������vl%hklC[y'�5>qo�:	y�a]����%:���.0�����6��M~���3�I�ipS�)��"���a�"�W��U1^M����q��������_p'��u@T�7:��B}-�b8vU�%��vz�������~�����|���~&��0�4�k��=����,>@���[0��
��n
u3h\Y�e9��`8�?i���%�8m5�i�d0La���K��>:�4k���ub����ip�>�����w��4m���E"�#]c��N���@�5�j��P5�@U6��c[��e�2=������������0��`��G9�O����j}��������Ogl@��X"V���w;�Q��^�
L�����Z^��hb�������;��ow%x�}&��Fx����J�{�Y7?����I����q�t�� n�XB\�|�z�XS��R���M:#�H�����&��,� ���+��1 a���P�����W�9��}��#�k����6����rI8�X����8�KH,�P
�R�8�l:f����v�8�ml����h���6�+pp6��Yv�
N�����_
����%���4|D��� YcL�~{�8,l!�o����%�VL��c����vw@��b����_*�8�	;��J1����Q������.{*q��#����	�9�:E{@o1@fc	����������?�-�.�Fa���%�6��8%�ll3�e��j���%�����'Q�9�/`�b����ll����~kV+Ll������sX���#Q�6��u��:��T�z���M8P�����nY�>�q8n,��>@c�y{��#e1rc��-��BK-��=���������#��x�����w�k���IGK������@��P�:�+�V�=U"W��jr���_R
���FR�h��A=8����e@��Vjv��y�dh;�Q�MJ�I�k� ���>�i�ZEv��1eQ��:��<r���VU������l,Y��m����Z?�/�U?b�;�W)�������H��yA, Vc�Qo����
h(< �Z����oU5��b2���	�\(*���xkD��������o<6���*�?���c�����Q��&2~�:��=�N���[�6V��5�4����@l��^�5��z��BeA����S�z(�jLc�-F2�	?�
�������).U��gK��z(�����G-���37��^1���p���R��,�����P������;����L������������r�=�����*�������,�������������������#�������Z\6"�K�.]l~
h��F���GD������io��]����E�k+�=2�Xd@��4���1�.�o,�������u1�|c�k�`jX����
ON���yv1�c����X��nlCv�k�5@���To��8w\�
���4�:��c	f��ce�������8v��x�Z�@q�;�X������'t��c���	�j�'i%����F$NNF�<{5*�q�h��l89q�����������E���*��8�������+�\q�}����h_6�K��X�^2�W�����V����M���A�<��8��	��0�i09(�*M?$�'V��Q��|L�r5���b���- Fy:$�Z�'P���>z^<��YL@X$F����x���xo6�AU��5�Y�@�����nsw���j���%�0Nl���P�8q��,�3���%nOYKE�H�t��u	�G[5�m/��,�����)
 �G7�^@�m�Y���oB1E�p�hd�:8��`�~�M(����N�}n&�f!�8zu�I1�m�8	��z�aO�>\r"r�C
`�I��/�������	������ND|�H�^{���������w��[\;J��H,2�P�/�G�7�Y��K�PWE���{��
����R����]�e������.�y��)5���lV*��S�/����xM���h�������"IV�����0Q7�)LN���j���f�.��z���`�D	 ����	�)`�G�^�x:+��vs�����'��&9��������&����\x��id���4UU77�I�A���,�����[��h���W����E
]�����+t��7�.d%�[N$�_v� W�U���H�<ml���\E�K���+����CI�������x���	�B��
�^>9����[1^^�h�r�.�[��+�������I ��:������
��*	����H3�����.�7������f{��'-N"ap����Nl����'��2����7��^�(����;�g=���g�@+�,t�}@+w����{��1Lf��������S`��f�u����!� )���.�|��lGI�5�H�0�'"X]��nbW����M65��;9�*5>[�4?�������)�g5�gkb?�\<����~��l�������@:b�m�>��!������9q������|]��jK}:����.G�8�qb��� ����=�v�WE3���%�o�ov#����M�����m���	��D�!�3�U?(qb%��������W�z�P���$�k/����m�V����
�|�3������z)E�j��������������?����e��7.����bn 	���/f��8	����@*�sH�`��
SfC:@)'VJ���=�zO�=��A�"����������+��9�"9A>�{�	���3f������'�#��
�#Tr�h���8q3��?�_�4@9�ev V4���X��9^��Y��.�Sm&�{M�'�Sbe���#u%*�A�8��(�!��mI�&S�GU�w�*f�<TSj����7�1}.= @�	1f�r����|]<|\�j,�z�X>1���k��,�p�������.�N>�dp�H'�N���~�h��'O�HH\8�7�p�j��
�|��
A���k�G������{����7��_�/���B���e��Kr������^�2�(�L'2f�U�'��q���
�����?�s�������8q4�F7�j�n�U�hX���d�n�p�PaX#�G�(�97�
�sb�0�7��F1@����`���&�����7%�>�1 y����u��U��z���� �d�lg����7�R��jZ����U�Ys�j���/�}}�i?)�pj�&�~o��h8=q�� �Oz_V�/���I��+N%'bv���
l��)�S�O�W'�<>�JU�u���j�<�O.�"�����4t���j����S����c":S'RV��ak�����C��.-�
��R������r-=�vQ��s��o�����%�M�N����:8������v	�,2�V����:4{dz3�]���l ��9u�8x��`���9�b���S@,�6b��(�/(�+�6f�#9
s��p*z
�<���L���rZ��Ck�5�H�sG��\�_4`w�]�I�D��?[����������M:��v���6���f��g%�F\��bJ�v��d��xc��M�L�y���1�g���5qM�\�o�����1A���BGu2�4�h�?}`�0�Oc��V�������zO)�!��di���NE;b���������.8�q���I"�����&t��
���})��S|I�@}��0.��%���4����i�O{h��[�\���BQ�7ux� kO���v�JC������#&�����	�R��m�1�����y��SG80p�j8���lUx������<��w�Zt���!?$PWHx�C�0Y>h�n��`����%���yre��e_�����^��P�9�F9�}?U�/��9���������F�5i\y�7h86_�,�O��� x���
�d/t��]w�m���x9
������h��@����Z����D.�����V�Vck��������1E�����@	%��(����O]��Jh��Z�N�K�j��Ael����g7�l1�yK����r�x_N��T����:��Lf�BM��.�K9���?���&Oc��|����t�[X,@mdt/)�����5�l�&5�N%H�|��V:ue���~�����c�9��H�#-�����?���z�}�<RI���t
 �4�����	����uL-S�
�����a�U���?�A���$�������nY��tj�L�-�jOXu��rI����9^�J�RC��/��@��4&-	(�d�\�����O���V�a���i��T��i�"�o )�������X�b���i)��S�����"���>���H��&n(a
(���R���3A�7<�#@r�d��V�:~���PN����T���G����H��t�5q�$������F#�IS������MM z��z�8��i������A�j��"�����zkj9^���P��@x$��0�6���{;t�����:�v���v�������{��v3���FP���T�Nf�����(d���%����nH�����8ud�S��c����NE���G'�G�/�bS���w+~hmq�oZ ��o�<%Sa9�8\o�9n�fa�����Zo���R��A�N3a<��@��M]����� ��j���5g����s���:�(��
N3�bOK���a���YPM��������Sq�'k���8dn3���
�
)`xS��e� A�\�r��.��#U��q\yjL`��������5���T�[j����7Y���C�u?�;������y�w[��uc��=C��������7��t;|�x^�t��/��E�u]�<�]P7�����6�L�-J�4w�x1������F��K]%R@��Y�T���
�U�PZB��^����5J}*B{@�=z���4�J������y[����!|M$���R����������w�<u�������Oo����m=��k������e��$���~��Q�����uNB���k=���yC���[:��zVY^�]�2���D�?+�������$���9�Zz��������{5�]{2����o�V4�<��/�HR�i�������WM�o�#�����+Y����������#�+���9��m3��=b2�)Q��aW�L2����0����Qy��Z��4�b��<7Cx�6�n������tF���{�9����~��N��
I�x�����W2��f6�v`��6�p[��F�6�]�Da$j���H$y3w�`�(�vHu�e��0�`�J��5�o@s3G+������[x�L�qi�Q���R�������3��EQs^WB�N7}{y���@����}9�v�jV�j�Z���<�35�9�gQU�M9+�O���� �����Wmn�_�P������.KZp�/��JA�j3�W���1�e�9I��x� e���lg�4�1��#����D���e���D�W�V��41|)V���C���c2�����M�7G��7���E�N,, &�y�*������ON������NUT��#���!�)<s�\'0�$6�X�[�
j�}�=��0�W
��<��b�%Z�������-IYm+9E^Q���1n	��h<����GFE�Wa*;e]�X��4�G�Ms�	+(/g�,�JH�^H@�u�0�I�����G�7[7����1�AP]�ux�j�������'Sh�.zwM5+z�p��uUQ� �3�XnA�k�B����K�t[����&�!�>{��9]�G$�U9m[��*��B�O����*���3&����b����l���C�W6�D��Nei��Z�H:��:s��\��ULjl���Z�Uw��Q����d5����n�yA����Xi��b�����*:��l���Y�02�COg��2��9�����iMQ�D2j������jP�������o����Z<����*�h�����Y�jx�����G�t�C�9�x3o/$l��'��x�o�f1����������v��2+9����w3n���W�Vt���4 ��Z/�����XK'	`�L���1��T���d`���,/7k4�z ��DG�^�?�A!�8XL@iE�z�c0C�����[��~]�:8���>;��:��������2���F���gzE��d~Xo������l�L��Z�S���"�&���X1!v&B�|)	0���X��F��~h�(�u��[��>�o4��,E|���w���6�3�Wg���l����G������l��nk��/7�.V?�B�4C����������3@Yg��Vc���$6L��Z:������9s���V.�h��S�������N�i&N���t�����j��:st{������<:n�5�^��]�fHn�G��x*��/3'��q��J�3`���nW��j���F�	m�/�g.g���s��?�9�C�AoOd�^j����X���[�v�J`�Q{���)!S� �3�������3'*��<,��.��3��fk�����0���Hs��s&�1�����gn�PJw�4���X|\-�)DB���:{��`�
�B�j��0���4����3'l9	��g���R��G���9��F��L����S1��O�[48�9W���2�Ig���xic��|����B:�i6��t���-���.���~@(Rf2�e0�n��td���{��q{0���h{����,u���Q�<�V�<+%'�g��-��e����3@ag����ag�����%df��whv&�/���$�����k�N{���yq���l_�e�����vU�z����,�$�/{���P���Lb�Y�����m���t]z
xg��'],z'|���F�Dwj�~��?ix&��p7����{�U��g7��7{t f�F�u�R�(�L2Rf��&Q��W�w&b�/�U�,�p��h��;�jg��;og�����t����K�������r�z�����O�r@z�'���Y�DZ�&<wf��DT��Y�����:�j�V|�vu�O\�L/���}#����x�*n��bhv���|�qu�]Swh�<?q�z{^���I�O���F�t�9@�sg4<�z_VK����8�g�.�V���	0���C(�����}����
UH����^"+�r������O�f�p��dM@K3Z/Z&c����������
���	$Zr���#�z��Gf��M�:��� �*�fp6/�6���Z��\����r@��ah9`�s�Qc(����9�B�0M
���hj�E��vu��s"X[���O��z9�{.a�l���<pT��j�?v�����/���=�Hwv 26/��F��>�j�X�IP����	��x��
=��v���/45�0��9���[A�_h;��;����R��<��L�y�d]�q������d���&��{_fx�S��mY������8P��D���M�8n>NvnA��Y8QC	5x��� ���d��
�W�Av�f����GY]�M��AJ�x�hY�kmF9�Ac�����:��uShSvJJYH@�d�I2��s���
&rR�����,�S�t:-e)�t���\d���AJ�;?�����54�_	�Fd��?�-�>���������U��@�D�y�\����N�a#>�8���F|�K�m��j���p7p3o��
I��6`�s���m{��>[y6+4L��F ��� o��>@�,9���jA{��Xax������c5��j?�����atk����'��K(s�>�d��$s�p7�Q����&E��&qd�Pm^}JiW����T����{�u�����T���K�3���- ��b�bYg�5�7�,j���{7T�@����N��i�VF��:�ji����������+�:;�����������=?r�L���H`
�U�����@�0�>�>:����}���g�.)�j7&Z�s��<w�xw;�r�v��_�j��<q��/�6~Yk�C�p�$�����o���M����W��Pqu������-���]w>\�@oW��������y����3/�^:	/m`]q����� ���L����\t�Q�ziW���$l�����WM��xU.���/���i���w�3�rq,�[d����EFz���G34v�*����������^�oj���o�!A���n�Z{�Prl�`���;�S����a�d���_r����6k��[�6w����(@��/�TS�5�w��6gQ�~�8~b1%��S?��B[�PR�l���>U����=�@ud_jv'��-��|��3}���6���q��(���a��v���= �s�x�O��\b��],��������b�o_��
�7}�Z�2��HMn���~D����z���;�����-��_��>��C�:E��6xImq�sF��ZhDt�jDM����u�/5:��n���ap�����sS��:��K�3�z�eO�H��;�p����}����Z�m������B^����6@�sG��n5tu��&��4�}cvw:�I~�,���r+�N=_�VzI�l{�mU�^[�
f�u�
�[�mCh�er����K�}(F0�KY�5��p�1�����
Y�-�I�X[��6K�b4@��b��6��feN����GG����yn��F�
��`FXj�qC^)�k�X|��%���j���Ww7N��6}����'D\���)����{�
�������KQ����&}n-jc`�=J:ga�~3P�V=���-Dm��[I,_9�.�tWf�C��I9�V��{�\��H�;��U��s[���Y��>�nW�%� �]��Um�������xr��<y����?���W�7}G}�Uu�sk(�����c���f�����PN����M�>�CL�����3=�xE�N��XTV����Q}.�%��'����*4R��~\�E��#<�N�Y���=��.��Y��J������J5�X��q�����jS�J};�B��P���b5)�Vl�m��R��������A��n$�7�ts�d|L�imN��UUy7P������=P-��v�~h��|[;�jS�V��d:p������.?�bl_[��}�Ii��M���}d��v@�����r�[q1����%1�&&�v��>�W��xV-+���_~&i8E�\��n"mG���}��,BX":%�t���f@a$�������[���4}�V�vx�>�������v���Uw��J<����y���]9?�5�rs�����5��+g�'��Y��fV8���
dG��s�;�d��7@7���g�R�$!kE=��������P����~�{4�6jF#��&wt��]�o���<"���oz��7M��$G�{������>I�/����E�-v0�>w{�v��>NQ������[�s��K,�c�qh��9��D����y=�E����������p�.������&;|M�;<gi���G�����e	�e@�D��u}P#�m�/f�����]Q���K��[�B��h��a�X��d����v��������P7�
���zn@��,"�H�3u7�Ak4����8��\���3}���M2�[-X.���������l��sW�yuH�W_4Dtv�]���D��P�Ky����bU�N�N����n,)����_a#��R�����!j���Q�q�@�
gE�[��e7B9�)�'7>:8�����sz;�L�;i�Y�25Y�Vt����g!����<���j��=lf���^������yO\9�K|
N�0~���i��]^�x�h��#z����]�_�_\�}�������)_9�������3;M�?�;;5��Z{4��?w��c������c���s��4��E��om|v�I|$�2!����j]_?��Z��p���e�������v�*����m��|"�
�\������������������G�*����_?������*���k��%j�zT;_M��5 ;AM�?��V��zr�������

�I�����c��"�}:������������7��^�z�z���O���/?���~�����O���J%�����Al�����C!;qM��A]��7���
?��<;�'Z����F��W+����w��YP@d�2���L:zc�P���Q��w���>��NRU����e���b��c������h����Ev��>��;Yf���s�)N'���������:;�M���X���W���g�>}�tT�N�f���W�z<
62�T8}����G�����k�����������(�K'�mk����]�b0@2����i�M��,�<5N)R;�N�����U���w�U����� �n��l����4:}�
zJ*D�5�	u���������Wg����J@��������7�u���Hi R�&6{;�N�;�_�s����Tf;B�)���I�������8�*�S�NW�����0�����/�.]��)�����n������p��_2p�c���^���tt�t����f���#�������, f�~C�m�}o�k4>0�?��X4%��T��B��7f$���q9kh_��SW�6q�=U������'����}�������X��&Vk�KY���sZUY���BEoF�,>��������4C��}�G�Mkd��2W�� �PI��\*/�-��T�����mz��Uq�a�*���7/�$W�(���\h#������������q�t�]��T�
�YYP@O%6^zs�T\lfS�B�����s5���G}0i�(��|�#���^��SU�>�I���uJ���z�:FUf�D��;�<�d����>�j#���8��
�������t�/�������>��}'t5& �pyV�����0h~E��p��H�w5Q�������cgB��]�}W;u?�������R�v���5U�[�:k���.����A������*Vr"-?(�.�����<�}��_��h�!�V*�m���1�~_� -��D����
Cc$��d�R_����X��2j?b��u��f�\�O�/�����/~���>H(���v��N��������|�k�n�R�hiB��l_�6`��;�q�'�@����S�<h_��A����j�������(���5z�o�Z*������L_���]���	r|�|�s��7�����������D��-~�&�����U�x@z���I<��A�/����U�X`K >�V�Q�B�y�ws}hC��\���'�&\���J�M��������|�S�����#~ $c������s�)y_��eP7"$�61�S�����=	��%q4W=��
A��jz��n�X���{k��}WS��Z��������cz������5�����8��x�����-�A��~s�_tJ��W��V�������Y�z�O���E�x����6I^�BL��A��=y�u�@i���c�TCs�#�Jl����OuN���i�kRQQP�����*z����#
���L?��Tf���H}�]Rv_�D�<���Q�I�?St>�s=�t����/�^_�����.��Ir+�����������v VN�����AV���R
~�����x�@�B�����������w35�pdi��x���)�[��^��K��=�r6��"����S2�~�c2[N��K�5����f~�k���� I�����}�DZ��.>i���~�y(�����zg�����Lg�����~3�\Q�4�	�f��]�����b�b�}�����$e��fS�0�5���,�%���^^��DX�R<|)�C�F&� ����kj����Dx`�AW�k])����b�8V�0I�AJ�����B
O������"2��dR_}:�#�S�H}�4���mST4�x�D���0��U���~��=�����u����������R.�tq�Y&~��^��;y�-O	h�:�>~;^o��'�������p���6%�S�P����L����9�R�]��d��V|���^P������Z��6if��l&�e�&��+��`!�V��U3=�>�n�cY����*'�N�<�^8�+U�~���� Y��j�XP�-���X@@)�s,M��3�AA��o;g`�{��T����0mc��<�m{g��}���;��� ��wJo�<��u����@���}^S}��	�2q����1���������,�R�\�k7��E�S<�`�b�]�K�5�>@�\�hZ+*}|�)1����4g�,>�*�~�*�Y�i��D�2'&���������1��@��/]0�F
����hwG^��W��R�TjM�X<@��C�a�2O|��"�#�����L|�����
nA��������)U�rF����	�h���(�>�'LVAn�/������q{m��&i$�k	]�R�L�5S$�m�_��of���=�X[�je���=(7�+d�������.��%#�Z�n�����l@��?|����f8�A�z� O���D�}������. Q��3Q�rh�b������_7O��5~��m�=�5Z�e�I\u7���)�������6c�=S4L"xu�����v��������Y�`Aq�3OC���y{\�D�
�m]�Dj�<
�j1��1��1���8��W�>V�W�o�&��v<�g��<Xz)���X��X�������2��w���i���X?E29�~9���|�m��vNH�l� ��-Q�(��b�>�����3��p1@�Ep����g<H�5@>C�O>�O�z��Rt������	����^��p�b�eV_i��)�a��d.���x���� � 8�^{�+�f:����D�{����.�
������ �!��6`�Zd=��w�FF�I����2�����H�Vr�k�z��p�K��cl�����(��	�`lS�$�! Q pM�L��>�~�f�p"I��H���d����U������u�d�����:�0_��C���3[K��\
���W�*	���yl�{��1. ��$&���
m�o�����D��HP���\��@����F��R���v��PTT������`)1�����+��gS{���QP��������D����9��������|���lM��g�_�[�)�'��!s}�F��:��������wI��3?��H��>@�B��g�?+4y	����������L�:�����������)��9�l��d6�*��������n��l3���n�{{�D�����?����?+PD��&Z�[�|�@�g`�J�rTD�R�K
������P����8	0�WuL;�g�n�'��7�>hU�VFD/ dR�3.9"I�����D����g�4�!�m�	������7�i�'�i �!��}�ZEnHh��u���Z�\L,A�A �]��4�5��(������W�&������OV�R����q;[�V��.�N R�)�p��	�������5�
�������������'M@��g_��f]����������h;J��������K�����eu �m _{'
X���\���]���A�@�w��3B�D�{t�R$n�~�lm� � ��}�P�&81<�/����.�Y4�j�:Q���
��P3b�@A6@�8NL�H�����j�x��V��^�����Y%o���� q��i�c����n�
Z���bY��]p�`��-D�?H�e�������_VEA����:�b  H��A�C�^K�	Kw.���2{�X>m����M���$){��a�����|mV�z���'����A�C �9����'5�1�H4�K4P����9���������C�W�D��&�����0��$'�~L�x���V��Z����t3!n�V�����j����8������������FMY�^U�~��>Ck������c�Hh��;�����1��B)K}(;����L��L`��L2y�n?l�E�AzC`Io�{Y���<�~��~�_P�A�D����L�A�C �1tQ�`�j�O��@�����:]�Z.�$i��)������@��`�r��I5��l/��?}NA��Q*?�l������TAt�&B�W{�=�YgsO�������� �"�W�����o�9H��>@]\�\��o>�z����-����� w�W��T4�<N��)/�������sy{d>R�����I�l��*����)r!H�������d7�'nC6U�cv��U|����C��J��>v������a���FB�����0����JY����xJYiRZ�f���L�5u�:�sB�
����Y����
 �Aj�;�^�](������\h����p��	����$O���
�>v--Y
{�@��6����5 �!t<�!
����I����4���&B��~!H��,�� 7"���u!Hk��l�
��4i
�cZC��o����^�����>6
A~C���-NR6��S#��N5c�n�Ma?�7I�>I�{�~H�]�K`�d2����
��H)�>���S��u�Bx�e��:�{8svU<��y9�;��{C��t8�����+�u!�!���:q�~���)���s:�[�����)�k��o�;��,$�"0s`�����t%@��������(?K����]�F���	l����B�C��6��!`��P��]�QC���	�b�?�N`����
���������Xikf^�a:�����%�������E�Yu��	�S��kK��a��m���^T1��8��!��Ct,�y�B�}�����^���W��#:@�����������I�_H����B����8�i�dGo�������G��h��yk�|���u[�S�s{��z����#���Dw�s�Z�j-/Jp=tq�����bO���������-�;$w�*�\�*��,{�
���-s�az�.����p��p^��>�!@�C��l[��@%]�qE]Pt���2�,�G���l������������q������=��[���!��^��(F���;2_�*���fE���!�n���N��\�A�����tH�V7J��,��bz@GH�P�0vMd	,UG��X<<�k�@�<t��lr@b�U������������
��
������$Y��42c2�z}`�U�y�~Rc�b�����h�����9��9*��PS�r���;L�v��?��7f�y�������������y��Xj^��-��p����������h�BKA����%��?�5�w�H������;;�n��&���x�)V���������4[��������',��;�#�w��Gh���P�o�}k�����LR?����StV.�_��?.�>[����t�3���:*ZwD�eH���Xv�:N������
B�y���������f9T��(9��z���O�d������\iBfx=��f�Z����U��js���5����e��~1+�������g��!.�T�P-}]N*M�?��)M{��S1�������UQ����f?�#�YL@�$��VL����nn2���X�uHk�$�	v���qoh:6�+," U������XU�.k�_��d�N�B>�7N�9�N���q���?b�Dt(��>@k����I\�������S1�z���6���!��CG,�� ?���`���-^�Z�����9|��0w�
�GY}������S�:�S���
s���O>�����<���rs�T�lJ.��>@m��<�w�B7�I;9�k
7N��3qC�����\T�� tA���{�I��3������r��[)��,����@���{���0n������@�������7���7����v�s;�2@- �<���\]kvo�8.�5�lN��5�p���A���nVsZ�^�1��`D;�W�k��]����P��#h�>�~�wP�f��s;��|{�xux�g����w;���N���� t���Kk��������C�vp,$��E�}���@g*4�m2��������z��0^l|T:���hw0�Qsx&�w�*�]�"h
?�<��A�$�qI���?.���U/�D3EE�N�5�sY���I�rs�w�"@+G��2Z0i|���-�X'���#GJ9�r$�����t�v����
���q�z�8r$������K�bq	B\q�!,q���.�\�ofE�F�|�0�Q�8�]m���n6*-�r�,�W���X�dP/��@����2��#��8F7���v�)>��o�!��^9�Y��"z8[�5�����oN���&�-#�5G6��5r@$G���~[�������,�����xv~x�X�'�W����P~T A��u�;[���t��7*V+}�k�����XJ��D5��MA��J��>���7�L���"4���y�����t��?����X��Y��2�(p����6uc;�T(U�BG�.��u��5t��k�9r�5�<��/b�A9� �#L@�;=
�<���jR�K�G��������ZT������hO�YL@�B�d������)tf{�f����d����M� b�8�;G����5y�>G����J��~@x�H�����s� �g�D�r�"�b:�����8E/6V�!z��0jv @��4��6 "�MG����c3�n@D���"����Q�/b�9�����3�MJ��>}z��-�uzqy��\��?}x����7��z�iI��������,( @���I�f�v�;��l�[o��t�%m�4x����!�#��C6h0��)a��j<!���������+�4&�����ST�#�;B����;��5L�]�M��?g�����~m�@R�Xx��p6Xw��u����
�6'��MN�k1�3Q��A�i����uv5=�bD_���D�O�y�:/��+h�\��~���5�D�"��d/>�e@������j��)���t��UH?/(�l��2�X�#����y�4!����l0�'�m��L'�q��
%��	�s�]���Ooh}U�RS9������C{��3y�r����Uv�0R�?�1�A�����}�Z�������f��6y���b5^=�B�����2������e����"�%}�<�y�<kaL��*��2X�g�q�t��q��,)�����Y�W�T���ev��k3�{�%S�	�D�����2����`�Q��B������w��r���vs^u%�����G3�5�bg��{F��z=rr���0��#c>PU���TI^������J4���3�P���G����<��;�]���]�o������^^��j���E�G.��l�	�������F��#��{� �Q�:d���O�9`�>��d�Q�R�J�fvz�}�}_y�o�I`��#�������+��y�_�7���9����7QsPg�����0������_�(������r��G������l2@Y}��T��#����2yS�4�#\�v�qY�s��_�?�]_����?��7��#���j&@�#	�P
G�l?��GU(goi�R&@rG�|����h����)�p>�`w$��#�c��]��&��9��O����Y��V�����C��iC�#��3��#����H}���n�zlV.6����>����jA;���n�z_�;��Q���V���������G;�4\o4�%��1�L���Edo���x�:�T;:u�����o3w,B����:��z��T��O
���y�������*[�X���_Ps������5��mg1�e#����5�����:����m����E�M&c�S�OMkJ����(yi��$v����Dt���+��g}Z��(oXuc�Z����:x#�t�GO�����0���3N������S1��^�\J��?.��r`V.;��l�uU>��<,�����V?��
������d�aU��T��=�n�D�S�qa�����c�g�6<�!�������l*�v��l��F��(�w�/��Zo?�O1 �c��f���l������J7������������K��q��QEQ�I��#�=B�a�
��&�t�w�\��l��
d�g)w��c���y/��������}���8n6���[���rx/O�{Tc���+����Vj���h�Lk2E���RY
�G��W���y:V9Z��t��L�����4��M��Pb@��V���@����'��/�����u]"��)�"<��U����jQm<��{����-VO�{*��
���G����#�����������|�p�fY���,C��1�1���*�!���������rUL��'^�@��6�|�����n��.-����n����K7]l�0��������U���c�'�����C���N<����
��/Y����%���(�����`�c+k]�e7��^OV�cS��L�)A8:&��mR�Z;�U���6Gf�����fyH�&���],t^���I�����������X���Y�S��LR�4�U���~<�������������������#��c��5WG�c��F�'�IK��������kCRH"	���fl�D`��3���E�����wz�Q%�����j������]�����5���
�f�
�Xb��}��D�~�[��9p�MA*"�X�c��D�FE�4�>�uqc��X��:������:vD�[����u|�����z:v2��]mdk���Z�Kc���c	�f�J q�|���X��GJW?����Cu�&~�n}.�fL��M�d�qq�3i������R�+K����82o?A�%7����mh8v���8����TE1�T/��iU��J�%�~�N<�60q,_��~c	���`J�O��4 �c�f�M���5o���c��H�C���G9���u+TGzdv��?���;L0���H:o���n���/��o��j���	_�����7����^&7vdrc������dA�p��Z6i�Q�(g���6���1�h�";�.�S�i"�XY�7���dc�6y�k<k,��C�L�]��O2�5�kG����M�����\��f�k�B�6���* ac��^��
��X2df}0�[c���	B5�\��}�Fd�vZY�Q�-za��H5F��t	�������1�Vc������g5��w7sZ��5�������e�����]hi ��a�}q�����8�8w(�4v�8V�GC
��u�Q��U��z'��������N������6�<���`��d���tBW�`���h^<�:��;�������[6� �U����l��Y�o��o�zgb������l�
��X�_�Z�5�����^h���5�Y��
-�d���5������c���	t,n_���p�	�$n_���x$v�I�q�j�����.F����	@`�����2Dlr�f���59qM�zP�u��`��s0W���	�XG�0���I}��2�i�j3�{�n�������]D�������6 Ip�������W��!����S���U���;l����8��X�4��F���=+�������o�h����Q
��L]��C:����BT�L�O�H���	�Y�m�5,j��������M$�UM��t=c6�\5�p��l"�j"���>@x$R��&��&"^:2���mB�mG�k����Z��7�dU�+����4��-��x_�3��S������Xv4Q���#��/4i"����	�A�`�T��X��������d�h��J�_|�����(MBA�`��
u�V4	�<`<�)&=��CX��&�X��!,HG�4i�@��4��#�t�8~2���su<��J�5?���{q�xNvG�-	h�
�������oU��t�b�T���[rc�s�8�x�ZT��������0K,�6�t����lc\>�EE�z�B[�1��hb�D��4�8�����.���u�\)�4�~�p����CoP�$���(�p&�<@�$�NU��t<#O���o�����:��2'�k�p5~(�7��?��i�#[Oc���9�iw�j����5G�h&���G����}�7?4�����W �@����u!@�	���^�	(�$~���vTw������^L� �H�(��3�������8��	`A�'.WK��&6[�����=�?�J���t������u�k����������$�B��������hx��jq��L��#+��n��������"dq	����(��lQMV��Zn
���R�HH���O�4qv���}fd�LZ�*H; H� EWcF��6����	�I��5s%~Q}h���c����C��4q��e!��C�>5q��e���C�5qv���p��O��H8+]?5��j��?���	i�P�`�I����|�
��&�s���n&7I���w�P��D���9L��C}Xk�7>P?g_����g�sf��T��b�I���8H&N`����s6�Y��@�$���(U��������p95Po��*X�F��5�@����
 ���"w[z�������	HW�*]�Y�e�����'�1N$�Zv R6����q<{��c	���l���7>m,�q��q��v�X����<���=�;(r��"��N�	��G�8�q"y����'"<2�	��Q���rQU����Y�6����EV��\�����H�0�e6�i���!��5����4�f�1N$��eW%2N$�X=���k��V����^�c��/��z�'���=(��c����U^���D;�������jp�����rVN�u+�n���m�_���Vp���+wM[��?����u����aU�IR�6����������Z�IF��w����b�O��v�����buW��������w������Wo��n���/7+�w�2�kSdZ�3��t���v�Y@v
K-������/�P��l����S��{<���z8=q���C�u]w�.om-�����)r�� ��AN_� ��AN�0��+�l�]��9��r
���AN���1�=��a����]������}�q�)��SW�x��tY��p��#w��8��n���>�m����Xy���X�{�\�yC�.�O�Xa�+'�q��w��^��)��S�2Ob�P����V�j�~�	|���Ks7w�L�S@����B
��!�lJ��7����5�S{J:�]���$����Lz�=s���5g�1��$��Lm���[�^=�iQa&!��u�P~���x��,U��}���>���8�������e1o|O�r��>�.��X
����n7J���������.���(�������jJO��T��Zx�Y�:�ld��j��cb���4p]��4��G`�D�~��HB�4����;���Aif9�.?5�����:�Z��_n�P)��S,���R�A���d�FA�*��z(����x�wt�j�����ty�4�t���iNmH3�z��J���F��t��	p9�y��+��j�Wc�=-�.&`��h�
s��}�9��'�E�4�2u�9�~���[��.����=��S�7������d�LO����rI
���n��Z	����qoj�J��<��9��m�:9�0dv�~%
��9�!���[�&�3f�����4tQ?����������x���X�M1����9���U���l5x���R%�2�����[�+��q]8���xI���zMw���8���Z��?K�|5��*�FS�G�H�z$J�i����hA��5P�a�J�T��VP����y����7\\���|?�F��h7��8�`c6��pj3����j}��Y�of�uII4����P�J@1�h�6L�7�6x���������Jw�$v���nK�"f�k��ZYH@9l�
��M^�dG������(�����@62��_(iS�%���ANm��Y���)���,Y,�j�x����E]&�nJcd�����jVr1rj�UT7:�Ew���\S���-z�I�`5���g�g�mg���8��L�v4_
x���+��6K}�����b[�����d@�i���~
��4q�z��IG���O�������T=��?,��^!k�����e�����k`�S����vB��&ni)��Sg�8�1oj&h^Q�P��I�Gz]�N�8�lV�0��i���x����<*-�tz���bR�~v����~X�+�E�?6�x�>Z8u�����]����g?��k����!�V8����7�������T2�e��#��������7<��
8M�O�����N1Q��.������NEw�m�b\n
8�4������m�N�=�H��J/���������.�h6��q=��xW�m�-B�N�V��w�����{��m^��mC�����XD��K�,
r-�Z
��TBg���gS�m������T����hL	Qa��W��	h���.�B�.b�`���������b��R����R ��#��`�5�X��6u`z�Z��2�'Q�)�@62��S���|���2��Q����/���-�������=h���[�G��Tc����5�t�MQ��9�G&#�����[�t��l�kr��D�S��@���L�lu[5�C�K3�\�L"*;�TU�%��~�c
��T�s��P��D�-a:6����0O���A!�%��T"p�����H�z�s9������"w���/����j�k*B����2@�f	��c�����+Pm�f�o����m���>��������<%>I����3��f��.����g�Z����j�����_�,z6��r6;y&,�\-nv����:6��������nk'�7�1��6� Zv %�>��g�z�|��������j��5���g�zg�#�� �t����
"q�uT��eo���.�A�l�A3����f��|���6�Q�:�I���uR�f9m����$tB#��K��zX��6���R�mEq������i����������3;2�7s�q���W�r@u3	���r�2�f6�w`9/�o�m�~���N�}q0���8pnf�s�^e�
P�Y�6?k5�g�*g���$������6����}�VI^���k&2�#�l�[�ql�e�Y��[��6k�5��B
!���Q3W�7H���(��+J�&Mk�j��j�{l���_r���I�+�Nn�U=���5��O�;<$�3��[f������/�d���D����x���l2
1pm&�4d��f&;�h�T63S��8��
��l��R����7�������0�v���p�u�@�!	���%e���BG����C7����L��hs^����U�/?����(����l��z����l @�Lu���n�J���,�]�I�G��f����&Y �� K������M�M�=ss��������92�`|3����w����D��.�3�w������m��;I�+�5�Wf��q���#��&�&IS[�����>\V5�jb�����L�����=`�3�61Q'�&�����|�$��7�y�6���Xo����G��J�Y<@�D���3/����'�}og��(���b�]��p9������bG]����0}c�gmm�����]��%vZ~���9����A�:���J���T�;(P?��G�x�2�N��a�wV�����l�^����N��Y��\�@�q���Y�h��w*;�6�*��v;�O?jYu}�uwr�1��7	FM�3	Hf��#!����o&���C(��w\�������+�1�:�q����8�~W�
�1[hlp&��#�^����������pT$��������k��@o�{j4)�����gA�
�GDd:���n%�,������iE�O�>���0�@>Dvxk�T���E���N�G��n&��]5�U�i�*�w����}����������-'e1�<{W*a��vC2���A=C��������L%2%dh�]�?�N+Z���O�B�g
$26��t���=�
��*S�X����ms��+TW�f0U�/��3�C6���F�b����`�3�EZ�r��!Gav����z?���oH�dB�$������(��F�r6u�Q�Mv^}�OE�]Ag���D�S���[��7��kr��!�^]4��X�{� �RK����q�j��������"i,��C���}�++i��Lr2f�
$�|�p�L��)Aw�*���Go���q�{E�dA,�>�g|K�m������#�|���]����r�tQT&�������{�����Q.���$�Z���O�;�|N��L��Y{Dt&�C2X�,sC��@O
�d�tzY����b�,:�h��vG^��@l�����R�n�����p]�tu���1�72I}���bm~A	�(����$*�����#p�I�C�y���M��~i��q4��=U�g>�=�t������������_���/��>]�����8�x���+)H�Mg�~�@���m9[�_�kSl�m�;_�~L��P��������uxR��L��}��H�����rf��,�^���e����vw���s&{�;A����$=�#�(P��D'�����v���
��
���b��^,_���.�_��1K���������:�r.���uh������A�N-Vu�g��E���9]|m�^�v��\r����9��s�c��)] �sn#����eu����v5�%_v���x���e������
�:���#a��8�����8��u�5�~)��{�N��W���!��-��q��(�
����r��_����������M�O'�����z��q|f�H7,��� ����R�����sMh��N�����R�}����o�����G#56�|")P�NV�\R��S�:w$���	�k�wn�}G��q��M��
h�ylS�P�y������)9��sG\���EYn������tk��+r�aA��2���$��7��xs��<�[XMuW���� ��,6�o�������`��\B��}�NI�20v9��eZ���Mg5S����Xbl���9����,���[c�o��f1��)hv' 6��}�v������y.���� ~�;����2l�8�H��� �s7{���f�|i��x���M���Kt���O�v�s�m
{!)���d
��y�4�i���o����8��cv�"n�q�I�K�u�3��N�c{s����b���+�0zz���\:r3�r�+�V��`e���p�5Ls��47�I�*��sG89pr�l@�09��nr/�C�&�
Q!�e�L���as��X��]�Q�34r.;�;��}�����)}\��U3�I2��E&�������X��k��-6m���{�4�;����W��M9�<���9 �s�U���ln����dm�SW�Y�@�����nt,�2�[q�9��������Nj�^5jW:}��Z�%�c^�x�K�1U?B��`�����'�_%f�Y�[c,K�s.���W7�z�J5�-E_�gOtf���!�W�h�v�^��N�k����A�5-��S����O��0�9�zLl�C]H, ~������+�ra�z��b�V�k?�w�����K����PN���;�]��$����Ak��~����(�r�;�rnuN�\��+[1�sn�I�\z�b<���<��� :��h6�sn�7�$�����eA@"�"������a%fW-��],��-�9J�G�+����T q��u�^ImM
,	�9��s�'f�z!���V ys���k��������F�����m��s����X�s�JN����Kor�S�s��V{�Qg��{���x^����3�p� �u��W�����W������������Y��Y��S=,5�X-���uZ��%2�|'����?��O�@�b8-�����3UU�m"@�.�p�d9��:���%�bv�1.$o��yE}�J��@a	������'M��F��f��n�G���z<�*��h��Xzf4���1U` T���N�h�wG#�������	�I�=����7�n|?H��]]o{�]�wV3���y��B�y�?�o��w��0��s�X����r��.8��di'���p�i���}��p�������p�~����W����i[w�� ��#���7wA|�D��	��A=��A�m�;B�q����0�$�?�6����n�����o����/����&/����vv��2+C]#�_X7wq2�%��]��T���.�Mq��,CjW��*�k��	t��k������z��ws�l�@r�E�n��G�b�g��D�2?l;t���W�
O�0}��p;A��+���Ug���|���_o���Eg��Yu�>��uF}�Ug����$�u_����N:���������Ug�sg�Q�l����rX��>�w8�C�L[�Q[o�|��[%�>~.V@v��>w+ ;L�������Y�{M&)�����m9+�qa�rz��"�&j"{�M�_���r����TJ�
@	D�b�,�A��z����/���fZ�U�!�=@�R�����`�����\�iZ������LRGd�q]�����eYP}h�"�;b[����6����L���,}��k��M�
���ZW�p���V/���;GK�\�S1�E��2����E�$���J���x����o��1	C�#m`�5�`X������N���G���M����_v# InX�_����LIH-�E�Sw��f�~(M�7$�G���#�
h������
��/��D��yR��X��Z+��Tv&�>�f��5���E�]@����8k,�Q����#Fk?,D3��������9ESk'ET�Q�
������Qw�?E��S�-b����l"^��6�b�@YC�y����}��d�����U�}�*K��
hk���n�5���������c���L��P���fT��7�
�3M;L������#n�8����i��(��]g�v �>�C��9o^�J����������7W��������������;4�c���P�j�y�g�T�>v^>�g.��7�XH@N,\1�a�fc��X=�k����_�M{4s_���+m
i���J)�,,�N���c�:{U���Hoq�������5��c���Y��}Ua�/'h`�5�y��z�f>�|�*�P]�����1}.���*���U0
�*��hK��3&�	�_��+�W��s������A��!TE1t����sfQy�@��|�J3��������U7J*��l������K.vX�>w�������K.m������Id��9l,34�%��CuW�U;�L�?'��[��c������5��_��`���s����9�g6���_��[e����dv�.���mnt��!v�>w|X�-.r�aw5A=s��\>�������;�M~F�'nx��= �����A�x����N�B����l`��7,bA�|r�_�Y�E*�+�a��\s
�o7�������%��{0����v�n.���IDi;��'��N��I�@tG�7%J�\����p1��b�������I�)���N���= Kl�1�B4c�N�"��1}��j7tH�Tv@lD?b�A���H��6�� a�*3F�*�3��_K?��N*z;�L�?'B���4�"#�rh�t��a��*s�
���P<L�O�ey����x�^en{L/�t{�����pIh3]��S�,�k�vB�>wz�Qr��U
h���q/B�VE���:�P����D::!����n���N���b��n��������6�P�&�m�;zM�;����p��rb�����4�~]������6;�M�;V&�i��[{k�����c����mOp�� �1wS����.��Q��
���8}�m��}� q���J�����!���[����P�V-�^�;xM?X�(���tv��.dy�|�����z����E5��]T���Euxn�T'?�]R}G;�:aS�\���Y�~�.e����U>���W=�N�Kr���zk{�5���qQU������Fo�-��u�5�dAy��GC��������v�~�/��v���� ����P{��t�/���m��'@m�#�n�}���R����w���D�U���D��'�>`�}�E��s�/�H$:�h����'���M����:����k�H��Q�m�o�u)6��������\�`Im��KT:�P	*�j�p_������7��3m���fF��W�D~�I �}���2�v�����������QN/�l����h����/Z5���X��b (��@��h��9�����%�ev�v!��^���7�}�n���N��k����>l�_���w���z���_�vzy�����HA3���h�d���#�,0sD�/y&����#L���������mm-��r���+ �}������������c��-p�v����[M��o��~`������U���������4��j/��(��8�>���t��,"�:��r�
=�.�{z�5������u��������}A��������A�� 2���W���3�_������4L�
��(lv�H��������_�b��m7����7S����7���h���5��b"d�c6Bwls�oT�^-�R������1�7����7�L����H��y�!��d3������qM5����-��jJ����+���}�/�\G|&�%���4��q�����F|�8*�p(���}��5x����q�tW6u?�=uJzN�(.����YC0���a�����D���w�C�Yc��7���!�u������V��Hd_"���>��}�6f�!���'<����b�>}�}�
	
V�����_�pc���9��pW�1��Od�u�?Y8@W�s����K(�Pp`?qh����CK+�W������������
v����K�/�h�.������Go'���s�3�}^T�����K��t�4�
������T���buG:;��m����#q�;v� �Sy�p�� �������6 �}����H�
V�n\��Z��'��2�nO�z}@K\(b���>�5�B�;!�y�D�������@��e�q��}5 ���f�������[�V4|h�w���E��_6z����f�~H�������|��(��~&'i�r��7�]�G�����F��R���������%��%(����Y?!&�p��������(y�ZS_dLG������_��C��j�������K�PA��&��Z`���.,i�d>c��%\���T�:�e�$��\�K��I�KF�lf�O���=&]��Wgx.6kU�����N,��	~8�����q����sZO59�Mh=�'tU1���BK������M3�y���0���I5(�����zP�*����"f ����.*�#����R���,��U�.M�#�iW!��9\��"��%B�Y����#8f���2�x�R�����[�����R��E�������S5�:�w�@3g�����8��?���A��?��O��I�����n
:
T�a�������z����~^�������q<�GW�����|�{��>�ue�#�5�6��WK{����+���@H@j�>������1A�fT}U��h:^=��G����f�T��r��AH���h��>���qf[w���;�9������3a$�r1_��b�4T���.�G��X�k�8�;6_�j)mW4V�c����nW���/U��[�._��#r09��Ua<lf��N������r>)���x�*T5���iN�m����h��������\,L��l�4�x�z���a�]r?�����
�������
@O	=e#�6���N/��<�_��W�W�_j���*��_9
�Hn�t����������E@z�����4�������R����s8xi��9�~��-]��rm���GG�H������"5p�Vc����=��J��)��.���&�� ��5-���� 5p3���o%���'W�o?Q}!P��m+'$j ������&Z�Y��V��z��n(+U�
�!������N@���
�_�4<��n�����)��vg��
���a��FY��c'&�7T���	`Y�ee���f���Sc��4�4��y��_��$�9QH���du
��T����d
��D�
=m?[�GS?�?%	Fe����D�����/��|�H��B����Y�n�~��;���K�,)���Sn����B'
�����AU����XU�LJ-���~P�I
,F����i��HM�t�W�����*A��Z��=��E��b�{kVb�A����QzC�������S�4���y��A���e�p��@*,6������R;���zi����i8PC�9S]�5��	�d����b������qo�h�n0qr��,Tq��
x+�0x�@�YG�T���b�Sc�6a�2�BQ����;���������h�h�;����u�d�t�3Q����s��O�:��*�������_Yl@%E^�(���,������\�X�������o������:��g<�OG
=H�l��f�A�����Z�����=���o��_�gM����;����j3�lOll)LY9��
��
�����e��S����1�h��^'.��L��j�H���>��~�H��V��+�J
���h$<�G�mmWOM���@�o��t
��<mU3f*X�@b��/�6��j��Ar�Z<��Vw�v�9,no	�UA�����1_�>�(��<n�N�������m,��(}`��=�r`��I�WOHu�2{&�G-���)��Wj���`T�A����*;�az9p����bf������1��R����l���1��G6lU�A�6����.�h�W�a_���'�������q+���m�9��fv�1�L�O��"�O;���9p45������C��7��P-��?�'�[�������,P���.�&��L�����:9p��@'�����=������S���A�������O6+qy�A*�L�>�?\-�����=��P����������O�������9pu5NO���R�'�c��;�%�����|/�{K�
����.7k[�a\f���Y+�Z27�������C�es<����:�Y����`hv�R����N�m]��D���s���9�Pf�.
H�@��yo����%����r�j���,����ZM�?_��~M#<�+s"w1=����O��)Tt Q�4�-'�;��5�8�L����h.�[��HYU���C
��{cM�V�;�X�2hb;N<�u ���N�MK�u
u���4m���Y��������N�i����>om</��TB��b6�Y�%SC�PR���~���2G
~Sw�"z#��t0K�s
�u�4khR����\��:]��)���+Yh�P����>,wx"h�����]�kU&Y����1����q�=���%�?�	��XP�YB�����6`uxb?��f�da��j�W�	������(��jO�^�\v�]�]M�����@u�#y�II�K����a��Y�v�Qf�"�NBa��wn�>�E�\o��,�u(Z��5��-K���>��W���Y�%�/��6A�<�t�:�x,�.���T�����Ej�e�
e]N=m�Vm�fK��\35���!S���kC��"^E%C�����+V��V��C�V��������w��mi
%tL�[%ux���x�-�#S@t�b�}��X��"���U1�L��E��������5�����:	y�}�t`[��Vi���:����hx�=�v(����L�x���''A��R�������q@a�"��������Z��V�[�o��F���D�CY0��PO��t�4�5,( c��.�Z���u�K��^���d�lQu�aX��v(�o�a5��a�K&I�_7�I����x��l��NN��!��C��\&
��%J�����	���G����O������J�n���/��.�Hw��u���b��@�������?�W�^v��y)�~@��nT�=.a�X	����%$��<H9: ��^�! �C7�������k50MP�������U�@�������#�Bu�V�co@*�Z�������������������>��I#���6��M��U��V?���3U4w�?��\��tgR30w�s��#��5�q8�0rt&����+�rj9\n�r�,��a��C*#wQ��	������������7l���?n=���b�`�C�9g�"$R����@y�6�b}���[B����#FB��<-��}_1_�JM����O|����.&�af��T�D�kBUG����[������[���;���cm�!�}K�������=�y�^���=�"��
�, ���!@�C��t����{�#n�bj����g�U��a�6)j��!Yw�1�^��C	�f/���3�����c.�-���^��#3i't\/(w�� XW}��Cq�
h���-Zxh����#��>s7�t��e�����(��CWj8�p�J
K�kv�]��P�6�}��F�b��tmC�Z�#\�9�����N����/�����-C���dFzY�],��'���e17�����a�����
i7)(���8B�R]�B��"%���(���Xx�B�Qo�~��������,��h��~�J8��8K?�E���`��8����U�y�

tv��d���"�Q�������'�E+Uu�����W���@����>NO��#U
���aM�o����t�i����t����!�}ywd�g5p�a�f��7tz��a���o^�
��fn��^����	L	H�>�pX���+����U7������3P���U=��������t��9G�5�:8�Y�8M���5��C�?y���3���y���:,�%s��8��@����h�o���M�n^��q�8�cv "3\_|*��Pd���Ry	a!��:;V��`@�/�����2{����Vj,��L��A,* ":��j��WGE�R��� ��{�n�7t�o>I���0Mh�����p5���8�A�8����dE�w������V3�Z��N����L��u]��6����7W_�"�G'���g[5^����������~<�����bR8�Hav�:D.���JN�b~]m��j#���D'���zKs��-�B��Dg��R�
���@�#	Ef��+H�h�<P��%\-������P$y7����'������������^d�5V��'������w���_9��������f�]`=2�U��0��p���C�EU-<� ���0<��UMj.�Iy[N����L5�/=F<����6�~'t8��]x��L?*�; �;��?}*T�<��8����3]V"�GV���j�,�d�8�bv�>V����$c�"m�J��]�`H����UE.�oh��l��k�T~]�/\z�@��t���p����������?>}�x����?�]�U���������c�E������ p8n �]��<:��\�G������*=%���O�>|��0��#H���b�"W*��^����xB�����7`�#��^=TwfU�8�'�"������U���y�?�k�z�wx�+@�#���@"G�-4����a
��~��3�m��#%%��W�\8r�����9�� w�]�}�\\�HY������3�Ze5�	(�D�Y��f��\a������?J�8�9B��������IF�t������+���[��I�D}�������(+�n�#�n��iE�J��R�h��F���4�x#Go�^���� o��?������Q���P;��5��b�l�E���#��F��[R�l6Y�!�2�S��a|7�WL@�F�.��"��:�%�;?t��G�?��>��������x�Sj����(�Gw9����#Go�V�����&����T�s0p���jF��.�����}9*�R��p+�a�N$
"�G�h�@T�og����������?.��/�/�>���;��go����9B���#G(�����C��F�[�TOS�[5�p;��yr���p��������a�4��Hvz��|��O�/�t3�5`h�7���YO �H�q�A�����[gC���4�0od�y���2��Y�c����#	fF@G6�0�4���S5d�r@�F.�ow��a�������@�L�!��F���"q1��<���[
���[F��$r���G�0$�>��!��I��v��R�00����!�e=>2q��H����VW3����0ev�C.lr����r=���;cu���(@�$��.�B=�l�qs�y�-�~������\	��~�m+�QU>������� �#�L�d\������}����_~Q�y�����g��j@y�7��@yR�f���?�=�:��4J�+v���g���������������w�����e�+u��bUp�����}����W�^��2.n��@�\b�U_=��7K��H������G�U����x�I��}����A=����&��=�R'z�%U @&G�[eX����X~&S�s�z�������l&.�09���KRd��
Gf>�D$��d�������4����K:9B>����0�Q���gQ�6b�K����+�K�Yo��W�U�2��gkP[1�n��(�����@H��Q�Ci���(��I�������9r�Q���/R*j-��K�P��h���Xk�s��=~��C������Z��.�d
j���bwm-���r�)�Ci���(��A���xk*%A��>@T ]_j��O'�6���[ZO��?�%@KG.���R_���:����}�b��H�t&�k��@��������\��L���M����:���(5�^�AdS�@�T�����S��8��V�F1`�c������U���v�)���p����}�������m���l��,����h�C���]c�g{���������
�ct���n��+�n?�)=. �c�%���-X��1`�c'���x(�X2�f���xrhF8�v�:��7�.���}�=��kG���v�;v���1?K;�K87�� ��=^�����M;����no�:���#���b���8x�����}�	�����hb�-k���./Oe�	1��c��n_L����5����1��c�Z_A6�������-�6g�U������������"�9$t/(v,?��ruk~z3�o?��l����Ry���W�Gpv���5����G���+� ���]]����ESk�z���wX����y����5��c�����on�gN�#;2�B��):��c�A��h��g�)��UY'����g�is�������H��P�%�G�k6��cG��� w�F��#��h<�RVrW(�1`�c�9��{����cG.��V��9� ���]�v������! �cG������k�:�h��W�J��?�q�e'�N ����w����,��cG��(���;R�1������RxJ�`�1���f������N�������W�.�4���D��/6�,}�i, ���Q�v?���4��i�����
����'�T^�7-��N��;@�bG���(u����a%�j�[�����cGsm�	��z�2��(}������]Pz�i^4
X}�8��C�������%�ov \�mw�z��0�q�8�.�o7}�~��.�i,����Q���5hd�"�l0QJ����q�"E�U�#��.6T�~�t�_�~�����c@�������W����t]����~�]�+�g6^�~��p���������H]g}��.:��]��(�j ��1��c���pm��z�U\��}�����I���s�3z8��������M�WC����/�S|���B��`�c�d����D���gP�1��cW���R~��������ph���	�6+���|�X$�N	b�}��"��%stv��.����v���G�"+V�&��"+�aA�����J��>@�r�5��'�&R�nB�@�A�;j'��T��%U@~���U�.��^�$�$	�C�/��[�a@s�=1�^���)�	������$�XJR`����p �c�����% � ���W��% � �$8���5�����������D7����u*5��$���X��}����DJA�_�6�����%'�\Cm�����t��%�?�]�� N�e��O�_�pIH@*A"Y��/���"������?
����L�������	`�����h�+���	��D[T���|8[�����fnf�b6���'w'��c}(�b��:���
��J;G	H H,	n���l& � q4pO@�@�b�^��}g�	 ���T�/��'��O�������wz��j!I�(e�[AH@"@�R��o������D��g�
�P�;W[�3�g���/x���Gzb@�����X�����:lI	I�0�c�6�3�H9��l�����/����1H6H$��}x���������o����%�|�������r�[6���R�9�IY�Wy��bU{��tn�X
������$tZ�s-�)O�:
���<���O����I��[a���[�K,�1���d��������j5�h��?	�>byG��������q���'��������gW�����O�Wgbo0�����H�R�u> ��(��������3!R��'���&��O,����j��^�1�uSG������>�}b���O��?v�����O{� �^��.�~ZN�!�����#O�q�@)1��Y����77F����������nSEZ�����7������7bya��� .|�S��$r�Sr?�)�|����8��3��:���9��u�S?.@�'���t����F%�O$4�.6T||�������OD��|"a��>@}ld<��O��Xy����E���y��R~�T���g��j�m�uE+��r}o������8�9�a��Jd�GZ���0���x�?ZJ��fV��*����s����U1Y���U;�Ga���
���g~d��RPJ��&���r���z�c��O�T������>]�:�IZ?f!u������j����j\�{�H��7n�K���d�r]u�<�g�`>�y�{�^l>IlZ3��o�I�T
|��$:A{<��8V����|"!��Yg
H�d�yw6��k`�h�������!�����)�u/I��8��	@�����O��<�g��n�;�X[���	�����'���=I���@�'���^��c�*}J�UW��x,�~,g�;:�~����"��b�P�����,���W�eU�& .{^pR������b��3���?����hh�]��/�c�N�&
�D���@�I����Xt�[�������J�}�����ko4YU-��/W���_�����dHB��}�|��=�JW`O=���7uT| <��F�OO������/5�m	xu�}���6f��G����Db�.�'Z=N2�'C
W=�e���
 ��d�0`�'�|�x���<q1���FX���R�O��
�W��];�����#1. ,���	��l�`�����S���hi�jd�E���T�J�*�/���Y5�(���Y�?�+xb�GJ&c5G�	�rU|)���6T�h�D����f8J0���Tj��*X4@�2��jU��QO�����N$�����������$i���1��cX��G���N���8	��q�w����z�E1�.?�����m[&��Z�����O���zu(I#��	����#`�d
��nYT#5z06����|�5���<���|Z0dw�b ��A�t<U�������V**�#����������[�k�<��:��g!O�`qk=b�A���@�r���������HN���D�I�;�1�X@�$����-7�t*Q���K&��H����T��^�^��v8��:���u �Umn�G2�����X,vmHE��Qw�j�6���H�w�������pNO��C���9�d�,����+z�l�k:��?.�O���@��)������
L�g,�a���7'R@"�6����AU�s5���������������)��Sv�S3�����7���|�l�{Mr`�9u2�V>��r{P�b��H�����I`
`�T���}@S�|�Mq�������
Z�����Of�wi@��
-�������;������~�fQ��z���$��qp�73��+�� ���� ��6��bM�x�+�<2}��d�&'���U����z�|(�7�j�up���yW,$ 2�}�B�O�Z�<}���o������Z�wj~�v+R@ ��l���
H�o���.fn�8�}6=�)�JO�h�������"�������|�R���6KMd�Z!���w��������j�<P��_��Q''��+�44�g����jA>�No����T�~h
sUl��i	�����z���6s
�4�.��1N�V�#����T"���
��T����O�������sl�������hu1m�����7^��O��r�[m2A����W5�{l���5TKYL@�6�#-)�{�7�2�bx��o2��������:��:-�(t^�<Z]�GJ�U �����D
�U3�w�*��L������zO9�/��]N���H��r�)T)T��l���b�|�����}7Hi���G��q�0L|rj���P��)�����Z_��Y��_�l�
P����<lpjc�����UH���	@��kz}�^[����nT��.��&(�-�KC
�Tr'g�Z!���2�u��j�<��1q<��.�Ki��Y3����.XY,����&W��jc���K.k�-�59�g��$bmR�`��l[e�n���
��JR��6��.�4�7N#���\`��	��SG�88q�����s L����m�������G������y����J���<N#g��~����	�6����L���~.9>�R�h�/t�IN#��� �iR���p�)��G��-������nFG���l�v�����r8���HK;-��U���~
S�����.������S
����i�j���/����J��8����O@���;{�b�����Dl�6NcG�����f�D�^ $���		 �S�(f�B����������
J�������:gw��T���:9~��9)@�S+�z��iW�,� HqjC�Y}�pj���Y��.�&Aab@�4���2�n�\'��g�����Qq�to�8J�d6��s��T2�f�*����#�	��^#�b|���w���e!q����^��Hp+��I>�����z��@|�������?�+�j�a~���{N%�U�������[67��O����5�����p�i
�8��W�e�Z~�&	��j�������m4��j�H??�RbAU����z�{�]4���z?m�m-��q��S��=�"NS�)ZWE:�-_-�d��������^sq*����J�{C+���dIm��q��Qx�KY�0r*����:Re���-�NU�j�S�W���c��z�
H��#�bjD��������W
���
@�����V�������������^��~YN}�S���[��~YE��Q
j�X��Vm���+����=8�z��&R`��
�����u7�N����V<98�����k7��=�X ���7��@�SGS���ifQ%U�?���n�m:��v�S]�n�>r9u�����u�M�iu�9j6�x��3�����`�[Nm��t���H��\���J�3�P,'F�k����Q�
V���r�t��}�+�<!4r�j]���.r���#��`�SW�j���(�Cj%��L ��J���>@h$X�N��J�2�Ed�����3@M�TQ.VO�aVX�Lt�f���9�`�����-��9��5okZ�Y�����K���F��������D.��2�-g'Mz��tuu���gxJ�Ye}�N�V~2@'g'����U����x"��(�����f����;sqRV��[Csm�(T����U�A��3��^��]2d�l���<��N���!f�7�����ku�L�cv���g�����
2nv6y������n.Ve*�YQ������,g���7Z�����oz�:�����h^<����2�9g���Di�C�]��T=���[��2�7g��=cQ��bZ�����9��-;<��?�����������9&Crd��
Gf>�D�K���}�".��	i����y�9��������3��q+W=��Gr��sp�o�x�5�v�R(�,p��X��*�)v��Cq���b`@�,~�{��	O�����2]��W��L���}�R�����b���Z��!���b��Y����I���x���������=�mg���� :�n\J��'FNrY�_���9sd�3�4g���M6;�`�Y�������7�e�������W>a�r::���8��x�����c����-�^7V��������Cuwl~����W
H�L"��}��D��?�i�����
��������H�t&r�����������	��3G#�`�Y���;\�V�r&���	��3D+[��?}w������S�4g.^�\��DNY�O�YJ���e��U1=o���M�?�d���P�,v�():�2z�R���]W�]�z���g):k��s�g�X��s&y o��e]�����o��`�3�qrs��
@�����7�`Z	��`���9�At24Xz�	���3����hM��q%=��'�g�9��!PDk���y'�gc���
��������0:��hv =.����Z�����s�����9����z�����3�?���Y��Q�6���3�Eg��T*q�|bkU�r������Q�/=R,�q����)�F9����z�H�0���$��os�X��,u����	����u���}�u��-���H����@m0(�#��8�8cV�.�$cv�������2)�n&���@��!������X�������6N���%Jk����`�Lt�� ��d/la�8����~�Y���Sp������6�:�����z�D��y#:��#d����Xx����/Y ��KT.
,7s���=]�u��)��m���Yn"����)!����\����5���fM����fb�����#w]������RQj��*�R~����2��f��Ay������=�i[��8�,��^slZ�H8m���Z�j���/. h���@k3	���`9���7g����Bs$��-�;��5\����.Y����I�=�v2��iy�t�nQ=c��������z|j��7?qY=	O����9����~� ����B	��cBkb������8����}�b��8��?�������fpy�����U/w�Am0�����y#up�$������K�1]��wH���4�k'���� �q~��"�d�%)#9 �s�Xf��+T~����w??���������5:�Lr���zzKV|Z@)����t��}9��0l���i�s�J���_��7x]����lw|E=n����U"B7��9���4r�h����d�@o��4s�'�O��[5�l�&s
�> :������:V5$����c�G���W�����XM��4��+r}�j�j���������xz���w�D�M���������h����,�|���7�M���Q�u�&����(�����a�g2G������%f��(�"<��u�Z����x������p.��Y��o�4�+P'rw�]��]&cZ��F�W����_{��c����{�=�EQQ����X��wQ���>^�,jt��N�V_�,&�m9��q��>�����h8\\��h8)>^{���RM��UL?*�= �=��?}*��r�)���!�
Vs���]� �
G�;� ���]���7�UT���>,�9-f���������������������C�	����Fb���O�A�n���=505�H��goo�_���f�mq�����fh�%0q��v�P��?��,�0@�Y���)t�	��o�	�I{�_�Vf pv���$��s��@��:�F�t�&�J��s���<�r����?��G1 C��)Z�*���o9��s����T8wv>f���{uY���3�-��S������}�������8w!�;J�Vr��5�k������S�8p�9��x�P���n�5D�du�P��K�������?���������^��s<<��=��s����8@�s�5��(P�?|�
X����}>�������x���4����]�;��2�.��N>��	�"���R����b=��p�������Hfx���B3Cnr�����cs���/yd���]����1>��s�����Y�?�c�Ojl��b����<v�^���������i�YA�����{����*���VRg@2��l�@Q1)t�,�e>|=R��������1�u�_�tv��k&��s����H��}�.5cP:�m�'.T����
fY �r����p����37Q+y3[T
���
0�<qD���<�r�<{"����i���f��h�
�y7�Q	��s'X�ZD����w��>��r��`Z�4��s����}�
��Z*wtp���8�'|ez���i�Q��n�=4�s����RE�{
G
��2��2����Ad����D�B�������pC��~��n��
�9b��=z���$�m@6�/��k���u�:;����J,w�>�����Y�X���ugk|�
<w��s����� ��Pf�]
)����s�9�J�9�-'�g��pf&=P0'�g��-���~k&��������i�+��;�dg�B��,���	����Xz,�cN8�[����y�2i�����/�[�����zv�@X��P=4�g��X���^��)�>w��s��NP={���q>�[��o3H�|.q�C��T���6��]����#x�4��[�
�7���[T������A�R���!}�����UQ��o7�Qw�J'}��Y�����7<�U��syW�gW���g3!�eV��������~[}�_Xu�>��X��G����F��z[O�������"���:Sv�Z,���`�9�����U=����N��=}�����'@�{����H�;��3&C�������@7����/������0H�=����:S�)�4��n'~�_* a��>��K�<W;O�/z��Y���uG5�u�E��\�4b�b�
[�E��M���������N��=}.�E���?T^����>f_������r�������b�F ���^��`�dwc�����=#�<��0:"��2�������>fA�V,Zi�(Mh<+��?L=3\:�S���������|�E��Q��9��"����p����9OS�FO[L_���*"*�jSR����V�DY��D�<�{J"��=������oU�:|Y�SvK��+7�E�����Ad����^:�gD���}�*����7�s����~�_�^@al���� �=@��v�{@2������i��^����v����@���W�=_��_��v�'���:}��P@+l��?��8��F�4���������L�O��smZ���6?��!%+�	T��������>���~������au�����ND�eU_�sA�����
��k����������p���!�V��z
;RN�;����.�=@9U��9\��@c^`r=x��#�D��{��#�����������N�%���#-Z[:����M���N�Q��O��U���*c�3������<v������9�C>�C�K���}��NP�����*���o2;-�H��5��������vg����QP�J���;}��F����[X��pwH��\�1s��%�]��=�\v� M��8:}���n�l�q`g����Qo��r���J�~���O���m��[_��1� +� O�"����-B*x-�����K''�{������K�d����o*L�*Q5���bB*�z��as���U-HF�����*f�5���ti��Q+����(j�"Em��	C�He� �l�����nu������x4�N)�&�p<}�����f���[��������_kp��`k��
�$;�N���14|e}��N��kP����� a�t��%�����cm����U���@����^�I������c;fN�;�3��������~����/����f`g��s�7bg���o;�3����ji���:�����_1����K��Y;�N�;�6 [���/��is�����,�!X����	��@����0�	�vH#U;N��U;N��%;N�]��b8O�9}�-�M�:}��	�,������-�f``9�E�����>W�[��I��)�y��/���~�����kv����8����������g����C���bk�}R�$> �}'St1����RNhS�d
��
]�D>;�{�������*|]��f��A��N���q�l���������� ��j�n�qra6u��p��U�wue��8����g���W��_�c���9�A����������tp8���D�� �}H��W��4����C��N9���3����{)$`�}'��u�����H�-���T��\��w���P��+5��p�aY|���Z������"�nZ�Kl@�%�����z���S��>@�}���<���/��M=��:?}w������E����1����m���W�����s}������2�[���7����_)�>g�����mE}</[J�z>�}7O|�{@����O�y@i����Sr�]�Gl�#�O��������+�s�����B�t��r��79����a���-t>���I~����G@�����r":/N?^�����>�5�]��Y?r|)g�]����~��JZ����+����W }��x�. �cQ��\f��cv��|��}
-@��������	2>�
�s|��[���7���Lo�y?JO��)�@_��O��j<�L2����R��K0�L�k9D 3�v���f��c��1az'bO2|)����P����4T�����w�Z�����������D+r^3*�Z*=>sl�C|>�qR"C�> !�wJH��	���^�_HH�c�U6K��i��48o$�m>�i���~�:880	��M@�;��d�������Z����FQ��~��D���2�j��E�;Y,K}�:�+���-)B
�:�y
�������$9�.I?+�W��&k�>���L[Q7�b\=yK��|F�|d�o�_}����@���|�46�Y
�����>�����U�f|�J����9f� ���x�;��c����?-J�7���#8E�.���6��x]��3���R�-Y
N������Q�t"��#T��7|@��oq�w;nJ
s[O��}8> �R&�P?Kj��sZ�d'����.�� ��G����o>\�r��=ST���K�	�������o�������,�]��������}9P��mZ�Q��Z�@v�o�n`�(�R����to
s�rS�_���l���#��/��T�)���)��d��\����c�E�4�*�R�����=s��A&�/��k���T�NK���k<�,6s}v'�-�Xp{��yM����b{v�t�M��1�\?���h������:lr_L>�k�:�Y�P�����qyk�m|��9
���hP���n����*T`�|`���:�r^�K}���@�G4�]3{Ed-�����-������b2�(9T�=������v3��WG�����~�E�I���#���D���L���;P����&�u���Cm/�c��-V���w:�{g� ���&0�v�q<��u����������1o�~@3�B?�-�������f���iwh�,�*�d���\����-@�2�mC�L�g��q���/a��(���c���"����R�� �aU���&V�@��/f�
�_2����|���N8>�f@��vuJ�����_��T�}_B��Z��}	����w�����b��)��Z��?��}��{�������b���������+���>��Z��j]��m�q���g��L�k����S1��S��'�k������3���C a��>vq\��������������?2��Hu�����{P�k+��?p��Q��[8��|���8Z��l��O#�T�#\��g�{������Y�]��K*3������=��"�}���gq3��n�@�����>�(}��k�(���;`}`3Q�kx����������������xz����1\{����k�����s���C��y��>�Y9�p��l�>�N����#zkb�t|�J�wc&��\��g�!	��Y&��C�A�����gWR�|`�����`���z�r?x#�z��.�����xx��vPA���������(f�
�������{�����=�w`1xo�/�'oV��uoRm��9M����}�UJj�m�RD���<Tx`1x_�J��LV���#��ziC��:���+����=�x ;��n0�����S���#���0������v��&I��d1����6�0
D�H@{�m;�q�����z{����H�s��S�m�;]�7���z�%"��AH�T���c{�M�f�.�IT�	�=?��@Xa��`�Gv�����TS��o�A��.�w��[�Q�����p&o[��Q��GJ:�t �����/���^�W��V��eQ�>T��rI#�r�C�{��F�+W"�MNf���a��jfS�4��k�!��v�X�]^�_��yP�����=���$%��B7��0]�����Plv�-���S�kS�!M3�9��re����	��?Za<�����Q<�s��*����b>-��H�`��;�����t_����������n�{x;����R�O���b��P��`��1Q7�)�Vh������ �M`Tw0�<,�%o��k��F�������4@��Hxs�8x`������w���Wt��&J��}���K�6m�w`������� �`�G�;Lw`a���
���j�-z!��4)pw�����s�o��~�����{���'��3��O���di������A��i���V�f�7�f(����u�w6���h�#����Yz����E��u�8�@rbG��s ��|�����;��_�/��{�����_G�
Y���^.���������[�'h����Y��@���L�����t"�jlO�5^?P���Fm�{:�l��/�Y�z�x�����mU��N���E$�G4d"Pd�m��l� ���H�A@�	��,�:�������'l���hn�U�X���r���M.�Et��R:H�q��n#W�����Z � q]��q y�7A�}O��M�k�����0�y�75�S������y�]�G�	�S<�R�oX��$ q�\��������o�ax��������U�W�����XK�v$1���?R�9X�
m������=���bh@���/#��r����yw���t=*�O��h�1��QR��L�������]�H��|�� ���G�[��N���H����%>�l�<?|1D�x�.�?���bz����a=;����|1&�b�����5��,T����f�F+���RA�o�$b����r 1�C�=`�'�sO#��.�Ml��^���V���8�(b���7����]���N����	`y���7�~
�9ln ���>�!;��o�,]��������@�D�N��^�zv�gw�[�=h>�����uZ��6�r^_�89��+���7�r)���
b`���~f�W�����q��*�sz���D)����%,F��-�b:tx�@���}�tI�qs�F�`�@_�2�g�z�L� oI��29p%�=��}�+��X74��������5����/XZ�/(���x���F.����	9�����#�~\��T���K���V�UW������, m�k|�_����^�������X��i��?=�)�C��D���a��	�d�UH�h����d�_���r��C�����G��P�
-�ofI!@�CG�8�q(��J�_�]���������$�����:���,iw�}�]EB~�C��"���k��<J��L�C�����y�&��2��j�k*i^���h6�W����O7C@���4��%��Pbo�E�p���jl���������~W�/��
\Jpm��6�98�vz����w��������g�ic�����
S���]I3�:�f��j|M�A�z=���@������h���7�>�C�Pk�����`]3��D����a�j���������S�F��N�8�P�v�<��E����|-�������v?�J^y]S<1 W�N��8�C�^d�WTxq��5�M\���
(�D�F�/R���wg��?��~�����/o>\\��q������o�.Oi��PB��}�"J�1�8]�G�GPx5���?��X�#=02�7���P�ufw�%A��)N��b>��x�y�jA����k)h���L�_<��s�g
B��"{\����v��<x�����b~��W�)E/f>��o(90[�W�
C��_�������7���GX�8`8t�k�^����TV���,���t�o��%_g6�lohc{�JoT�YJy���F���������Ff�f��5?�]l�S~w -6:�u.���!s���adYlm�_V�����~�����a���wR���\�k���E5Q��^o�����>X4MH��6��Z�Z>������j4"�h��U�O�
��;���
��[o���-S'��A�\O�I������>���\G�J�+�PG�V����J?PS�w�^~$ %���4�wm=���[�����S�q�nE�4k�X)a�p���d}>=�UE~W���ze&�&��A���V�z�ct��R�9z���]�������4���Z�
��������J
�_���=���^J0����M�Z@��kox�B�9�&n+M�8�����&�}�4�pp(��C��x�p���4-����=\�����������H�F�: |C��e��d���
�3QKo��D����1�
�_5�hW�Z#TK�mL�~�����p�a��L���	e�y�6s�'b@t�a)���v �������p�o2�:/d}{kEH�O@�^1�>�(p/c�Kn����TW�=�o(��
J����E���[�L������[��wu�a��w�X{*�M1���?�����8����5���tPe��'8���!@�CE�_�2;�HT���Z���7@en8Lg���#�YU�
����SA�F:����U�a*����|�j���^O
Z��-u��V���(:��hv����%{oJ����KV.�p�		@��l�9������#}�:�9�4�d��������n�d\S�&����_����5�I=��������A�~���������������goM�"��!��C'd6A������B�<��gu�+��oo?]�o�������lm5vW@��}�t�J(����\Tst�����h����!��/)����l�M�sX���������CUm����)���2?�^��C���~9�~@+�g�x��k3/�R��3=-��Z��=q"�v�;�*FZ0jy�d7]��ugf�Q�B�{O���S{�8������g��I!���|oM<.�;q����@�a�����l�;�H��}"�oG'�g�H$������;����^}'U^����P�_��p!��Qm����.V���84��o��N�'myB��
�����w@�g�������������O������>�;:�v'�<E���E�C�]c5$������E9����cU��Q�K��^
�.X�h7M`I���n��
E2���dW��Br�C'��sG��EtU���r4����e�1�>��q{�
�. ��z�A�����������,fX������{^�v��$b\E���9y,+v25"5����s>�'���u����l����<��sc;�W�E��������;���rj�U���I����h�{w7��R*�oKn��iOv��Z���U�i�%J�uH���Q�0�LZ�6�x��s��=���#��4L���'
����f<�|�Z��Xj,N������_��5��Kt��/(�H�SJ���L.����l��M��6����Q�en
��J��GcE�z�$�]���j�c�r�|�j,�&�:k�Z���6��S��^|�~��'�Q�J�T��B�M)�R��r��+�Y7 �#���?��I�2�=rb��0H�a�����_�F�}�$�j�zw�=
�o�E�X�$7j��LX}2<yd��i��\�����`�#�9��v�5��Hvfp���fK���F��_P���J�����`�b��H����[1^�����~X��� �#��Z�������7]�q�[+�7�M@G6�is����-f�]ZH�K@3�ooO�6px$���>��J��t�4R������P��������V��z�;r1���$���-��C;�w$���>���hn�����Fl�;y��-�����B���vNv��f2��cf2(�3"�xGV�Qg�[H5�x,~P=�����q��#���,*�6.�����y���Q�>�h7�_��h��Ss5=��3��������7F��G���%��z�s2o��J\P�49Z2G�
�,���3���{��.���_-�i��3r��#@�G6�e}��W���T}���������w~qy����~�����l�Ber��w�]o�Z@e�tR9"�@�#G���P����R9"�@�#G���H�A]������0o��)�nMp��?V��y�`��#v���6 �C������1�-��� �s#p�2�X��Y^�p���~c�U"��G�5t�r~m���&�I��i�9�~M�����5h�d���@�#�����]+�c�C>��a�z<���
����w��-gz*�
s�#99�s��l�P�^M����M���9:K�������h*�O`@-������/H�#}�<r��[�����p���u���I�=��d�����P6���Vq��y�p���uX���5��S��(�2K5P��z1N�1���������Qj�9�l	X���%�����z���kzc��������co��N	\4���V|��0�R�`���~<�����]��Wt��7�������ot8�H��n.��}_���zY����bU�������y�^�R��d���v��N������w�������+
��"��~�E
����/-���S�����Z����ZO�")=�6����<�\����d��g�v���:�C����5$;��%L�T��T�y=]�m�6 �#���������.�>�(���6s�|��sk&��DVG���[�4�#1�����\UK��I�,�#W[�<�zS)`;�G>��jP�e5]-��BJ0�6=�v.J�-'��������Y{��JQ+�T/o�X��gIH���((y�����9����qs��K�r��.G�]�rp(�����J�% �T����d&
�I���_�G���'� G��4�Pd���q�~,�����X������T
?��� ���Y;��i����~g4r�r��|�~�B)��K�%.����%���y�jR��.�2� �m�pMUY��oG���c"�����b�,�tq���.|i�6�q�j!Mx�^�'�r�h+�8>qT/k?A]�M9^�zvu���9Y+tf2GM��+�]�b	Qf���M,��
��h�X���k���/&��t��^�v�d
�p���U5�l�8��m��.8�]��;��Y'1��c�5�?���g���E�EY��F'��q�e|�e<���$�q,�����O~9�x��	H��Wc��� D�����]W;+��8�]GA~�H7��'�M�c@��&��~���"��G��[��P���P������m�qM�����8��bv�5���R��3LS��U���%:��h���u�1��(�;b�U/��]:@�c	f���J��L5p����H,1��5L-��	�((�C���7��.�dS�D50
lq8f`F��.��'e5�U�[�������T�@nlD2]��� ��
An.��9�q��5���A"�B %6*y�w��d������v�1��cG89pr��]u�����S�����:x}Ge@�	W���xPQ���p���T�W=�s��;A�Z�L����q�}�A;�1 ��po;��:u���tG���l0�>R{��1������c����U��kv�v���I��/<�����=�:=�:�w�v���������	��P���
�c�tG~�@F����v;�?��f;���������-���3[���=6�/���C�s���6ZH�X"��K��w)#4~�>�a��<��B=���t�l�#cp��7�H���*e{����Og��go�S-qU0�q��J��7�.����VX�:j7�*8�`�8`��;vE�����[����������=�{�}�X�j�9��L�x�;�}������u��<v���;z�����kW���\��#���Y!�P���A8�"�q�@�����)L ��������B���s8�0W��[���]U��)�P��JY|�@%��.6�<w�8� �c�C8�<v���M#��M9k����W��>f�6���i�����z
��v����$E�����c=�t����~X�g�Sb��Y,g�?����<%�E�Q�q��G�~���;2�b��lV+�y��sv\l(��Fq����r���cZ@�$v�,���c�o���,��$p��&�P����������m<��8�;�op�P�������@�(9Z j��x����0.<u�
t�T&���������f�w��P{^������J�9����i�������L���-6�IQ�� ������	�KG�92-�UUV$m^�V��(�
��L<�b�&���A
��(��"��%7o��LZ;�zwr�M�����*p�d����������������q$��y
T��������%;u`���o�A���������
��I��I�,�������
���'`���M����?���q���_
�X�/�s�����|o�����}���G��}�N�����!;>��r�;�����9�u�<v���gC��\�@y$��5p������Br�f��0���|���/4���n��L��������la��A	���xZ��L��8A�I����e�N�B���;��~���;M��
�<�_��W�W�_z�T���� ���"�o������Dp���lr��=�}�t=�0.�vr�02 ����0.`w">���.:;{�M�')�K�-O��x��[|�x]r�a��b����pdHH��w>}�������_�U^�4�{�R��2;���6E�Q;��~~
�����r�l�P����f�`�D��/�N=l���Yar�y�t��D��f���?kR��,)z@���r�e���A����x���$p��,0������)M��`���~�q'����5���Y"��'�0�������#�"<q$��g��b]��8�c�$<	��u�����M�;����'�3�����'6?j~3�=�������y�3P�J=OC?������P�������2<��p�v��H�7�h��Pw��Dr�V��Wn�F��C�p���UyS��5K�\.Vk=��-��	9��y���P�I(�>���>�	�R�<z��[,>o��x�����<���&���Nz�(1v�m�iM�x�|�����vx?@B'������Ot����,�^�V�)���'����WM��xU.���P@s��Y!r9��G����b��?lf�R�FG�4���=*m��j<��zy�v����BI����#O+��gL���bRN��7��~��>�;�����'5;Ta�RU�c�H>��xn����D2����t:�u�����AW�����O��8�"ba��d�!�`�L2'\M������^�Z������h��c���$�P}�N��8���j�����/����T�E�w*�#Sh�V4E���&:XSPeUk�eD�4�A,E;����)dBG�4����,����������<�j	X�Dd����9����z�xFF��l��j���U�PeY�^�+��,�\xz�,�x�3�
���ac5��x���<���-g����T������4g[��g��1�Q�xuP����T���}��D���:W�9���d.�e�hv�M�����q��<�L���e�����>�4Q��-m��r�[��>��Wr z��4��@�$i�p]@Z���2e8��j���7%���hI��>
~}_��
%��*j9zm�1m�12��?���G2���`�D����T�z\���dY���\q�[���m&���-���	��, T",LV�t>-;�08p"�����a��@��D�Vx����9(p"��:!]�<�P�R�sx�rI�%Y������+no�IY�	px CLy�������b�M���g_tB���r @�������b�q���Y5gK��Mlh��	�����
�����xz���/�n"���w����AsN������d�����T5.>\]�~y���4C���
H�D"Y���r6~����������xTGj�Y�U��J6��
5�9�m$�4��s��M9��u�H�)�h6�����y�,x��&6�����.5���^l�)��SV�y�H����O@y{������(u��7�/�7���O��<5������g3d�[�P�^x@�������R��c5A����74Ga�D#s1���j9�B 6�������_$F2,�\���a�0��l1��nqS���xh��*��=�� >��V�z�(�B� TV��l@�&����
���/j������(�$�(���h"���F@$�����O�~Y�����L��H�'g
��������_��,���z�����i�4	��SW�i&�O������M��*t�TrSsOM����d��\��>��Lo��g�a�X�|&�IIy�u�������f���P�S��e
��T�B�����Dx�C�������+�+�;����z���������y�1S���	�H
(���������z����-��{�.hj�@�gh���0R��6�}�U1����.�)�)`?S�{o]��PJE�6���.�����co������tU��*��z��pvyw]o�TT���U=^5�z�EH�:��n#�%���b~E�DK8����?\z����xz���������}q6����3���Gu��(M-D��(�UA[��i���Equ�=(V��]G������:`��@����������`��I�c����U�5�zh�)��C����4�(Rv O'JW=_��/%���.�
�3��xp&��������\{�ZlI��J�����c��| ����y���y�����7tf�s�n
�u���R��t:H����LGc� k����Tmk;I�RL�L����`�����*X��d��k����]�����m���8�Py���w4��6�����\_������W�g?����t�����O��>�z���������/~U??}+;�G��^���vdOU\���H��D�Yj	��qtKW��1
����y0��+W*��44�����@�i���]N���.����������^� H����J0����YcM��{����zu��kU�3%�g~��9\�x:�����u���,�g ���vP��������IA��8
��
��T�k�=�*`iS����M�|������is:��'�L�T�!p7�]�e&��������Z��r��C�z;�0��M%2������p���v��8E�L���5�U
����V��8Z{Fo���f���Y�3���j���W3��S�]���Zv�c�V�)�_��q�'�p\x��ME��uu��,U�o�"B���=��&s��� ���%�;��59W�<w*k����@5��2$_��M��OM4���\ ����n�	�CU;k:���;�7Dk*����5u5�E
-T�:������C�p,
`aS��.{�(�T�\�}�^$��[�2����Z6�h����46���]�3�lj���tgpF���jc5���3����	�E�j���l�>K]�N�\�u�M%������jS��m_)IV�JXm�y���8���^��y����W�N��<��)
�����R#vm���RR��,����h���������[�]����[<�YH@D$�����S��e,�����&����8Re�]�u�#pj��w��A���u��s:]����
��R��\���Y<<��������^�wj���L\
@q������U������F?�*G�j��q�.�rju��O�HZ�Zmm�Yxq��b�����7��k6��SW�� �����j������H�:�����v*����u]�P�p�����:$��T���v7$��R�@�$���h���m����]L�v��
�=*��O���9�:Nm�q��c�����&�KX���+x>�����.���:u$�S@B�	��A�������L6{ot����9����W;X
~�ZN ���`�*^Q�%��V�'���l`�T&�Y���T"�k>����-3@/g'��)r��[�����T,�R�J<p�G#UnG����7�Pe��$8�?@���"����2���M;�K?�~0�L��ew�$�y��w�:%{H�`V��Xm����\:�p�~O�:�xhv�"d��!�s&B�#O��Q?��u)vr�yneM�	�I$4������]�y�s&�r�/������
����zV�L�;U�o������Mn���7�KI��I��L��9{��;�~=P�g�7�����\U�U��1�r�jxk���xc��������n��~��{>+��.��8����J�����|:�3��������%��YT�T����l��(��89�s�5��X2�����l�T�w�������������0����l����^�v��fpx����(����tL������%L�u:��_@��{�7M�{\���N9����r`���%6=@�,�<X8�[������zh�B��se�����r��='�RJ�\@Z�d���`�1��9�0�����)G:�g�R/�����AX]��~��'�N�@�����t9�B�c-��S.�,=/�N�_l7M�|\�
� �)�������i�q��P%@q�fvy4y=���.%��#.�l
x��2�f%�&+���d�������������K�G���P�q{Y��l�O���.��}���#��bX�G�*b
(��1�m
��4tu>�/�i������^^�-�|���u�L����Y����,Y�wa�I>������llF� ��;N��������8e#�6O�Q^��fiv Yl���<d7��o�x��%��s��||��/+�[����R1K��
���nd�����\)�pSD�.$q���u�/���M�x�=�x��|�7l������]����������9-~�w�45��f� 97�S����6#9x��'�0��M#�}�lOG���\��SA@�����O���q�K��r�l�t�� ���wt5�k^��#�&J��-T=^8 q��q�&�zZ�����Y�cY��R@��1�>},���7��^#�L��'�S��h8��
�=Syr�������O�'����Es���	�<
WS^�N��f���K�����,#��aoY[V��2��x+�`��O3�~3,���M\O���
��^'Yb�w�=���YW@��#��9u����tJh�t��D�5��r�Q�2
�������~JN�(�F:@N,�x��S��������G�R�&�3DS�.H�[���,��S���"������t�|!����y3���7 R���S_�I��zS�x���@%�_�q�5q��S�\��N��*o�%9��[�
��������v���=���Z���T�cQo�
8\�x98:fA�i��9�g#�%�I~��7A��*���x=�s�_�-s��G�k�r��K0~9��|u�8�?����0�\�b# A\l�Nu���x�$?��;���"W,wLz*h�`c���K�}����[�p0���#q�n�L���e|�\����Y�q2zh�>�^Zq���9�ZN���$.��Y��P_�����w�K��m�����$��SWy��M��k������������t=��N�CU��dg�7�t9������*^���'j[�F���	��;Nv�:c��2:��%V��z!�J����-r����4������BD����:�pl��a32�}V���v~/{H�������z�2~���]��q�������e��B�(9�\���
�}�m�7�G�2�P��7Z�S�F�1�NY@��b��f����xZ=�s�a�Z:`�3.R���d�E�8yu9��0�sQ���i�����>�������M��Uv���,w�����HM��E��5�q�A�=�z8�l���gMHN�%{�RD������)�^����f[�l1����v�u�@��-�3���b�D�w-�~�Sg�����d,�l}VG1���k�~��3@Eg�{��3���l76Xs�a�lF\��p��oY3���77~H����������b-wn�.��������3������m.o=oWW�Tv8����0��c�P��-��92�8�EO�����[L�kG����*��u/	k�z���oV�7����M@�A��
w��F��>������v��%����y���N5�����V`Zr=8���p�Y�5Sx`�����]�4���RL�y2�
|�����L�
 �N�-8rf
��������`��p�����w�t��.re�0�b&��������& M�����K��8-i�P-�Z�aES_��W��&o+��(��.D;4������5#5r.��3k�h���s�z�R����	_X��p�Y��4vC�4<�U6�q�������cE������h�mYY"w����)�q�������q�w�F�*Z�W��8�^�bz���@�}/��n;���3�l���s��1����\���+�W��x������|JG��[!�b@jn3/jO��6��L�n��7��d�m�����0��C^�,��~�����V[���fM�����^g���G�:6��`���5�������05��o��<������X��{�*B>^��!���o~}�"�����J����c�X�f�r1�>�D���P�G9�����V��0P{/nA���&A��3�=g���3�7g�Lw����Fz@s\�gi[�
<�:C!���fh��p�}8�49���=l��`��!�=�Qg3(g�l�$���t8�Yc�nw[�2R�e1{�U����^L�|X5a�j��M�����3k��f�]&��J����#��T`w9�����(�b��awo
s�bp��s����#M��8�\e�UAv��*> ]/IQ��3���I/�3+`��$�g���6�@�f��F�gJqT�K%�����1N"��q�.=�
,@x3W��[;{�pV�����Sw9*�lo����#����cS������������
���4�����eY+4sq�.oc�H�+)�n�U)�o�&�9iK,��oof���M��+�����f\W�p����K(�2@�3.���(W�wU�Tx2f(g9�d��=��q 3�
g����w�{]�9�qe��wv
5�8s��W4���2��8����g����T���y=V?������/����B�����W���~�������w!�/��Q>���FB�5K�/���E�������b�������W�G/���>_��}gw6�z��
g������I����C��5�H�}>��^�O~�9�����4�Nt��w=�/_<�������O��nU.�#����w�y�|T��W����iwv={���x��|F'��'?�:8�������~D��f�Q��?�(�wT���������/�#}�
D4�����|:~>�g�S�����0W�l���;��\LG�T�8�=�$ID���^����b?���Q�����$������_�?�cA^������j�����>���_�<>8���C�������������{���������_��TD�{��PP�
����'$���P���|�~����B��:=���(��xrPNk���f�yb@����:�PK%�N^�^	y����g���b]�B{�J=���j8��W�����D�r��+��Z��q�&���&��+&}/�T��$����z)lmLU��Fs�D���>��(_�?�W����4�I[����=�Sqz���V����W��}��������*���YK�'Q��+Q��sZ�}���b.���/��oW�@O
a�����d!�v9�vK����R���P�3oB
��IQ	�?�~	���O�v�\ ��^<���m6~��{_'\��eN���d,���U+_���"%b����a��p
����z��/E�QV�Tx�F[
Mo�Y�K��iqgb�����e�W��2����+�)W.��pH�\<�A���=�*��M�����Y��{_?T��.�e�{���gA����A�m<��B���W�(�4���l\U������
�U~%����������TC]����Oc*9��&�����7�B	�Q��t���"��{�)]�G;�j��6M���e���(�-��$���6Lw�����[�@s��������+Mn�����O����[���q.�-?R�A.&��Q.=���)���������
��J�u�7[o��F��N�����F�������������uk����O:o=4����{�@v����\:-�
�@s.g��"������,�(��Z�}�W�������A����Q���g��������7�k��Y�������M����s�K�k�{<ofq�~��h3{������9lf��X���T���M��.��lfy���}&��=��{Lo��
�3������y�M���R� 3����������}�f�b� ���I����~j�4��:�������I}��4�l.|�����j"Yf�UU\M��M�?�>SG��H����V�~V���f��m��@o���)�����
�$M�������M���
K|��b?�l�<�d�vK|�����_���n �I��E�����z����6�;H�8�Lv�=v��w��_7���}�	���L�&
]g&�p�����B8���b����������Q��{c��z��p���?����2����~�6���Q���{����j�A��_�G��7U�J��~.7k��>-������&'����
��C>Z��GL�#��?SSI�����?���~;�������U����H�
u$����L���mf� �\S4y	�a���0H��4F��x?��}��a�D��3�
ybQ��x2��z�A�O>��(�~��M�f����(q������6���Nua���v��Zl�]3�^���m�}���@^��f���M>.��C��Z��6|?���<tJW�n��&������0��_d�h����x��s�������}4�x������4V��?lhu��%��=
�I�+���1>������c�l�.7����g�5f��7��
����������7y>>}Wl:g�\�fI�/����+��p^��}����t�g�nH��{��M����5F�^l8���t/�\�����7E����f�O�gR�A��ri����>W��e��5�O��n�g[����F\����?�>��.c���o������o����j�}C)�\�v���Q�}����>��2��y��������xg&6�������������3��:��Ea�:J?�D!2�����{9�q�6/6�3}�YO������|�g�k�UHk���4���%,�!������z�e�$��L�2��fx������G� �\�1������H"#�m�~}.U4���s�
��������2z�e����m������'�j��lnl�}V�z_�Zup�����'�\��Qh��K�i�us80�\3��@��'�l2���b��Ox��M������)��X_��1	HQ��sV��1M6��?����JD��c1�����i���Q�9L��|���:j����3��wd�h��������?��Iyu���8N?�B��� a���>��P��>�K�)��r���'���%�)--��^�$�+��U���-��s�����$N�������9?���R���w����N�E)�'�F��7]O�
*�N�
��e��{-���;�%:���p�/~�M�>����	,y��oa����K����|���%����o_��}����������5�K�7X�_��������o��-�����7H\O������@q�n��o�!����Yx|8�c[��L��zt(%8����vs����|tG�gt�G��=��*��zl![�9�G]���z�Q1+���H�Gzl�Q�?{t��#�A���[=��b�.B�#<��XCw<�[15�8���x���0}��<��2�=���������>��s��R�G1�1��ztw�=�H��7��#>����}?�����r�Gu��y~�g��]��2��
W�H�C=��[�\������<��c
x�xC�>�8����p���G�hs=�;w��G,�Q����oNx���qt��z/��
��8��{���<�0�����G��#����[��K��/���<�g�P�r^\�n�u���-��pO��&~������'���m��/���{�*����(oj�����ZJ�s���8���Dy��m=pX8�6���w�h����(�to7��Map��z��U�J=�
�Dys���"���k�7s �]3��jx_Eu�����������y�2j�"
t[�o<��xN�G����_��A��P�Xj/��� ����Y�,����%t������g�����'���x���� ��Kf�J����"6=� ��x����K��T^=���Q�Z����]�8�n���d�~:�2�n�m��z��>��`�;[�yR^��{I�q9�!���y����ln���dR�U�^�9;���`�����z�H5�U������g���s+�a�����7�g��Zj���=�]�����uQy��z��b��^r�{��dR��������>�9����WL/����/F�BHe]�t�iISo��q+�z4@'��NJx�b�����_^�Bt���'Q'o8{3Z)��~!��A���|w�V&������n��i>���]�w��>��>��O:E��-}��*��/E��z��Y]��w�E���o.E�/��:�������l:TpC��r�#�����/�WE�#jMe,���:��i��~%^��E���EQT�E>�7_kAMq]��F��7����+��u4�c
�������#S��b�kk��W�m����
�Z[N���S9]�V���
��*�&�Z�ZBV�R^�5R�U��)���n�������I�.F����RP,e�&����U)��-&*�����K�w+Y-���d�.*r(��4��R�R���*w�Y4mFZ�5oM&����YxY;�U���RM~�4���J_3�]�������������|S	����\Ei����'|(��������h>Z
Hf�K��7����o��[�$9j6#^^�}Q}�e!���nA%M�����^:@5^5[y�K6/����h�5�ZV�X���F�� o��j�j9����N�i�4�q)����m�Q���Q*�N�x[��������e;6�y����:�y�4�^�hR�GS=
� M�7U�)������E@1C^1��rT4�2�M���
/EKQ�,������4�i�BVYu7]��i���t3�SB���7>�7��t���V6�X�_���2���JnO��pU�c9�N�G�P#��\�����(C�����}C���Nj�

y��	5�9-.��j��4���{��3o1�uE:����v-��YL�b��a�8����+UMQ��#�S����������qe�<�J���R�1kuyu&�^�x��4!o��j���+��w��������7# �!/�����]�-Qk��h%������M�]�
��b��Y�9d�Y[	���*�[m�%S�7PC�p(d9J7����!]!�0;[>s��F�����I��oT�[PS�uJ��M)dM�o#����jANB�
t����F6U�s]YB0gB%f�����Q�@�	�}O#4����f9�UQ���YV����y��u*&-a �+�Z:@�"^�����nKE^�{��XQ�LJ��&$������"�]�]Z:@r"^r��� �V�������U����&���4z-��������]�wr='��:����rzD��uCN4k��D���jxIS**�l��Zt���l���M��e��Y�����������)���q�-:�:���E����\�g�,�����@���4����j�j�y�~{E��x%����/���|7����!�u$�Fk��y��V����MLw��H�,��������:����B��E��z!�*o\�^�i����D�g�mV=��y�]CX ����Q	��y��-���M6z�����i?s�ZYV
��X���Y������N���5����6��u6a�$�wr������f
����u�yq3�B�����b�K��Db��S�8x;/h�e��LL8��@���cA�.��3�D]�b�"*��+����I��+�[TM����=N���=�O��I�h&�hz�P��<�v����x���-W������P�)��)W�bJ��,���SA!5�P7��	5��r2�f���Tr(WPe�i�Ll
�5�3\�xyghx�4���$@G-�����`���>q$.a%�������8- K	#K�UL�x;�����EY�������Z&@mFm�mX�F!V�N��������+�������%/�\�iQ�7��L�R�p��C���#ZCy~��T�+m:�����]��|���b�4�������'W���@0=�?��Q��:����Z>#��y�_&/��n���K��4u��t�����S���<R��S��e�7E5�o������)��2Q���<����OL^���L��������,�6�n��C�0��H�$��_3-��(D���B�[��7����o����E���lgwb���4#u��6��1�
���M.�l����(E��m>��+��2 ��\,j1<=�X1�c�q4#s����x��sY���$Ce5��	*���M8TF�����E�&*Z6��P��M)����2�f�!��1����}������';�Ijwm�f�n�5�X���Zzx�.$h�������)���q~���O��\��W|���?�t�]S|���x�J����O|�'&j��\YX���7_m���/��mYug�o�s>5Qaj��h�����NO�����.}���w2��&uf�i�sXg�n�\�M���M�%��n�=5h�y��Mj0�>�����W��I>z[�����^����9�WOt<���D���,��*�$��o.�����^[�k|K���<�O�Eu��T�B�n��,��j��a�j2G8�i���.�Q�m,ci����r��EE�No����Y���������Z�]����v���}��s�/�k}�U����%���N�7�h��)�)�����W\�!dj�i}���B��,+j��������^��k�9�UO�OF��D��X�TU�A;���9l[���<���T���Ri��:A�d!���=�%N��|�?���V�G��*���I�d@�d	��*��b��J������|$7��F������fo���g)Ko�A/�Y�+
��>�K��p���A���G=�)9f�l
�OYP������`;B���7Z�s<Gh�>�V��b�������@���>}F�>���<�9��|���\d����gvU�F���qD���c?�'����&vP�G�2�s����@�b�9��&Gx�-�����SIfe�D���,"���E������?{S�%/[��b>���������y/���,��yF�+����O�?�����������_�������� ���8����f$�0�������h����a_��.�fk2)
�q����i�/�������#�4nV�����(p����2�Iji��O�:��Ahg�M.Z�����#��[��=#k�1��'D�c��`b��(��p<n��
����
[�X;(�2��j��v���oa�zeIw���������������^�3��-'V��!s���JY@����Z�B�����X�����z�@IxRK�����9V�L12���5�e��{@�|��������X,�c��0�\��:���������_i��VE���zeaPx>��|�2+�[>�n1�+���=�2��|�����c��:����[RQ�J�3����=���Or�Dn���Ly>u�:"�[�0�y��g���f��9���,8���7���~����^u��p(Xs�lyJ�9���[�@@8����p/V�Z��h��sx���WFS��1`z2��st�p�^���P�+��
�R����.���T�����&�����������������5������������O���$p������j��}Y�=�[�
�2�%#�-3	S�O��;A��y.GU��o]��Dm��}���i�?�q8r�H ��O����`.�������:/��/�d'�����8��@V����r�>o�]dO�B�n����0y!�0����s
�F��<<?���:<8;s|v~���������ztz$z��COwtK��p���Kn���2��I��-rycu������:��[��J���������3-���+�l�V����u[�I��������L������}*���4�>O�_��y�����~u~v���t��$�h��W�9^��aZ��2~G[�l����nz2@'��t�_L��s��|@����'d�����c;>�|t��Ac�kD�����}
L��7U>yol��8D�S�g
QpD����8�NO�.G��s�0r��m{��5Y��6MDz����E��Q�vX��x{/8�N�Nt!U�.�c�h�Ep�WT4�#�h���*�MQ=z^�S������"`y<���#�`���PO����]g�]�A�|���s�b��4���%����W%�?��8H�z:����7=���\���m���y�N���\
�j|��l������Wo�Y��~�����HN�\���Rac�^����}(�s}�.^�xz2@|~Z�TFk������<#�s/�P���R?�����������\�J��x�S�duzW�b��E��'G?��}��|w�����i=4yX����O������D!�``�u��_������l���x�W�����g�g'��5o��8���Ul�,\����3:���Dz�Z8����q�O(�`�AQE�Q�������A�v�OhQ���p�\b8j�*si���alc�1&���yy���������>�
��A2�gm~��}����������
#���_�]���}O3HS���GR���
7���t�=��2�J��\2�pI= iL�X":4a�p$����N�	�^l���&��m,��p2i��v��f����M��4��8.���B2�\3mN����Mi1e�Q��+X��
W���t��d���$7�Xm�����l�BO�	�c�����x����[��MUo�!
�s���{I��,�Aa��\@������:
�/�j�U�#m��m�%%���N����4tj����do����y�y@*����:v��Wo�_U�p�����b������������z2��1�<�DI���
�G�6������5]���b�X�s~rQ�t�T��:�(B�$���H}��9�,��
��f8������&Z�Q���S��;�Wu3�������X����b^��b������{L�T1ZL�s��@_l�?eS��
�a'����cb�]�9��^�n>�KW���5��[T��8����f������<A^>Z6������Z'.U������L(g�������h��.�7������5b�e��N��||t'�_�5�T���[�`������nP&��T-���K�����eEi�z?�B;*U�e�Z�0�u,�k�63���4����2
q�b:�g���
��$_�[�
��,���u>��;-�y���k*���]/���P��������R�z/�Q@2���(���~�F#�o�����X�\���GE9��D��Z��v5������oHh����5`yV��>4j�8������i������Zs99������8�Jt|�|"�7�����0�^�������h���
�N��
D�-fa��`�:��+0$_[}�?�<����-�d���|����3?9�����?���� 6`�X�\2$k����#��g��L����u��L�u
��pk�t�{��j�
����}/?*'�P��h���5�����`�����:>9�������g�[\WHl�!�z2@VRWYY{�fZK��%a51�eV�p\l��/�+��������4`h.b�������������H�"��x?��<�o#g ��_��|T������5��Fp1�d�RXxWK��	�g��DJ9N��[b��1/	8�w<���\�������)�������gT0�g�xd�Z�=���>q8+~����F��k�EU���d�U-��D3���-�*����z������Z�d��d;9�,_�N�P�����	� 7�4�i��������mk����p j��� R�H�>�r�S6�b������7vAk*�v����J�|���h�r��:F{pp���S�N]7�*�5
r�k�&�t�-�,f
�'&�t�[]��t:5d��5�����#��x��>7��6����w��y�������#�j�F���T
�6�fj��[�������*$��8��+M�bc?��dDkyg��P���=�'c�p��A2���
���9]^r.�����F��a����4���f�l��e/\���Z{�"���Vj���l,R��p���[N4s����BF�|~D��iXn��7t�zMK.����+�vC[pMz���4���i)��i��)��������s���uMy�N�xH��w���{5��U�����X�����
�����!7�����Z!�oC����=Bb7��i��A��=9��d������Gg����<|��u! sC[��g�������K1�T����������
9\WO�+��jUjto�'��
7t�pW���v!�nC.������
9rVO�SO�S����f
-w�����S�j5��v���r!���^(���v]�hr!8�X!�G�����������x��$1��2�r$�?�t@c�(��Q���!��z���L����0}�cT�d@]����U����V�d���y�����S]xLw}�W������.�@S��\����?��C���6v�e���9Q
eK]6g�u�dk�^�-3�`�a����n;�k��z2����V�A�o�W�D9�i;�osZ:�)p���U����e��,������;�fX��	/�o�H��y�i@8�uN���v�yp��w5��v��0v����� ��*������%�������O�\��l�e4�k;�Z�2
��+|��E|��! gC9�;�
c���u��P]MX.���o��+W�@T8�VO�I�������� _�{��V��a����X���Jbb�W���v��B��j��E�],���m
�S��Yq�u��B�����*`kh[�b|kh�[;�y[��qOC���cW�d�6$��a]��V�u��0���z\�KZYR�����9���t4��������� {{|���zL[��T=���M���@��J�!�F�@����e���%F���n�j�����?�k���CN�G2�71U�-��k:�0��������Q�
S*o���.F����t��|�Q[M��R�S��]L<r�w���Zr�t�B4e� ~6.U3����s�0�����<�y[�tFSa��B���n�C��vL����+�~o*Eu~��f7�:#h����S{���D��&eD�k�$R��T�E���;}��!���u�����_�qyF�[�u%$�k��F!@��\�X-@��)�l9���w5�3��]�L���[��0�a���:�a+����~����?�9;���(6b`yr��}w z�p�NF_�B�?g�<�V���l�F����
�[���J����;:b���+oR^y�s���c��~
]�W}���j�09�6���F��YC���! SC�����L���|�"��c��wxk�����B���H�tC�p�@�a��a�Yq"��S�t�}�	���S��,��KzN��5�3(��������6b�����;���������iC��>���6�.&����0Yz�o+w���Q�b���9B�%�`�0s]e2��g���!����e��^�V�0�� �k���6��������������E���8���/==��E������9���Z�}|����3Z�4�p�b�5QN"k�]�"7p6�l������p���w�C��v���E����[1��������Kd������]"g��9T��K����O��U?`V#�=7y�c���]#��a�Eu7��^�}$�p/�-fb��!���)Z����c��@�l7�/����8P,����Z�L�������}LA��M��-�����'�l������-��$vf�����^O����e�|#�=�m�vm}p��~m=w����(�|��>hw������v�_�H6�����:�v�}���c^�P�{�=]|.��Gq�C��+"j:nq`y8nh��2�7�/:��=��Z��X�q���f2��7IU��Ze��;������+O�6���m�y>�	���{�T��1^R8_�������m��F��8���V=)���[/�����Z�U��O��XcfF�����P���P��n#uk�R`^�+��p#���};t��v��X��b�����Sy������]��v-���������b���MY�s���0������\���+�M����c������-o� )�'������6�"�7���� �Q�j��0F�6�z�P�Q��f���>�]'��.�G�����������8���E�=��t����l9�����qY���x���K���	���������\p1q����{������Lwx����'t"����NbD��,�/��w`#��F���>_�8r��;�k�c�.\n��6Y��2�r7Fq@G�'�"�'��H��2�`��-���y�
��(�F:@#���z2@#8BWOtv����waF���X�V�SFHV�W}����/#x6b��OY���8��\�@m��z2�7s��i�w}��k������:��zz}��P�[t����q���vR58�(��BY��G��<�����I��Z`�K��Z��F\�T�l{��������t�p~�^L��r�${�T*�oIT��7�$W��z8��66���;�s)S2kx���,/�{(	��n �j��qQK!�-����<��4��qqY��E����#A�E����2�y-$����bD`E 6����5�hTe�.8J�,P*�%�e�jg�6E3����6�uM�[�4��R= -������p<>���vD+��������������u�U#W���:�@�we�H�&Xo�_o{]��zQ�x��h�R��n����������]�27���@��{64rdC�8IY�CF�����8�S�9�q�'=��������q�����LgS���kg��pv�Xu$��~��L]O&0e�Sv,7����j���'��v������j%��?��|1������u68*�fl�7���B_B���m#��L.;�>8����YNk��
�MF.�����0����3v����W�H����H��i��'�"��=�S��x�Mf�;��Zw~-�HN\����l�W��j3��������
cL`�
M
G����W�����9����������!��@]��������tFn���%pf�E>m���2v���o�����i�*�j�||�]��
p����g��8�N������>;���`�Ksyr���"����db���\�����~�~�sC<c�x���)�b��0�m{o�9�9�����YI�gy����u��M��P�q��Y>��4�& L�t�dH�0,{�z��dn����^��d���Hk�����������������Z��z�
�1"�����P}6.�wt7iQy7��u-7)���B�*�T����[����
a�R���QysC7(��au'Dk^N�EeDK����1\c������lo1��]�|�e���I
v������`����[���o=�a��?�����Xa�?���?���S���K������5�����_��=89�����t����$����Q��5Zic���NK	�S���n�W�h�ih��\�-H���K�
8�
B����\����v�7e�H�2�Cq���[�/�����!@�����!�m4��|~#������0�����)7�c@��!�k�����q�t�����q���
��k; �|Kji�-K�1@qc.�>gm9���e��2�������l(<���������������]�����x�8r<_�o�F����Y�tWADy\�+�1 uc�k]b�n����Jw!���|�D���.����3���d�+�p4����y���js�z�d��D���������6v�mWf�c[��is�����xc
:zstv���-Q�����1������n�+��7�����(�y�����b"�r1�Sr��1
@��kP������/��j���Z��Sk����~��;|s|���k�@L����d���7G�B {��1z\K��ub�T#���H��������,/�����s�*�G}�u������M/F�7�����p$��y`7��]=�(���:�7�����Q�E���|:��.f�����q�+����FF��F���\8��Zz��M@rl�.��\�� ������9rW%���wcK�^F^�7Cw$�����h��
�I5,�;8�O�^�;#�����p�H����N����������?��?������$����eW��[FQ-'��\�����]��+*�c�FQG��f�**��c�h���+*��c�?���c������u��b*`������UOP���>���c��m��T]�	���'5�����#rz���	�;W��X9F�p���fU��bL����V%M�p���#��$�!�S��X�d��������7|.�CS��p����j9��RMO8��o��(`���F�u�`-�Vh���K0����-b$X5���T���rG��9����w1����]�:8~��3��I���P���1��c[@]�O�y�����M`rt�=���N0S������W��z���/�'�/���7`r���M����67�������1���LQ�>��#���GN�@��	f��FGEh�jx�8s���4M-fx�������������g���{�FAYqc�c�T��L�V��i�%�_����)F ?��u���B1��)� �u�F['��N���Jh����p ��=�}*o1-�|���
�cZ���#�V�@��#TM\^����F��
�Z�R�_6��������:��j=��$��p��p}��/���'N�Th�d���V��d��V��;��hw����r/g����'c�d��*8�������o�	��v�I�\�����
��������c���
�N��8G���2m��.��`'nv��]�$���r:�k\<��v�pb7!�/�Wt���=��:��%K��4YT���+R =���v:ql��`�)�u�FX��2KP��g���(�J�\���WO�z.@������== "d��a���B'�8�p��e�+ub	��ts���W��$��;����m'���8�}%��NX.��y�����	�g[v
,�u�����/== O3�'�'��������8@�X����!�p�����q������~�P�	G5��}`qd��TF~����7��)'���\?��|R��$�x�98�CN��	@���}��JTk{k���@<XL�3=@',]�<�mD9�����@��j~�����%�����%�N\#	�{�cN�������������^������[N"��'�Q�����.�0h���S�=�	@�K���
@������������Pl������=|W�s����b.L��	�o�O%�X�2��pJ�HTr;����uss�s�j������H��\�@��[����_��9'
P�	�X�3'���49��d�nMq���1����;�:q��w9/o��f�r5|���|8�D����`a�m�!(�qMh�������q�������\�������r�.�����'ob^�d7��M��v�rQ�w�r�MB������6����;���W�t��S��{o��NT���C������7�����p>�v���D�
�
�,o7Q#	��e���[��D�+����:6}L���Ya�fC���p�����	`���q��X���lr0<r�E6��C/N,xq����I��%ENl(r���q�.�qbC��)Sg��NE�Z�h��q�eU�}�.�q2p<}���[���y�p��QNl�r���t2��~	��.���~_�@IN��'6`X�n��n�*6.Ik�M�)��H����&=�q�1��U\��K�������������_�x��g=�-� ���
�����as:K�F�_�H�XK�����e������+�=8� �I��^������7��������ROtr���}9u�[t���m�&�0��uCD�N�&��M�X����[��n���M��pQ��d@�v��_^b����_�[�e� ���GsG�Z���^��	G���)p�u;���SS�����}'�z��������������	zX>�M��k1��3f���<H��/�;4���UD��y�	�6�����������=T���nU��;������4vy��<K��~�<�m�H�n����M2G�	����T�zA�����K$��]�0w�������80fK�p���;�4�q���`�7w0�|�1y�������%CV��A�E���j��(��#�������i�������>��I��������f_�w������n����T�P�[x��i��i��������u\�!���s�����`y���^�	�!|�
%��.Z��ad ��]��q1�����0m	(��A�	�*���g���''��=;|i�(�N.�n�e7nQ����9���/�����}��h,��8lo�;�y6BX���w`A{�!����;��%��������N-.��.3��w�=/�r����"�[������a�\��J���WL���u,�_�w�)��47���>�����
c� {�����`����)$�H��;� Lj<&�tVVUq1��?<�� ����q�����MQ���}1���P������D�������n(d��@��t�o��Y<����[���TO�
��3�Q@�������I�`~��'��
���1�I��8���\�b=������\�D�\�e�
*0x� t�i��:���	�;@l����&p����,���]����w������D��FY$w�gx5��O�O*����j�s���J�	X�����]�j~��
DQ�������
�5���&o���v`#i���:�i�(�2���W�����p��{���w�J�vf^0��p���[x��c���ml���~���}�v`�l����eZ�A�d ����7��8����D������>�����n�kuB����	T�cu�d�;H\#6x���P�������[�������fF��;H���B��nw`�6�=��������~��D�a�:z}tFAz_����w�����%�b:�]��|�i�c1�xuU�_
b�j����z2@�8�}��dT�����.��zHh�M~�w�c8��+'��zb����_;����@\��8��ofg7�Oo��Z�����i��1PH.���PHW�x�~�f�:����V�^H#�TH���F�������vs�P�u�|������6�x�l)�����@���{�;�@9D�����
e<�`�(g���T"�>�\���f�������P�[-nL�L����Y4q
"0��b^�Z.!��e��
���/�������?�5���9T1�E.F�F��-~����3u%����-�q�#=����8?���n�����3G��m���������:P/�����;��\�c�y��@Zl1��f������V���n��`��6�24=��/�m���}�e+�T���8�7���AO�G\z�����`��=���J���K�n�F�9���/��WSk����W'�Go~:>��Gf��� s�e,=�����V�E=����]����&�G���)�NL75�
9���H�������tzx�����+�.��a��[�{.����N���C�&��y�� S�2��,�VO�O;�������;����n�s
@��o��h;�7�NG��,Z'������N��rA���	�	�NsF^��p:V�wt����Z�g���+|�QN�#�2@�����J�.Y����<Jf������t��g�O�v����������:����i]Cht�~�<9~�(��SZP@2��$��^�5"�R��x����c�[7�s�;MwR*������Jt�W�g����
����d�8�����>pmI
���%x1=z_�r�1����_������`N���i���l%��r1Wk%��\
����r��E��G�iafN7�Gl�2�M�R����0�u�R
��4�Hk������k�@��r���}X���n�3��%�=��^�l%���#���9Z��Nm�����SLmM(a���IkyY��Uf��XTl�_�ZwC!�Z%c�� ����Z}�,k}�-������>�H�@6�rm�s����xl�\�XY�����mKU�rT��P����t1Z��9�,���<�a�r3]��P(OW/�k�����q��"��Y�y[e�,�*!������0M��Pe8�Rm���n2�5`#o{�B�"\.&�;Q�x�7a�(�"���K�*���������
����J����u�������)Jw7��rE����m�I�~C�h�a7U�O�z����V����=R�I>�����
RO-HS�h>����������L�����O�O������|T{[
�1C
{=/���jv�tW���t](�io7�[5���s�>�]��e�vP���J����0�F=cm
�"
y���m��kV��{���e%��RT��B�v���z!p;eCo{o����w�+����F�EU���V�����4[�+}��)�Jw:!��/�S���t�(; o,�mLn����.��q) ������h}7�]������,�V;����\�6�t?��T�5%�������v�zG���f7��S=c�<��O�z��a�=��`������se�Xs�7��z�����N�k>y���u��\YO���f^i_����W�W�ro����d����iuv[�q�m���|S.�x�G�y�P��k��U	���,�/����4�L�����QG\l�R�M����	`��-��1��qj�x,�����[��r�|&�4��g�k
�.���.��'�fbR\3&��������`����#"���tM���h�2���].�y��	����+��/c��73�e�& �j��l\��7�����.�����u>��^�tk��P8��W�g�o�����;�$�
.��S�����i��u�o�a��Y���T���9����,9u����8���'��~����m"I������t�lS��[O5��m��N��6����{�W�>�~�����?r�@�Sq�����VV��%��4���i>����^^�(!�N�k���l���ZP���:�SN�f^�4(�k�����@�S+�l�1����2h��C��|8��������9�����k���=�����q��9�p���[b��\�����%��������o@�i�z�k�������\3���D�eJ��g�� ��>��p��a��B8u���y��]�w�3e��4s>��[1*�7�����HNm����77��r���d��X�/{��-<G@���x���(1W�I��&N9��}�6
��9s�l����������8��b={�:,ZS)<7���������$�����}K8v%�B�q����g{��r��g�A��[n��U�g�q��d�z�!&X4����������R����&�(�r\��c���k����VQS�Jw�*`|.L��Ie�����==�:dC��G�ON�O���3@�f.�n�'
���������zz������m��?^��e������2���5��/��r�a���`��FA�������S��':2������B���P��U��8�P����E`�3[�c���p���d�����"���{�[5	����;E�+6�g���]������G���9����yY�(Y��|u(�,�P���od����s]L��x���m�3�k�.�M0�����d�XY����S�'�E<����f\����b1-�b8)���U��	�p3�kY�q�Wz���
�,t���)��D� w��q��^�y��=l5[h@-��}-�0��
$0t]����F��Pr�������-Js��,�X��Mn���d�6�B��]��3[�f}+)Lqf
���]�&�j��W%���M�h�v�I]w�J	^�J5Ao���l�r����m
������+e��im�*����������)�ni�w��k�u�������i���m�����O���i�������~��n�0kpgc��rm������:��3�G�L���
�l��;��F8�a= �=OJI��#��1��:D8��y"J������+��{�J)���]t���G�������<�sj���7�8�D�������:����[�����rceq?]��U�d����z2@F�
���)����c�G/w�B���	��-Zw�W�q\���l������������
�F�����pD'
���������"Q�D����30,,N�"������#��r2�	��3@�T<i�u%�TL���X
@�,��3/%-�����.
d&�p��mVy��`3���k�@G�����t�tX�6�1�`�3g�8�F�5���:��]�nY����gN�P�`�Y��a��q���tX�q�e�[2@Ag�S�|wOlwm�J���l�Z":�'����h�����?�~:����?�~�)o[���b@N��+Ol����WT$���R3!�������� �����B%&��e&3���C����h]���
6���7�^v�kl�u�8�Ywq���Q���3�r��=�����a=A�'����jWy}���6����Z��Eu��+��3S�g��P�l����V(���Y����=btd�R�������'�IP�����':��M~�M���5R��%�_�����-�s�i���5@3g�4������_�O_r���`���Y3��8z������%Bfl��(�����d���>�/'�
9s�r��8���u�F������G#������e�C��l�����E0�Y�&�F���E�}����i-�b@ �yf�����3��"�a����{v��>����6�s���<��D�;��UK�����U�Z��>��`���y
U|����c�Z{:}~?7�>�/��U0�s���vs�|�����������|��U�H"2��y����@[����ZZ��_������"����:J�!IQ��b�#�a���,@��_}qI|tm���q��soL���������br��f,������`V},c�A!���[oX���#K��z�Htu�gZ����s$0�c��'���G�����="x(J�@�t1�����\�x9�7�p���������7�A�
\;n��:h7��4�������54���a�5�]���y��:~/�����F:@����<�+gAMh;�����������kO�0�������{+4�-DY��fjW|��%����egQ��������{�>�r�3
;�K��z���W�3�����T�������vV�>w�= ;��v�&��s���,}��V@k�)��[�c��9�����}�1T�8/�������������x\�_�
P�����}��]�a;�J��~���L��'�����Y��������Q����i��0�vK�����j����p$�P��@$m����R���d��aIs� �����|x���2���0AF�� ���T��I,c���\��M������#�)���=�x_��jr���Q�j1�V8n[E���KQ[����n6�����Ar��%�%3���
����&eO�Y���jpb�M�E�$�5�p��@\O���!�jD����W��?�����}��]�6]��n��"m2����������?;�J���J3]��������H�c�3���E�����.�(�s��4��O#�}�=�������n��b��7~��|e���@_���f��9�h��}��P8�sZH2]��b����:Pv~�>��mfA�f#�����(�
q5�v��>wk�v��>�����g�'���������V|��3�����	����'}~�u�3�"i�2�c�l.V��c��������ztztvx~z�f�/Y��xN�Y?�?��1~9����'G���N�e���0hG����aQ���!=l�R�����K��Xj'K�s�"rf�k�@�P(}nEk?I�R�+71m�2���?e�������F���D�D�@������8�����n0��c��������[������]���T$
���R]/��
[*�������3����S#�F6���	�AU��>~��*r�_f@t���t��aR��	�5�q;}J�;�3 *�<E!K�=;tJ��� 	\��5f��
��+�:�X~�	�1U������*}����5��B<��rQ�	�bCbU�
��AE@*8*���,��H�
����'�(l:��N�:O_������7������fr���w�+�g���c��4�B����
�m���W�h�,����Ua��C[�t��;R�>�Z}g�um��R�\p���_
v]�-{?6��H�.>
��>�QdcW�!����5������$0=}�E*@��C�~���"s m���-�������U��m�u��d����:�i	��]gb�+���%,��[S��;��{��)���K&��]��67�^��E@�}W���A8GP9��-�mxk�9��H�$��H(Z������{e�����6&7a=�/��@�����r������s`��
[(ay���^�q$�m2�J��2G*��!���+(r*�-i�3��8�������
�b�c���+�>�05��5�>���2d�f�Tq�UU�����gC��/�c��TYx�����i�����P������a�����@?B�!d)0���x�s�v{�*��}.�n�:�7�$�R��5������b��\�o���{���T��Pa#�����.OYZ�����(�~}�5��9�8��G��G���������|<����g�_�Y[����~�9��/.<���`=���s|�`��{����;:
�-�r�w|��Z>@���Y�Z����e�uw��"���s�u�t�����Lo��s!z�t�0���<��L"��;���n@e��k�t�#��I��}3�a��&��w[��u����yi��?�����������U�~B�	��,0�n@�ylNW{�'`�R�k�Q.����|8kC7/*���| 	;[1�!�9��o`?q��q���iK��p-��N7H�Z�^�O����������������X:�j��]�`w�>b� ��#"�D�w
;�z���khS��E�?ptTV�Z�K0�5g&����.^�`�����Z�$q��!mR���/���OG��a��@�]C��h=��G|���
h+�8�
��#a����k�e����.�~�N�35]�?"��\�����b1#r���]���j!��[��������lt=ob�����,.n`�}��6� @K�)��q���_Ar�gh-�6
�-���a���ph���?���<Hi�FJ��6r����W|�l����	���}~��g���3��~�+e
k+LB�9or����u\Y���8^������b�:�m�S�j~���>Gh��r�vj�ab�����O;4�u�H]���6D������=���`���6���R��O�������l�|�{�p�a���\p`:�����@%q_Z��e�e��T:`Qi�3���3������<Q]K��"�.��\�b^{���l��+o8�n�y�]���/F�20���K2�*�s�����L9���>����-\q�r����:�����%���l���v$��$},>-���,-r`�=\c�U!����z?/X�5������]��+�������?�����yb���SfdQpD�>�
Mp4-=�"T�lSz�t(�m�!��t�R7��f���:�8����w��X�<��:��b�Ru^m]'��f�?����1p`��5���(e
\�����|T�|k�~A7��^�s^����|N����3���_����zk�)wC���@�*+�l`���|�T�����]��>4���=������$5�/<N(dF�.�6p�u��U'�
oD>[�s���(����#d�j�����/�����������&��6�H[�Agqe�����U�m�����hX�����M�������8����)����W������?�^�?=}�?)8�``# ,
����	��D/=����w�(&�*�{�����G�oF�;�NNQ|(�<����y=�FmFU
�����fTN����4�i�����E9��n`��VF	���*�_��8����t|r@����Sr��O��0�L��I��T����
����OG7�-�A�
!k����^��
���
�����8)��~�?ys���U����L1
��4.��7V�w�4Q�
������6�����i8�������4R���.6��9s|Kooy
���)�^*1���9���Ur'�[@zF�:��@|8�H�G���K��<9<����������v����Q����.���P�5G+�\$`B����������l*?l.p��0��s�_����su���6�bSv��'K��x6��u��(6���`��s|����6� Z#�og8�sL�dp��������w��v�+��sP`�4�q1>��D�x���7(����H6�`�S�R���(�5�>@c���F:@K������$7����p���������
��pP���*6����~���5���l�[Z����{�e�50�������k�>���W�2�P��K|���()g�p���u����N�}�S���.��,��4��.��+�%���~8p���������}J�0������{.�1p��e�-I7�zFY�j��������5����7�U�XXN���f��2~F�Z����Q���g�1~��p 3������7�t8��Af���),��z1`����.�k�����8�]���8z !CD�=����-�+�@B��kHO�QEe&o���#�1��|�nw?h��Gp��g�:G��k/!�CKDX��F���d�1��x9�
��~�[bW��c'�t����y�e�/�n�;�1�G#�4�qX�������!GC���7����v�1��G��	�r������h�2!�C���S�~3��uo�]93F�����A
_&0c���}�L��0�h�����Hf���C{4|�=���c�T�w�c��=�{��G�8�\T�����0c`�0p�]F 3��~@3�qFhWs,�8���d�9���iuq�W��ff�l�1��G#�K��.m:�H1� Ec�b�q�F:@*8D��$Z���2��g�y>mFB4�V��B�A0�!�������Jh�SQm8��`����H��0��:�,��90�Ht�����~<>~���(��	�r���2������!�@za���4R]������S]]�j#>H�j���9�����	��B6��r�Os8�����G������x�|N���7�������a�����h��[�R�d5+�c=�\���c��U���:�-lN���,-����'���xw�mQ����6]�MG��r*�V�Y@8TP���G��:��0���
G���ke�q4�a�3���X��c	{�.@���:��m0�u�tb�E�4.@��!<�YN0�M\o(������C�<*6�o�J*�6
u���C�t�0���|��2� C���! �dc�cBl��-�G�R��AC����l������`A�C�D�\�R#�\���������RQ��n$��_j���5v�z�BwZq/��K:�V �d��N��?�|u��S���}re(C�P��e��c({]'�Q���9�S���A����or���4�2t
��ZB���I]��(��x����E�e���w���.
`6C�p�!�3�t����rQ����Eh�����pQ���2���\��\�M# 8��6��7Z�`�P���6��v5VBvv��P���o���-���@��h�F:@����u���!����f�K������#<@2CK�
������6@<C�4�x�x�����]�w0r�����7�?��+`JCKP�{���
��0sc�"��F{��(r���������4�0S#��D�<ix���h�'m�@�����7�8�Z������Zp���l�C	��g�U;����g�o��3����;��7�Z�T��O��G�������s������4��_����o<����)\h��2�g����<�&?�m�SH���~e{P�b���XT��ee�(E��6'E� ��R�bl�q�~x4��N������*�����"��FWj����x�B]��yv������U�t�������+jd����������*��-�����m�K�e���vy�a���.�x&� �&�F����&��i{�,S6�N�]^���o��;��������4��e��)��-y�D��bZ��(�n���.;�������!��w��s��C`��L6���.k�.P_���3�jf�'`l����#�!�W�eG�n���(og��o��D&U��%���Q�(X����{���u:�$"
(��)j���B�~�`�Q��*8k���Fg����F:@88���Q2���!�����B���0�\#�r5��P���r�thU���g_?%Z�6bck>Z�6b#k�# e#����bZ6�����|7E�"��F*k��+qQ+�A'|#@�F�J�AA�����Wz"�D����C�4m�6��Z#����O�:M�mQ<"@�FUk�z���|z���Oq�9�mo��a�$��FWk��$�8������d����8�]����i��=�'�����e�;+E87�B~�i���?E�6��1{���"��9F�9�G�/E�6����1��F���- _#�H��V#[��6p���l6t�������E��'X#.,��p�4r��,}9�|���������UA�}�mi<�.G�Qs�d����O��������l1��/�j$~�����p/�%�eX�~U��\�[�"�O#��F�J�k���w�T�9�|~��E>�;��|@�X�J��)'������|D������0SZ����5�/�7�������e7:>9�Y�d��R����W�^d�T�cX�t�`�Q@=��t�{���\�h�@�^�mE�f5)�����y�3��j�zk���^s��:��k��5�"�z��:��J�b.�������o~��.h\������h$3�kdM�}$��C1c:9�L���o����48����-z����a���L���W�D��d�nlTI-���S���F�L����hF���,���}�49�	+��|�Z�e����N:y}�Z�B8�go��_�'��{K:�J���%���Y{K�Y�wC��R���Xm��ed��T�-�5������n��R[�	��s���w�|K�K;R�vT�j�y�������(��o���fPc�ko`t�j]�{����0��YT5��i@��Z]��o&{G�7o�9�KU�����a]WT���j��}B�������^��T�K�*���#I�8����b�uaZ7Y�p�t������o����.�`��������k.���}���Z{�&�B;���Y e�kf=�6���_��9^��Y���ZS-�/
�5����:NR��
0��f�" ��[$\�������c�!@-���!r���3W���#��9�p������U��\0�����t�2;R�1��c�r�GL'[�����"/r���DU��yt}�,������{��%��a.�#�~�l}�#�t~������z��W�p7pt����|��������<D�G���\\]�3������5�C��v�qQ��>�#�u~G{��Iv����dz�M�v�7��K��nso\J_�����N���������&E��bV�����6l��Fl#����KoL�$��
t6�TL���bM�4��3:`�cgv�o]5Bs�
y��+��5���n����R����z:��iz�e� ��
�����O���l���)�t����\�o~V��/���i�n�.1]L&����}���p�{�����#��9��������g5�5��/�_���1P���9�B��oS�\Qg�!��'�r)����9v���-Lu'�(7���������R�slL,�������sncD���F`��8��y�G�X�L��oO�{b�P^6�a��P��{L�s���x����1@n��K�7LO��{zp|��Q}4�Y�JH��v��$��#��^�&g�l�@�8�YO��1;�����9��������#�.��)����=������l��D� �5?�1�*&j2�lH_���rH���r��!��c�^fl: ���^����.sEW�]Am�1k�1@�c��>j��.�S�r���1�W7���2:�%8�4��S�>8�H�����C�;�+r����%Z10L	�`�1�Y��N����#�c1���h������$�V���u�[�om��$�x���$��W�o��@z�����{	�j�l�����������;�,x��(�;�m�o����#�J;v���>z�O�a��G�Rv��<6���K{:��E�Z�V�h�����x�,Ay���b�{=�
���cl�d�������~"@�- c=(u��$6�*��x}��m����|�e�\�sA��k1��pS��H�#��9v��W$��	��H?e���m���\�i~�O�����c���iN?��/&��_����o��b\_�����~�]��q>���r��g�����f��X��IX��}=nm�/&�\�_�\y�|��������]�wvg��'Mn�w��G�0��/��?$�[c����s�_D(��~��r�a�]���z���������}������V�b>�/��U�;����R$0���O�O������E���!��������O^�������\��(���_��;*���f�������<��
�9������������J�8y��9�K�����x�$����x��_�{b��#6���?��c���'|��������cVV5����>���_�<> p�����+���?���{���������_��TD�{�wF��%+$���7O��RS�&C�fO���M>����M^EZ��`��/����i���n�?Qh�Tm��~��������z��31�Ii-j�RS�yoU5���+�����|"~��v�W[��
$�����N-�O��N�-�y^?��Zc����4����$����-����������O�R ��k����N��?���L� E.��l�,�����"�����{2���x>?�s1�4��/��oW#vO
a���'r��{j����vO~���7�G�����t��b��m�T.�lt/�\��������������2����&�d%Gf���U�^��X
A���>��Q�e;���E�����G��YYSqm)4�COd�G,������i������T���rO���[�������=�!d��Q46����(�to7��~�"n~�A���x��=R���
�o��G��Om��G��=�f��R���W���+!_��.��dm���_v�br�R��i�5I�-��ixg�z���[E�������<��M�����'��-���G9@��pG)�ks�%Y
�k��4Y�q4LO ��{����?yo(��Z
/�u*��������#U��b6�"�iM@xs�@��.��g���U��Vb���B��72s2�Y%>|6j����^��������[[�W�|�y�����u���O�4�{����_r[��N
�`���ri��K����^������{DIf~(��I>�.f�Y�&��`y0�����&�9��f��q�~j{�������^o�}b��'pZ��b���7�?ukPa��q�73�������~�G�\8�����8��&�����[�e1v7�U����O�vi��d�7+^yu�����fe-oNu�Wt�2��r�����-��~�����1���������@�����+�������A����2:����4��O��{^@���pu:���6S�46��
�)��=�����g�]���R~����������3#?�	H��5�(�@��y>.���'?Y/�v���~����C�O1a�����|z��A��������=���O���tC����� L��@����t�q��	"��I���?�����x_��/���G+�/��������q(�/��z�������e��/�������B�<m
�|J��qm=?�����A��hF�e����(���������;���E����T}m�8�G���8@�/���)��v��Q��i��a��� ����G�<�s[_�A������?��]����M�9���q���'���?����������0?_������e#���������k���l���?_��u�DL����{i�3i�+�B�p7����"�7������������?��A��XwAL�g5��P��\/=�k���l��/��.�`�A�eP��;����S5��C�D�5�x�
�k��+&����g������(Xq\������5^�x�+H�����q8�76Yz[��lr���a
�-.��v��5t�x�KE���rv7���l��z���<#�������B��KF��
y|?�K�f������+��^��[<��������7gG?5a��S�F����.K�On�YZ#x���	D�*��[^`������P�,z��];$���c�,H���<,���o�i�������V^���'�������������F� �f��Di��PU]�����z�o�[�@�g�V�+�_����8s�F����(dZQ�jI�C�[x�d(|�ka��):�����4|��jqQ���L�,��a�bX���Q�-*����H{�������s��7��u9*'t��h67�)�?���ko�����X�Z  �
�F�����!{�M9��(��d1]���������V�SQ���9u��b�7���|����r�]���i,���BV�(�����;LZ�U�|T\�%�;����~W��Ye�������t$�u����p���K��d����
 �r�oD��>��C�$���T
���5�
������������<�5=�;K��'�MQ{�_���,r�rBW��MO���4UP���+n�����t���}�����#���NE���u�O5{��p�%
{�|��������U�~'��.-R)���~I�\S��������TzR8J���D�/5�N'��\<������i9}V5���7:���Ax<���z[��P�p����$^nDi������M]�]��2�.gr1�����E��]��������q��pt]����*r���(//��ne:�Y���3E�
]*��t���Q����Vu�x�|*��b�(����nlT������4�\Y���z�^��-�����=�G��E����7b��I_Da����bR��}N~�l��H`�K`3�����G/��Y[���;�@����<%����L�h[U�=U������\9W�\E�0}Ij�f�=�����+���lcT,���b�y5������d�������"g��Q����o9�>����O��5���B������E�"
���O{�({S�vk���r>������l&O�C��|e���nR5�Ag9Hw��0J��;��������)��;)n�E]����H$����f�����>��?��J�J�LzW-���JyVK������4�Y����dR�����\vY�( ��������6�JQ���m�<�'�����S�������>�j?U�v�����Zsd���buw�8��u^-��j��,����0�u��"^Je���(.��������-��Y-�
��P)>�h�N}O3
�*w%
�4���-&��M����h75�~��n����%2H����b�_vV��a�,��v��b2�k�o��V��C/V�+��p!�[�/���f
���_�Ze�W�rlV��Z� ,'�����8Z�6w�%�/�5?��^�h�`�mN%
5FR,>S�R����S��Z3
�v���H"�m�h&�x���o�v����Z�d�KR"e5zm�h6wK�jG��G=�^���P�J��|&;������v�L:[<�U�����"QVg?����vMa���E���5EB�Lj�]������k�_[��w-��{���e�*f�JK�w���I�]uW=�c���+�L��-�����������v+��NL�K��U]r_Uj�����W`����h�hx�d:i�t�����*��{	���|&����xL�&C�G��j���Pd���/����[On��C�LL���@�N�.��6c���@�O`�_���MxAE�V� �J���6����C���[����i� (7A��+����������4�7�2�3�9�2pS���l�Tf��2�����U>g2
3pS��C>�����&7�17�Y���������f�����K1��;�����������pV0y��i���z>��%�"1��H��MZ������R�&����\������4$u����
�7��R ��tP�������"�J�l�n�Q
��)����/b����)��x����hwg�S
d#u�d���h<����H�Tcv������i��n���"�_\5,N~@/R7���'}��b�� P��M5:�~��1����G'�����c���dn*����9�Lv@N279�������dn�����[�L��dn������g@d27�����+3 4���tsm{�gD's�nv�\R�Zger���IO7�j:�],����hO��=�;�P���]~���g*�X%���2$>w�Xj�����H��}Y�������OM�?���;����H|�if7W����G|�Q^�L���z#>�(�b6�2�K��|��&������|���r�+��|���������.'�s���Q9{N�C�� �}�x��Z���zN
YSf����x����b���k����8����w�������r���y
���������H���[����s�so^W�g|�����l��{v�������M������^p<sof7������+���.��@�9�7P|R�9H�7�������c�9�YOH,��8���bo����><{xr~���W����y��m�f���&���xy}��������-�zj����p�9�YOh�%SX�b��m��&���m�B)�Y�������������3�U}N�^�G�k���YdX<��*�N���z�)��<���@f`-� ��p^��y��^yF���rT����fK�]���?:��M����p6#������������������������O��}��t�d@���Z���N��p>?N&tn~��:'���=0�>�����Zq8�g'���p&�i{�b1
�O�l���.�������#�OY�U���FL&��r(��zjTO�>G����^����w�_Z�OuP��P������S���~���t�� �*�N��%~��Ez5�����s�Q9 ��Ik�Z
���6]���B`� K>�&n��U^7�.��p��-,����Q��BPFz�`�t�bz��b1�,��<�'KM�_�a���m*��g2Z�����-s�����e]����r>��Y9��P�U@y,P�z5T;R%5p�
�Y�������p>��.���n
�'��]/@r�����,����/�-�,�q\�����Y�R<g+��hL�����eGk��jx����t��j�x/����(<�T�,S)�u�u[�����|'���?��y�%yI	�<g���1{�Q]vegO�k���p��7(}�p��g����9:���^K�\M�P{�t>�5������������a=�����dT�C:�d�Dp��h1�|�bG���F,��K���(�t�VP����y����>�����0���KU�)����};�cA��O=�[9���`������A��������Z��O���b��T����>Go�����a�]���:�a�����=�#3���l7y����bZ)r<���b����;����'z1�^>'�)��+O�)����:��9�wf�@a���U\��K���0L=��9��i�~O�d|�����o/��-�d@o�0Jc�I�#$�N~���n�!h�}w�#�N�'��&]�G7
��P���\����y�~�������=�to�l��/g�?z���_��sh���}{|�:���h~�������("�A��Q����3�HH�#!�d@w� G}2���]�?���g:2G5�S9�,��H��u������V�c9~�����<$X~��^������t�� ��7z�]�c�)//�[CT��53��Z��A����u���A���F-�*��(^]��;>��r��cv|��F�#�%�jM����������z�@����,D���i�q�����i��cU���3k���!���7�jE������d�w�A�[=����Ez4��Y
���r������c��d�����3�������:}X������0�
Y��e�9�QO��'6��n�������������s�w���/���.��'c���#��*��}�}Y�Zd{N}�sx��a}���/�u��R�G�p�b�QM�{��RpH���]��V�'�G����������I�����=E��Hd���KI����H8"�����H��sI�nmV�v�8V��A��
M��M����	��xgX��c�:��	��k�}bf
$���d�hp�������V��6��4��ex���G����@>X����p �ZcR���V�:�fMp��)�XA�E�	Vv�Z��(3�:���=�H�9W��e��	^:I�iP�p�����U���0�&��%���,�����<����<z?+�����(o~��:�J^(wg�t����A���Q�E-�u>kp�-s�{RM������z2@28h��7��:lIc�Z@������d�8p#m���4���HL1���W^.��
�j#@y����}>.��������2�0H=��9��4[-A��YT�����I^-&��M��n�Oh����Dd�8u}Ok"��s2\���B���?�`du�J�l�4�*�d�h13�|�"��F?}�"2L��Tu}Vd�4J�z���m����bc�nn�NY�g��2[6e[�
�[�^��Y@(m����A�iR7�55[m��eVW�]m*L�?�1�����J��dK�fK{$L+�1�H4��c�E7����qo*��
�����b[�I��h���q�>]S���rA1)��,|���������=� �,O��l0��6`�M-��� ��,I�����6U

:��av����'4��l��Iu������||~�P��\���rE},���'t��k�5&��|�R��Z����zs�,��%&�t�����$�I�'���^%������B��f�F�S1u�75���#���#�xX{�������M��YI���syY�*������%[L=��d�O�����mzUy�/C��+f�E@[8���7�)��rO��;{��%U�����
�"�+�*��n���5k�G��_��-<�NcQl7T�s�"`�m!FnzeQ�E%��������<~s(�!o���M@l8.�������R����]�����rK�m�4��x�EE�P���F���J�����;F�|�������	�M&L	V>�
��o$r�����������us�hq��z��6�������[�kGn��������E@�88X]E����	�����S��r8��as���w&z�@�V���-y���V0$	�#�sU&���rN��x�cu���m@�8��z:@mV���;�S\z[��&����s��[O�72�b��o�+������y��O��w���MZ��B@��?�^���K���"���w��@�8��|bH! ��Yl)�(��������Rb,����O_�3���'��/������/
�` [,���������0���7����0VuZ���J/�����$�����0$��S�< l�/= k�'����
�0���Y~=|Gh��OC
B����]-��h@�xvYO�
�.�������G��r���[���L�K��fs�Xs�a��Xx� u��A�Fb.����������TWs�1��X �`1��=4�U�y�\�a�29�L��2�����G8H:�v�����or�~��g�
�D���8=�s���z2�#�����kQ=5��~t��}���>�
�==z��:�"rVn����?7<:xt��xF�7^G{���~DM�>&�9��g=���^��{�q����`�>;�"3^I���[�[�c��d�N�pe��Q�U��p��������U����U����ov5����|��� �!�0�KO��!�0���;�Q��'R8�Hay$P�	���V����!GW]�8����:�9�rC!��T�]us��p���X��wz�#�u�{?y����Z���FWZ���P�9h��������`�F9�,a{�9�X����F��G���Z��QC�WR�������.X�*�W�B��_l�B��6g!��C�e�
�TJQ�H�wNz��rk�G6�e��!�CO��{}'���8p�������h�u�>�mty�eh_�8=������tv�m����9�!�C��]�w��(M��?���Z��Jf|���a�4��`��U2���*r�T88t��C��:�����+y��Z���>o/g��1���R��Y��d^,
�������My?L��sy�6n�:�a�*]/-.z���T����������*����x���Vp�������<���X{
)��HphC��p�?V7�SD��7�2�����Nh[�6T��;�����B��N����5NS5g\�J`�!������!��4��u�\n
�E��X�=�7]��K���L�nUl���[=���J�����
9.V���5��V=��9��J�(��w�\�� ���m�����'�aK]H�|>$�"� �ZUr��[z8���5����x>�I��k��\T4���	�Nl���"R�&p���4������rv�;�3����K`4C��l�v-���
�d��%��`/y����vm���6FB^�x�O�?�W�{K|����gv��;��sB�{v1i��L���(P2��v��J�'ov��Y���4�f��2�0K=��c��������Ph�7�/o���2�c�N#�f�bV���Ea��d8^��������K�
�Y���9'�^x�:J����x#i�LkJH�E}<>_����Y$&v���,��V������7�����o9k�P�N���5�-��E�4�Z[��y��xK���D�r ��`0��q;R Y�yu^.���>���rw��;o����# 2C������b�����^���UP���Og�'+��tS&���e���T�?����|~��9@1C�����_�z��,G�^��:qd�>/de�E`�����j����b;/��8�.&5�U]4GF�dq3U?+��.8N����9�V��9�@?�\\U	_7���9�C.~���0�"��s����[�O@;X�f��V���\��+�bl���S5!�m}��
��s�6��L�n���mh�C8���Ss��6�{AH{R�X���QW��H�q�>��b�R�����%���4p,�>\b8p��X����y5?��?k�"}0w���..��u�@���������Kp������vM�o*�& 1\tW=�B��2�PH=au�����	�\+��CE��%Q�{\���A��W�Y��(@ ;����f���S���U�����[0q;o�502��v���`a��n
V��[�R'��S3�`�1P7�1c��w������z*�f5��������\��M����f��"���������hc�����+����+�jN��"�+5�Y&�d�������L.�tv�A3U���#�'�������J+D��"�~���7�eLh��������|'f.��:�5d_W����/��������Y�\VD����m5���#����$�R�_s���9����,�f��l�ATe��*iF����\G�����Gs�"�bF��>�b��!�;=��En�f0�h�E~�0�+��A��s�/������
��-����r6��z����e!gN�D���Hq���8��ZL�M�������5+�s�,Y�V}�Wn����z�;�����T�����{���]�"����O���:�����3#M�l�������Z(hmc�6m:���e����J�dR�Nk���S�j_���_���}tm�h���Gx�'���������ko6N��d��w��
���@3h��V��7v�����Fnpj����v��H�w�(��\��7�tJ��io�)f!�q�Wm�\��;��(�v���p[�;683xk���rF
�6L\^��k
�6�i�gM��?����}�%���WO����D��V�d��N[>�����z��r~�������%��FU���[(|��;��
�����\m@��i��_EKfl�8���Z0��CaG�oW����J���a�fxQM�I�pN<q�v�����q1q�m� ����6�|NN������b��T_�&�.K�������s9>\��b��[��0�1��5r�f���!�'Q�'����
��Sa���)��{������q���dn5�p��k��yo[��xr��>}�
����o
�CZ��w��Y6 ^#[�XyJ=M�F��)]#��m�_)d�,3�dG��������,������Z2��l1V���h�M��qI�U*�@���E������(��!��"~��m�4cR���:|�]��������
�e�^�E���G���!��!$���D�[��-�JD�_a�~�����~Q����/\m��3�
;UO��-���@?�D/5�F� ��.����0w�����x��~tm}��E���&���'��y>;���8j>������7��_-��F��V�Mt%*�����W���*��Y�`��;:/�Y���j����E�M�3�����-@Gn4qh��B�//��qT����e�]'Dc�1��Nvv$�Z������O�����j�����!�V5?�I��/SV@E�����������v������F��(v����d_���H���H7���J��zM��,���y?f:9b#��u4�(
 �l���G]`GJ@�:����9Hq�!���0��UO�G�3^��F�0��4U�d�z%��PB1��g�����8�zB@8vW�W�DA��6�n��Nzrr.T�����H��%�$l�}�o�A��T���s�T	#w&�TX[}���e:�(��#�����u��S=W�q9�X_Lp�u��^#���.��r\��
SH�r������{v�/�q���^���R�L�H���x��[Cz�9I������[��`��`�3�n{0���_�jo ��m�"\�S= 6,��,������g�q�jI�Y��%�a�!�������T��U�e~����	,��xm*��{�3v8rb������b��c\�u����l/��]�0 8�������U��jd�I��ir����3����rQ��*�c�e��L
�n|��'oaQ[8���rd���4�,���%g�|��6����t�
0]{�r����v�?h���Zj� �#����ck
1lF�����r#��#���?�6���J����
���C����,�OGN`�0��I�O��?�]���W�$^r����^;^�U��#'��0���)9_��
j+�EV�[GnAi#�PGNAi���@k�L@rf���@�#'D[���5��Lf@���k#3z:ai�LR�"/�;���O��tK�m@d���@��t����-|��q�����	G]G���2��7��#����/��,���:�s���b��3v����|��;vc��^�t��(����:

���>:�ur������������~��01@�c6:o�����]�b6��t����Y1�-�s\1@�c6Z/����L7Fnv���h�z2v���\'�����N����+
���Hs�:�����"��S�5�D1@�c'{��'q���>������������E��Y�?5.7��[�2kG*i2-.������������$��������>4w�N�-�>����5@x���KO����i���t������Go�O��(2��c�>�"6<va���T���j�1�����n�P��w"/c�`�\�a����b@W���^>�������f����( 7�;�V��Y�y�LT��]V�� �g���c�j�l��Nrb�(����ME6+�uX6�Q��wz,����
�^7���2�����$�}z�64/���=s���d�@8q��+5���r���hD����x����a4�x/���S�d�(�v�
H�#���6"��dy���<���Q�9F]O�"K�k��m�l^.������c
�$�`)F�����bqy�s$,Xz�LG�@~B�F|�b]:hi����[B7��2���9�L�:<99>aZ!@�c.�������8���_���������t.<�����7g2���$ �����#L�����e�Q9�w�4<EK���M
����5|���^+`~:�����e&"��'�}��/�����D?�!��������mD��%w����] |�5�}�Q��c�����m���-OR�`t�jU
M�|[�Q5��f��O_�z�?f�S���s�_��������E��B_����a;E_�7
+�( hlH����}�����tm��Y��-H~�~�����s��-1�]�<e�[���TS��������s<~�����1���%����Z�����q����G����������N���=��������q�����<vb�e/����T�?�j�2�������B��J+tc�(8�����z&��?�7�������}G��@�����R�E.�x���j��)*u$�D���Vz@����f@�Up�1�����H�9 �7- ��'t���=E6�3g�1`z�<����q5D����d�$��'�U��S1��	D]c�������b2����z2��s��1"�<��>3���)�)n�2~�mQ_7���s�K���1��?��.�����Khh�����Kh��x�g�7����S������o)�bu����"a�-�"�P�2���c�\���B����*��n�F�$
�(���e�ex���>U������h��;r�e82}���\Tla^��f��}1F-�^n�5v��3���f_y�����g���������B�����.��|�-������D�.������[�
zU?��r�YI �<e\��=�"d�_%�sx������]o������7��-�=����3��w�^�nP��
N{C>f�(�f�9�6_}��r]�5�#KEK�{3�N����k|���15�=��=v����~0X��ot�r*���*=f}��q���b=��%�@[�m&K�c`1�cP�'��P�S����R�y�c`1@�c.H�n�~,�y�Zr]���'�m8����b���\�n= �e�#�~#�
��F9��W����5����J�G�������;�w�h{���Q�.�����g�0�j�k���a������I�k���[�]q�n�M*�����@d����D|�C��y&�v��Z�v��)��Z�����4e�e�m}@I���5�|����et���F��P����f�M~3��m�,B���^��2.�Z�bM�kY��B:B��"�ir�{�������.�sZ-J����n
fC'��J����f��}���0�������6��&O�br�e����f<q
����n;+	��'�\���&(O��r��M�&�)O��r����M=�8��M������	\O�����p������&O�Pq����y�0��������{��M5�����it�
�{�����	�����|6�z�q�z2@�|�����+����i���I�t�0"���WT$?O8�\O��k.��/+��?�i����<Oy�8��������0��ZP� ��f<qb���t~��NA���&O��N����F'��'��=�;6?A��<�����!�����>�	������1�v'�h��	���+��tZ-v��=��":M*S��?��'��e'���+I�}fot�'0w:M>�"lZ�� N!��������p�I�4%���>G�'.����4��Y��w^3
)���=A��<qA�uE�y�O%�O'�%O�����N<q���4<��m�>�FO`�$r<��P'��'\����q����+���!��'���kB&z(z6:A�=q����<��\wx�O}�8�{;�%�� ��cP��i{�+Jz>�q��	��?����O8R^O��c���;A�>q�����'���'	��=�	�p�������;A�<?aC���O|�8��a���	8�8�`����p> a��w���.��ZK0N�%��OQw����F�2(A,�	�y��>pD I��O<�CEM����)���Yn��p� ��o���s	wn@O���-T���.�)�3��uw�u]��
{|��) &�	�����p��d�h������� ��S|z���HYh�����d@��P����c�R�����7�Et

�q
|��uXL+Bh~�i�2�~��O������'N���6�vK��p����Uo,��>a�o{��cQ�?�|����+:���s$>a�x���x��H��$u�(����p6���_�29�����Vg��bf;�O8^w�	��H��y��ly�C��<��;��kO��'�PO��/������|8n��j�iV/�u�'��'z�SX��k�����j*�����.����nA��'NA��^K�@�f�|8��������}O�b�'�OO�������Hf����9������:����_��9z����L��� �n�zkS��3���-p������@�,��|�_�>:cF���ldo�P{��a�������;^��Q��[x{�����N����r^��\�y�;���m�T��A�:������^m��77E�0v�<�s�`n���-��j�<�v��S�B��9D���H"y�;��m�����M5��g��8`��w���%8��k����W�m~���c�e������m�7Mk���y7]��i�w�K{����/��x�+$t�3���8��kM��G��*�S! ����m2_
o�k:%M�p���z��plA��^���6/��`\�m�W&��)�ys���I����iEQ�
���c�B�[�����feU��$l
�����|X]��*���0�"~��l`u����Pe{^�u�����������O��=$��-���}uqC�3��T?����@�8zXOH�[�i�}�������8�$���x�Do�����[r��} @\i-���"@�/�N�]N��y�[=��{9@B�(d@�l����5p������o����y9��N�����Q5�7s�J���W��:�>�b�@�z�q@��B$�}����?����c���<���w�������.r���SZ-.�A(j��@4l���������'�~����r��������.�����6zUk��T
��.r�rV��
gE��d�Oz?�^n�k�`g��s2B����H!�Y��'��-l������^�:��P�u�0�5�Y�l���m�u�����X����c��K�}m�c&j@������7�����?9��N������{��b���k|:��S~=@�7��c=�)����tk���V�V������U����P����`Qn,j���4��w��F��
)m:��)����n��������N���|���[e�u�A��(�C��G��p>)Dqw�
��D���WN���Dn�h����[��R�qR���>�@�b���O����d�}N�6�H�[T�1�4�F�Vt��A�U9�T�`��@���t�8Z6ZZ3��* 7��:����U0nx����s����t����\{���1�x��b�	�4�{���*	����q�n���k�p���d��9Q����.�q��}C�p6�S_�����x����s��z�L���I�'XcD��B�Z���c�
������c6_g0�����������������c���9pC<�p�gO��u6@���/w�Yg�T_:��R=�9����G.�6u���*���6���0{:��ScB����n���@c���Kr��^K�O�t���p���U���z�@-��F'�����/�vt��v{��������S\�e�����#L��K�j�t���\[x����F�_�fkV%�����*xz�i����b���5�s����)�����KC�!����u����n�]R[�by�u�!�j���2��zjvmH��
w���-��vM����q�
)�iS����KM�������bZT��v�;3��U�I��"`W��-�p��M�R���n���&l���Z7u�uS@������Eb=!�n�I�k�=fK)�}S�i��67��}��R
��)x�z�}��R
���wS.�������q�����S4��A�����)����M�@]�%n�R���N�_��q��������)��;_�1���i�:_J+�r�,;_J6��b�����y�N�?��)lm������bXgLzb�#s��|>��x�������|��6p�l��MC��������MC�Zl�G����M9�WO�A���6
��U�\���e�q>*�����j�8^U�������OTS��n
�i����J�U)��^��z]P����]�)���C�Z��3�& G���D��f(qi����ceqs��!/��]���6O��~�h
���
w{5�.K���?Z��us�7$Ww�6�6�|��4��rpc�l�;�"����^ve��N��n
q�o�
����X�We�n1��wQ������	0���VC���h�h�q�������#��@���M]i�Ua���*$ ��h#_�}6]"�.�8��������F9������P,2���q;33Q�/�"9���n6��k��,�
9u���Rx����n�q
P����tx���7^6��Y��q��6�q;ot�w���nm�����tp�4q[�8p���5�x@��l �e����H����)��@�)���$�MW�"���8I��r���p7�EH�_��*�wzn��r�S�d@�s��p�C��h���-H1R@�t��K�:S����n�zt��i���\��,g�cS.�)=z;xk����#�:qr�N�
�V�	���������������D��B4[����)�����A�L*�����FU�l�Q�[�},�{�-r���g��JF��TM��X�J�Y:[���M�B���G�v�pp�V����(�n��Z�v���)@oSzK�8�.&�Qs@G,-����J���)���1�2x{r|v|p���OG��z�/He�r!\{��{���,��1�����ne6���)@m�������qCA
(���p���o��f�>�l
`��[�p�)���)@��<A���N�N�_��*}2����j�(�)�iS.�0TNh���z�mIKu���`S�����	]>~����W�����r�����J�
�f����]��p���VV��=3�I3���[3�ne'�]�=�N'�^��V�������CO��;�=g8��7�n������+���p���{�xB�.6c����D4�>c#q{���z2�n���n3,�b_������e#�n�6�O�b5�sZ}�Z��������-J_��:�K@O3w��51�m�<����������>���U��[��X�����Bo��d�
V�����E}h�b�q1a�CQD��W]���X^��\�,���NAF�C�golt�V������U5IH@T�Rka���������o���3@�f6Z��o�����@�n�v��j1�m�@�������- ��8�Y#�n�"p{�\F�Y����Q�q�YC�n{�T7�W�Rw���v+��z����K1�SWys��\�g�kZ�P_j�����������by�������S�d�@��u��3z��5��@�f��~K�}T-�i������m������5�V=�'6x�'��f;�g_4jf�Q]�u�H����uq�l7��i%K�V^j#8ybt��b�"�~D�e���5t
`��k��U��(�WH��5Z�*�M����Y��>m�+�j�^6�if�L
�xh��O��g#��)B�3��f�U�����,�x�d��>�i�T��8�oGd��|�ENQ�3��f���m]�v���2@�f����U��G�����6^�����c�qQl�d��8��]�$<A�@��VO(�3$j[��,W�������p��{[b���k���vM���������f3���=�{0�L��mnp��t����0��3S�����>�)v�3@�f9���b*���C���mQ_{U>/�Y���VF�]\|mH��ei�2�O3�X��R�5�|8]����[��V�(�����N���������-Jm��,�X�>P��%�L�[�`b3.x��j5&�j�X�u��~|r�_��5�8V}K=$k��x���3��f�j�i��ffb��E�4��b�E���wU.���V�8ZU��,j6��y����-~�h/�-ZNgi���s��tj�3��z���\7tW[Va��=�����?��.~Kw����:7����YqT�)�mAMY���gr{��p�L�p��������������}����@kX�,���db������?��3@�f) �;/�!5���i������r>���[&��5Ky��� �����	K]=��f,��C|��;�.�������w��8j���e��I��cO_����#������
H��
�*�������,�U}^�W7��p��U<����rp�J�|=���G��HtH/��.s\~P�K�=).@�f��':,�z=�`I3[�U����6kN��e
d�����Z�@4.��E�u�4�i�2���&����r����{�<��)��t���K��kc��g�W�s�M���~��,A���WZ\1�#v
;�
A�o�<��[��>��s��"u����.=B|����u�hZ���]x�&���q������n��a���S�V���7-R�!���k����p�zC�[7�@3�����e��/z��I�>�����}�iI6��[U�u����|C�*}�V�v��>w�v�m��*�m�hZ�j�T��(4���5Mos�h�M�����BZmo��	��rm�D��(D�/u��*C����*���M�Y�Z������T��a� ���i-Z�!�v��>�H���*}����)~[�/^��X�q�)W����5��7b��N[r�]~WQ�}�������l������GY%��\��T������%�.�+��W5��}��I��sTK�,�����},Y���&�e���Z	�[�"��Q��-�W6���Y�[����������?���o�?��D�@�lP�����Lq����%6��'�r�OW���ZS<����fk��{�d��C[h���s��8.}�X�@����M���|e������{��;�Z�EU��f� k��"��{Ur�x�H�z8�Y<�.t6/|N&��9����0}�V�v��>������'��ys��g����(�k�ZeT�:�HH�����[�������R��[f�S�T�
kvU;(K�o��)���p�����Y��:EV@8z��)��,}~����DF��s�����60�l�v0�>�Nq&e�/I��QKW������r����raL��D,}n�����8�ZT0���i�[����ev��>����er����2;�J����Q�P��9�7�U|�Uu	Z��Y���Z�7����k�"��:�����=��o�-F;,K����p������g��f�6K�v��N������������*���euY����"#�<n4�������U^w�B���������{����v�BZ�^�q�n�4��������iU��A�|O��`�R�v�������4��j:;vj�>w������o�F�sg'�����;�v��>w\�3#�6C�,'/�N����^��8��y�Bc�=;[J���N.\_�Cr����r����z?�A�:-����.���
���h��=�(R5��'���j�:��:5��N������f+�����J@��"���(_�:�RB�ek�X4ES���	����&�?����I^����X�:��7����Q�|�&����^[����f�A�s����=z�DD�����f�} %6�(� '}n��1�������V��w�G�0�c
�i~�K*�F�q{��j��Y�0,Z�����LvD����p�U,//���g4��c�I�YR�ZE�H�|1
k}��=�H�&�F��q��]�*�"�H0����������MyT��(�����i?!}j��V��Q�z�vX�>�/����DQ������#����w*2���d���G�E�7<��x��7O�\�hq�j��i�Q����ek�)��@1��5�s���G+DF@�l�+S�@�l����E	S� �b:��r����h[t���Z�{9���[���WW�Ls����B������\���o���I<�Q�/#8�J���
�#�ue�\����3+��������g���WJq�O
s�4��^W���9�^{�`�$��v��e�Q��"7P';�K�CC�;?��g��]�ea�>v��>w�v}��V`��QK�\��Y�M��������w>�Y�����������h��R�4��>`_}�8�A7N��_.�c���t��������?��������������/5_�.�-�+=kD��mII�;����z��C����ji�J��l}�{�����g�����w����F��e��J��(Y# "�s�����48l���+��	��C��+<zOoO���3
�'�[9Y���t�}g��o@����������{���u��������,��@q\�Y����M���%_�o�@�����Y����,��[�~�m�X.IA�E�m���[#��%�V�k�$A����0R��������^~3��t�R����Kx���b�Z�0�|@������&9�sx���,K��~�`,����z17\_����dU6���U�9������H�������8��1d��
8]�D��������*��T��]��:��S�05��:�C������~��l�@���d�q�����5��:���B���V3�o���[���l"��~��Eo|���j���=�k����	�-mj�T��PL�9����U�V7x�h���Zn�V��X�My`���7/���~qa7y�yB9�V������4]c�go�z�����������������0����#�=���j2��c���g�x�����w�H�5����fq�X��5���B��R;}r�/�w<���'/;�-���F:@��|�j�Qfn
t8/���vW��]%s�;�-��������$��w������vTr��\�="m��l������o@�{��l
U���,���0r����x}.hm�e��j4�����9�x����h��	�]�=�m���=���V�{����[t[�= k1�����d�,��[�
z1)���
^?v�W����M������������>��c����9>�L�H�v�������
`v}6����������*J��N@\?��"��i�Wy���wCa[��V��f����C$�9����:��������n�x���5Z�4��1h��w��O���z�Q�b������^c0�>�����j�%�U`L���Ab=ud�����@ge��z�m��O\g=����6�r��
n��%�~}�kl�/�_��~�v�v;�,k�k�'�r2}���������sf�t@���5��� 2��dR�����R��}#K���[s����P��ur�|�.jX:<`}������\�0��2"��*�%}5�'z��������2�f��u>�}}@�6��3��������;��K�Q�E(�Z':�-���Y��[����_?u;�&��1����@}��~>�x��u�����&�`���1���Y�z���r!��JY�����u���F����"���-VV��1`�}� ������1x��b���+�m�@�~�����N�c�5L�*�{�d�y�g�N���l��4s]d^���!�
�O�����O�o�E��f�i1-���bK~
x�V������w�"������z�W|�N��r�|�&��e1N������D�@{8��H��-���9��
@����w����Bk=g9���j�����tVmx;��gdDU��Uv1��rm�D�V>�����<p�����e��{�G�*�p�q�F:v88�i�vlqq��^s���C�E�w�?4c��4�����������Lu��N�3�����}N����&oW��s�4y����E���=�5�7�E���&~�z�S#]�6bW_@�X��z�2��]m.��
p��BL���A��{-�g'z6r�:�*���������7g�''��=;e�T��>���t@�D0��muLT��U}��?*��D�}Gg�������~�F�(B��V.<)���~�0q�
v+~f�����#z�6���<�F��V�����q[�4��=E�� p�%EK�h[��qET"��gh��6��[#��E�e{*���6�Y���[[@KW-������Z��
�I�� �WSyN�>��,_�W?�_�s9YN-�n�>�d��8���vn}���X��R�r[��p�T�p������M�$�[x���	k{���q�H����yr
�j4�f�H4��(+��
���
8�}�T�,g��bTt��������Wy�Dc*�w'z�(���]�+���ovH��B*��F:@�r�\�k����Uy��p���:�E
D�)m��I�PI���\�����#��
8���@�� zW{��,�ls���f@�\8]# i����}�rv��cM�������v���4�7t7��%pr;�}�9o�D�V�����2��i��>�������
����j?�����R�jo[)o����A{GM_���\'�k�
���u#�V6����k���	��kW�N7pG7n�Q�x��p���3���h'���r�F i� ��h��e��x�v��`�R�����
�{���5����
E�b�u�����d	�|�}=�a8����f������~�g����"ZF@S��'xL����������,I����{�t�.$���V�d��<o�~��@F���j�V�.�
|�@F8�H����/��[��}��`1���F;ua	���@���;\@�r��%�����C"`�G&8Lp�1��&/�������~}�]�.�}�%|M�`���������P��O\�.���)��=V�xK(7���+�U��AT�Cff�:���HV�����x���k���3��|����5�pX�����s�����7=��n�i-���� V# �@��M{�Q^����4�2��^����-Rm��@�\]# "��1�6��F5�����pFM�o�����<��i����t>�}_�6�Dj�����tXX#�n{��#�^9/����J�kS\�y./$�,�KG�����UAm��	��5��6!�2�oy�����D��X#��N`Q�lc	���*����������
����|:>��������m�����
��i&6Gp����v%
9��H��)���#��n|�=0l���aC���d\\����\`\#���{;6U]���r�����������W��U�$7l�$�2����o=v�����0���]�B�5��X�5w������p�!@jC���!�dC$�|�{��m��y����q���������%�w�j�������
���I�.U����X�����d���\��Q'�n��F:@gl�(��T|���������Z�8���
���v��!��	
���p��9.3�l|]_���w��z�u}!��C.l��@vC�5\�@�!�.��N_������B@��l]�X�������������Z�V�����W�bZ���x��:����������o��`C�n��YT4��g�T��bZt|N�yW�\�98MJ&�a�^bYnH�(��WN'wb�A����hOL�o)D����U���b���I��������
����.��0&J��B,D���6�
�k�0@sC���oC����Aq���o����r�6�Z}�6tlh�p�{N��m���j���1�m��5��*|zh�|�^_������Q	l-��_��"��8�l�����|
C��W������gMU�v��3n6���k�On��soK��7(�V�v���n@WC]5�@��,��t�����u	.���i(������XT���rv��������!P��v7����4d�R����d�=�xB���lTWst|gm|�q���9�j���������:��
�!�����vv7�Fb�i��i����sX�0p��l�Y��+a�<�
3�����n��ti�>�����?�
@/C��}���;���^��/C.�����Lin���B��k��)
�+�t�g}��q������B�8:l1�/��
������&����@�!�H	h�%����7�y��0�1|��~�w4�S�cS����:
��P�}��~��9�y�y������QS�����ZNb�Sd��C���!C[XSQB�������p���F:��&.�i��_,&
mq������z��O�c�8��D.����EW��9��D=lK�&;�P18�t�������[%Y��p��1�����L�����'����| �htz9��Htry4:5�<����������dP���@j�?@C.����C�N�Lp��ng��A��b*9�)�C�g4�*��%��I^-&4J[����	@CM�gm���pv8����P��0��D�)�b6�%�9�o��9wXbh����6Gt���g��bh��yx*�%z�s�0����j�)K��6iO'���V�����t��2�6cb����! CGr0�`�D.�5=p}d�D�0t
xDB�����h���$��a��l�i��&`�!�	C'4�}<sY��o,]�I/4��*�4�?�N/u�t�&��zKZ������F�EU��_��w�m@V2'g�l��z��5�-8W��2
u9�[������-��>VF�<��A5�V5�`d
�I����n��Q���.r���C9���~]����bL����ecW��BS�e��R�s[
/s����������8F�/�����^0r���vy��CF�������D@����U�#@F�H���&�a��4��+Ed���|�;����T���hQ�C�����"�FNH )S����	��s��V��F�e�:0r
��f��V�"�R:� z�t��[�RV2�+#@F���f���Ykx�W�9����&�!k/�,[�O�G��gg����O������`#�i�b���2X��r�o�������wu��Q;Z�{k�DZ^���W���,�c��4 b����H)�-j�r1��V%J�NL�h�~�Iv0���p&|�z�F����q�M�$c�D2���Eu�/m8mL����/�����qaN�t��r�bS�+
y���F������P�x�%�(�X+�|�
�C���\�����K0�Q���R���c�Bo�T^�;�IC���)2�i�e1�������J�B���x��r:EL]3��,S��z�A5�ed��*:�O�}z_�C$t���Xv<5=T�I�W����W��ZFN�U���fb����%��~�'�
��_� ��Go^���?9;"&����������Z����U�����	���\��H��� A�M��jO�C�1Sv���/�����s�4��n�'�1�0h:�� ���^��s9+�'��>`L#c�A��
���<�x9����A9�<V�dq�Z�t��q�(=���H[�!n�?��x������U���4r�K��l�n�;SYN�����w�������w�n���������)��$�]���,�v�|��3��U}����%��D���<��~� %��)N���`�t9�Ln�����,q���u��8�6��Fx�n3!_>����Pzw�Z��P���i����"�N�a�\�@
���V���������>og����F�j\�G��z��~�f+����P�`6�h�Y#�g5&��q��q��*��������D����E��q���brH�(r�c/�����c�w����|�l�+��e��j���#���D���8V����l���s\>��8:/�t:��w��Y ����G#�F�JIT�ceR�JK[#���	�52�2����d�T�[��CH@�F����5��2��������7jd��W���I@f��d����6�HZ�:���8
�\^ld�`��2Z�R�����V���!/<�Q����-:!��Tw�%Y_�����3[,jh��;s1
������\��F	��[<�F�������#3���^l��_�q���8��h`6q�l)$��g8��3���D@��;T�6���H6�@Uw2
�^;���pq���u�(p��W���]e=Q��O��=�����9��j����b6����C4����S��u�\�����p�eT���6�-��V� �����6F��l������A���Ts�����%r#���k!J��;=�r�Ix���u�T�kBw;:?>����_'GW���W����w''��������/-8��N"��F�k��Y'���8/h#�(Wg�t���eg8�&���H
 ��sI�%�m.�U6�3F����@Zn:k����W�3�8��j[���S\�������Z>�O��j9�����\�8r���.?�a��S��rdC���u���
e^�F���0{w����@'m0����-n'��N������W�W*��ME[�����������z�����p�'u�p)[�Mn�f��G(OJcjC��2�����3Fr�qv[z,�Lw�cd���bG�����#��6�����8���f�����p�y"QG�'�*���u��:?>��{}��W��$����.���f���g�j��8r�Q%��������
��Z�9jY���V�m��6[�b"�6�}�P����;����s�������qg��N���e�S�����o��Xs`1���l\�F�"�����xlc��7=�5�;�����6\]��r��M�E���p���Rr�Z8�\G�����d��AJpk�{������v���8{&7��E���:��
�i���������g@�������������Vm���E��y�`p��C��M��@A|'�@-g�WP��	�m��I[�b���i�gN����Y{�2=�0�1kSZ�G���5f=J7����+hllCc���{���Z�N?}�WH��B�ZC�71�P�m��>��`s��~����(��e���������,���C'W��:�����]1�FcGl4�h���mw������H��O�%�4���|�.������i=&���.� �1���&�]r1�_c'�U{���K.�k�D�6�,��b����Xl���uq���5�l�Y�VJ�����J�GG�1-q�s)���Y�&W��R�orra��\��z�FP@-Y�WT>��Ht�hs�
@��1���W���w:�f���^f4�b�Z��pDo�*2v}p��-g"L�:@�B:Z~ ��}�4�m���VY#{2P���W�<���f�
���U9X������~�3(���[U���j�_g1��P��UA&��jx���E?����7��^��|�7v�y���������ty�Q�*����	��E�B����=�������<�R��_��j
��p�,7"�V+�����t >���qC;@���K���7v"������kQ-����xQW�5��Dn\U|(>��9r�����p���kDN��J*>�t5�W�=�d0�~����	����7�q�D��K��!�����b6����?�|�/�b�N��y\%��5���;1M�6�n�E�W���j��z>z���`��������!PB�V���Q�����wv��������������7o����|����c��v�E����F�7�d��y��7v[
d�t=)����9-�#7H�u����{��|������9����?g��a���N"ns���e��d�l;fI�����hk�:@�����3?�Q��������$+	��9c�:@�����`��;vt>��'v��W�|<R=�l�����m�6���p�F6~&w�@���kr#R���d�F��i����[}UI|�����8u:'����_��9X���0
n���KB�D�2�[��_��%�-,�e��3�_�,�S���<��`���"�z��G�T����Rs�P�ZKk#�[\0b������&��*���`��E��e��_���1�9�����������<vD�c���6o��}1b�����m_�p�1�������yD_�`�qw��m}1bv���_ �qo�����1 �c��6����$��H��b����9��J�;��n�:@8r���QrT-�1�c����p������!�6�[�b���9J���}1�m1@�c��5������7���<A�� ����C7I��pp�>��<��J���>���j��&����qC���ES1��$�XOl����_�����Dy������V>j�n=�q�u��F���'WOl���P=qt�N��t\���v�q0~7L�}5�������WE!oO��<���-{�����o'����	G���dn�x YE]~-�t������T�q���=��$�O>��Z���t6������G��%C��F������Q���G�[�as�����Au���nPm\
��oe�N����	��+�n
�f]�,�/���XhD>.
y��������f�%���b���f��x]O[g���&�<O���E����a&	��'6>���4=Ah�'�z'�FO���8���t��=	\g]�����l��<��ky��p��xv���������.k	 �������$���D��_�������L�@�GT>�|bC���X��)R���@�'��|3;
c?��l�i��)�L�>�pz�,��	`�{���$�O8fJ-K	������=	���G�y�s��I�"��w�&��+��k�X�$D�4���:�yef�I/�<��G����%�_o�}��������������Dx���`Apxb�L�Q���;>y��'����$O�$y�x�������e�{9�Y.&��$�"O�y3�|�f�c�sJG�X9�*5��V?���k�>�}����qC��p��q�,6b�j��x;q�����Qk��	�f��N��f�����I��T$�`c@d'��*P����R6�-�0'���z�������I����o�q%P9Y���$�DNX3d�Rx8��6MO��8��a�C�l����$��M8�c�������nc���6�����:	�k_��J��6a}�����{��&��1%����6�ZJ�f4	@l�����r�eGZ6�l�hi\VkJ�&��f��6q�jE@A���8|l�hr��5I�����-lb�&�)�qP�S���Sp61��
W���2�������%�k�
�nb�${�`��6~�	�PD��=>��&���Y��&�yG{���	K��D���3
�n���4�0����$6M8���1��i���u�Z�<�y{��mcl�&]����:�b���4���������`��4�:����O�gIT��B�����Y@�&��;?.ig{�P�	g/���Y��&��Z�b��6�m���,�,	�m�5�����2>�V���	g�l\����f+d����s�����'H9�v�VN����c���igk��H��r��u�����Y�)Qut\7%��M;[k�g���T7�8�&=:����������
��r&���7��p�����e�?|p��wRv7�*���FL��l�����`m����p��	��N]9�ZS��'1(p����u���8���$��M]q�M���'1�n����M��;c��'1p�����n�=�)�S6�4���z�(��IL�rp�q��T�������J8�(a}�)�p���u�:p���=���}�����)�7�/;���M]�Z��/;�=��FA��y��[kr���V7��P���d����7]# +��/8���MC�AZ��h�S���e=�e7`
���F�:�Lf��0[�0�m�q��u����Po
��n�K���ZR���i�^
���#��q���<������.���:�`K�b�\
����#:pj���No�L.�E�e�P�i���z�f��� ��
�5������P���8�Ei���*���u��X�����(��`
x�4r�4r���r�:��[Q�<�mq�F
��4�H��w�m�S����[+�&UZ7q��
����O���8eI�z�w%l���/����}	)`�S�6Cz�}	)@�S��,���R�'�,����@(��l\�Bl_���%��4N������6�R�(�������%�aN��������9�f�:@C8D���R�"�,�,x�@�+efPj�"f�G!�7��ds�E�E>�����n�����O��@�g	�}yO�dV�g�NoU����UA�,�.���M���Am�7���)�:�{�E>����=xB�E�r�e�lt��_WtW\1Lut.�s�������)k���1�z��lQ�OE���WM�s3
��|�2	�1@�)1��o4�7.	*���Q7��+��n�����-)�Sx��u��(��N9_.m1�H�����r�M�5��������\@(�6B�<���~���HR�,�6f�o~9��e�:��;Y��+����^*j�/�mfYT�����]�,��l\(��Sf\����.�i�r�/��a��e�
���)n#�Cu�(������=�S�o���b�������r�^4��
Eywvzt�?=>y{uz�����gg-�}@:���s
H���4���o�������:��������jN�*g�����)��jUM���b��N9*z�Dk�2�5�o4*��NmT�s�O��[�Vr�r����fh0���T9������2��Z���S������S�6�$�	rnh��^����l
��S5NQ���)g�k�K�rx��5%F'j���j�J_��jLt{[�Kc����������S�O�������q�"��h��[9w���s-���;�kq�Cz-�}�� p�3i��.@�6g]x{�%���� ]��.��&�������l!�}���n��t��Z����R�W�$JuC�e��4F��]7Z���
�J��X�e�[��6��u�|{����$X���7����E��.����P4j	�7�w�����9�J�����������W�>����*�+�u4�~��x$!����.�u��E�\l!���:��]��v���f%�5�+�h%�������e	����q[�R�������Mt|�����2�w���|,�
Tk 2n��%�*{��)�ah�l����X�i"���l(@���_.}7�F��r816�`�b/�
1����
k���`����P�]G��(�.G+�0��C�e��kji��&Z����i��]��_���~�]g����0����XK7$
P��`kI;��2,�}�Q��x dn���>�f�.����rW����������XR5��R����m�>O��������pKy�g*P�X�������xE�8�������u����Mq8s�fL�732���7�>�n6�&��"v �����X���R���6���Fb�+�Z���sOu����|Z��
c8�?�������7�d�Z=������|e����|*�l8(���.���V�bS���
�V-m��^-��K��V�-w���ojC�d�\@�,��������S�h�w;.�vH�u���'F����f�������#�����ENF"�+������-R���F�������/��o��yE��S?8�k�����+o��G0�n@{wm�w�W �]�E�1�p���+�k�d"E� �{�c�c ����y7���n��`<��U�����k#����. ������]LA���rF��u�F�[kT!�������?P)�m�
`w�M}iiq������{��@b�#��D���2�Z��l��?�����.Go;'(����:�Q�����$|`�P�~Z� ��6��FR�(F�v����4��;��(B������w)B�����6'k�d?�,��6�&�=)��Y�4���Nn\�������H�w|�V���8^w��ey�Z2G�����w=)�+��Dc�r#!������$�/>��K4��N��C=�w9\���jq�gEsNU���i�\D�?�LJx���W>����}���=6�"O&����)�<�����&��p#" BN���U.G�52�N@m�`qs���]�������eM_m3�<TC5�<��V�_�������k�b ��6�����^f_*�!�.�"C�*9�5tp����e��������-�5h�]&H�wX�u��,�:����m�wS&������v��YM71�P����v����rf���]<7�$�#���b�H~��N���jJ�7��,��]��w��s������m��������${��"l?������p���QG�]�w9P��77[o�)�kM��]W��=j���tz��N7z�6�nc�-��P:�n�����O�L���osyI�r�pY��)�]�w9��)m�����{�r��S�����H������4
��.K�k���}���]Om���rT�����l}���`�{��m�����=����B�����Oo+4~��n�x�:�����������oHz`��/ V��z�F6=��l�������h���v��#���G�������);������{6x��t��\�����u�0���z2�@�=�HdEVN6���e>����6\-Z&�z����Xy-���KC�<���y���,���fm-R����)�s�9y��,�6��l�s�������Yy�Us���[W�z���qf��I�m���z���9�^��������q.�m�����������t
�\x(����L�R��>����3=��8+m&������������� �{��U�Py&��l1�P�v3�g�4�u?[|8������Rs�
���b6���=��l�}KTto��3#�l�y�k��R�X�>�fN�h4����A�+*=���`�q#C��������������m\�33�*P�q�B������5^�R���Xm�{��qf��u��9�N/�1�ud�z������b�7���}/E� ��b����@}������R��=7�r��
?�Z��/���c�5�]<&{��Y��Z2�\��Y��Zj�a����jDn��B��9��%��cs/�Ru<���u�� ��Ln\�_����qu X6�q#�{o�����p��",���~�j�e�s4��m�;�s~gS�g�
�x����59{)������wK��F�����(m���XZn�Y��X7a��yogtd������\�����p�����+O_���
�2�Jd��[��-��|J>��M�/����j:l���g3>7�1k8�0g~����I�� �{6b^O�����L��OG���a�kF�x�w/��h�[�"����NE���O��3�O��E����{N�8��Unh7�Z��qx�q ��s���&�Gms�4�y�5J�������f�E
�����da�N!-,��6�x=�K�#���h}-]L�S��v
��!�`�@��M�y�����-_��S�u���x>��,�����:n���(k`�@���7�d��������zzQ���������U�������}�K\�-��kH���-7�]
[��k��
{`cC�ic��1����='k|�����;z�'�%
Y��il`������0��p|:6Kn��Z�6f�_[����9��`s�����������	��X�o��0m���X�M�{`�D�u�%:�O������w`SE�qSEl��u�X 6�����V; ��4�C�#
�����������O�8�
��=��|b5#f�(���	uC�(�r���2��='k��nS��m�8��%z�O��Z]���:;��V�^�����;��������Y��
�b�z���=���r�>��j�5*���Ql���>���fh�{Bz�-���p{9z=~���.��'qy�O�
bI�q �(��NSx���u�U���U��1���\��Pj�m[Fz�[F(i�?�������G��S�%^�b�M��Q�?}-neUF��1�m��Nqg�T��;u��V����&X�m�RH�?�P�b�D��}�/���$}�H�����U������VI������Zu�>���x�����e�.���v,�Qv3X����R�����Turqq~qP�'ZM�pv;��3U�|�};*7�a����P�\2����zm���F���e����>w|���;o5Y��)�P���0�����y{�����9��G��!����u�g�@��V�����������,����x���vOL�����F����[7�`A�f�\����{�����|>[,�k��������WA��6=�G��?/pm���Z����jJ.�{����1�t�hy������h0:�����6D���3�������<W�����O	bA��1*��0P��wD�#9��e���X!�n��(����aU��ou��'��3�0����8o�h��t�pH��&.piJn��hF����l�g;��ny���^���F����>wk�����m��fK�.��J�g$�z0-������!��V�w���DI�k��'�c�n�:������g�b����v�[
�s��*�l��~
]�k�n�)r������%<�V�7M��@v��
�u�x8��'�����C�����fNpU����>�e����7�������
g�����H�7��X�E_P<.�������I��N�!���X�p�����F�a�6:�G���~}�5t�f����w�O�w�s���������BM��Il��	R&K��l���s<�`&��=}�0������,�m�������J��"n+�i��h�w����<�k�
2
�i�P1����j:�I����I}������������Opi�gH;7��94��(R�w�`x��T��;��S���=���.;9O��i�����w���hj,d%���j���,���d=}��7�)��k�����t����c*�K**���_��������%�#2�m�����l��gr\�g��H��{�|��6�����X~�en�tc����""��l���>}�[4beO|�����3�v���l���Z}u���~���'7x(�~���09H��m�����w�'z�j���@{4^��.N��p��������������q�����	W���<}�e/�N����P����c��\\�R��F}�F�{����d��
���w�"_��R�Q��_;O�?*���\�x���KU��jA
��w�:@�R�Y�Po���,����[��\��`V( U�N'�U�\��T>�1��g[4;�N����7P��(2v��>w+2v��>��L%��O��u�ht�6��vj��;}�����4T��w��1���te��~��]��w!G�f�l���sWbD?��^N����}������:r�����cF������6������{��Sy�o6���*h
c&P�?��u���.��<
��`!
��f@����7}���p����S��hg��s����NW���R:��������������'���7��E�� ;nM��=����Qk���QkU�LN��[L���j���CD
l�����<;M��)�y��?������-�i@�z�x`[L��*���pt>��}g���1~����`�����JV���lv������!ks�<9�W���Lh<�m����w&�yudg��@���;�g���y�N3P.������1��������������h�bv��9����]�|�lt�|�0�V#}�_��M�I�UA������	F�L�6���d�x��e����$gcE�1���:���|����@��\������l��nV��S�����+e�w��r��FD@����~�<�����Y��g��&�6��������:��
g���f���?���f�����K����������������BC����9~������4�B����}���=W���3.��
a=	c�!|@W��t��j��?��p���}\�D��������u�/���u�+����m��F!����������<@|��m\��3��x����������(gRo\T��Y���d2���f+�JN���4�`�@@�~��V�<������%��Y����Z2����d��]��7�g�/f��p6^�������P����
`�}�]������hc~�(c���<J{~�]o��2�e���d�3/7T`��V�~J����nXRV/��_i�����~ �����8��c��q�x�Bn����#q���>�S2�
�}������N���@���\{2��W���6}�
���p��Wy[����|j)K�#�������`���..�����7�W/_�_\��< V/����A�[����^�P4���������f��xa��\��w'�/N<���������������?�������6�Gl�z����
a���;����h�������g�{�@�\���7^��j��`��N[^�'g� ��H��m�}��7�8�G���
�w����<����Av����7���}'JX�:��-��Y��}'0�5�{����u^J�T��x�G������7 N���idJe�j{o@P��Kk�����b[����@��B�p\���Z��
g������]y����l����P�w����"+Z��j}��wwa�;���6�a�:f��m-���+�kFbw@v\8Z�wz�V��#*V��?��7 �+���a��=�m�
p��;'�Gag@p\�X���/N�G������zk��[d����&�j�7u������~�� �w8Y��Z�������7�?�������Y����7���8���F���%��\Y��E /�^�f4�w����W)�����9�X}�cm����7���	��
������y�hW��v��@n�`�>�\�+<4_!�\}g��>�R�����Z	���>��
�^��[��|�j���%�A���%[^"��k&��<���=�����g��~��w�����j���5�A~��5���5pgP���5�i�P�A��0������Atk#�@��
�F�"3I��'�`v�	\(�*�e����;����]�_�����xz���O��� BD�j����{�X4�~���q���zc�y�G}��a�C�KZ�I�T����lC��:
:.�$U�u��pL�q���j``H�!��nW��#ZT����/�e��f�*����(��+�tFg���v��^p:�ME��`�����$�%C���g��R�!��C��'��zK�����L���*�f�*�����k����j���'T���C&�\����jbP>�??<�N��2��{3 �?����w���#o���Iy �'.0��eV��J�Wa��_��H�}���1<�~����b�P��W�^a����,�UQ,��}�����gj��m�PL%�9��zB�g���n*n���v�%�|W�;$�z&���,��cA���A��xU������,���j��G��TG�[�pi�$�����B��_��������z��3��B���^~�^U����0���V�<�V���k�;
�������W���
KM~��f����~U��~=��^d�w���/[���+�m�SH��#U[����7J�S��@rFrt�)\j`�RGj�J���KR����U7_���Q���D�uz/����k�Q}�r�@_�������d��l���G9U���r�����_�C���1�p�f��n&�Iw�F)-���d6�o�������bJ#�E]cx��K�z&O�6p��
�n�>$���]�i���2/n�#t�&�P�L������F@��G�������[-�8��
����g\WM���SE�B��::�@r��Ig��=���(����x�����r���A&����� r��	{���Z��_W�a3UWJ��	@��R���f�eT�{��@����8p`s�d�^&��b}�b�a���	����o#�{�3���r�B��YQ���r\8���\16T�j�(�tFT@X"�1j$^Q��b����!����n�j4/z����<��e�
�����.�j�C��z@}��M���7@����1�o����U�����/�D��B�U����L$p�d0�%��e���!3svh�>��]��`���rB�
��p�% ��z�@� �g(>@�:l^���yp����<8�N�����oT���~6�/���C�[�p�0Jdd�6�T�}�f��A�O�^.��K���t���BM��L���|QY�/���V�`=�z�%�E��p��e���A��������4�tW��"�n> z�%��W 7���j\�<�q����G��5��������3����*�r�-&�)���:{��47pp����~��!����dE0�A�vhI����LKU�\����-�@�N]w���*`B�����T(�X��^��g6�i~���N�����6~��:.�D7�Hf�	���n���21mz\"�������H^�:@g\-m7O���9i(������Kl<<u��������n_��(|�w@
�%&�,�o�!��u�:qP�q�N�-mg���TS^����������l��E��p�����o'Z��W�M=�/�.�M�z_���T���|BNO�������R��@�_9��u&�p$����n�D��[����Y��%��CK�<�#[���8��1��]D���
@�A�Ep�C[d��Uh?e�#�e_~]P�A�E�a�,%��Ao�����#�E�}KLV��r,�,����\4�&��bc�@��	$^ �|p��r�7��_sX
���Ez�=�G�&1F�fs}��7���������
9X��k�����`dS�e����0@!nY�11�J7�0=]LB@����E��;��l7�Z�vK��v�]���=S��/)E�I�!�kC_KOy&�~z�y�q��'��
M^C��[����
����4Wk1-��
����3q�����C�D���c�nL^u_2C�k�-[
)�m�o���2��wHS��V��-2��^��aVfB@��,}�eR{K�%�n�d��p�L�gg�G�'W���������������XB
r��q #67W���5diWJ��������o�'3a�s��*����m jm�f��/i�{2�k�! dC��UQ��)�~��e��"+�%��R(��h��Y��z�p�����
���MM.{�UV>Fm�m��;$Uq��`j5�vC������f�7��xB@��6B����/>����4��L�������ns[���j=�)��[�	�_�'Z�c�|���:�:������Bk/�@�8���7����P)��*d���
m4f$m���v[�d)�/���D����:L�7�<h)}v���0���uZK����C&O����w?�������q����'W�\n�q�U�������QA�r.�m�gS��@B7�_���9����Fd'U6�x��z��N6�n�u�r6��L�	�*��jbF��	G7n����Yw�������/��o�N..���������
	�����?�'����
9��P
7�V��<��~0^?V'{��m��kf�V�g��2!`qC�����7�V<d\���1������&�I"��hhI"���<LI���S�p���7�8���p9����Z��C	[jk���8�r-4rC ��@ N�b��K�a�N��r�~����K��}������s���Nm��m�F�&~l8hV�f�L6t�d����K�����_�7��ays@P�w.���2�������*�������7�ZC�����������'!���z�t6����MF�H}��#"�l��f����eZ��P�a�j���b�j�
��W������j��47���
]�v)���!�����p��27d����9>�XV�f@m8����*������A����us�:��r}Y����#��b7t�����6���c~���7tq�����iA�0����<-�T�]��e��Vm��@P8L��.�?}���0u>"8tr��>k�\��KO��:����,Q�q���j��q�����6P?'�D1��r����D1u\$p�~�n
�c����6,q������o��Z�z�C�}w����o��b����7t1������:UT�m�]�&!s_�]�\k�'P ��R+Mk�H����������o�^�D��D��D8��u�@|���>:�Q��1����e|@��������)�&�P���XN�'�q�al���S3��
%j
� ����������f?@M�H���r{h@%9���PIG�d��������r���{�$��Hr����(����	4
��!�+�'}V(xs��YE>W�����{����E��8<�-�����2%��CG)(t���\�w���#�]��m�RL��V��6,�������yo��D�A�q��D�IsK���c'(Tt�dB���m�]����l:�O+[-#qG��_�����Fi��B���w�SJ�v���;�������]7|Y3_�n�lD6�zA+���|;�w�	�'��YW�����uWnIF�����|����7#_��[|!�����|Vd7+2��o���(��Tr���C�����]����+�����k&��GG����}<�<����
\'�[s�q���Q�BD��l{��q<6���q`�q �r�^����&_L$%��
��	)]�TT�QG��9���u�����t$���8,�������?� ������~6�,��<����Q�;\������d����N�aW��@l����Z��#2�$G����iOm��Y�4�^����������VSQu��x]^y�J���������(����W#"�/6��ZD�����(w��	M%<v�f����PGe��l�<��h������(t4�D3<\����R�/V��p������Q��qw=	}W�Ta$�Z<L�w��4�OU`�w�����E�����-2��+��>#�)4���f1`�#d�l���"7����k����>^��t(
����;r�����e��p::�~�GBXG�l�U��Q����������A��r�:@t�M�Z�9����:�$�������*�)�
P��q�I3{��Rc&����#'�\�m��19� �#��l�tA|��e �=ud&��RA32�%��W���oU�=��H��@�����Q�z:�b��!�x!�2�Bc�����"���y�NV?�H�w����	@�#����^
�8F�%��t��q��V#��#�Td�9��sI���w���������io�p5����6����<m ]YEF��d\NaS7�z�#�R�J����v:=B��G1��1�@�Q�*D~�����0'O�+������v���z���R��f�<������]G�����1�Z�����N�����\���ttv�="@�#17�+q���,I��|W�����|q)7��y��f�ND�@��!�k���r�r����������7����������+�oF$@��!�����=�"8��������M�����g�=�z��G����`S������t�?��x��?�������tztx�?|���O�7��o�}*����qT��	Qo�JQ��LR�,q����fIM��=�z�:���Q��w'�!��V���&=rb�[��j[c������d�O�9ZRG!��r�L���)�4��E6�
F��i\�y�1�P
����[?"@�GsiN9���z����w?�/O���;�8�����������A����mW!6�(���\xpsx	x������&����,M.��XO����A&}�.,F�o!5q����'���qZ��w����q5!�����^p �b!m�Xq-�M4���q1;�B�/�+���D�iScp ����[�o>5y��~>������]��L�������%/����#�yZ��]�X�{��4��d�x��r?x8��(.~c����v�|1�]�H:�*����/�g�x���7�g�y'9q��B��nbw�n� ��;�
�z�c0rG6c�*�������_�)S9BU5A��J��������ty*����G��zn&��oW��s:.��P%����>fZ��jS2� 5���K.���BX��c���hN�9���OI�N�Ng�[�8����:�+�Z����9?j�:v��;�����J
sop�������Z^���7�x�������bm�93��;����R[=3�b�v��m\�.E��uu{��nP-"�xYF-b�z���y3M��"�������V�y]�R�#Et�&�t�T�y���n)]���V�K���5M��s���u��[��ki���Y��e�F+0�:�-k�1��cG��R�Y{6p����a����+�u����k�ZmK��?���&�w4�����;vd�-���F;���6�G����6��$����O�������5��VWo-��\1���]a���1k�/
4���6�$�	��d�0��cS�����1���m��@�q���8�����9l�������h��co4�f-�2 ������
�Sb�����&������f�
�b@���j�:���9���6�?����w�f�.���K5sW����3�&:�;%`1i�@Me��}�;���g�wt����\
e��.�z�G��o�Y���9���di�����@ug��}�i2sY1�M��:���2f~�J�ZX����1�S�'j9v��N|i,�������u�'&�7�>=���L�Ht�!�F�
p�qhi��L�8����vUo�0���Y.o�j�S�di7���3��g�m<st�)b���%�����@8"K��^nqr�?��y5s��������*�U�m�r����<3P�6���
���c@�����M���b�!���k5�h��+F���-3�@�1�����]@	�VJXn���yU��,+h��~ �n�R�h��r=P!�r��2	>0wh�������'-.+��<�f4����Z�`I�6P���6�i����dk��R�Z��S�;h]-&���<8p��R���r��Lb���)�q�
��pA���(����$QZ���7�N�����F�V�YD�o���t�����K�ol��1�����;B�1�~cG�����Y���E����'kJ��i����4��48��`�:@ol`�)����x������r�\��w���!3�@�)�P��!Y53�c�w�������M������B��q�*�K*���7�^�!���9^]��"��@ D���a����Z���Q2�����6�������t�q�~��-�w���#�������@e��P��#��e���l2����@����E_�6���Fl�AJ�N���[8_&[�Pb����k\�	��R])9�F�q�1m����3�'-;���'~;�������X`?�u�������lp�u���h���
��x���u��D15�mUE�+1�:�8�qB=��Y����&�~���6��Q�W��[�r��'u=Y�U����t���v�A���"���
P������
�wR9}U�_T�&�7����u��9�K7@����,�*@8����2��cW,xs�-����7���������\b���p`����q���7g�'��f��o
d;yG�s���-��]�����4�c@��{8�p�F�aD��Vd��kj�>6���88����8�����J�Hz�6����n��@@Oz�C3#e�ER�������a��U[l	`�7�g�����dl��������q�����*6����]�7j�56��V���C}$Khv�K:n3�r�$sT��+k7`>P�"&�s��
a�?p6�����z�Sp�)��(�z��D��H���S��i�������Z�������1����ImS���e�j�@�yo%�d<��J�.��>�o6�WK�FvN88�%��p�=��t��]�6ob����c�>�C`s����_�)'�l\h��\�?%�Euf�t5��e:m�gS�-M��p8�q�=n�0=s}v�^C\�m'��tu:�s�5��>]��8�p���hTU�i�����&�2#�3�6�
&Y�'N��Fv4T�ir4�q�;u�^e�6C
&G''��Y$'�@r��$`�cy,�p��t<*�jzuP/'�K�kb��s���XV^��}>�@����8����Y���	G�$`�������I�7P�{su��	g��O�$NKg�m��h�1��l�/��i����1nP�$t�P*_�:�$O�"��������	"�����,��e�/iG�,����]i�,�9����M�<�m���@�I�6UT�W��7���G
�+��?NB�.�/����� '�l\h �/Q����������u<�n�E>���p��M�Fh}�@0��e�jT{���W��@��q��	���_��[PN��E��w.�����h6�D���~h9�\wg6�h��eW0����)�2'�l\�
J���-��L�*��d�-���o���	��7;�z!���@��#��H9�\�,�O�ZT��^�e��S��/��:�GD�����X���	G:�ja���-@��,��'�*�f�^��Lf��+U{:���)Mw����������9�:,�{�����-�C.����V����w���W�G���N���l�I'VGf��N:�]�F6�UJ�V���
�6Z�7'����V.g�?q���N��q���
m^
j&�,7�����������A����F�@��S������l.z;�����5n=�E�������J!�[��N$��N8����+Em-�t�D���{V�:�$Y2���N	��IbW�a5.��(/������:'Vc���������x��?��O/_��m`�;+m�
rN��;/��A��l*��k'����f��psb��?��\s������l'�p4�\���.2
���t�S�)m�P�:����%K�I�On(5��[<	 �������]�$��������k�sK���in����^����|N��g�&v�E�	]��](�z�F9��	�;��_
P�I�>���^:�k>X��.������d��j�I�4U0��k9Pk#��lsb58��:������W��r������O��@��f��q��%���X.��y��Q����A��U>��y)��ocJ& ����4M�%7����*�
��LAv�AV��5�u%G��s��Ej��W��:�����%��C>: �]g9��G�s������N���"����&��j0��h�����l}��/�����N�a��b'��~�]'��	W���tb���o9�!�uk���O	p�������S�;����y���s�T��(���(n�G	�����-�ub�l^�~R�S��W����u2��Y[�������_�R���u>b$�>�	�����Y����������_U����.�������f��E�1��|j
h����a������N;q��G
����o6�c�������A���:[<4�����6�?!z��i��+�s�l����Y�-����IX
������D��1�6�8���r�r|�q������f�]B)�ST��h��FK'yhx)g������bT�4���w�wY�CJ
(��wu�_g�e�T����Z*�&�������N����'w�N�-:6t$����?��=>������Q�����)z�pN	����5r���L��9uC�����H��-P�`����J���Q��p�	�s/)`�S���hS�r��u�rT�1A�,8��`:�e0��8���z�Q<�	���^�L
P��C�[T��)���m
����~�+gU���W����z�x�\M����@d�R������,t^�����W:�q���FWP��#e\�Cn�f6�5�q��r,�`L�S� ��l\Tf56+!`�S��Xz�]�&�OX����8���b�Er���/&��<@��oQ'����bT9np�F$��;;�N-��E�tU���6����r(���e����B��5�n�qf>P�{1M��5i�g0�<�L����\���,���fmw���06��?K�{���+cXN.��D�]�9�pO88��I'������n8
���I�o����&P��+�HF���{���RP�k������0�)gdl4��Mmt��&��7�����o}�������b�����L��{%nf����:�|��gV�f�M�k�Jn����������>��B
���#w)�Z�������MK����9��3rl���.�r��Q�O�]��R�0��r��qP�m�.�(�S�����'����j,�/���B���Mc���M.��Y��B�Tv �E6X���x�7�sS+����3�I�Z�`��������!�|x�s����c��@4��I���^�Nm�d0"z%�b��P�;U�_�S�;��;e����R^��R4��Y$�T����>�O�k\��
�5����Mm~�k'�j5���$���7���^)��4�#�Q#& W�}kA���N�(gKy���pnj�s[
�s��q.2i����Q�g�|��t��M�S���6���7�7Ma���Q��L��j}F>!�41Da�
(��m}2 ��\��hb9�e���_��s���M�i^#�K�o�F�s�����&��l���I��g���,�l�x�p��!���D�9e�= Z�"6&R���n=(�O�6w���)�wS�k>P G"7Dnj!r�o��v���Y�����a�o9��B7�]�:@v88����D7�!�4�-��������LU5�Q��C�M�G����	����q�y
��4�����E!��m��$n�����F�
��4u����/d�:���k����x^�y njq=7C���)G����(f�\l
�z�i���V��JS�K�-d��U/t0\��B�Q��|>��V���*�E���B�� ����h�E�������Q�Qh�w!�Q}����.�ph�&���pj�|�}�)S���T��0�od���3���5:b���Q���C=���O�����%N\tpM����W� ����p
���O��H7�XNY�V�>���?,�FE�d�������[�}S@��=�AT�er�\�2\���;�_�r��#�;G�5k�s�u�AtE��VZ���U<�����Ym���g2�s�6 jjQ�7I���{���J4E�A,#�D�	�j���\������nF��K��zP����Y�����f'l
��&`��e0�]�q5�c�����o���g=�v��:��v���XG]�S�5r�\��[�j��H�������so�	�.�X���7]��v9�]�:vu�r��������[Bc\���Js�
-��
���0j�@�x�-�^����f�������r0�=�!)hW<�j�\m��:-�8	:��8��p����%�B�
gl�X��;����g��=y�����+�Ig;|o��]h�.@A�
�O>t
��PPJ[�]��v9T��-�����i�.�C�,�!�p�.k=���]��vYCY�mR���.�}�����T5���8G��]W�����^!����4��m��
�D�8��L2�� U��j\��z�����n�����M@���qBc�����t��eyV���.�W�,���X���.��:&{��k��^���b�U�eb������Va'�l�ra^3�^���A��9�t><�&����C#"�C�:�.�~I��v9lU]���^�F���j��*U���Mc|�����:��k��^��!p6�M+t`��.b��V�]@�v��l���������?�t������������+LW��lAC����2��l1����.��u7�2wiL
�$���m�jy�%0��}���;�$��;��c��u2	T~\N�qh�ae8e���B���`8��b���o�l���6R��e�����@mm�=;�in�5�MD�v���v�n���m��{�Jrm�th�����vT��]�s�����s���:Z�vV��a�����e�r��o��vu�����^�h�"���&�+9F[�6��|y4�<~���nN\��o��_����V��]�v���aE��)��r<Yy=m�`}.@��Ay�����R��R0��d.��r�[��vr�x0U�FP@y�.��}r%GNg��s�|�����R�/��bDt*�z��z]�iU��
�:�mo[�����+8 }J{���A��`n7����:�V�}x��gb�ow[;�Zji��fp���u�z�G�=]@w��. ����p���GM�<\�S�2����x�"����?pyD5���}\�%wmXr=m���r��.��?�����pnI\��]G��.�������V��i�rH��������u��p���1�r�e�5�@�h�K%`�e��������WzA��O���,�lvm��hc�
:���N�qui��{���9�����u]5�o���]?�Z�����+��W����Y�(����7��o���:r�]��vmF�/$�/w���j!A����Km�z�����Z�����\z ��h[uV��������:y��ET��FZ����.�����q�"��^�q5  67��h���c�����)��6����j��`��[6H��1 x���b2K#�����Rz�����xi����D�At�w2�����d���E�g3�5�Z �=�!�*kSo"�|t�s��"����\M���f�` �BbV��+��&5���������i�L�-�$�L�d���XP-����$ E�CN��nS��D<�D��Gp�=�k�@o{n�m[��W��j�]oz��� ����6�����������]U��?�A"�\uS��L���P`Q<�����=���|�i���������r��=�6�t�f�j�
z{,����rT/�_k��k���\�t=^,����)MQaa�^��5��0�io4������p�P��`U����L��h\E�2����*�A�`����)�"R�4�O
C�A��buo�j@I|{��L��"���R�-��C���2;�H��v�*O��b�-�����M��=�.���64�z������n�<EQf�lu��@�l�q����8��Vl���\rmV����X?={���D��Hd}���q��S��0D��_U;��m�[�������
^�	.��{�<�/�qW`�=g�����do����hfJ�,X ���m��x�^`?�C��I-�<G�|eq��I��jKjw[����+���	�Es�M���9���1��X��2���=��8���Wj��
��Zt�
p���A�f_��=����e���MX�CHK�d����c*�rhg���u���cl2X��l����Q��w�bx�����{�n��u�q��[�m
 ���f���r���k4y7�9��4�Tmih$�����P���y��`�=gL9��	��K/`�&P�����-�	��Y8��H��3����-=^���1��U��)p �=g�9������W��m�+�f�q4�q�=6.�l�v�����{?���p�X�$������k�t�������FD@g�,rI�U7����X�c��$p��m�K5���l<�,>qj�^�|��nI����^�
��qpF�?!BU�X�r��@���*9���w�����#�G`.�s��=��8��hh��@c�:@S8�XNe�U���PF

\��Z�RG6
#L�Xq���KH$����X�`�{�f�~��X���/�Y�_����x� �	=G���\�z�s�R>�/�������6c���v������W����o=��\�a*sR0�=��W��t{��XG�l���c9ZjkV���8�^\=~�KY���Y>��U@��RF������myN��Q�������Q�T�\�n�Cp�����X��
�����_����9���j\�A�M�J��H]Z�������_8��Tl�5�5�`�=�����&��Y�,����M��>����B(�j<X%��Z���%����8�M&���gEF���>!5���a�4P�5�@�=��	��k{]�R����`u{�k�- t{]���6P=@��8�v<�.��u�C��gi_�+l�8�5
Q�T��az����\��D���O6��t�gCw]4��`�g��uI������O�{M�����
���
d��q)�r��~��}��.	0`"UT[J$�����D*2�
&`�{���q��q >6bW��quBQ���6{<���>YH
^�r���@����F��yG���c���mcA'��s��d�}�BH�o�v��cV���w��������X��>w��{V-��]�A�f�MEP%��<�DbTq3��������GV��X��>w^_��}��D^�_���H��*{��.�<��U��V��w	n�����`��w�e������������)����J���(�U�<q��q O��OQ�om�}��!��$��qi�,%,�'��K�����g�w��Di����~=�o�nn��e��-s���n����@%g���FNI��r�^g

��O~����7�n�v��0}�Ft���vU���g����q ��5m���f����<P]����~� ~���,5��("��_���O�eY�d{��$Z���`1#(�,},�huC�h;��6)|�g�b�>����0qy '�9�Q��@1}�>��giW�<O�:��D��\g��T.H�x5k��0P�������*��<#(9Y�-�i�XY�c������U�u
�hWs��3Vn���ye�|��7�'@�\��D�r��[��������"��=���v*�>w��;v"�>��8q :6�cd�.��Gh�.2�����>q�6�b8�(��%�U<t=������i�jN�5�g{�v�>�1Q�����q�����qdd������W��e����uZK��f�_������|��K��������8W��M>~�T��)���3OU�;q�A�.INU���:���������>q��75���J#5��{�����0�se�7q�>�j�����k�xv�w|���O~2P��:����jDU���tCPL7�Y|�^d���_�N#��V�V���b=8).�'JT���6�/���*Z=�����nVT��+1���-�.����� ����~���A7U��i��(G@�*e3An)�@k8����K��as3y����������<2����k_#*�:�}�t	Q������(��� ��/�{��A�P@��TM��%�P�2��EN�>��'���F�d62�v��>��J��d�|�*ANCK;�L�o�<ag��s{��98+�T�;^�U��X��=my��/g����8d�:@���^��Aa�t��Xsq/ V;a�y����j.n
T ��������;�G��������a�N%�����I����]�0e�:@R\��������.����Y8��k�����)u�';dL��T�7s����{.�
��
M�*��zx�����U��h*�+���6��__125��9����|�(v[��Is6n����l���h�84}�I�l�Bn���-V��04}�_�v�����^K���3���vN�83};��^�����Ce"C������7�_��p?x(����p$s-�r,������E;�L�o���c���[����]���j���_�\��m���].����������<9�C���1�������v��>wk���3}����7��t��3}�M�c���s�L��a��P��c�<C2�oEx�X1}nCx��~�3���H���]:�4��Dv �>wE�01}�+�P*��6�Y�[5�@�\���1�_��� a��6u����9	����]��U�5
�\�b��|�����@|������.b����l�]W��g���]E|W����l�!na���9X|�."�+��y���#�b���d-#���zQ�����N���"�>PW��E�
-�#����]��>��
���F��d�o!s������m����[X[�.�d����fO7xk���a�������� ]��^m�D�i�fs��.r:'�4W8P]�q6������|:�L����o�t���������������	~��N�~��a�=����7F4@Tl��L�d�=<Y��7�4�C!-+��qu�#<�ktrX�s`���[���ws��Y
�1EE5_.�JyVN��}Z�sh���p�����<Tg�Q��+��>G��*�����m�@�~��)&��?4�:l�	g��[�ps��`Cj_�k�r0��g��� �6�����P�J�X�;[�(@o�p����I�{����<����1,f����*3G���i�C����]��L>B������Z��l��D�6���o�Bw���INOd�A���tK6�*qN��u�*����"���b���@m���3�j~�<,���F����k\����Wd#uinru@��"��VV�
S���&����fY.��zz��%�[�;��nU���*�1��u�h8y���"�ldE���c�E�=�>,P$'�v$7����-���s�93]s4�S��N��,=�6{��������X�E�6���>W}\5+�V}�ZE�p�/�X���	� �����>������s������
�n#��cG/�cW_�^}}LmKo���P�9@�����&�=��j��(��l�k��:#l��:���+��Y�����1�q��1�G{�q��wv������*@�8����h}��:�.��X� �����g��#�vA�g4�7e@�>��	7 Y��Qq�G>�/�������`��H��?X�r�(X��`
���8��zJ���n��xF�\>�OD.�`o��l�
dca��Bb�Z��wg��6�����������������6^@�>��K�WQ.��������H>��}��s���s�O���U�[E�xTJ����j�����,����y���������-%�>��%Yf��eS_=����n�Q��i���e-� �ju��:B�S��F}-��2[�^}���1�w'z�c������c_c�{�����]�����w�����C�x=[i��E���*' J}W��=����7�(����pEJ7�5~�]�.L����JDD�N��FpR�s�5�t����CN���6T��P��o� �Mb0}��b�Z�����\���{���,�(~Q�{�]��FT@klD)�z#6�
���gN�x�S��<\��N��b�Q��V@��<�YO��s0�q +6�����r�L��2[�yq�V��b�u��h}�����S*���6�y���&6����K��D��X~��y@�\���Xb�����[���T���9=�[H���o���~�gB���r~=���b�
��|��������T�����vF��R��e05��vq�����>g�k\�����vj��tz�B
������Em ���3�Rqe ?��{E�G�Zn�q,����_��O1U���[~Z���t q6_=���������@��?��������a��z���Y�:@�XRV�k���l(�M�Gs�����]�m�Kz��t��V�%^����|�����:gN��}r��m��[b�]~� 7`���R�(��t�5�c���#t)]�����+����Y������'��� �oW������Br���I��|A5�~i�d�����+Y#��������m?��������/
+
�tl�dT����+������O��T��_V�l�v�:n���q�Ut��h��O�v��T��Y����7aW��������6g����#��C�Q������1I�� }��uN� �c^��Z$@�lF�N�����H��������gW����}ri*$ �WR�Vwh*���q4�L���2�"���Y�.[����q��w���y�7��Z*7��p]��)j�X���`��������M3df! �s�R���}5�r��}��3�X��M�D�{�X������PA�i��^�XL��j�U��+���Q�lJ��+�>��W���)`��ufReEK�O���hV�����y��BoG��3b	��e��d�O��	(� �5�Z�h�����/�:l��t��{����gIl\hZ���&9[!l$Ul��
�V�������R��\����+�y6%�����:��)"0���A��9�������a�$���
�nK��<�Y0�4�����-�T�U��:p���[��O�ua�e���u@N��t�����
��tx��Mi�a(��t^�p8[(�_�t��wt�dq�r���� '��s�ob��X��|Kq�/�/q����f��'�?xux�����p�5�v����u����E�C������.��`}��Fi*S����� �zx4�e�-�_,.�!��h0_��.���<�8}wuz�����}�@nvw3�gL���?�6�@�A��4~3)@�i��)`�G&;Lv��=���&����E�,����\���A�z�x�l��v`�n�&�������?�o]�$��]:IWv��
Hy`5BnIL7����h�y��q�s9�3K3���3r��V�� v��F�)��c���{�E�(���eQi}�@Ai����+�^k�d�b��R�����q��6;�����g��2d�C
��e]���{�����&7[l���r��|�W�c�t.D7��T/ci�7�F/��� �A����SL��2�1@�Gt=�z�l=��u���f�l�2�=����T��)z��p��q�#���E��E�����@�v�il�Km���A��Ek������6
�������E{rw�F���(�����������X�OK�Z>���lz�l���vu^����i\]��~c��p7����j�X�:h�6�8Z5}\���O\nlsz�76f��o�|��+Sj�M�=�wJ��=�a�e����v�4��-e��O�p.+����D���n`�����y&�����J������S�N@R ��Dt����<pd���]�
��HSBZ��6
ck-{�1��u@m�R�^r���W���1�DS�I�_����L���,���nW����.�y�`�<��Xz��g�>���&+����i :���6oOw�B�PC����N�"F��3��S�N@8��C^����J!��65���h<�x(U'��X����6�Ssz�Q�i�?�5�
��.�r��m����9y�a�UZ�R��/����`�A���;\���%kkR
0��+�������wy�����v����i5�����]Y�If�It ��������
����Z&��da���������W�)��J�`��1���?�B��S�,���f�(�+(���Z(����j�&��x�����0�R#^���m�u;�&-%|=>n��*F�"�/f���=�v�p�ag������!��C��7�c�������<����M����~9[�g��S���R��|�{h��-IN����h�C��������U�V�rOl������Y
wr�}[jv��&W�{��)���������O����������vx�����+�>�}m��hv��"�f��\��*��~�P,��\{L�w����K94}i�}Q&Fw39�Wn��v��A��f�-���."���q�\���G��
(���^Z��U^���i�_���D�>w��: �C��������9�n�:@oX��Z��F�{�dX�(ju��Sj��z��{���(�`�
P�1�O���f���*&cCZ���1�;4�7�+�n�m@�����x�j������7����������&��(L�n�&���U+h}�����=d��a�A��!�����������*�j��9!�4�i���X]��M��nV�d��C@�����y��&�G'u7��h�|�+~f4�����R{@�\�w31Z��W����Hi�
4'�F4��f����Ue����q�K��:"y�� �0]����N��.N���]�_��/�.N���xrq)������m�[��
Qp}:����>si�@�.��%j_c�<����u	�|�O���8�gl�@i���"�f�"���
��#|�>^�r�T��!��o�N����,�9]Us
kF��\���(�������,R�w�@��yk�t�7��C$�p&n��H����-�C����j����f�`�!���%@�����!@�Cy7J�]Yv}�;-�g�e������3n
*9G��w������l�����F���O���(�v._p����u�$�w;�K�<�B������)'\�`w���e�y1_d?�����_,G�g���~������Z����x��|��{���t�}"��b�-^v�a6�{�@���|0�����rT��X����?��z�b�����/��wK�y>��>)�zwY~{�?���O�R�U�����&�W�����i��N������Ev����r9�����r�\�������i�|1��n�c�������'t�6w��T-��xvt�G��W����r�����q==P��-'������DZ�MQ�^�H6�����RX_(1}>|N.s��9��1I��4���+R�IR���0�_T�0��N%Q����g|�uZQ/���I�#��B�}}����s|~t��w'��Wo��w�_��yO��x�Kx������� z����p9Q ^��}B����71{�M�o����O��!%������/�����]=��'�9t/�x*����+����z�L4�RY��
))��w�5\�����y��<��+���H$�Kwm���|��X>���.��O��Sel��xR*cC�I�3)[�����J��I��>�rA�jUU�S�Zv��&e}��}w�?d��u����dA��4.�r���#Q�g��e&������g��o��+��~�?���x�A�V�eT4�����=y&5f�%����%P�?�
����B����������v��A�T���X��,r��d�,��bS�7�<K&�<����0����I|�g��'?��/?����x8��
Ge��/��#���R���b�����d*U��2��r���l8��"���w�x���F������{��������XtM����t���gA�������8��L]��2>v���2=y>^��e�B���]|�Ic�
j�/�C1�p���a�I�-��)�T������{n���Gj����6�?���>�[��~�V1S�ntG)�����%�4�����4Vk�h�=�nM��M�������ox��K��q��Se����?��u�(�Do�:P����mi=kO�?�h���^�?�Od4���B|�lX���O���f�S����b1��I�q�#���
�u��������Q"i����I���R�3H?���EG��^�Z�����l&'����Q7�~�xC�����E�b�J:g�p�x������IWmNr�8��d ����-�}������sDTV^�	[V�8������F*�@�;��\��rx�]�Q��?{�Lh���R������mU*���=%6�����2����Hi/j��?#����n�m���?w�~��;���2�V�h���V���*�~N���`�&+
l�P�q[,�A��9����[�V���V-?����b�1E�v��-=2h��V��C=�=��sw���������-��g��H/�e��K�gWI�5b���
;���v
�Z�v�	M�gW���4�.�P\�����^��B[���>�ju{z=+��I�����h'y[�E�R��_rC�[������Gt�mX.VS�����~�|M{z1X���P��mn��'7$l��a�e������t�Z������=G����*��;�i��9�b.���X����(���3����@�e�
M��A����V}�����/;��D���p�b@��;L�}�������t5����C�������%�Ty�3���3[���
D����n�=T6-V4@�ls�6�n�����s��m�n����?�)�gK�J����S�G��m������I7�����g2x��1�y������Q?�����B�{�e#���+�T�k�U��n�(l����
�	�v�Q�5�����j����6����d<'#��F������{/����h.�C!��G�i���3[,D��s�����k���S���*R�|��i���B��m;������?g���d���F�a�v�a����f�>mk/��_�������n7��+��������a�AO��e��������K6��1J0�O|�]_��}�E�>�[Oii#LY��������=��>��?����u�[���Ck�p�Oh��N����H_��x_��|���u����_��e����W_��|���}���u��c��i�-�h�dv���]�����C�S�<m���sn��2��/i����'� �����m_��M�)P�W��s��eP�_����:a8���!~������/ek���V�����r �/es��W��"�n���Z���{���gx�?�%����@�/����$����_�&�/b����[��t&���u��M�_��>w�����AK_��/d���]���>�EmJ�b��|1~�_����n�������?eZ��PT-s�p9����'�B?M��?�'_���#���6e��_w�|�������V����x/6w���yj@mS������n����v���v���&��
s+����w������{_��8���,����u�����s���������f�i_����Y�0��"	7��*M��?.UG}���]�OjG�Z���9������J�#��Gs��������C���3��)�Z������z��|.��7��f���Eb������'{����}4�?,�d{{��^���g�����i<��vL��l>���O��j���z��H����X���9��)?>y{u��������m�����E��^�6c��M�g|�������o=�[T������K�j�V�L�VYq!����������pQ������|������j.���~y�����l���]Tj���"�nD�����ZW��Bj�@'~�/�X��V9Xd��2S;���|�X>�.*��X�}P�}��&]�A�����B���>S������3��_��p�����j����}z~����Tn���Z��v�Tm-��p�����,����P�����J�q��05Z����!�/���s���l��Y�n���F�w�}Ae����7�-�������n��L���6=���d�!���d�6Lm����P�������0��r_�6�u�����B������*���Z #�����G
���n���3�"{������n���5�&��V@YB7eQ�Od%ddE�
��	�3���F�(r������s������_zo���������� d�������W��70��!S��[]=��[�
2^�
��!S�_h��������}��az{�����?��jj������j�]�_���7��U���S�%b$�*��EKW�`��l���i���i,P������Xb��of���E�?_�8�l".���|x'B��*_���������!��64h���w��q������uc�Q9c��
/��I-  ?5���\9����4(+_�3�"�q#����@|EhJ���s��G�=�3j2(G����&��S���W"����u�X��I6��oO�O8lnS�t�[���x-���`W�R�R�)�oh�M?QQ����"����D���&m(����	m����-\)`�&.r3r���-�{������Qv#��X
[����1�������.�(��H��|��
��Y>�����}�)D��=�eHO�I�H��CuJ�4�r��3��w����";ML��r�L���1��1[{=�������C>R�l��$�c�Z4�����& 	��	�k�
Ijqb��E��.�j������Wu�n7�Y�\�}U����#W����r5��Yk�wD@����[�N@�N,��V��U�T��i��j�p��'��Qk��} ������$��eS�o�~��R5
�{�:�n�E�\-�?�����i�1U8���w�x���NUE9��C
j~�
GD��D
�|��H����������_^^�����*�����:M	�����.��nP�����B�����&��3u�i{�JA}O��{
�{j���D��:?>?��5Q���	
�}X��3����2�H��Jb� F�{{RE��}<��?�d�
���x����)P�����I�	@��SN�~�nB�*���g�z��A���u]�������r]�Z��������o?������R�D	�#����t�u��!��2[L���hU7��h���sc o]K��J�.	��kF�]HT�"Q����t�����>1�J��(�^?�
t�B�^�o����Q���|��oT���N�oT��_���i���%��v�Q[���=��ET����>�?�1�����w��N��b�s��#�e�J��/z@ z��~�^���<�%*J.^�f��6�r��F�{��:f�����|�I�c���s���af��|d�X~�^���N�~�^��|�'?�:�?�c�������	������|���D=�o��U.���P|�m��^&{%��������lYp$=���R6f������0��z&���z0@5X����������������gW����}r�T/��p�9�Q�PN�d�S[j���9��b��'+6%LQ�T�@h���9�h��fO��)�5����|��	D�������-�e��u:b�J����FvP,��`���0�[#Eu7��oL��B����e���[\�"���eO)���m���"��A��[��lt����I��N�N�xS?�:;uY��������W���P�2@�8���)E�Fj�!�x��V�+J;��>�`k=l�!�2�"B�$�V��z�F�>@E�`'!�Ir=�s��~�\Z�+�q�����P8-]M�Om��[�	������&��!m6���_��v6����_��o�2z�k�������_EK��n�k��~����2[Tk�G���X��^6=b�I��?�:��2n%sg�G��������\
�R}�K�����L5�jznv�
&T��s�Y���s+?�`U?r]��D�*7"��2���L�!3���!��t�W}�_����z���=��W(�����_����^v�u�u������f���0�/�_�$�/�o�:nZ?��
��:|�0�bN��zR���#�	%k��� �pT�v����#X�y�4�s�\�$Z���|������z
�O��?��O�����	�O�#>���*;NBQo6�Sof/�����{�:��&�g����f���D���h+�"�o���/\1������e�n�[�z&���x�[��%1f������\���O�V��I�o&�s�U�T��q�R���i,i��Dn�i} 1��H����1p9	*q�=�A0�z���n���U�W�N= T���^r8y�����/<���e����*�����)U��_����O�nD�E��wO��������
��s��6D���yq�-��������+�D�g���!Y���
���p6�H�k��b6��A>�D�g+�X}�c5��Fpk��	��~�X���}��^��u�h `��q��>G�Rj�;��G��oz�����e��������z6W�����#������Rk���~�6Vp��ul�U9��n
��'��q��B:qp�#H���21 ?��u�X���V�s���}{urQng(���~�x�E&E_�qj����y�M�V���5$��~�qcS�*@���$���O�<D��9}�N��G�|,�N}u��6o����BbL�<���T�2@7����
�Bn�/������3��m�p���������LFN�I�-U���v
�}< @n�8V�g�������L��G�p6(�T��P������K��2��i\����������6���To7�j��>�t�{�����D�=�rU7/���j��+�f���	���5H0�\���8����;�_���k�+�o�����W5�E�7 �u�w
j�
�e�2�p�q����iB�1d;�x�zt���'�B���S������3�_����tg��>
l+G���VD<M:�86
NS!R��N����(�MC�Q����_�.��+Z�4r>����v���{V����5��4,��	c�<l5a,;p��7Q��!T��0����d�O���8�Z��>�����`���+�X3��'M���<��8+�����C�^H���:W��@����]���g>
���I���P�x8�>8����K������u��3m<�����t+���S,=��(
���r�9��
%��5l�����������������B8�\`)���#��w��|�RN,=��*a4a�p
@�W8
�m?j�N�����a)�]
��n�np� pT� ��he<��L���p�3 �����u�_������RcC�by�b!��4~W��r�e���VF���g��d�W<=��1�>���c���l��Vsy-|j����qWo�(/�e�QC�(@�^N�pp+J���d0/���_I��T�;���k1@b����/�����I���g���a����P6`AY�l^*�~�e�}��-�yLs����W7�I9�Ek�P_�gl"�1]�QN���ml�=��
i��/�"��=�V�6/N�;j�9n����o�/� �g��l!�r�|�/
!_��6k��5�^
�p���LF5�aj��D����w�����0R1����|&k�Xx\%r'���������������E�����L3 � �Q����4!@��S2��1�|��v�m����ZX��k�j3�����T�A���Tl��Lu,�N..�/��$������	�st���������urt�{~�?}�wqrq���O/O�N�����= �6���D$���?��qs�����Mi�.�D&VC��}9�����h����F������<c�������8�kKj];6�
4�s�������qo��6�_Kj
�s�!�>�Y�@�QS����A��.�!�(�������ic6����>���7��K�ef�\VT9�R;
���@E<�_�y�n�.2:
i�����������w?(jV�#Z]Q�}||f<�����tt�Y�2X�~uPu�,~� ��L��?�C������wv~��7���\�0�03&�t����0@��#A����O��I6Hz�]���7h:�t�M�����@�����T����Mn�t����M{���U��4�f��!Pu��J(��s�q(����IB��O��������_�E��c1��7�8tt��q�)�����pln�R/�J�v��wz�}�Z��[�z�|�e�J��6xp���e���\@�����s �u�s�I$MI��������3o4�~��#�>(���Zd7���\u`���
�\��'��vu�5�Ikm��/�7=�p�jv�^�[��������y���_H�,Sy��U��H����<;�h�Pv��~F��(��<7O�	
�+\�4_��ta�A��k������\��6%�3�z�q�����B�.��A��h�Z��r�	�#!
#Ot����3��m61�9�Nd��]�B��X�Td�<5,�� d�s�Fv�
9�cu�����:Y;(�|�v�Dr6���K��]�B'�W<O��|�����U_���[�1�![���
;.3)F��k���!@���v�t!@z����3���CD����t�m%-ftn����
-������"������;v���=.2����-|C��J
�|C�r������lS�A������x6���VEv��Z��em����
-���)��gE�"X-=Z,�w(��c3[c��r�\�*�����
��U�7��_U��8��{������,�����������=Q�wX�o��:��@��	G��,�g�N�Cv@j�X�7#:Z�G--P�� �6�a�t8t26'�H�����������*�����p�����A��|�!�'{C�\���
O�bh\Ty��/%��p�!�	W�F����n�
�V���j����8���R���j��M:�X��zT
IK#
8���e7L���������������jo2��U�l���>J9��g�����g�a��q�-�������z����! ���u�����(i;-�H:�'����_C����a�{N��	��>�-��Y�����}6��q>���B�i�K-��J(�Q�m��2��(���:oXv����
��Y��P�L�:n���0r���Js�1�%�;t��C�q���q�D�st����9��L��A4A-���CWc�f�.W��p���>\�Z������rH�d$������85�+@�9@\����1V�w��&����J{�58fvS�j[�W���
*�6��i�;�����oF}����`�Ys�~#m�j!��>0����\���>[�����vg�E>L�0��������r�(��0���BT�����-��������e#(��4�Y���1�����t#�G55
���Bs�N�F+��.y���e�fS1�x'����!Dx����Cw�q([D5S#w|V���Xm�l�$@!cWC�F^�������H44����H�`���������v��9�S��D����Y-;�C�\����g�O�����	�j���}]@N��~�����x=[L�s�3V5��������������D�/�#GW���8J���4\���,/��,�o���f�lA�/W~��qF��d�! I���8i�=
������I����a��_�=���|�������w-����9�[�P��q���T�I��N,+`�z��0u~�����"�K!�Z����{�x����e�9a��O�P������@�RgY�#+e�J�u;p��0u��&'����HYs��A�OlIBh��)�����B����B.5����o&��v��
P�at�D��Z�l�QK���aw�Y��)$.����������x{-s�!^M��A}9��I��|� �kb`���_2Z���i8���V}_�	(���bZ�w�����OiQ�P�pB �AQNF�9�4],gso��y
=�>����	Z��Q�j��u%�*�s�{�'r��9�����#��Yw�"��G�|���d�Tk9��bI_y*2�:�2FT�A���l����<������eu�����\��a6,���\&�����rP�����qv���8�j���(�4�VNJ���%r�������\������x����p�I�N���
��c��@��&
E������1���eL�����&$�:�O:���rK�����x�@�lF���z����-���Zg�$G��`�CgS�F�hN>��&��6�F�1�������������������7�����'o�N��������58���Vf������4]���e�i����s��n���]��l��"@�GV
�3f"��G�����~���/�{d����Y�]��9�h-���#�S[��]�"	_%�_$>�� �?l��S�Z�)U���_#�h�d����n]h���������4����u+B������v��\��[���jW���b��F���wg�G��|mQ�~E���^�������G����#�GnV� �#�1cn����ns�~���y��n������,k4�L�"��G�v��f0���io�nG/����x��������f��
��9�����r��{-@�8�]��"����|d"��Y�c��g����}��g6��(���z��Y<��\�`�p����oK��E����~[zhnU����Y�(���$u���Xf�5"��e��U����b|@�#+�oI���E`�@��������4�R������#M�8���C�����Bq[��	G�IG_�k=�������������7�I�������T
���S�z)�Na4�e��^Jz3�.D�k_�{������g���s���;m�v@�����!"���#�']��p'd��
��lE`�Ad�y`�� �]��	�r�X�M	#����d��_]����E�����OL��"gk�4i}_*���h74��>�{�P����g'�M�V�����^�l
�v���4c���bj�%���(rx�?rB��<	��ck{����8Z_����8\_��
�+��x|��"yG�a[�l-���lU���L6�6}���D�-���k�{�Q��Q��#��y�`u��.���v�P������#z��>�T�,��-�K������9o������z��u�~���p_����5�z��@�G���Q��qf�t��"S��0��9�)���m�v��x��0��x���D��\�m��*������\=�KP��L;�<�����(q]����5�d����J;Jg��D>[����V0����l�:�L���
���E�������f�em^���g�4%@�����8�j���9�l�[*<`������Ak�W��Wv5�!��Tx@AG�Wx�AG����d��9b�����F����������rx�(u���\�aU�(5��E�@s����~Ds���������Om���#��Z�}e�{����� *{w]oh�S�:���"S���N��>0p���*&]Y���h7XV%��`YhZ�W���q��Xsi��l~���#7�8|q��h��p�����Q�Ud<�p7���C�TtGn��@�#2�b�{3����m��?T'p�
�f5�b���U�Fq_���.�8u��~Pz�=������JDK��c��X.F����*��Xp���M6Z9�`C��$G=�a�:�6RGB�z�c3��q����e���\�:�8�j���D��1�|�r3�����~�|(�W�O��V�������v�9�=�����6�W�����eJ���Z*��~ H�t�6\�#@�F��5�7�N�rV�!����`)")��:Vr9/�_�P�%`X���hb@��,��_��?��h�����[�0�1���� ����-���z�����������������9������{l�n��1�hcF���1�ec���/c����	Z�}NES�����Jkzi�Cv0�".�-G��j�w�������#����M�?����!b9���Zc���)��h�Q��
�jB5&/F
\?1�������t����9@����2�{W���>%��R����
�-{���w�����I��P�w�by��.��LT��M��lQHf�n�&)7���F�1rmV!�u��h��h�d=/��|��)���T��v���x�q��1�[c��J����#U���4v<������b2��;�_g{(�<��Q=��s��~ 
Cg��o7�3���9<f�,�
�df�,�7&�TPh�#W��4����=d��Z{>+��z�@pj6�/���. �g���yjrNFF�uj���SxIj��0����a:�_?������W~�ye>���im���I���z�1���8E^RDzH@@��4
>f��=�uV,�7��f���2F.����2w��<��>3�����63vc3c�f�6k���JQL��HL��q�O�s�1�9c������1�?c�imRlG
���-��?��M��N�2����:5�9��<���i��)Tfl1kfj�x�K��nc6����eC�1��k�:c7�0�1�t}@u��I���b��-Qr��&|f�����?he�+����@�~!�$<������5�5Xr�n�?dj��y��S!m��RE^h�����C�������7��l-�c��L5i\�14���SL��	k���yd5�C�\��T3A"eX�8R��V�u�T`�1������G�)C:K��^9��M�ry._����'6#������w����FK���sA�/d"fd�����9�S����rU;��%th���-9���CY�2h;nC�����6�
��T6������`���jr{��Yn�'�tCTsP�>�(i[F/[�84��P�2���|wE�������k�`k;;.�g��~F�m��I
j��,@cW4j����O8h�����]�w��WR��K����V���6g]�r4fY�2�a�����}c�/������b���'�_�dWF�Yf�A�_Q'c�)<�Ps]`���nn�1�Dc&J��:?>?�������d�S�<�1����������7ju��F��v���WP��k/��S�!hjlCS�j�QcWw������M����5N�d��*�+�Z���������HCi]���b�(g\@@��O��l�1[��2��l@��V���z@�l>�����`T:!X�P2���@�����~�<4����g�)�4����[����
�>ww����4�4�<h����hD����i$>[���s�x����^���<c�y���-V��
�4&(�s0&-���C������+,�f���t�2��K~�p�qo��5Zg)��|�;`��9�RkFL&6;TR���3��p9�A���B%9�_�^���������C�+��5��[&��r�
�����^������P������lYp$�)�Q�l����	�6'��������j�JK��8y�����������������gW����}r���&M�@���	A��^��� �d\@����t]�D�K��L���
��fO��[wL���VWU �n��@�2�e�j� ���R$3�5,�j�����43���e�U����m�������).����_������L	�N�������q[/�Z�N����"��E'p�Ij[����o��Wg��.����:����%��&nN�	W��9�:�D���X����E�$�4���2A�k�m^,�E�tvT.���E#�u���u��*�S�c�P��&�%k�wr%�lM8�U��(j�G!	�Z�sT���W+��<\��Y5b�����-:�C��4����e��Un�*�-f��j��+���vv�1�#����.��&'������.���(������*wc��\�H�����S���r����m-����-���t&}"�Q=$ ,���
g��lJ9�8�%EMX��U��&��'3�_����C�k�����uL���i@�&�����Y����S�b_T������p�>t�jb��#Hh��@��N��O�o�'{�u�1�U�������*�V��)>a:��EM\�F7�wR%��s9UY;�E�������
~�����u\Uis��:��_1��5�Y��}�	@R�(�R�x���0-������4qJ�&����o����$�FM"g���d7��{��&��_�K�zrHs��P�����2��im� 5m��45�)�����Z5a9�SsD���`Q�E�1��%�LM82U�P!'�SJ��m�`��5q�6����_R�����: �0��rX���5�kom�4�&tl�������'^�M8|V�/�7�S�%]c��������X�����E����2U�:{i�5\���[�2�M��������?%nUu���n�!��e��$�����h�X���g���)����?@�&���R����d �I�<����9�xn��m����%��M\�^{t���v(W���$�g��,[���(����+�9IG�/Y���Hi�����E���Rhh�W�2�����tb[-@='���(����5�s~cX�$u���:4����<��^��m�^����wL����
��'�,������~��Y����������/4�k#�0"�g�%h8`J�����
hdu������p�
4��^oW�i��<G�|9��#O���"�f"��:���akGI�v��M?AD$?�Z�:��k�2@�l����������cY/|WlV�
o4#���\�^wW�>�w��|7�����"�i���N��v ���U��&~*j��f�m*O,,��u����y�'���2�����"���@���Kz0@��l�~���#���$�p����~1 *6���"�`�}'2��jAtF����h���������t�m�j�\���@�'^�C���x����z gF�������������<��>f}�
��I�n�^�������H}w *A�����9X?������$�������?� `�+n��-�8�����S+�m������)�}O��'Z^��������>�
���C�����9wp��.�f��4�_�s�:�T�w�M�~
������g�?,)2O�@E�(+���� �/�#��'�q������p��;[/[�~���h;ov�R@��!�_�.i��5��;��O���\y��l��)�����w�zv�I;�1�e�����g��K��v��iH&���) �S�`�/c���	�^�#��P����s��#��v�2v�H�!u~��O�M��r&��e�4�v�
u<�`:���U���8�YzE&�7:�L|
������p��R���n�x
X�����
��|M��r|w�a/����V�����H�,~Z�O��T�����7�_U������������w=[�����7��cq��`���F#�������_��X�q����[�X_>y�����n�;����'��B�.�o�����x����C��
��z��|:��=���l�-D�$?.����"�y��n�����x9^�V�a&:,���i�|1��n�c������I���I_
[��xvt��������(WQY�����e��r2�����k��M��,J�b����Em@��$�X����7���9�?�{tDJ�����q��_�1�����0�?��0��N�Q�O^��>y�VB�E����������>�B���9>?�������7g�����N��'�^��%<z����X}=����`Z��1Q ^��}B5�
�7�{�M�o�����O���!2������/���i�|v�0��(�a*�Z+)?��6D������g��"57_
	�b���!�G��c6�t/���Tz�5�'ye��h7>�d��'��a�wY�|"�����E��T��L�:�gR��s�����=vs!�}R��p6������l��7����/_���.���~�������,h����[n 2�|$��lq��D��Z����yv���1o�B�W�#��O8hy��t��&������\t�D���N�����
c���"��/���|r#��3ZP���v��A�c��7�)�,v��d�-��bS�73���`����a�
����w����������W�%e��v�������K�k)gWY�d�W[�2���o�\n�����.T� ��'?��x�?
�h�?z��L���C�����c�s�"��N�y�w�]���������uU�v��p���:�������/���[�`��.���1�R�������l��M�����Bj����4��[�K����%�{�4+�Q�|�6F��}<��?�<��@��I"���6�{%��������r-}
�3����r#�Uu��N�x��o]���s^��*3�H��9yOW
��0:�C��������=U��T������X���5�0���g��Qo��|4���=�]����Oj<0������������������7i��k�������������J�#f+�v�&�?Vb���������v�7�X|��B��}������=���(i�=���i�����KT���U�����G7X�vO�'�o���V �!�������>��%��Xw�n���7?m�a���e���_ ����fe���(DwJ	��Q��_&��)j�i�[�����i��8-�=����?������[*���B��z�_���.]����O}�����P[�I�_�����������?_������������?_�p�]�����)������<�2���C-�%��F���^�.�|���/n��KZ���f���e�����U�/b�������Q����/r�������������?�&���IV�{���z��O�I�H[�������H�b�g]v�.�|]��o[���B����5���f�5Y��z�����>_�}>u-�����
M������������v������u�������c6�z��\��4��%~A������/��aT�E�+nP/T.���H��q��L��a^�I������y�3���g����#v�:~��)F3=���x�^�3[���~�����$���p�X�C�}���x{��f8.�w4�?,�
x{��^���g$��a�O���x�cjsg�	uZ�Vs�:��x���L��l�G9=>y{u��������m��vo%��O�i��P���I�����VJk����=�:������X������������'������O��-���n���:_.��:�����XHR-$Pc}���fKo.�FY��]f�(�J���+�s�t�Mg���H:_���\��?{�"��=-w_�]��?(�k<�	8���N�5^#��������]Vd��W����t�������P�lx�������/���+�U�^���P~�T������t���c���j~���[�=�v_/5�J��@�������7g�lN���-�r����~xo��n�J���1W8X*S����������\�;1wsB�&W�K#O�(?a�I�E��I���V�����E��;|w����t��s^3:!h! W~���r? ������EC�]��-�r��Q9��4S�s0e�hD�� {����c}=���'"��*S�,��[�����6��
g��x.�i���(d��u����T�r��*���M!_-�f����
:U��P��D�S��I�����_WY�f
���~���l$������e�.��:^o0!3�2*}�\���Y�M��\pL���]���l�G��p$��|$08��N���&d��_��p�e���u5[�P-y<�S��n�;�W�
��(^-����5/~��D��;�W���KD�����07B�����l���h"�q�w�;d��v���/���������!Sy�{��G���J2��r�j��z?pX�����o������D���
Tq��\�*��s���'}��p=�<\���,�e���jW��h ~=��j�8:��������[/b�4�t�
�-��@��\
���7)�L|�2dOt(DC�N�*l���2��I����O�����\�vTn3���}9��tpT�1_��g�z�3l��B����o�E6��8���N���j'w�,%�n�Sd��~���M�������<�M1���}'z��$_�M^�J���=����U?p<��>��	tf�^[J?8P��-G�h������U�q��aQd���iQ�_�1�w@8>��w:~��lY.��9qy�3e��r��v(8��s<��h��_�%L�����)���y8&��A��rGs1H\O��h��������+���I(�3[�����h�IR��]��O�����r�p8��e�j����l(��q6�S�|������A4[+����u8�?��� �q��8���~����n���T*{�����4n|!n\,�U���Su&����Rd��M�W�����PV>��Tv����2��r���Y:�s�@���`lfYN���"w����dp��j�D�f���c�P$B�I@���Omd�e����b�z<�-�h��������I�������\��K�N�*�n�R���le�x�Ouztx��R<������~�sryy��	��@�R����K����,������^�/�\d#�4�	�����y��fs9���t��p�����Z��z���4}�]	T�.��)����;����B���R-"�d?��_����g�z���Z��u�>I�3QZ��e��]>���������m	{���v���~=������������q��jW�����>-���S~��]7����.�ma#���B�������A9���������tv?�D��0��+�������z����U���}������BVr���r��&,�O.�E���@`�|���
���o_E�i����=�����a�1��D���?���������o_�}Y{�Z�}5��	��&�����	�����&��{6}��o����c�������E��}fK��G��i2��fU�M��I������������?����������8�:���H_����|���:����_����|���:�c����?_�6���}�[V�c�����������T�����?_'~��'����@��]����:�u�������W������������>_'}��&}j�����5��j�u�g�Y�0�z*G�Qo��u�c���:�fs�F��7����s@_��Z���&��������G��Tkn�
E���w?[|��{WN��d"�H>���o�����&���l5���6���\���U����[
~w�In���F+�8�����3s��w���������"Sw��s+�$Y��������X.V��J�7������{^�c��Q:_fSjF^��.��\�<���;�?���gR�����M��Z?���VN��:?��o�S��>���xw��e��\���������?��q��Z���Ov��,��ew[[�]]x[��m�!J����f���tF#s��t���d�`�3���$W��8�������E~;�,6w�����tGg\]�nE{�Y�������������U�&s��s�-�'����C��f��=�	�b5�2�q|yk7���Yz[z����4�m���=�yo���x����w���m�Y���|U~�e�ee���|��L.�������2�����hG�_���������~�����]����-����q?������������D�d\0���>��7*����������^�t0��2���Wp�������)����/����7�g�\eR41|����E���i����������=>y}���������������������N�h�����1
��-�y!����Z;A���~5�������U�~��a:�n���6�>b�
��:�
[����y��h��������7��^zo����q�h�8���^�2��y��c-������y�H����yJ�_�D����������)���|��Jc����_j���������u[���>���;��R2Jqy�?�>;�����I�����O�7'o^�\�E�"g������V�k;{�Z�����y������/�p��km���1�Vt�VO5��]Mh�z��2k��|����~���j�_{���k;�6������/����8�j��r9+�,���q�a�2��=��5:��Xg����������&���1��xW�X����*��R�~��!���nZ^����Rje���X[^N�l��A�WL�|2�F�`��E�nOkF�>U���Q1����-���\����Om����"��#���jY-
��eK��8�Ux����w��)�W���,[�]
e����(��G(�\���06_������Y`v���Qv>4���g�I����F5-uwX[���1
P>�<&�p>?����_��|�>x�E�u������������=��4�-_������Od�j�
�F����
�DP�Fnq =F�����]{���1�a��p�
��%��L�#
�8=�>F�D���Y�9W�
�.=����:v3�9P��S��d����*G*��x��m�lL=��������#"�����K#��N��t����B|�v����1�c���s��8�
F
�){J�����������]�){]����g8E|�g�Qa������^P��q���'���(`�U=&{���1e����C{}��r6{-�[k��o�^'���5�o5��!��{z��S�R���H�w��R|�4������d���T]�2F����v_}�{>���t�����1�,�g9<�$9L��$�B�`\Tj����l0�Z�����\��K�j�rs"��&����"��R�J�/��0@�|��/*(�y�
H��e���}�>	\�V��V����j���|j�q�����T����>Q��mb����4���/*8G�����TP�9�N&9l[,�Dz������"���1*�;�i��Z�+�fK]���s�^�/�7�'��
>f�6t���e�[z����7��zu�e���	�S="��WF$����< {s�
�����d�,�b��}���J����U��:LO�	����1Q���;j�e,�l �B'�9<$2'(�~>��&�:4$I�x�XM3= �N���	!�S:9�r�\�/3+�u��l2_��'�Q~�@��L����S=�c:���/��c�'{}���������&.��NE�X�����T:��������������R�1Q�x�2&�5����l��j��Lf��5��V����H����N%e����.�MQ�������q�aU���9���(p!*���W�� U���q>lz�e��FO
���*
����n���Y;5�H
!����>`�����M��Hg��O�-K�A�M�/�+t<�����N�o������e��\�OG��l�.�D�9�
?�
�,t<]o!��!^�)���~�	��8�:r2�
H�0��b:�v�g���E�uo�y=-�HUuo)�>�A�������C������Hv��E&����p��>�2R��4U�C����8	Y<�����4|M�Dd�^�����\]^]�?��|b�9P2�s�R�x�"���F���V����1�}f�/��C�dt}��s4�:�{��q�P���s1H����@���= 6�X�����x�z��N��Q�.��	6���7usg�[u|���@��������@mD��B�������M;�.E���YS&�R�Gt��kQ���B]|12���!&��q�g��gc�U����q<�&�g�G?�I�T%�#��u��>_�����r��i $}���.�G?v�65NHIjS���U����n?����~���2�]dZ�xY��o�H����'�G2�6B?���p�.���3&c�������
;v���6�!�pA���3S�������r1Z����^.{Z��-�|��0�*��M��,�Bp����ic����|���c�_�d��N���`.#���}67�*����7ch�h��\$L�90����k
��T�2@�G����@I�1�a�@���G����������q��V��[��aI�e�@Gr]z�	�����_T�d��wyH���\�@*8������!�>`X}�a5�*�R�9J����RM��XG��"�X���%�k9i}7�&\�>K��j0��j'_X�Z#��J�X�A�|�D�-u���"�AczD�&sP,���}�?��F����1�-��r\l�����k�X]c1�^�����f+�"ye����A{�h�oCk-i%z0�A�U��"���@l0�%m�2���Y��q����������9�V��$�o#a_��� /�6>O����z��s�+�:�W"����t�9��	��?oTiz��	J@��,�Z��vs��T��S�����u��+�����'G��>�����:��x������D�wCT}���=�I�����B����_W�"3��>;��_�|xqr�=!@Z}i�/j$��V�F�J��|I�V�\4��&�
�u#d1T0�h�F���r��`�~����������F�rh�O��n���y�_�z!��XV�cY��Wc@��=������D��L�o�����' -=�u3&�)�n$��\��w���s��~�=��|����6pc`��[%Q������Q�������t�BV�kE�Vk	��-G7K��[��>@m�q�hf���x/i[�O��G{�/u���`2�a�����KR�Q��e�t')�I�V=�K��O�����:�N�����'`��2��~=�l^6J'��]����'����<��	��po=��2-�}��\����-��d��*JQ�m�H����*���0��8l+T$j
���l�E/���#��o��l�h�����4v,c�H2 ��_�Q�������O��w�z@�8"Y�8}LI��
k���U�����X������6���Z� �t:[�7��}��x��Z�-6n�ipM�����Ch����d��%:S���/��\�~���3��K��q�l�C�����������	�@��
bnI������D�xr����I�3�����zz�?���|rx��6�
:�R�f�;g�/���'<y{�����N.D�}xq�����*�s�!��e�r;u����V���43�L�d�gvAX�4�(��Wv�Y��r�����uQ��a�i�����D���P�A�8��)0g����F�@68�W�`z�8�h��p��r]f���q���yrTUd���.S���i�)�P	���?fS=P�96W9l`�|T"b�|�7D�
8�P��m������A=
������hvho�x`�<U�O���W�yuP�?��)->1�����e���
�������4����l�����Vn�S���*:G�j�|k�M�����(��rO�+~���Wo�����
�����
�%K�1jeXvQ����Z-�2��C��p�A�6��~��r�%����{�:^������~�;���6��z�\l��K���!a�b�xx��j�l�a~�J���+^@`8RV�����o��	c�L��'��S���}����RI����[�\�����
�f��m7����7u���
���,Y���*��P��+e���)�=���
\�Y����l���+P<���Jzo���>�z-�2��~ 0�Xi��B�T@��E���F{��8�P�o���s
�m��$$������*s����`<����E���E7�=������v,������z���u�v������3�F��1�k��&	��C/�.�k����o�����fAj��*3HWpU�Q���#0g��A@3H���C<B������3�hOJ��!�h���l�t>K�����+U�m
RW9�~�9�I'�/�%u����C�Kp�A���U�eC�=W��p����~`S�M�/D(uDR6{~,
8�T��X�_�G�=?�,�6��f����'Md��j�|V����s�4`�RC�UpT��u*
���W�:'��e�k��-;_����z�;���/7�����,��r	j'�����4�S�N��r6�z3x���Q)���w}{o�j*g��v�
6��+H����{J�D�e�)�@U�a��N2LW�Tv�vt
����[��=�����s�$����?
8WT�q�t4I����T��L�t�O�b)�W��Mgfa���5�h�s�6�~�P��k��z����@K������bR��U�]�6�:X������������5�!�i�q�mi�!Y�B�����_�.!k������|y
�S��RK�d��3Y�bp�<���c�l:|0�Y! G���Z#:�4���Hs$��i�Q��<
-��T�J�����E{an"��^����nvt~|�?:������|�����i~=:99>9~j�b������n8��&w�?��z��)c^��Qk�v�9lU��]���ek��s"r��2�x��r���U_�a������_B���*{-&������*��L!v����>���s�.@�l���4�3�]-=6=����C@��n4hh���A��t���~1�"6T��\��VS1�XO���J[^
�q(��ef,�ghC=�K��w�3�>��X(�������a��
�|T���65��4o�V���(����4wj����n��0p����.GN9wc '���yc�zr�����3�!�3C�i\Hg��_�@�� �>(����|P� ��p��.��,��3tc9C�r�����'���me���z��q�!�@C�/j>qz-]9@k�,�Ii�;������"o�f�e�&x�/{y���5�����Xyy>#|��0�g�!�e8����Y��B���d���<8T'��m�/=&P�9K�2���/�!9\
�^'�Pi+��!��rY�v�erK�4��)������
9:T���w#��|ws�Z5#j��}�@
�K|�-���4�������"
C�%�n�%��2��Qmr�����\�D
9�U�2@i"��]1��=���b�"�V�5�4tL	��:����i����e�*A�U��5Ph�S)�V��V��d ~����r�D�m�d0�fam�d��4���*&��w����9����
���������a��^��J���7�v���
-p��~�u��8���D[8�N~f��(�W+�}��D���D0i��q�E��*6�r������O��q�Bx)%]�bZ�T��AF���^�=}���@�82�5���m"�]	
�}����[������a&�j�k�
c��eU6(�1��`��9�����n���%��
-��N!5�?�����7dY�Z�G���
mF���2�f�Z6t5Cmv:�M��6��pmh�k������+�o�w��h[���D�_�l������l��2U�E��N�"�G�l��������FyK�����5�3:-�)�0u�?xi�<�o��E��h���j����|�����Q�1�����=`hC��-������C��6����46tFc�G�66���
Z�X����6Lw\�������
Y�U��%@�����~ %6���%���Vo�}������R��-*wwoBn_r����V��y�
|���!Ne
�������^n���Z
9hU]\������9C�Tu���6�T���5��N�4:�Wt��]���f<�b�t6�����1)��
��k��kL
U7�*[�_������g�V�t�O�eP���D��!M�gh�9�07t��m k;�������uC���/��������,,�F�x�����s���6
$F�
�l{��
���� ��4�(���;������������7V���b�!@z��-bk�/�!��
����Zn����\@u�����}Tjk� W_��J���=��Q�y5�=����|�������N�������/o���"u��h������<U���*�:����9�*(r�ji��e��6�tF�(r������]������N����t$E��FL�\-q�	j�
?���kl����.���������aw�\R!Zs�����.9�s��+5\�����@��"�d�k���������jI����Nt��c���9�x�2�]��mo��|\s�q��e�<Z|j�'�y��9��k��5����\d�z����s� .�Ws�-������R�RZdb�D������t��tJ�O�� $t�o�N�n�i���xpK������7'o��GG'�����p�6!?���^����q��������K�=9����=���[w�,����0���*�r�o_(��c�xx�Os���4�����VL�1���B9����C��x@�s������r�u>+�����s�����9#�1��cy.�C�@�2�� ��w��po�y��|T�F����O�E����������)&���� �N��rU��U�&_m�V���U��G6�_�����v������kZ���Oz
��7��r�u��������V��*8�u7�=<{�����d[o���=��h�.N�m7�\@i���v]L����?�c�7���;v�(����������rH������`����T"������p�a���Su�U�P7H�����J����}dc��K�A+(��
q_9�3o}^��Y����?y'cP�(��h�Ce(�r�
�:vo�l�V"��GG��D������#�G{g������������OZh�m!�����B�yb�e}1�~������/�U��72Qu�E��Q������
���~- yn���#x/����{uZ�j������~�EK+����y��������v��Wk����3f��`���mR ����pP���e��
m��	x���^�� ��>��7eD
����O�m�����,fRRF���,�n�`��G�nZ�E�4�bW������_�W�z<�o�z�Y6�}V��G�g3%��Jo��G6����,@y���e@�M\��E�	�O;P8\z��8��u���9j��5������3Pj�%����s�~P�m��}\�����JU[�[ZG�G���Qc�[�L�����`���v������P�Q���*%k���{����������?S����v���LR�����g\��]��wN/-Y�mMdI-R��/��`�� 7H�k�iau�lQ7A�������U���^h6m rO��<@�{�>��Y�`{w3��WKi��l��3��0��TG��W��4Qg�H�q!���tv�x2���eG}p���o�t@�{��=�x���G����������4\����k��@uP��K����"Er�a�$�&��N�_���A3����*����h�#L��X��y%W_!_YZ�U�q������W���;�"��i�\��C&]���$"�7��c#7[��	0���H�H����,���(s5 ���O45 q����l�:p���t�+��5_���z��������%����u������7w��w����a�i��n=@�{\�g5 ^\�f�<�����Q�`����x����
�ihph�:��z���p�eR�ly{��y
@�=�Y�j��[z�� [v�����5P�c��i�|�g�f��}@T����yq��������]�(;7	m�TWL�q��!-�'|t�����
o;��&�����H��A�5K(�m����]���	�T5H�4>��k���/��|b9�A6VW��f�_���L�����a �v������p)�6�R�x������������G|@L�:b������
�kXg����>��}�Q�t5>�}m�g����v��`g���v�D���}����d$�����
�h3�8.������a1���z�5�3����>��}���
@h���j6@l��	Z�����xJ��j���~��aJ	���hC}��4w~�����r�`�1Z�����1�"���V/W�6�q�Q���u�����7/�,���Z�@��<d
�����j6���&����f��l��{�;-�1�P����E�8r^�A�?@[}� La�v��A37����7ELi����o����1�uA����(U>�H}w��q���4��S��Jz�4�T_����i��>^�y��w�<�S���f�4
s���9K���V���-��5��o����)Ji�Y*��>�AU`�>�u��Q1����G���54&c�?��L5����oo������V-���FV�Z����I;�O�>)A��.�4k����_����H]�%������S��
g���W��>���������6<�j�������`���"Y#�k�s�������MO��ZX�4KQ;�$���y��b0Mr'�;����S2aE��>)����OF�������h��I��<Q�h6��_3k6
�_[�����j6���hL�1��"�!El����RE�)���x9�,��i�)I��h����M?X{|�� ������X8$�����V{��\�T�o]���Q�E��+�X����x�L�l�v!�y���[r{���������Uy�;�f�B�rI����aJ��.�p�V�Gq��D����F
���b�
���]�����?,�1��Xi65H�,���3�Z�v�3�����l���J�z�+�j"��*����D�Vr@w���������p�������F��f�7���}@��^����/V�u�X��%�~�7�l`��*=�^[��d2�AZ�/r[|����f4#4�|O[<\��Q��g�������5�����0�}������	kEq����]@�B���u�8I(
��j��f�� @^}yU�*��U[�U}]�dK
+���5���������>6�Yy>qw�T_�����9Zj�XU_���:-�������;��)��
���r�5�+o�iPM�~���kyw���>a}�H�������d��������N�y21^M��=*9��F�7������M����G���}�[��s+��M����Y�>�W}�_m�5W��[��%�_
K����=�9�Jw�s|������qTKV��s j�\`P}�@���9�w'�s��V`��>��*~{X���!�
�9
��� ��q_��<��H���W�l��:0��[=�zu)�:��z�8�T�F������#����f�4����-)q
�?���>�FU��7��8.��6X�
!h�[�����G���
� P���(�wJ����^t_5����P3�����#1�JsK���G��L!#[G%m?�L�E)LG�o=P
��T��p��y������i4Y:��w��p=+�r
lCg��iT��	)K����LF���e������fZ�c�=����d^$��I��t8L���1�5��b�P�G���}����t�4`�]+�%����v4����u��Gwv��9�����
���vh�\����
\��W5`��qR�s�k��b��)\{�2�g`L4/6J��2�g��U���5��WX�������������'�7��/���hB�2��ouu�s�W�e��'����/��@[�� ��)�zP�������k�@�th��%���|�x���A�����u��@@��������=>���{�?y��pv|v��O�O.����@��G��b��S�/G�#��g�{��b��}I��`S�p�E��U+�g8��3?�:(i`��%
LQ��0j�o�F���ZP�@��������:�hYdU��������i-(� �o]�G��bC�2)��`*�-�"�u����
���V[1pl��
S4V�)�����`�`�0�+c�#����mr��x;�a�lJ��y��K���{��B�f�7��Q��H�m�!3�m`�4�k����q=`]�u���$��~x�H�5U�> ���:�`k�S��� ���R$���k��� XK��,��Cwf|�����I
tH�Z�~��1J�P��CU�l��2����h���X50�e�47���D���2�ZDi`F��(
���-0h���G�a35�U�P������p�Hy�pe>p��lTk��N3�8MU��D��Tk���:s2}K����@U��7Z��Rj�z�(���h�Zu5Tw�{������}q�{q��{������FT��c3���B��f�T*�h�)�iEI�b+���b��~
�����3�b���5@h\��_�Zy��fD&���6c���7[�@76�h�X�=�:�ep���@Gx�S���pp&�gH3�Ma�H!+�)�!��(���Yr>�,�����pT������vT1Z�5��)�c'{���3�D$�0�[1B�y�f�g0���,�*F����(�V�}5k};���-�M3d#z"A(\�9��Q;U9]���+!�#��Z,7)i.!-��ZH2�5���q:�M���z1&WK�&U���HngYb��QG��������Ar���)6�Y���4@�����w9��U�Q���4C�T��]q�x����c��g����o����S�O�^�O�6�!�\�k��L/(�����f�=�T�z!�T�\�LR��r�G��.��<�+����������hI�c!���:����/p��l��]�5��=`,C6�fK�rQ5��>�����!���5{*�jf���j�#���9�R�K@�"p�#���Q3
�4�f^V��I�qQpr���
h�����:'��B�C�^�{�f�.@/���bx�7���Q}g���J��n�;"P���
�L��^D����a��E��?��ig���8�=�fhi��[���1����1=�����j6@V4�kU���Y�b�wL�y���&����;��T���m��[R
��?p��I�a�S����q�!�'C
<�h]����)�L�3�� C�p�!@CIlyi:Y�p���s>�[���({��n����v���\����B�I5;'��9��r#�j	h�����n��!G#vh\f���=�x��y#H"�a��T{��:��l,?���s����l@���{�f��PG����N�c9���J��r���+Z��bpOc��T�������j��^h���u�|��D��":�s���������0������'?���a�������w�F}�Q�}��]N�_��������u����D�i#�!�B3��^VEW\(A���/��?�/`h�2�_��-Q<_��|M�?+�J���Z�q'�����N�^�Sl��T1#�k��p�!{�yK��r�_�d<@�����v���+�����r'�N`{�.����.j��wj�<_��(�f��a����)��h*
�zL��/4D�jQ���b/�/W�M<4��,>d�d�NoQ���^�;�_��@���q"k��������&�k�N������m��������H60d��"U7����?�����9�8���������������Z���aX��"�G����-��_�
�\~x�g��������iB� ���!k����b<�d��FN������Ow��������^��e>K�k1��c��0C�Qgh���Z������To2���lrBgr�R� ������o�*��Uq4�T���v]9d�s�r������r�l�Rv��h=u%��
	�����0`:��W��eh�Qv��V���`q-n^����blvX|0����=K���V���2�1b�;up��w����!�C
����:���eQ�4�<���tH�e��J���l6Lht��|$@"C�TG>s9��c�8�Ps�z��ST���Hz!���"C.t�:�(c����.g�I���HU��r!(��D�M�����)5f�H;����������a+s(Uo=�{�+��K���������A���R�DD�v51����p�<�]IR�e
S��p��Y����!�v��w*���63/G� ��h���y�� �0�2�!�B�*e�JDf1)#�JF�1)��S��MX����x
 0#��T����%?��NV�e�%�N4���2pb*��]���0"YFf��-\�z��b���0���2"�bF��t�Q(32�2m�	U
������T 8#3�3gdFpv=v����	�����@����.��E� �J�Mj��E2�]f	�=Dh�;���?����qhhW^��#n����F����"hF���8�S�D���#����n,��%P;����B����r"�'��tUs�����������#h����x�zF�I���h��������(��U1�+�(�8�I3���q����32e>�Jw��W���/�m����3�pOU����\3�NLo�[�UF,W���.B�T��q��M/��`������g����$#�l�D���P����yf���=�.�����#YFL,d��MD�n�8�Q��DM���r�t��3&#@HF��;&�vH�k�����!�D���Q��V�Rk
`����l��M^�4����.�Z���f�N�iv�LhVlw2l�2��T�-�3��C7��:�)�eg��?���2��PM�G�i|��)~�]���b:q�rF��f��o|�������N�{���C�|F&����h������Mg�\�i&�s��p���y=u����8%o�yl��@vtg��Iu<��5�����!��,#3�R/�U@��4�E�22>����W��~���nv�"@oF��fd�e7�6]R3���k����tZ��j�$L_�oY'�u7�����&-h��9`�����E1�;��'��f<�����3I�����*OrOg�]� <#�I��Q�<#3�����G����"�zFfA��P��>�Wh�(Xk���d42CF#��F\��VK�g���k�/G�{�f/�����\U����G����Pa���6�	J�GD{���E����F����?�b6���gQ,��bd��-3��pa�fsk�����0��7�	>��������3'���K;�b��|y�&O��h5Zg:uYF����k�����H�u�}^���&@#iT��^wZ���u�+����nrO������Rk:����F3k@�Ttq��S2�X�hbd����4^4iH�HC*j�!0�H�+����12Dk�EIG�G_�"��
 F�0�l�j���t���HTd6Zdb����>j9��g�g�cd�9Fs�4�v��f0����"�� #�,R�� c���e��N�~Z�������d���+�����������
��2#]^�,{�����d��M��fti0����-#�������IT��v�Gz�L���/�H�]���q��p��*��ym�3
R����+k5|��U��/#~��F�(sibJj�&����������W���Lc@t��3�)�]�O�6���>4%�^���n5��kN$���z� �1B0����������gk8Kmy��c��1���Xs�8|#���t�� �i[�g���x1<c��z�z��9����f����IIm����u10)�]nX����Y�r0��VB��E����������]1��B]���^5
�w�x�(9[Y��Y�4���Y�W��U"o2I��?�e�X�V��2����v%�=n�$s���+c��r���cs�����V�=�]�A��B�����f�I�F�_��P/
�Rw��N����#$�r���6m�������l�Y"����[�3�����:�����	���1Z���J2v6�2��S���0�XsP9k;%�,����9NSmO��� �U�@�2�@~�cjV��P�����RkR����?�\�H�Q�����M�O�(�[�LG�m{�k�����X�DHt>YtzI���	�������n#�dJ������8�!������0M���s8&��-��$_��"����~#J��������s�fK��sg{�=g��r~kgc�Z}��S8�7Lg�1��C	����<��2��I���
g*4�n3[,���������1���1�Ec3\4�h���d������t�wCS��s��Z��{������!_���b~�f�����3l��'`��u~����3���{���5�gb���f�����9>���B4fO��(l�5����\zd�����/�u�'h�R�����1i���B3��:���&��a��5���������y�h��a�_t1���(�������f�xIL��p�1�S�4L��a��<�������������}��a�G��N_9�T����x��c�f���f�Y �zaj��VN��'x���������������,����Y��G�glLz�������;�/�gl�3�g��m�������qzSo��p"y���*4LU!����uG8?}� ��O�
?�G�����,��:�4���Y�go������)�M�_�^���qzI��I9�8@i��j6@��'���I���T���IQ��b�>����b��Z;w�����3�UK�*���
K�������1�����-
��aQM�Z�8�Jv�x�=+�/+�2�9�T��	�Z�nM)v�m����K�ZE��o�y�f	Z:KyZ�E4@u�(�g�Du[H��=��H�MX����A��es�~�o�����sg���U����	�a��zx�:��}��d�u��Q����?��1�c@~��Q)-����������w�q5�xs���������'o�����G>:=f���1�R��d����y�����x�?y'����.�����qd6w��������z!^��Iv�O�Wu� f�#c@X�aY&!��cZe)��2�b�l:L������c�?����������|�j��j7��&��"}3��CXa����$�;��|�^3��f�Q��ISV/����1��Ljw��X�t����4���v>�`����j�k$������\������W���
`C��t�3V7���hu2~��hz��G�d�%;�y�s��6���d������M��A�����G����{9?�����I@����<;^f�s�����������b���kNJ�s��
�R��Z��������r�w�s�
��Z�z��s��Zd�}m]'Yc����n���	T�nR��{Z���F�+��K%��)-�����7C�d�,�v9��i����vS*�X�%Pz�|���V�:.���2�
]7{2�:4���VO��	���%, ��J]7���������-�nX��*C�R�r�\����,�1[o�BE�7's�D��x�-��{h������d��)������_��E�A�{�;��������U������H�v���t�S�����L?�����g����������2=K&�p��e
�gv����8�S���xW���(�����~��Y�n8��.@��+���F�s�3=h�7�>��>�14;�^|�m�b��{����UU3�P5�F_�BVv�`������X+����v%F� A��%#UM�let�����hG������>�%�o��G�'�`��1-	6C�t��z���o:v���C6V��d0]7*�#�������D^�e��40+(�7�����T�(�UH���'����^���$����9R���[����a��D�v"�k<�h&���ej
���,���r��-�&9�=�����FomZ��XC�a"��r��,(�=���eP 6�k��e�L&��SQ8�AzS"�e@�T�.Isd<�.V�<��=�b��QH���X� 2��(O���i`�n�t�����	b�����Yb�n�%�	b��&�D��fv��&�^������t}�E��P�2�t z�����_����t}-/�O�����������?�WZ�{����������w�o�o�/~�k�$�[�d����oTGU���������E����'G�T��t|�{x���n��b�nX[�lx-�`RT$���R���I��kv�z����=��sEgI�4��Ye�����!�==�L���J=�L��j�Z������(����d:O[��G,�uO�.���g*;�K�P�H�&-4�C���V�0�$;~<��L%_����������o^U���"' +f���P_3�9���,�74c����c���}�d��0]7{^=+L���i���(0]gJn�:I������]�T����n�6R��%�v���m�"��T`:*����8�*"�iM�
Z���tS����u�����d^�4X�m�(=�K��Z�����-J���u�E����#��
�\�n�
�_><k��A[����������������-���b�{�� {:��x��9����.]7lTz ��wg�F����%>�8���x|�?�������i����sjk��Vl��(HJ�E-��l/<a�(��sZ�������69/�L\�o�������rah�:��B�[4=�K��-����'���jf�.R���DP.����t���.'�����|\Px�7I:?g��H��. /��2�4���������_�w��;���S�E���e�����j���H3�oc�����-2�~����p1W)	/s(���Q=T�.t"��9m���E+cz��.����j=�K��mr��hU@^������x��b^rR�����f��r����?�&���0]7�H�M���(���s��������q��{����]�Q�b�{z2��Nd56��I�J��2YW��*b�L&�
CI��8i��T�%���G����+�Z�	��E_F
&��p�����L�	������
�Y������g����$=sL���&e4f�5�vU����J��7^�j��2������{��%���LS���r�q+ ����Bd�_�'�R=-�A�U�-���l�������]������&-��Vj����}��N�U���d�(���������������O.��f7�Tp�H�h#~^��d��R�K����������9Fz,��U��vomO����&_IQ��_~���/���6@�mM�\�����"��'���
�e�g��S|O/�v�ty���?J!�*���\����d0��@K����Fo�P�������Z��ul�Z�Oc-�������-}�Y�n���g�m�^fmM`^#C�(�um�m�^nm.�n+����yW�?���^�lC��L���{@��~��#Z�Nh�u��8�h���8�k�]�l@,�����P����ls}[���M��Z�����2��5���_[��_���x�6%^��_�������m��*�����R��1�����;*�.No��Oy�������eQ�P���.�/%yJQ����d}�i�[�I�����Yt�6@d��~��-/��Z�T��Q��|��hb�*M�����8����sI��_�Ek���_���
�`��)�����m��
�a�1���������\1�CQC����x�S���O3q#p���c�@5pJ`k�
@d�.��h�c��xJ)��N�?����EY;o�hO�����������8�D,g X_��(���9�+	$��]q��@����|�n��.P���oy�d;y�g��J��?y�z>`�]���]����n��@c�/�_n�4�5����z���'��d������So���2f�S����DiN�lk�C}(�������kqw�^.��:6��F�]����#�����A��
����YQ���d���R�$tl9�6
����?o�,~�'i��[��	m���S�nj��P��n���E�
@�6���*  �m���
X��T�����V8��@.MC,��A�:x��M�����=��_�_>�F.����E����/����T���
@%M�l}UX$�OU���6�
>�
����U�m�9�����T���U���q��u:�������}^F
kEk�K��v�
�{��VL��*q�q���w&@=��a�
�|�<�G.��y s�!��7=X?[�ns�z+�}�F�W[��/���w����r�h�*G��(�}V�{��������l�Y��L��ay@y�'MXn�]��_�!��%�E��������9<�f���}���C�8\��U�������FX���_��5�<�/Y�l�
�o���6��mC:�t�m�ZY
�$�
{�,n��h� p�P�X�e�+�
1��6m����`�5���P�'!�v�W�osa�[����
��Ud��j m_4���"������d��g���~��~/��������Xw�QvC�R����l1J�{�0=�|0�^�6�F��"c������u��$a����J��/����]�w������\�&_�d���${�B�!Y\���S�HE7@(���~L��b@'.���X7������,�}p0����l��)#���4��3���:9�����7�V6����!�/�;=z�G���e���j��V~{p9�S��d���~��m�<U�H����|���\�A��j�O|�>���=z"�G�
=��������n��������#��zO��EZR����|�f��$����?���o�j�t���������#�����?�Go.�����m]���c��B�{A��*�5�z�L�O����oo�L���WDz�B�N�����<y���S�;����5�,R�/~x%�#���L4)���k8��S��Ft.������/�t{*�E�L�����1&�_����o�${!����a��(�!��ltM��|�����cK�4�EY�{[e��r�\/���UT�����K�!����?��*���id�qQ��#Q�'��,�u�^�g_�:��\_���4O8�x���(�U�����D\y<uR�%TP�������d�{�j����h�hm�k+�}���Av���XV��v�������j��)Q�/����!��z�?��^|��8{��_�2*2`�.�����������u���3���N����o�R�����NS�b8�f����N������6�y����c����5���9�Wv��	��k���X�?��j�rw���7?�_|'dt��x�~�\	3.w���R^Q�Yf��+�����TH�2��b������Ew���#
��{���������s����aB���k�$���F�\wKV]u�����a�F��j���4�lO�;1o���[���1.�W����A.����4�9Q���i;z��S9�O%�:��-��}�����x�����E������hv75{�����C���x�~������U�0�~��U�?�D�����x�O9-�q�i�a�YrM��������������t�+����~=��DR��=��z�R���/\y�}�y����|���	"���0;����[��78������������0�7�_�2��g0�R2����~/�=��y��VP������
]��lQ;p*_�9����f��=}�F�q�����t<_N��$]�.5�!]�v|�"�m��g+��k��+p@;��c�s�9�>5{����<CA�\kZ�~�"��!��=�~���������{��a/Qm{G�o�x��\:�gM������v�N+Y����sydu �m����}������K'�g������v�*�Lh��^����;+����y��-�5�����d����t��tA��!�����\�\�Z��O�	�-���O����OW�{@��u�#p�p�|���?�=�����������[�c�l��r���\�����0���a��!�D����<�zY�xZ��3�k�KO%>����4�����g�:�L ���W<��g�"�,��g�j�����(<�5����>+,��-D?C2�y�@��(��3x:k��>��������?�u��"���,�����=�����=7P��������H��GQw��[��
�(� �=���:X����&��4�nK|l����
�(Z�� �3v��-�����������}w�{ly�6����U��
v������f�)_����ER>L�q���k�k4��H�y�F�L9����zxN]lN�����*��u����R3\�6,�E�G�C�*���QaE����VnD#;m�W�"s��T8*V�y��3�M������6��H\�g� �Q_E��:�ZG�������3|i9=�E��^}���dv)��
u���-y�?.f����m�W�����G�s���k��9~wq����!����U�Y$.Rs�j�GF�s��7e��m(�0��z���?�U� ��Y
�f��O��������s
�7���M@��NfUr-�;�U1u0&iz����,[0�������JT������>�K}V��4]�(V��4���o��U�^�������^�����h������:�{��;o�?�U�lL�Wn�����`"��p0�a�RT��#���M�;rU��w�i���
��NE>|O����E�Hi
�j2�N���'����nY8���"#�)��+��N��(�����������T���;��?[�r7���#�����\�Z�|q�������G��G�o��o��t�6��N�i�4T�k�����/�.����y�����Q��p�[�9*ra������i6Y����8K�Y�;�@+3�l��Yg�$w(�D==@��0�i�Z3�	Z3�cK�g�
+��,�$FC��'?Wa/?��r<,�[���5�����\`�t!�s�;���N���v��q���z�M{d�������f�$ 6�Y�J.@l\Fl�u�rwZ�����4-��l�$�e$��c(v�\t�v�r��f<U-��*�F��?�S�����|���u�Z�w`}���;+��sVD����(,Q �J_�?�S���+}���M����)9���9+��~����!*��y�\���*���_^]%�����f�����#�A��>F
��dZ�����cR��U�yy�%;����������:�-�����J��n�?�R��9�IXx5�5��&�Q��`�C+-�7<��U^�ixy��w.L~�`�9=P������'[��:����GZ
k�����u�������?�RXaxpn�?��D�;�R�����J�O����~g�k�2����q^-f�R�;�ZR��;��6%�����Z�?�Q�dx��GM��}wK��>��t9���7�4?�����-�;�Q�l�A�"n.���F����|F��a#�K���L��f���}�d��u�7cN���=
����(:Z�7�{�����e�'��?�[{�`"���������bzfe-��I&*a^��F�=������Z��!?e����
��c���b��]��(��1��=�}���Jq#G�����p�� G�i�j�?mR���M����o��\��gI�[��:��1#�v��?"R���@��p���B���F
{�UN��0�$���*Z�WH.�rO��C����<I�����-b)D *>���B ��,C�TC�����&�Gh�����e�.3��@���f"��#�'B��&��	���Zj�|d:��P���" #J.@$"�9�eA1.[�wd:Q��OmXk�L)��v�:#����Tg	��hm�x��8��RwX��RD@a"#g �:1���wZ��B�����Cl������.AZ{l�x�����m��9 �ci�#����C��G�8JG��k������YN�)���b�s����@�M�tY���|�j� {	v���1h����E�H4�����t�z��o���c
_��E2�]O�K�������y�l�a�������n�����!��e�
q}�����q}�jQ��z��t���k���P��.~-��
$�?Na�9�a��r%��cW�`��2$��Uf���f>�
�G����g���M�l�F�<�68�4����l@5����%-(u�
����E�a<-`&���H6�&
�:flQ�D�w5��m���0�5�l�K�.�f�m8>X=D{v�C��@��*6��'�@is�JE����l53��9�Q�]��f�Z��l9Y����J>%�f0�'S�wfw�W��A���ImP��['S��f<������D���I�%���aF�w3��XK.EX?��Q���6GR�7R����c�m���a�&�lgq)4�T��
\g	�K�1�,Z6�#�kQ��#~������
�k����8��
d��1�*��3Q��(������Q}T==��J�R������1���@8X��g0����p�*�/�����S�����?����~x�)��f�J���J�C��c9���W�|&���[���/�|i��)�4J�j����Ei��3��>@Cm
-���I��BaH��t��*���KU	js$h~����1�Z~%���P8�Me�n����]Sqr����o��������Cm�T����d"���t������-���X>e�%��C��P����J��E�X<�`@�8NT��4��=)	O�����K!���B��3�V2�;�d�;I��n��=C�������WY��.�?���Gq������<�\T����������j6@�8l��
F�h6 ZmS�U3����(�XU���i+�R�
�S��t8V[<��������0[M��j���&���4 ��o6�XU���He��D���6�O[�����fy]��
U�_k0��:���V5���%2������c5.��=��a�9�����6�3NE��#�L����:]��S������� +��� m������x������z;A�M��H)=`������fT�mJ��jC������t�/jfs�	�9$����@���H�bI��
i�i�Z��z�$���m�����QU�z��=V���]Z�.�SmNU�����^�0��i��b(�[�pm�jY��- emSR�������9DV��G��+��;�������3k@��j0���yh^�����-k�����N�p~��*fa���t�t�����M�U�f��vh6-�X�#c[�(�cm�-��/�����`Y;����� ��<���{���dw�y{6Z�G��+7 hm�����1k����cm>/n����~��4��K��9��H3��Kf<@d��qT�`�vd����^�t���Y5���\����w���@$�6G���m�x��z53 �fY	X�x�����Y�,pm��[�������*�������{�`�k�F.�@[����������J�zyj��N��������rN����f^�J�	��V�jjMg�WM����������\�uL��U�`1�0��3z8uz��G}���MGY��1�X[��bu���I�O��<0M����[[6	7�d���Y%�:,0k�/:�u��;��2�������������q��kk�&�`d{�J�f��lc�������1�n��:\�z�Zt.��k�B6Zto�8�k@6�����q�j6@8`�� �a��bPUk�)p��&�SL�8��u�����H�F���	��c�H(a���$?��Y��+��o�����P7�(U��pJ�����utl}=�R�6��h���f�k�pb������������������_qv_���U����������������)m������������M3����h��Zevq��Uux�*��S�q�f���_j�Uf���g<$���z��R�(�-[�����XPy����r���U5�c\��ze�|��p�c�V��H��`U�h���cVu����,��c������S���o��z����d�u8B�ZvbB����AUk8�H�#m/5;�u�5}��b�l1�
$]�E��{�����a�\������X�u���Au8�|Ke=�������iQ!��Q ������v��	mk���&QK�sz�����O��l���N�_����������hb�������V�%T$XSE��<�:fQE@n:�aE�n:���r:�h�f�x�t�#�:�t����Z�����a&Q"g������T|�!�]���O�j�I��z�l��a#�v\+�u��t���:wt8��(��'-��U^��t�
b����8Z�w���J�p(��<���H5��9���pjt����|�����(r��G'2^������1@�h�v ���;O%d�z����_i��j�Y	A��@E.�N��&w�Z�������,���de�*rJm����2�XI^U9���7T��p6�������d�}���Yw�W�$��_bJ$�ktb����	�NPF2Lyi1�Ajv:�:�fd4l�~T�rb��<5�5H�m��|<J��X�
���$��}�j�B+���$?�I����t�,��XF�(�(h����^�Z~V�u5�Lfw)��]�G��@��i�Q��JR���VGv���%�g2X��=�r<���"u�rx������t{�["^����W�3����������{tY�q��a<���o�jj�L��=���\�+�\�����*M����������h������7����p����W�m�
�)�\)�E���������%����n���E5�8ua;�������]]���U��$�dF������f��m������I
���Q���	�f|�Z�����@hp�5��*�����c�k�,9*@�n���B�V)/����@2�E��c,5^`�A��
)e�@]qT�"b�Si�rr�T�N���pd�?�|�0�E�t��r:[��]6:g9�������
�. ]� lI<�]l���t]������
���h��c]o4��������H�Y��dT)���q��r0���5�Y�J��������r�;@2��9�j6������?�a�'1�H�d1�W?&T�&r����<;�pzrt�?ys�������7�?~<���������b_i�~�xz*mj��;_�����{s�s�G8�k�3�gt���|. ]3��^!7��s��jg6�p�q���s����t��� �k����^t�����p����k���kv��Z-����1�<���vg4tMAC
~���2�����<4�]��^�8�D��dd8�
xF���Y�v,f����Y�t�z���?�/�=~^ G�18���(��?��2�+]�����_��E�t:��k�V��4XL�_[�mi������0�Y|����J6��t9��5�����=_,
c��@(]6�g����Q��9w�O�|���'������4'$��7��_�LwuL��1�L����M�S�4D�a���Zeh�H�S�7e���p��W�\�,���.]���.�.]���py���Y���
QZy`�.�i��9
�c���O�b|u_N��K4��h�R/��W1���F^B�e�?���mp;�F���]x�]`�=��n�{R��{f	���n��D{�
�P��P�#B�Y��k�<��E�,���&�"����-�:(O72�Pa*������V���\���������X@����x$�u�G����Z1@N��p;i���o"4%[,��d��f������s�-�`uu�3��L]`�1H��k3p���P����;3�PS�,0��P�=R�R������p���vIL��/�jx�����'�����������$��A�.�D�xm�d�z�����.oP��Y����W�#�r�.%���q/�l�
��B]��,=6:e�B�����l�����T%���a��z��������G��E>�6J&�c��{��t�,�R�
=������������g��,rJO�^�����X����W{��t�0`L�z��S��N�l��F�E�6����t��p1��`R=3&�L���GX�����������`V=�h��m����{�w��"dz�Y�@��5W�=�zfq3;������R=�JU����i���q�=��zFx��2o��-^{�w�tg�����^=�C�C������h�6���X5�?������c�{{���"��u/�z��8W��G��������ri�}���`o=�����@X0����j�-3A;��vO(W����]���������e�=>bwR����0��S�<����p������MM�:=��zf�R0�FK��D=3J��X��^��#��E�����D��K��@=����[wV�l�g���M0_�d�>�C=MX���s���)�N�������R����^.�yH�<3���g��6��z�8T�����E�T�R��l9X���X�����8����p��xK��g��@.=.Ze��.@$=3DR�����l��;�_1��V�@t6�'��[�f�nK��ZS��R�f��8��O4K�������h�,m�&9�=�����_$i6Xdb��T�0������	PLO���F�fTW����IIw0;s��pK�,���IOs�9�������J��Y	�Vzf'�{����	��1=�K��ps�!������
���^��1 <=��T��`|��ZikOQ�0�$@U�N �@8��p6J��&?���0��=<�rzf(�ZM�*P
���x�#��u%%������7�i��2=�#���tx���1� �	�k�������E���?��Rb�-F�����<sz��f�������+� ����p���o���R����n��E��	�5����a���)���Wf��3T���a�F`������)��	�Lr��c)��������q����a�9�(�K�i����������l��5G
����4=��T�
��t<�C�Yz�������f��Q���2P�u.O��=�2�;9e�����������t��sA8�l@C6>X�*�iU2�������"jz���8Z�R}>|�H��>�*}�(���k�W������L�(f��Rb���m��x��|i����}@x�=S`eX]�X���2Q�?>����u�7>�<��u�)���Rj'u��j���|N��rN����Z8[Q�z�s���^�|�s��jU��gN��:�s�B����(��pr��>�c���XJ�
�YN�W+��$.�s������K�
��Ra@L�1����5��6n����Ea�Kd3M���mv���H���I���3D9~�f�����7��Y�J�����ll>�����O/N���K���m���F_p�q�c}�E�fXd��7������r�V\����&c}�7�f�<}�5��!��&��hE�1��[MdCZ���o������5d��KQ�D��VEI�w"�gm�iut'����U���+�b������W���n��5�������0���w
G@�jW?�����(���>�-}]���I�{8(Zjv��b���]���|WPt��`q��
".�h�������K�b��Wj$������J&�U�����4����)f�>p����0�����>jS�MH�^pm�YN��2��5T��X����b�Z�L��2�6�S�>`5}����i���ieh1f����s!=���1}�������U���h2L���������v��c;K@h�f�;}�M�f�;������Y���������j6@8v��A����T�	�Z`,}����B�����)�1T����O4����2���'���-�����N��>�w��v.�Mg��M��
'Q�
��Ng
����r��������.u���������)<�*>�9�~
i������a�j6@e|��J�2������9 T�������j6�s�f�������\mv?/�i�	Z�q��q���F���.n��u���S[w���L���%=t�r�������0[mP�>G]vL1�������r��
6�~n2�LSK���<�Ot���T�D�W����q)��e���?��Dw����BT������������]"W�3�����C�E�a�������B��te4����_�k�0g�J{B��]�����>�0}��r5������1��J��t����;�KS�r��t@V�f�.}H������K��y2Xo������&�w���?y#����av2���A���Yi
����7�/���Jr�wN;��w�K_C^��)qH&:��/� ��I2���T�/����Z�q�4���\�Z��D��%������������~O�hN���������[����7
��R�Z+�hG�������l/������PQ�u;N�\��������\���7:�����Vh�G��;m�X�X�ov.��R�,�g����I"��s]#��ot"|usf���>�S�(	G������c����j�(g����>w��V�sxh�� �������o����S�~n�N�W��������j>�#�+eFo<a40;L=�f����jS�U������|������1���������@y����na
pA�l��pP�*��:6Vg937(b�5�����'DD~x;K3+���4.\��Af�gi:�lEw
!�;[m ������Z��30����q���,��!�$X#�mm�3�lW��i�a��:�4�E���R%�P�L���yU��PJr�[|�i0Y&���*�xX�\��.��_Z��^G���/�8�.'����Ry+������oZUPJ��G��L���S2�2M�p��W�4pL���Z�D��:.4�X���������Gn\�����/���i6[
ta0UI�h`3�����Y	��z:����O(p����\����K���d�,Y���?
L�S��j�@i8�f�)wxI8��(��,;�v������}������y�1�N�t5ycE�����f
$����d�����i����I��v�%�������q�w
w��u�
�{�q�+����d��NV�D1�C�%w���dj���PT�k�:�i���Z;��~V������f���X��S��?J��-���/�u(�G�
�}:�Af���,
���E
.��E%�����@:������[����=L0-Scu�k4�i�JI]d���nd�(Dh�C.o���O��u��L%�j��e��������:av~|`��(nh�_#
����}�<@�����R3*5Tj`D��*�b6�Wr@0tG�SRE�����������&ZT���O�O��?����~|������3o
0����e�"�E
����	 ��"
D������j�3yS�_[��z<�?�^\�O���?_��;<������?v��
QY��
p�@w�{Wm@i�F���"�K^Z�]��T����>g��"�hVg��v���W2������_e�\x��T�:��,f�Q@a|��f��2jD���S+���7��q*�:���if
�\���8�s�K
����x$�����r:M�I�����K��@�����|A���"����}|w������p� 0]IjY@�{�|�T�l���j6@K�b�J��5�b#��R�5�$�
/�e1�%���t���&���L��1��*G��&�V���_��Jy�rv9�����m�h�2��3��!
~�+�tD�E�(CUn��U����?-�ro��.���R%�S����`~��[�J7@}�p��M�x����~<��2�S�VKO��9���"j�4=���q�m-�V�����)}��R8�H�z��	��u WE�f���V�%�����w�g����������6Qw�-�P4���d�Y��i>>�g`��,Lk��r�A�����{�S{���+���7g����
�����y�D��W,���
];��_�����������E�9�8l���W+l��?k)�R�x���^-��a��TF��<J��	0������>*/Hq�C�)�#Y�F�����,kk������fd�muV�x���^��U��/�	��Ad"�����S��d���T�G �hk;�'��]
D
b�Bj?wpdc��WmnJ���@������Z`���lr����*�BF!p��:��r��j6@�8<�5��q�s�$*����@��M�+�;>4v`�����#�*\8[�bE�u�0������;�LE���AK���g�:�q�a��txX@�P"7�V��Y5s}�
{F�B��=��#�W0���5�ZE.h/f��
����FU���;�b���x���!�evX2+)��r�i�,^
g���������,$�j-��4Y��B-����"�L��F��a[1����iZb��,��v�]RD�i~��z��kiM�p��4\.iW����U�F�5����qJF
��f�y�Q\\5�H�\��*�K5Hk�$�(*�h�Q�;U`*!`�C��v
Xu�2��������q�! �C�LI?7>94���0��z��Z����x?r��C/�8���@���������zDzW���B@.�F1q[Ft��(NaVQ�:-:�����C�d>��@��8�X;d?�x����RR0��C�`xH�c�����������Ih�t������l����)����c6i�8��PF�R���p�(�c����k���*\�x�Q���H���S
��n���"nY����F�m/y���r���5({�lS�������=�M�������NZ�O_�V�`���!���z��a:]$����2;[���C����)�s�h����Q�\�(�n���}����������q@�� h��/au�_���8s��� h���~I����r2���uT ~����������]����3�)��.�l��|z����~�e�{�B���_�]OLW��	�� �C#R����i:@k���s���c��������
s��@��g���i�;@���X��k)B���������.����uy��Xj�vh�p���m�
�&@������Z}��C�tU�a���F����o(yh�b�7
=��v�p���&���&��(�,'���<��>��(�6&������z:�[��/��"I�i�/�I���/�_�������h,�����dQ���y�<4����������Y�����i��`�!�\��[���M���)E���E���!=YUa@p�\�d~�����69�./����L�:��,����9(���R:����Ee@L����W���s$�t�}����v0O������t^��S��� ��.zr=�LuP�7���`C9�shCyU\��V	��L�����_�T�����C�������CB�������lY�S��>g(����q�e�Q|0������i�d��O��P.!���H�:��0��4�q�9���\B��f!�C�� ��UB���� Y8;�pzr���z���������2���4�q���Z�@�8+�6"f�K��?�;�C@.�F������E�n��3����Q#@�m�����(���%������F�;�t������������'o?�����/
�zr���
P+�8.ZG����p��4F�#u��Lud��Q294"�-��|�+!����02H�4r�������Z��:�Ph5�F�������_	&�����f�:���f6@�#.Zs������UoH]�u��)E@+��Bnn��4��`�Qo�K�q�Z���4�s��o��u���e�:�
�g�/��6����mB��z������x�f����������J���*0w����S�e�|��I�x�iPi�M&����/����U���=D�8T\�F�Y�i������I&��Q����<����hu���(���������J����z���;D������j6@�����E�<�^��>E��J�v������i����Gc�����n��l�QS�������%'la��������V�b`[[�!uer�d�Sp�4�6�����#���T:��<�K6 �#�t��#�;Vl����k6��#6�s�]�9�pgv�&�sd_������Ms�^�@�nw>�t�!�W��VWY���9���h�^�H�0�6����s+@%G����3wsngZ�Q�d�~Jo}�G�T�V�K?��#�h�C���5�]=���#@GF��6��bE{ulr�vY������ZD�5�=�,\r�XV�J�j����\���5����[�"	G������7B��[�X���7������s,���$I��Z����xy���7f���>�}g�q#3W��fqrK��b���*72����*���%�?�F����O�F����"*G���<�k�`�(����U�l�=���?�+�m���J6���|<%j0�8l�:+�2�-/B�}��pR62���5��\d�n|=������dp:��(N 6fP�Z�;4�Y�����*�#��F��f�@��2^S��\��jg���
�����[
:-���BtZP(~�c@W#��mJ�^�!Lkd9�;3���#^�l��p�kk������?��Z�B�����v.���?��\���l0�][;��J���)/����F�<��F���W#���TE�4��i����4��h�Z��HXu���(0��j���Qi9!6~�^
�����������c��&�b��U��*���z*��F�H����`^#���'��Ff�~�=�w,���Y_+��U��?�/�-6 )\x`5������Eu�-���* P����1_Ik��1HShr����6����j6@_L���K�X
-�1WG�E�����52�Ys�d���F��ft�~��y�QokA����fL��.��
���BSy������W[Q�H�1g�4�U�U^�y���N��X��t�SF�@E��`>#����3Q�`���������� >���EWV�v����y���<h��0h��Y�g����S�;W�@18T���)�*�E�-���8��4pkU��fd�iF������-�Le����
���<�m��X0'�0_6�niO��rZ=��`���-["�2�x��x-�dlLI����Z^}Yf����>���u�j�t�L��m~2�~��c�������m2�^���v���W������v���R{�v�U~B>s,�:�8;>|+�9<�����y���� ��)��z��4MD�i�����_�%������1p�����k�.�jQhyk4�#������[:��2��O��h�z��M	�V�qi����p��2�q��Dmr�S�Hle�-����h��P�/���dN26�$5��!*�����zl���
�)r{��F1�'c��T�Z���U;���1�6��F��ypIa���Y�n��9�d�F�m�/�6�
�S����~l%J�q}4�Q�NrW�%���'c�T;K@O�=I����f�Y�����Q���}P*:�fA3����#*�l@c6�(WAz���y�������YSi���������2��K5���J�����\^�;-�Q��0.te���j6�s\[=`!c��l��l6�`��2����{�A�\���cS��vN�Fl��2`c� �1�0����(
j�\����c~�t��
;F�cl�>���u���#�l�sPc��x1�����{�������@��[�
��15���16%-'������v��"��l1J�/�����������)3sx���
������d�![X���I:U�Jvs��*�;������Mi���1@c�h	/a�����h+��W���qb���#�l�<������M��Z/)%�� c3�0�a�#�?-e�al��*Gm.�^Z\a��C5��u0��?��y5N&#k�8��4;���������[1��[������,OY{�F@����f���Q�;`

��)0������S�)@�%CS����,^<1�rs���1�.����x��K	p���
�i�fWma���H��H�M
�z� ����06�	[�e}f��[6�����IwE��=��x`�qh8���X���>@�:�m��Z�E���F�\)��p ��Bf���A�L��0wp� ��4^�������/6��b�������AV���C���/�D��2� �xn��K:J8��r"�&�u�@GF)c�����UEe.�
���.W��&�u�B�8J#�_]�Gk��jW�x�e��py`k�����l�H���������X�9bQ���b�E�T��a�a�-���1ZR��<���3	i���u�A&;P:�&]^��w����LT������DJxa	�e���I�)�4���E�:�1��h�-��a�/Dr06�
iuo�)�WY�y�6���,�sT�>5��WuF��12�n2�c����,r�O&����KHcl?����F���C�&�m������_*�*�S���D�
]�&t�!������A�7�?F���_+8t���/�k�V���:&t���\�RC�
+�V8��Y�)����n4��K��l��S�t����{z����vU����n_Uf���U����W��]��(��	������Y���4���h�����rr�j�
Z�T��fmQ�)��ugw�9q���`���po�3��i��.=�/��~�_�� �?�����wp9�_��R|B��O�>�n��|����+��{�0�LD�C�
�����H�Vd,��6�&.��d!n|{m����������;������n�u���o2��`��~!��,.�m��)Y��������2�uy/j���Y$W�_�d�����I6�Og��0C��d�d���`p�L�O����+������`�L������U����m�J�aU��=������o���w���6�/OUEU��r0�E�PE00��NM��Y6��_-��}���~z/<���^�_������n���������y��V��{��8[����4�^$)�=t���o�x��������.��Z>~zrd�xup�'����������/F.�kS��������d t�E2�?���
+k���J��z��h6��6����'/�)���/���L*�7rk*d����D�"E}�	��j��������<�lo$�I^Y�uW|�j/]J���l��3����~�f��$�I�����U�6L��&7:�^�&S>������;#i������l��O8����;�*��g�F��,?�J��t������u�
D!�D�,N�D�G��y��qr���U������H���'t��1�#���w�y9�F�K����%P�?-�c��:����h�(4��V�����Tw(�WV��F�!���j�juGS��_��7Z�A8s�������q�����eTX�F]GE�P3Y�Mm��aI�15��6�e��o�R����p�S�b8�f�������E������w���X_>V���%����+����,��������e�����M^�F�����W�����.�b_�+������q��������
	��,��)�����_����"��{�^q�r@U���>�0��?�<=b��pE)��N�������'��;�����=[)r���y+�7\}�\82�ea������4������,����yb�VO��w����~*��	Ol)|��=9��*�^
��������nj��7�)�'�]�=�������Cd��a��`����a��(T��$p�u4�G4��a����z<]�N/~b;i�Y�3��44
��P��x��d�d�z��<��n�����4����v��|�i����Vk6���K�	;��l1^�a��^|��v����v��O���E������w���|@���|�O���*V��k�u[�7,�C
{7�������/�f�@+O�E�f��
����'65�;�+��|���	�������X���Y�>qeu��nK��k���	�'V��v���
|l;~�&�v����>���8-����������u��&nv(\�'�������n�g�v������_WZ��li�m�����m��n���Ol�ow��������D����o������7�l����7����bPVqu={��S\MS��5{��������a����,�6u���k�q�����������j_��"�4b��B��'����u�����������= ��s��B{��}������������R��[�o��m�?X�[�����[��1��.Z��0�M��\����<��=!��\��g�n��&���������|��g��=�u�g��=�u��W=�%�g*<�� ��y>�l�����X<#X�����f����������F#=;��Y�|F���� �G1�9�}^��s����`�o�����L���[Q.k��^����\g�������-��e����
���������|��9��
�������&���x?�wq�}w��m:���Y�����gB��������b����D����L&�;+��/-�V�=�W}�Z�
{����m�}#���'nEU��sv�$������:��MFc
*��+L���y��-��1�����+ng���7��$f���� ��,0��]��_C��1D���GQ��]n���~�k��H�'���c#vR�f�[&������+;L�jQFt�t���d�G�;�`��]n=�r�����Y�Wk?������Cw|c�(�������I��4������cjV�/�C���������h��N��2����^<W���(���^��]w//���w��rH�����s]�$������,��!����e:��K���-o�cY��tF����l�g���9;��U��@T��d��0�O�Ufp{)�����S_y������?��������{�_w�X�Y7����Tg���|��T����x���%�~`\gB��"2"����<��&;���%_T_�������JS\����q���hM�����Zc�<���&��U����4�~`�T�A��4������X^�7t�U]Bw?4�.N����au�
3���+�o�������UE�K���jJ�[�YQ��~����+��0��pz���+=�~#�aBj�^���;k���T_��U����e4��vu1s`�}�L��q`\�W����J���\��Z�K��+�d"����e��o����U_��`.0v����@�e��c�������DrHEU��[mVe���\^]��k��o�������Kn2�.��p�-���Q��}uO2�h?2�;")z������r9���5���7^k|�0^��e��!�Cy�!9�t<�fH����R�����������Yij��+M��83���&���a����&�z$gPl�Ys.�v���a��n��E�����aui
�E�`>��a����i��<�g����5k��O����kj������F�y=���Zk��B��Q�z~�����#m��-�����������[������������[����Fm��vE8EjFo��E?�m��������mY�-+�]]��.=�����VkU�-4��f��2[P|�oXs��������3������[�j�W=%^��*��*�N}��J��J�n1�n1���([�����v�����E����zz����?K���[�w��n��-�������m��-�������5����������o��-�����E9�(����ry��Qn!�-�V��"[b��i����~�Xm��-�����e8��Q}�ry[.o����f��f�����������o�����[��-��|����l�������)Q�a�3��L
�k���;������_��5��u�6���������}����?G�_�w��o�,���_������wX�[�������+�^���%_[Q��o^|g����p�VT+��u���/1��n��Xj�KGu��������s����H�o�tY�N�9�����2�������<�MLu���n&�j:���_����"�_
`�{c�Y��J������yQ��������H��
�5C����pE:��8�j�=���XI"�OY�4Y��v��VQ�U�k�`��c%�]W�8���k����q�l��D#G���9�/)_d��V������O���i�n>��oV�&`�>�������K����+�����'�K�Ho���o�i�q1[��6�����Q"�M�>����w'?�^����5����j�7L�����om����-�fZ��T-�M�f����h�6����x:�,�������e����7���M@���v������W���p���A���a���3�f��m�Y+w5u0�]��LN@0w
�f0w���|yY������LcWo9��LG��(�Z$������(����^t��
�a�r��-���� 8� ����f�MA�w�F��i����v@3i�l��@&F&�.��$=O���
��LG��|b�P�LA��L��t8�$���~8f����6��?0�����F~+�T@63��o�i��m��wr�~��~(�Up+Uh}K�|��'���c���7O��o<9���k�h��h��h�.���_�d1�>��#�/��[w?w����8�y/?� ?9L&���������
@�w��4n�i�i&j�?I�R}Za~�N1�xC�����_3���{��A�w���.����<���;�g{b�;�X�T�t�<�$��$t���2����h��e������jT����e�p���>��P��J�����+�R���1�����c��5��b�qc�M��YXv�?d�0w��K����Uw�����:%T��&,�/��b����V���IS+�d�����^Ho����p��z	@!�M�{
�	�������Z����/�}@6=n*��^�MO/��K�^P������Q����tz�Ig^������U�7����������������7�O-
�PP(�����B]�n�n9�7��G,( �>#��������ED����?WaQ�Qf��E2������r�Yk�_#�eK�U@}�u���$f�
H����C8XG��r\�6��`S!f��rw �7�S�i���j�����X�k�U��5��q]�E

6�P��K42�@#7�L ����{�]I40q_Wki��`2�7�0�X���;�� 
[�>���@�BF�
#O��x�����C�8v��E	�MR��C�V��{��]_No�G�/���7�lyk
���J���GJ��h�t����c���O��
Uh����t�/G����e�e�t�5������D�����vq�
���1�
�Rr�2�s����@&B5����~Vv�X����L�����+�Mh��uw3�To�JE%���]B�5�W,Z�x��N��X�����'�����%b�~K8�5���:,�������B��;�d�����b����d1Q��]��e:\��\wD��e;�	��h6�*��I���X��\���J9G@�"C�(������s��D��Y�����u3�K>�']�GrC�l:��Y�X��6QZw�&��%
�z7�n���T�+)I�W��W�����[�����/�z���*���X�I�.�������*1��[��t�'��#����7c��D<u+���$�]cQ4���[��e��y5[X�x�m���|E,(Ed7�*T�~2m�f�x*��-��in������F�iq�k��q�� �/�w$�/Z8!�`vE]Pj��������,��f�7�����
�f��$F<��*���@cC+���F���p�=�+���+����E�O-�se������RK��b������{�g].E%���EoG���dB��3����$�f�T-M �1+�
������F���*F��[t~?��Y�������+�A�5k��SW&��!���+g���V!���Pr-=�Z��P���X79�k��{Cu�w�:���\^J�)�I�'kc!t��4��*l�o����()J�T=w��`�����i���7�k�����JB��-Z\�s�#�"�<9��������B�
������+b6����]��3�8U�����en68�-kK���b ���=���f���������o��:�����x�.�u#�<��f��Wq�\el�O�,?����'o�##���U�f���7V��):�t����`�=��m_���I���(vn�%�����I�������{7��9���I��)@0X;7�#�cX�
�����Uy�3���b\Hc����C�s����3	���sV�?��m���T���r���CW��#��l�ph6�v������w��������@���+�-��WUw-�zA�[�R�
j�l����h�j������+k���}<=�����Y�N	��6Gs�t� ����%F�t��t��?X�[�����Z������p�� ������i&7(]�{x��C�������FjK������G��PH�y�������ms��?�C�V�@'8N[���a�����o�f��?]~o��K�H�]��
�Uh@`8:[�hG^���	J���0|}9���6��-��i�Q1m���S�6��m���1�$��r�Z����h�|��mW�������uR+��>���cN��������tOvzR��&����z�4�e���\���D�c�56�W�L�l���?n����
�s�>�]�6��W��t�Q�d�i���Ht%��P�D��]����* ����%������@9����L�f1�]�|��W��-��y�<�����v�
8n����}�Y�X���;n����sK�
m�C����$@�Z��0�q���I���B_����v�M�c������=���b���k5��W�����4�|E�#�~D�#�,u3P�6,��
�^��|�K��j3c���=��1�4�bM�!g
��pK\��r��bL�1g�>�n����R2����dB�fT�_K�jOUr^���Qk�s[���s�aw�w������4�M�7M�$�9#�N�P�5�����/���!M�F8�����o��PU��c� �,�d�g���)��W\��-l�/h��M5�N��z��%��l�jc
����q�*����Q9y�i����7�Tq�v�]�5�0�����[�|mp�v���m����1�lKeew�W���
r����p�{�
��������v��������6��mX���|�
%��q�9E�FJ�����PK�{6/�bZiWF~�&�b�}�)<������5��97���������=�k�c����+��hs���T�
��C���1`�m��.��(�?�;��8]	
}��U%S$��������h�W�I�I@���Pq��zcf���Pf�m��l?oaF�x��G���T��� k��f�,ZK�(R������j6@`"C+p{��W�!�;���U�/3(j���ER�@�vd��x�W��7���$��oe�\%�Rv�w_Z���"�������)�V��jK������`��,��������X'�a���u�-��`�
G4+���f!�"	���&�����%�#�#�I�)z|������������e>�����v^+Z��t���H�O�������c�`�m��.����l��s-	gK��$�7�h�C��l����#����_U�'���A1�AE����{N\����s|v���q[Cms5�v���0�M���9��������}i�K�]���c�;���mz�S���^	'H�X�-���4+�����l������v���z��Qw���m����l���������S
�Hj��������(�������U8��vz���X��@����R�U����������������g}���c��8�w���
��Fluu �V�@�1��;=C��`Y�,�w��������=�����D\d:G��N�)>^�����1�\�������Q���Q�������L��_ |A���G�y��w��&uJXG����A�j6@�����ejM^�x$��z�r��� ��-�.�?�������F��jwI�x��EV�C��� ���������~��uO�p�T��*�{G�w��\O�����J��z:��w��g����j|O@�;a�f��X���y>�����_��E>��9�_�}�TH��;:.� �^����@$^$�U����~.���;�y�Z%��p������o�J����W�U���
�I�
�v�1����������A�8�~�2�] �8����F'��u�c}�y.�g��D���6	8�&.������=�M������k�&�+H������n���
P4���k?�j�G4������������qw��p��T%����*����UKxb��W����g*f�G�T<y}q�y&�M��>������I�	�u��=�q��`pLw0�+G{�Z#b`����YP�"���X�iZl�zr�,8(,�*g`���Ey�*�.5K ;�-��&��u�6�Cu��*�*�iQ[���,*���RT�������U�`��c�� vSl�K�p���l��p{��\��p�����8����<�F�i���6�������17����z��Z���0��|.�x+5�Qo����U��N��N�:�!������>7��mM��`2�
{;t�����#��K��;\�p5�\�oJ���r�k�\X�r��Ul��c���'�,/�>��V���Y�l�b-6��������$+Rn��hM��1�����)�����E�0�����"��;\�p�txz�����������������{|����p���
��
���=7�������0\;���V�����(�l�GR|���W�EG�����]���������z����������?����������C�H��p!��l�t����+�rGK��G���teW.��d�N(�V��t�rk�h�v���8��VP��E8����*%��2W:�����;\��������'_Ue����d:k)h�l�n�=�`o���]����P�K�3��F���ad, z_��h�+��U��v���)n�I����Q�K���$IiU�A���@b����Y������������n@�;:2�����u�/P1.���
�#6w�*��hj�9M���e�R���"u45�]��z���~��8G�N��xrtxzv�������{"��<>;?y�����[F���D��1�r�����zjM��p|x�������pa?~������Mk�����+5�\eg�Gn b�����wI��aU����.�&��d�=��������$�]�yc��;K�f�06.���&�
J���A*��P�������B��i��w-�Q�����mQ@��&��5[�R�:Z�j.&{3P�no#0�Z���$���f�ww^����n@.��]
��
��������Mqe`��Z�~UnC��n���t����.`�]��W��K����eRG�.�]]�v�v�7�p<J��r���;^-R�['2>�h��9"%W���/��&�5��Eu/Lj���t&B~j��#���*�i9g�o��'���w.`�]���V�J3�w�w0?]hyu5Je.{�"({){{�QD��.��]���
5��
��
p�.��U����ko4���
�@��1tx���#y�����mK�Ah�u��/���b��H��A���|����z���P��E��L�.G7�}��,��C*�����GqA��\]�|.���u��v�	v�����/Z-���@o����`S��
����k:�
�j����Q�WBY�����Z�=�F��:\`���E�W������C�j!����q����W=;MK��W��\�=��l�u�S��~�
p�B���w���1u�_2(IW��s��C�k��R���S�������bB�Df���9�l�'�%�b �,��S
��������B�/���?y�H��rG
�%�������.w��Y�����z-��\n;��
-��4�3�8�<�G+��k�����>�BV��q��~���}J�T������FB]��:��j^@XP��:vP���#��`�\�g�&]�����!�j�X`�.��C*���\��w{W���^tC�r��
h�����������?����P������
 �.E��U]����)��;uX��
��U�y���K:E���w���f<(u���U�h�k�/k#�^2�k�O��6,�:�Y�<���Z�x-��/�%����&4��t�#��l@�6�_=|�8���t��b��{�Lyp���]����}:)L���]y���'{�jX����E�w�G���Kn�b
�y C,3o�;`@��5_���`Q��������w}#b����w:y���ND3�mPt��*��pw�
�J�m�����UX@��J���v;��X�E@s�p
.��]k_��WUR��\���u������.�a.:c������A�h��M�H�]�-��[�pt��4y#F�S�&���k5{��!M�kY�
Q�]���E�V|B7�3EW��T3�2����f�Aiv3���SvYNF���T��#�a��f1�����B�Fw���
���:�'�z6k=3P�5o�!=���M������?��x�i�J��@��:J=7�m}Gw�����U=��	���@�s�m.�Y2���F���Kf�d@��!�'�[�OnJ���?�S��I�zD��wu<�:�\��q�j6@�B�c:�Z�n���i���f�xw74:R�����!3��]_�����}{r�����r���bxq����13`�������suB��ff����:|�����d�![�H�PoZ4���w�����*t^�I�p����z!|j�\�|�k����k�zw?�z�8��]
�����i���d��+c����t�ogI��]������.���l5�F<��r�O�\�q�vv�����s:���(@�]Y���nl�m[�W���F����;v���kr^���$v9�X�s���)I�3�������3�����i�u<�Zz�����t�--=/*H����g�^W��qx���2-$�y=�r#��p����j6zI�8 XUt`���V�y!�oe����P����m=�e����g��ZN��M����h
�@��7��j�� r=.�������L�^��Ni3������j6�M���~������&�5c��*=@W=ct���L	H��[�0�Qz�z��f$��K[J�O��?����?=����?��4�c������s��������=�k}�	�����	���l��= D=.��*�����c:���� ��Y���y����YO���i�53�<��v�s��@]eeZH@�\��\�����e���qlfK����`��%�5����%�Z����W��7�(��0I����_������L5	�	w��zpb�'��;0��x���gz\d����g���0������e��0O�6�=Lw�����X��#y���G������<6�YY�z�W}t~?O3�q���~R;9�z�vr��X.�R�@�r�Eg�������/U��t��Us2�~��$
��q�V@�zF��5e���6�����a��;ni1�I=Of���&����R�w������(0SJ�lm����8^C�E=VL���^,L#^��I�)��wg�����%���E�\$��}dCb-�*;����u��[D�+
���*(�r2��R�=H�q��{��!� ��N��]pS������[����9Pg�Z$����[D%)]@�6�nf�4��r����(io��?����v�R�_[`������Fn<�zF��];�@s�>+���J_�j�F���qh��
P 
|�)��'�����N=���������^�5�1`O=�=������L�<N��A��)5��#T�l�l���V�hSOK��Zo$+w�g�M�������y'G����B.��+YLUk��h�T���xw�%��=�2�2��L�Wx��k0�O���}��zgP��H����B5h�Q�m<�uW��r�6^X�p��bd��f�	}6}���h�fI']���I@m�$��*v(��� +b�����`U=-��V2��zZ�Tb�����G����<!W�B��������uTnQ���L������E@R���n��[W�����!�����"V�P�p��].'��E@�����E������P�d���U9��7!��._��B��O�;�!.��U�J@�z��rV ������JdK�n����v.���#�B�H���7�n���6���E��G.����W�/-U�]=}g�w��g��n�;��3b\[�S�n�N�:���O��Y��{Y-��~��������:�~��z�����5v{7�kq�����[5 hF!�[��f>S���E&�T����gU��$[.�\�Q��6���Q�r<H�F_*���gc�z���l�����������F���#��� �H�N�r���$;��v�)9fk�^��LI���mxrs5�(R5��z��Q���}q*�_O�j-�X�'��$�������;�K��;����iH��u�e��W9g`�=��2��MF��Wu����Y��
�L�Ls����&c�_Rk2#?#�]�$���(
Z�(��7�G���xw��<�g����y`���&6���F��%���[
�4�gT	S��E6���T5�h6���[�KfM`��-���+M+��h/�]?m9v�����������Rlk���/�� �i%{4Q�g�,�o����Mfi���+�NSRv�����{F���g_1�q���Z�/�,��EQV��K�o�A7�"<(�����A�h�2����"(��A>@���!i;���1��8i���[mK�>�I�����������Ph��<�k���H]e�Ss��6�U�P��@e�����F��/'KDAk�������k����<^����c����Z�)��������(�X���;"Q��7j��a�f
�����[����6�m	p	�������W3Id���}�\6�yS~���e�,^
��9��NO��C��anw���^��X��L/����d���uZ,�g���=+�Us��<����Q�P������=3y�49#�+���z*����hE'wq�Oh��<#:�:���Ao�u@��1rw�����)�K%Wc�m�Of�;]��������)��[�<�f�I@�8��0�\������h�A�X��\�k>�v�eZ��8.fi�T���[u;�oQBE4�v�R�e�[���}����S�E^IfR�a�k�G�'}��T�E���`\J�U��f�Z�|�F���o��^uf?�*��_�^u|���|3r����P\��'�������������2%��OH��x&�����t��������V���d�����y����1i�Q|	:�c<e�c�F�\P\5�,�Qn=G=��V�B����tt"\�yq
'���u�(Y =�f4�5������p����w7:y���@����9A���H g��v�R3-.��k�����s���
��|��l[��|�ykO����YtT%4�<����������������������������>�^�S� �s�DG����5�DgL���0�g�l�d�S�,N�:��6T�Y31�>3�P����j��[~.��}���m��&��u��o�����G�j�`������#��������j����oO��N������dk�����N���������b�Ix�������Rj-��3<���u�}5;�Z�bb�X���a�s|�[����	��~`����K15N�!O�[�
2���7��e�{�:"�`�oV�����}V����Q_I5�����'Z�I�m�<��}�W���d@�TVF4O� C���ty�cz�z����C��G��D�������W�������+���q:,rDk8[,�t>�RO��t��y���}@R�z�Z��J��F�V���c�,�b������(�s���
�M�^� �~��_�����0�=��=.%wV�r�c�v�	�+��(�0�S���J
t�m��m*a�����s�j+P����"0���A�$�����go���������O��~����.�Vo��R�'��91���dI����4Y0��He?��P�2i�x�7�F|sg�����]��pa�
^��o/��W�f�,*q��|�W�5��z B�)����p�����j�h_<P0�^������KG�3t�X-a��G������<v������*h�8��]��x@+��V.N�W_/�}M,b���)���v�����lV����s���l��������Z��m"��x�9���L�99��'YF��1.���\�3O�� a?6��Q��l2b�������&�sq��l��Gn<\;�M�I��������.�c��"��������t>1-���t70
}�(�5]�P��������rp�n��ui��=t8S�C�$0Bv;]BJ?%��� ��r�h��,�������������J>�����z8���x�l��p!��D�>��U�0�-�~$���k1�u�`��LT�-�'��F8`C)+�U���6��UOE/_>�,���c�E���X�h(9�8��TU�����9`���$��Q���h�5��U�#�CUO��~;�3U���66�R)y���p�%���
Y;��{��[��RmJ]�S���Yx���Q�j���C�TF��Y�Z F>��a1��j��:P;0
�Z�����O���;0
(m��
�-@����d
�q��ZN�HjC�w�e���'��oZ|�Hj7���1r(�^��3������NVW��y�@���dwM�Q�������+�����S��Qu���pV�K�t��Cy���V�Z�c���!:`�3f�{O���<�`r5�M����.��: �
i�?��T=�P���z���Et{�����EH{���J�zD��0r��j3(�a>�|�A�����0���X�v��l����N����(_����f��3���u#��&�a��]��Q�>�UM�1�J���U]7)�>0�M����zj���A]n�n��������*�
r~��]Y�xFK�]��J�au��t��������v��|���]�H������v��5 ��(������@m `_���5�n��t7A�2��y���c
�!�l@�k���n;���Z
[��T���Pli4[���`�C��~P�����-^����6N�6w{�^� x������{��vU��?��%Cg��#���-��^���-Ej����mEjG�v� x�&�2��9�����^�e4�q��#+�v��(U��+�v�J������$40�������h#�[f�M�f��c`��������l����j6@~t���5�$��r����� 4F�z�F�dKh�q�n���4���"�v\�^;Pn���
h����`�C����q�|�GZt�!�,=�.�l�"cT���M��>�k��N��/�l��~���O���6��l�D�.(=�����tq������"�Z����l�t�!t����!��N�p1me�$�L����v������P����m(�������f���N�`����I�F��|�MAl|X�z����!�40���_�k"�vM���0���(�G���
����lx�6<4.o������d6���zE:vl��dv�#����=�&�9c��������!��Mv>�F;(UTTGT�Uyj"*�`D���l�{�b���}��HjS�+i������i���=�N,���n��3M�9a�HsB��!��jN�	�$)BM t8^�������{�f���g*,N��%�@7A�!F]��W5�P����-�dL�[B�0R�e�4[B.\��
P�Zj����!����j^�Z������is-�L��
q#C��������u�<���[���7���C�q ��l�������Y��0�fk�����S	��Q�2�x\���1������?�����Yq�eE9ZIx��Ux���_��������V<������������..���
P3$�R����O
��15p��.����=4��.Fu't���`��c<l�3.�`����E@gCX���3K74|hJ��SQ�Z7���������1��n�6�����?����Y����c����H�L��BI��rm���i�w�
6^���:^�i���f��C������+�^w�.$yh������B����q�m>|u8���C�r��[����H��:^�v���lq7X�����C�sRZCd���K����D��L#�:���t�����bL�ayT@�z&SQ@#7��k�,������bn�>4�	���U�	P��e])�<8V~�i��9�ZUW�O�\tt5�B9���B��D��u~��F��&�zi���j�r��
T��O�#����21�����h���;$��
FTr:�k�}�
a<��T�@�����[��K��wMgK�n<��	R�<a�rN�
o:�/����i��o�8h�>��[�����Y�#`o�{���4�����:5d�u��
}�q��8�00�CuC�:< �f���!G�Rj=	����vc6�f���G��^��;��^���f�jK���k�����.`��64#a�j����n
9�U�4hmu4a
������l0��LGp1���9�)i}��n��{���[�x���IA<%���0ih����wG���V\;�=E��a��!���`��(�w�,���y������{w�5U���:�nu���W�j2�F�4\.���
h�F0�H�[vb`D��L�{z��A��;i���W�7�E����!@P~����\Q�t��\x��(�w��Z��f	�~L��������),�f�5����?
�Bu��jc��|�L�Z�d�b��.|��?�4��p����q$�J��qb^\A��0�V!��8P5��9����3d���F�JWk����G���`���e�3���lk����a�j�Lw��X�^����/w����5X\�I
W�m
[����K-�`�
�r��U�\f��2�c$�]_>�P��1C33<fk��P3���(e��'4�/�xb�����e��Jy������X�����4@�_�F�6�������:a{5���'�	����v�[�7�1C.�tnM�)Q��>��"�dy��mQ4?���F�jm�u4gdNp:��.lf���j6zi����j#�O��w�����������UG�f�o��D�*O����	���#
��x�T���,I���
���e9���'�#�=F���:�E�To�o����j>�fi�<w�+%Z�)?�t�"�W
=_��O�+�Z� ��R[�`Z?`(#������0���p��q�q�j�M8�v������H�)��|#���\��XI5�:B��l�n��t��ur�/^�'�S����{Q��b���sk���5�*T���h)������5y���<Xpy�&]���aj?\2��^E���0eWuM������iQ���iT�k�12�Sl��+���E��Fl�������)��*U=P���\�cd�<��kf���K���KX�T�p N�f��(�q��s/�y�i�F��n����T�����F�P� ��#$Xt�o�4P���r:�^U��i�k�h�M0F�g��
���0��q���
��M�����|Rx~��}��������rv�hgLC(c�A!'��������������bq�?d�yr�9#�$F��fF���G3��@�\3<b��l+�����B>��_�����������>�u<5��Nn�F���3������d1��E5�.��9&1��D����0%8����������@#�����i���l���@#���&D�y�"�?�t�\�9��f���h*�.!B9��4W�V��bd��|���	���8Ff![#8F���c���@����B��m��?�����" $�����tL�hQ���a����|����l>OFj�z�v���sX������
J���������GG*�82�����V�9F,�(!����E�h1�����8��X9�x�;��L�
��%�c� 0OPK�8�]��i�=��������S2�[�R��xd��~��q�#��12%-��\��<��$��*qa���bat�V:KE����]Fv�����X~��hfsQ:�\ +#��T��j�O��@����-�Bh�|7��u#�*�.��GuW�r2��g��>�k��������,��+����.t7�A��������K;D~C�T��e��-���(�'%G�8�����F�U�H�X
Gd�d�B���]}@G��fnLbV��7Q/�9������[�O�����H�np�A�l/GU�E�2@��\�e���T����z�����q�O
K���*#�{�0S���H�1��Yv�����<�l��D��%�g�a��M�iG-g���F��t���W�4��R5�0&���%q�(7�=}2i�QL#60h���5j\h�q�����Q�x$����Ql����Q�pv{;~�������HF���FZ���j��4�F�"�8�T9�3����6
;hS���������qO7�	�����.pg�q����CwF���1Kc�@���x�/jl�3�il�3�p[>J>��bv�H����q�b��F91�K������8ky�1�Nc����`��C���B������i�.0���jl���Q��sV����������R����E�.k���Q��s�����O�Y>u��pv�����Za�vV��������N��k��A�E�Y��@HX����e����=��#�AY���g�a����������!(1@Bc���./�(��h_���g;�3q�c������T���Ri��1 @c����h���*��F>
�@c�U�"��N_��i&G���c=g���y$M������Wd�L�XZ����
`��S�0���y���1 Oc��
��<v2g��d���k�����~-r��}���V��R��::@@|�,��z&�'l�|PN+�����D�|��@~��<��n��D5�����'��uu�6��W�D2���]�������_�!yr����cG�-�&����BI���K�I���'���DZ�
~�gV�����EO[� IcwmG���)jK�e,Z�:�h��%?c����$�9�T��J��/E�������S��0�|�f
��K��s���(���.p���
P��������6iR�.��90%RWVP���0�Mc4U����G)�������_���0Yt�H��3�V�%��y���-�6�Y��� '�+���46�OW��������������8����Md���
��Y������M�n��R��R�{���x��i1ENW�P�{(Z������������2��P8��1�J.�y�����T�Q@��3�q���:�%����N�p���B��$�����o��?�wq��E������?�v[���My����!��%S�����+ �f�l���4���b�|G�<@oc�U�*f]�z i�Yr-��dqV��G������l[�����Xzi��g�g~������i������^��wv|�������"���6"�*������k��f�������T����j6@u8��51 ��Rj�I���N��9>%�{+��J��b�����qb@�����$�s .M?4n��kGND[	���|���K����t2����6K�m�}x03��n����X@���!Mc��f!Mc@��F!M����vH��Ye���D8�~��4\ml������G����@���N��k*��m�=���T���e*B[��xc���\�RJ���66�hc���m��Z6������i��z5��9���5f���G���"��������pS�s��oh'���SGT�\�U�@[f���W(�=���i��3U�9������5�MWEd�N����j�����(5f�T�c^3��h��.������k��V�����_�C�7Xpj���j6@b�e��	�.|dZ�|���,~7@8���T�l�_=2;g���|kl�4��C��Z��&E&��m�t������|����cZ\X��Z���z�������n���n�*��U��Y��d6�%��~�]����8�c��������B���Po�h������"J���N+��tG��1�21�,7(��;J�h��t���\cd-���B�y����A�|^�����d"KNX�]EM��S�������UZ	���U���2+\��<Y��lf�*���7�0aA/S��L�R������V��:�nL�#���������n���==NJ���O�����#��ul#�C�jzKV��?����A6�v��=��pd5G`�������{+#�=��9����M�,�&KU����t	D��4]q�����8��=a+�d�aa-� ��_��hG� ��Z����KH�Q����T5�vE@B8�����@���������C�t����+]�.�K�C:�Zw,:���
������,+���r�IV��n����k�k@�2_}L��bx�.��%aP'����I����r�Q6���%@�����K�������b�S�=u�A�
����]-"��]�����
�W�t ���z�&�i��Xw��������zD���m��~��(�����N�o��zd���k�(������hEx��+F'��Q(��6�������^�W5����=������[4Z������������3���L����[�2=K�3��6��k��������#<�rIO�8�����/��X�9k����L�����%��b��-����4�G�������U�J'�^|bW�X������Z>_"����|�0r0/%�c,�-���#(�����t}#�g�!�	��^��F'��{��i��I[?����(�O,�����������^J�j,�h�g�e���aw�:���}`�W\ �V>@<<@��f!^W��Z
$�(����M�@�]���g���Y����\c�
�v��\���2�����X+'vO���Mm���N�������1]74�R��)�����E��GI2g,cK	��L|���d
�]���#*H�T��/��>���3e��Vo�X$n4��[n��y8#��[�Dr�����t���-������qTo�v����I��(V������e��4�������6�W{��j�����*��x������R5{���/q����d���2��]uG� F��w�`������0F�W
c�t�U�%�����:����Vo�g���ao������ok�z$����o��^�n���Q�q��`oq; 2�����>[_.��Q���f������E~�xY}DS�����TM���?z����vP�$B�E�z&=�L�bC��������l�����?��1���\3]gm�G)�UPMD6�<����*+���t9�d��o�����z~=�L���m��iX_���R�U\,�zj�'�9�i���k`�X�-�@�x�No�Qi�n���q���v��3]7\C��~�������Mc0�/���Y��s�G����1�+�����t�G+�yH�2���LF.��9�{k�d(2��S����pZ�`�J�<\<��
c[6��p��*��L�
�q:TP����@L�u{��!�'��n�r'�,7]H�&�2��� 8]7{��z�an�n:��W�T�@]�����7N���8]7_sn��:�	U=DN���S���us��
������FA�.��\�T�KK�r��-Zzh���eSM�ZM_����un�j�z�����an�Y:{���=LN���%z���?T(�W�
U5��+��=ZN�
%c�\����rn�6�
�������,^r��D���m�q��W���6`�mm`���V@��v�x�]�R(u���!�
hx�8�24�{���
@y[��x�ZOC��,���k���LRG�6��m�7�\/�6�����Jfs�{+������YUU�||�!�?�����Z��2�8�r�u�FQ6`�m����X�����NZ�5 9�^O-�t������k����HY���L�)��zZ���{b<9�e��\�e����

�O�4�&]YBgo\]��d�r*,�]k��������,�;YNb<��hF����S��I?1S�[���w_Y���6��m�i`&�e�X��r�H�m��u��N�����������{��pxvqBq\���/�?=&���m9����������z�4�4}�]�F�������s����c���esH�6��Y�����k�\0����u�f ���r�Z6WX'ku�
�n �&�Z=+`�m���{@UMCd;��;$�����@5��	6��!��kP*��s���SS���6
�M>n���"��r�����	� ���V>@X��
��f��;7�b��6Q���-s���Z�[��}D M���i�=�!��m6Jwg%��E�n���5� ^D�)7�v��oh�C��8y�3<]|�j�t����Sxdf#��-�E��������Z�m@��l(������b����hW���,������H�#J���.���Er;��t,N����0�t�[y���'o�����V�%
xr�`��W��oYZ7��kj]�~|�^���Y���}��E2�-F)��4�t�2�K�K�o$���+�ha=���d��bx�������U��v��qZT�����������I@b<>.d�t���������`�����(��}�do��� �W�0�����HY�O�����|�<�C�Zq��rVf���q>'��m.dvk��s���p��������_^|</w��U]��Nn���4�
@q[��7%��
�q ��i��NQ:�0��
Xs�,��m>�6
!�z������<���^;�9�����H59G�X;;e8���"�A���/�gG�����V3��������oF��-���pw� O*����si�^���ePz���%�A��'�t�`����8�nG
��61�f��!����i~�P�
��N����qE������OI���@nsy�&�
�g{2.�����zu�V��q;0]`k,
n��vn�!���j��!��rO�;�4��'��`����@���O\V�L��C���9���k��90�m'[|�Q�l���0�S	2��;��T��~>��.���}��l��F��Yc���I�M������m���3*�C���y1��-=F&�
�n���[������Cl���m�}�����=�%��m�:`V�\�l����vd��d$���Ej���|X��5��hH�� �V7(i[GI�t,ms�t+�u�s�~���2��`�:���R�.7N�C_)�W��C�|�����,Y\
�I���v���w.{����qz��*/�
����FRB�,��[6�R�2L�tf���,�c~��=�@�\u�H��n��V	@3��Yg_4iJ��1���m��53(�Ncy��hf%@��Jn5E�>��i�f�����K��Sk>�mMF&9'����������7[f���e�^i6
wnPy�]����-H�W��5�~�I�JI�j-��z����s�vG��0E�W��j+��������eq2f�"��;YBP���2.���ta�-�{;X�R;�3�i�eQ�b�X��B�8�F-����h��]yYS�	4q���R�N9^$WB.��b!cj�����x�O� ��-��k���$���8:�hv���2�VnzQr�h���?&Y��qh4M���N1i5�UV������5z�r���5���[rJ�'t��xkyo��=�|�P������w���o��Vf�A�A�h�y$Qa�����
0g����{�]�hg�6������m�����R;F 5Lr����wo��U	��{C��H��{n���[v��2y�4�]L��� -��g]m%�>�oE�s�9z�
6��������@��e�wkF��#���@;������p6	T���1������7n���h�)0��q�\p��.��:��8����vF@4�\�=�-<�u������G����VZ�k4����qLO1_���T�I$`����9f�ci��H�]�=i5�E���\rx�PX�5;�L|�cj���yG�������ph��U�5��s����v����A�2=����UG�������:�Z��x�Wc]p�}|�cG�Ru@�&�W�k�8�X���N_9��u8~���
]P�vS�����yM�(��Z}����(0Ou��l�(�E�	��d2�^���u��a!���j*��f6)'dVS1�J WG.���E1
���]�V���Ai9��u<��W�ZzU)-����n��Cr���ct�X��^h���u8����������~{�����O�I�KsI��K�	���.��G��k:�����l1J�{�0�L����S)~�F#�[�����lT��XN���|{m����������;������n��\��7��a0�^�H���r��,Rbk�W����L�=�VH���Y$W�_�d�����I6�O�&)��u�?M���Ld0�N&��������	G0w��O����������>��o{T�
�ZE����S��d���~��mz�T�0Q��wH����9�t0���3�����W��p����=z"�G������+���|�w�z���������^����=���i)4laY�����z������g��������?8�~�x{j}����������������7o��~�����n�������4�X�����4�$HM��6��f�x��u9�����>�^]���A�~{�������<+0r����W����9��H�xe}�_
	lo$w��������T{�:���5�D���W��E��O��'��.aga�0M_��\�9�&�I>��������_��,1���y���g�L/���qT�����K�!����?��*���id�qQ��#Q�'��,]�Q��Y������n�#�Ws�#MG�'t�a�'�J�{�]6�[r�uY�%TP�������d���z����h����������7�%���R�`Y����>X�z���+WLS� �_��7Z�Cxl�������q�����e�	�Q��Q�H�LV
ES�:*�E�fL��Tm���[��;�*���������;7���G�c��GoQ�Q�[_>V��N�E�����{%�8����}��,��f�A�;������/�:��{�R?M����������)�������d���D��nH*�h�M��h��_����"��{�QUq�rS��)>����?�<Lh���OH�m��u�d�W7\��hX��m����&��J���d�C����u.<���{ea�^^��tTK�#���p���b�VO��{����~��rRG8!�'�����W���jX������S$q��)�i=t�J�����|�* [���g�\����$���j��'�����V��������z<���Y�B'��f�D�����~~��/��y���u���������s�g��=��="������r�q�b�!P����v���5�BT�~�/X�p;r"��
��1���v��b��������JwT������ghgT�4}R��`!������uiu����LH�������U�BeT�����j���J��2T����	��o ��wcnd�-^��O��BZ��,�-�T]�����^�����J6�����_����a�S/��,'Po��oi!EN�lNo�y��v
O_��_����-��=1
������et�u=��7�v�������O_�1�����K�g���G5X���L�zO�D^���5�W�<� ���26�o�#iKq�$��m����>��������k��c,N{��=�k�����o�p$�^���� ���|�!9��OsH���*�}���n��-v�N�?��jp��x���An�����������?[��+�-��������1(�-�����-���C�<&�)���4��a��C��u�����r�3�,}��f�n=��-�oW������4�����A�>����������lH���>��P@���{6L������|���*=C���nkxF��
���6�|6L��-��#��?�%�u����^�����{�����_���v�lA��>5lK�l)�-e��l�-�M(�����
��c��tV�GUo"�x*���Y�o}I���=�7|�Z��������m�}#���#nE5��sv�$������:���1�g���\g������SfW�����E�a���&�R��)-�{+��~W-���"s�H����OT0R��������Du Q�I2}�hH���Y�����+��5}�����Z��%��|���`.
�\9���T���\�C;��j��z|z�}��ol��5M5�$`6l5��Z�S��~�x!l|=H�h-�S�w�����_f����^&�z�GQU����������t:�k�/���J'Y�n��dQy{�6�L��|���������D�Kq-;��|����j,r�{O]3i�	�m ���<���u�6����Be����������������v�o;������������m��������������E2��`�l�����mn:�����N�m'���RE�h���3��m������������v�o;������������m�����������������v���s�������v�o;�G�n��&0��������};p���#=���M'�����v/i��m'���o4���A^�_����J�g��;�};���`F]�����1��=��,���vn�q�Z����n����1��f�H��lqo89��ow�s�73��!L��2�.��g[,/�b��Zs�}�1��N�����O6�N#���0���o��p�h6��
����:�w�Y���4�f2����n0r��������=���*V�;����*V�'�X����w<��U��3���,����4�SU����qbV����+Qk�?������@[�l�O4�/�X�]�����ar�	�8Xg��uCYG��&u��4;>��^�V��+G����������F�`��?��������,i���]����l���?������v�g���]����`�������g=��
�������v�g���]�y�?����t����vq���]��-l�Z�Y3�L����v��gI������v�g���]����l���?������v���������M�~Qt��?������v���������N�?NM��$nW��?����m��m��m��m��m��m��m��m��m��m��m��m��m��m��m��m��mz����B	L
#73Peter Smith
smithpb2250@gmail.com
In reply to: Ajin Cherian (#71)

Hi Ajin.

I have re-checked the v13 patches for how my remaining review comments
have been addressed.

On Tue, Oct 27, 2020 at 8:55 PM Ajin Cherian <itsajin@gmail.com> wrote:

====================
v12-0002. File: src/backend/replication/logical/reorderbuffer.c
====================

COMMENT
Line 2401
/*
* We are here due to one of the 3 scenarios:
* 1. As part of streaming in-progress transactions
* 2. Prepare of a two-phase commit
* 3. Commit of a transaction.
*
* If we are streaming the in-progress transaction then discard the
* changes that we just streamed, and mark the transactions as
* streamed (if they contained changes), set prepared flag as false.
* If part of a prepare of a two-phase commit set the prepared flag
* as true so that we can discard changes and cleanup tuplecids.
* Otherwise, remove all the
* changes and deallocate the ReorderBufferTXN.
*/
~
The above comment is beyond my understanding. Anything you could do to
simplify it would be good.

For example, when viewing this function in isolation I have never
understood why the streaming flag and rbtxn_prepared(txn) flag are not
possible to be set at the same time?

Perhaps the code is relying on just internal knowledge of how this
helper function gets called? And if it is just that, then IMO there
really should be some Asserts in the code to give more assurance about
that. (Or maybe use completely different flags to represent those 3
scenarios instead of bending the meanings of the existing flags)

Left this for now, probably re-look at this at a later review.
But just to explain; this function is what does the main decoding of
changes of a transaction.
At what point this decoding happens is what this feature and the
streaming in-progress feature is about. As of PG13, this decoding only
happens at commit time. With the streaming of in-progress txn feature,
this began to happen (if streaming enabled) at the time when the
memory limit for decoding transactions was crossed. This 2PC feature
is supporting decoding at the time of a PREPARE transaction.
Now, if streaming is enabled and streaming has started as a result of
crossing the memory threshold, then there is no need to
again begin streaming at a PREPARE transaction as the transaction that
is being prepared has already been streamed. Which is why this
function will not be called when a streaming transaction is prepared
as part of a two-phase commit.

AFAIK the last remaining issue now is only about the complexity of the
aforementioned code/comment. If you want to defer changing that until
we can come up with something better, then that is OK by me.

Apart from that I have no other pending review comments at this time.

Kind Regards,
Peter Smith.
Fujitsu Australia

#74Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#73)

Hi Ajin.

Looking at v13 patches again I found a couple more review comments:

===

(1) COMMENT
File: src/backend/replication/logical/proto.c
Function: logicalrep_write_prepare
+ if (rbtxn_commit_prepared(txn))
+ flags = LOGICALREP_IS_COMMIT_PREPARED;
+ else if (rbtxn_rollback_prepared(txn))
+ flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+ else
+ flags = LOGICALREP_IS_PREPARE;
+
+ /* Make sure exactly one of the expected flags is set. */
+ if (!PrepareFlagsAreValid(flags))
+ elog(ERROR, "unrecognized flags %u in prepare message", flags);

Since those flags are directly assigned, I think the subsequent if
(!PrepareFlagsAreValid(flags)) check is redundant.

===

(2) COMMENT
File: src/backend/replication/logical/proto.c
Function: logicalrep_write_stream_prepare
+/*
+ * Write STREAM PREPARE to the output stream.
+ * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED)
+ */

I think the function comment is outdated because IIUC the stream
COMMIT PREPARED and stream ROLLBACK PREPARED are not being handled by
the function logicalrep_write_prepare. SInce this approach seems
counter-intuitive there needs to be an improved function comment to
explain what is going on.

===

(3) COMMENT
File: src/backend/replication/logical/proto.c
Function: logicalrep_read_stream_prepare
+/*
+ * Read STREAM PREPARE from the output stream.
+ * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED)
+ */

This is the same as the previous review comment. The function comment
needs to explain the new handling for stream COMMIT PREPARED and
stream ROLLBACK PREPARED.

===

(4) COMMENT
File: src/backend/replication/logical/proto.c
Function: logicalrep_read_stream_prepare
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData
*prepare_data)
+{
+ TransactionId xid;
+ uint8 flags;
+
+ xid = pq_getmsgint(in, 4);
+
+ /* read flags */
+ flags = pq_getmsgbyte(in);
+
+ if (!PrepareFlagsAreValid(flags))
+ elog(ERROR, "unrecognized flags %u in prepare message", flags);

I think the logicalrep_write_stream_prepare now can only assign the
flags = LOGICALREP_IS_PREPARE. So that means the check here for bad
flags should be changed to match.
BEFORE: if (!PrepareFlagsAreValid(flags))
AFTER: if (flags != LOGICALREP_IS_PREPARE)

===

(5) COMMENT
General
Since the COMMENTs (2), (3) and (4) are all caused by the refactoring
that was done for removal of the commit/rollback stream callbacks. I
do wonder if it might have worked out better just to leave the
logicalrep_read/write_stream_prepared as it was instead of mixing up
stream/no-stream handling. A check for stream/no-stream could possibly
have been made higher up.

For example:
static void
pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
XLogRecPtr prepare_lsn)
{
OutputPluginUpdateProgress(ctx);

OutputPluginPrepareWrite(ctx, true);
if (ctx->streaming)
logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
else
logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
OutputPluginWrite(ctx, true);
}

===

Kind Regards,
Peter Smith.
Fujitsu Australia

#75Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#74)
1 attachment(s)

FYI - I have cross-checked all the v12 patch code changes against the
v12 code coverage resulting from running the patch tests

Those v12 code coverage results were posted in this thread previously [1]/messages/by-id/CAHut+Pt6zB-YffCrMo7+ZOKn7C2yXkNYnuQTdbStEJJJXZZXaw@mail.gmail.com.

The purpose of this study was to identify if / where there are any
gaps in the testing of this patch - e.g is there some code not
currently getting executed?

I found in general there seems quite high coverage of the normal (not
error) code path,but there are a couple of current gaps in the test
coverage.

For details please find attached the study results. (MS Excel file)

===

[1]: /messages/by-id/CAHut+Pt6zB-YffCrMo7+ZOKn7C2yXkNYnuQTdbStEJJJXZZXaw@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v12-patch-test-coverage-20201029.xlsxapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet; name=v12-patch-test-coverage-20201029.xlsxDownload
#76Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#74)
3 attachment(s)

On Thu, Oct 29, 2020 at 11:48 AM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Ajin.

Looking at v13 patches again I found a couple more review comments:

===

(1) COMMENT
File: src/backend/replication/logical/proto.c
Function: logicalrep_write_prepare
+ if (rbtxn_commit_prepared(txn))
+ flags = LOGICALREP_IS_COMMIT_PREPARED;
+ else if (rbtxn_rollback_prepared(txn))
+ flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+ else
+ flags = LOGICALREP_IS_PREPARE;
+
+ /* Make sure exactly one of the expected flags is set. */
+ if (!PrepareFlagsAreValid(flags))
+ elog(ERROR, "unrecognized flags %u in prepare message", flags);

Since those flags are directly assigned, I think the subsequent if
(!PrepareFlagsAreValid(flags)) check is redundant.

===

Updated this.

(2) COMMENT
File: src/backend/replication/logical/proto.c
Function: logicalrep_write_stream_prepare
+/*
+ * Write STREAM PREPARE to the output stream.
+ * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED)
+ */

I think the function comment is outdated because IIUC the stream
COMMIT PREPARED and stream ROLLBACK PREPARED are not being handled by
the function logicalrep_write_prepare. SInce this approach seems
counter-intuitive there needs to be an improved function comment to
explain what is going on.

===

(3) COMMENT
File: src/backend/replication/logical/proto.c
Function: logicalrep_read_stream_prepare
+/*
+ * Read STREAM PREPARE from the output stream.
+ * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED)
+ */

This is the same as the previous review comment. The function comment
needs to explain the new handling for stream COMMIT PREPARED and
stream ROLLBACK PREPARED.

===

I think that these functions only writing/reading STREAM PREPARE as
the name suggests is more intuitive. Maybe the usage of flags is more
confusing. More below.

(4) COMMENT
File: src/backend/replication/logical/proto.c
Function: logicalrep_read_stream_prepare
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData
*prepare_data)
+{
+ TransactionId xid;
+ uint8 flags;
+
+ xid = pq_getmsgint(in, 4);
+
+ /* read flags */
+ flags = pq_getmsgbyte(in);
+
+ if (!PrepareFlagsAreValid(flags))
+ elog(ERROR, "unrecognized flags %u in prepare message", flags);

I think the logicalrep_write_stream_prepare now can only assign the
flags = LOGICALREP_IS_PREPARE. So that means the check here for bad
flags should be changed to match.
BEFORE: if (!PrepareFlagsAreValid(flags))
AFTER: if (flags != LOGICALREP_IS_PREPARE)

===

Updated.

(5) COMMENT
General
Since the COMMENTs (2), (3) and (4) are all caused by the refactoring
that was done for removal of the commit/rollback stream callbacks. I
do wonder if it might have worked out better just to leave the
logicalrep_read/write_stream_prepared as it was instead of mixing up
stream/no-stream handling. A check for stream/no-stream could possibly
have been made higher up.

For example:
static void
pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
XLogRecPtr prepare_lsn)
{
OutputPluginUpdateProgress(ctx);

OutputPluginPrepareWrite(ctx, true);
if (ctx->streaming)
logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
else
logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
OutputPluginWrite(ctx, true);
}

===

I think I'll keep this as such for now. Amit was talking about
considering removal of flags to overload PREPARE with COMMIT PREPARED
and ROLLBACK PREPARED. Separate functions for each.
Will wait if Amit thinks that is the way to go.

I've also added a new test case for test_decoding for streaming 2PC.
Removed function ReorderBufferTxnIsPrepared as it was never called
thanks to Peter's coverage report. And added stream_prepare to the
list of callbacks that would
enable two-phase commits.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v14-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v14-0001-Support-2PC-txn-base.patchDownload
From 32037da69873728bed286f0b4751936f15dd5c81 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 29 Oct 2020 05:59:39 -0400
Subject: [PATCH v14] Support-2PC-txn-base.

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 192 +++++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 145 ++++++++++++++++++-
 src/backend/replication/logical/logical.c | 229 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++++
 src/include/replication/reorderbuffer.h   |  56 ++++++++
 6 files changed, 666 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 8e33614..7eb13ce 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid_aborted; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -73,6 +78,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -88,6 +96,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -112,10 +132,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -127,6 +152,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool 		enable_2pc = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +162,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +254,40 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_2pc))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				long xid;
+
+				errno = 0;
+				xid = strtoul(strVal(elem->arg), NULL, 0);
+				if (xid == 0 || errno != 0)
+					data->check_xid_aborted = InvalidTransactionId;
+				else
+					data->check_xid_aborted = (TransactionId)xid;
+
+				if (!TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+								strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +299,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_2pc;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +359,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +604,25 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* if check_xid_aborted is a valid xid, then it was passed in
+	 * as an option to check if the transaction having this xid would be aborted.
+	 * This is to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			   !TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -646,6 +814,30 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction");
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..ad8991d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>stream_prepare_cb</function>, <function>commit_prepared_cb</function>
+    and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +598,55 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +656,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +739,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +793,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +849,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1040,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d5cfbea..a4f8113 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -221,12 +235,27 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
 		(ctx->callbacks.stream_stop_cb != NULL) ||
 		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
 		(ctx->callbacks.stream_commit_cb != NULL) ||
 		(ctx->callbacks.stream_change_cb != NULL) ||
 		(ctx->callbacks.stream_message_cb != NULL) ||
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require prepare/commit-prepare/abort-prepare
+	 * callbacks. The filter-prepare callback is optional. We however enable two-phase logical
+	 * decoding when at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +266,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +813,120 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin supports two-phase commits then prepare callback is mandatory */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support two-phase commits then commit prepared callback is mandatory */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support two-phase commits then abort prepared callback is mandatory */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1003,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not enabled. In that
+	 * case all two-phase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1246,46 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming and two-phase commits are supported. */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..4c1341f 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index dfdda93..becd20e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -174,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -244,6 +266,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -405,6 +430,26 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -431,6 +476,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -497,6 +548,10 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -505,6 +560,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
-- 
1.8.3.1

v14-0002-Support-2PC-txn-backend-and-tests.patchapplication/octet-stream; name=v14-0002-Support-2PC-txn-backend-and-tests.patchDownload
From be90ef42b76823b394567a0113f3f7a762853e50 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 29 Oct 2020 06:03:03 -0400
Subject: [PATCH v14] Support 2PC txn backend and tests.

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.

Includes two-phase commit test code (for test_decoding).
---
 contrib/test_decoding/Makefile                     |   4 +-
 contrib/test_decoding/expected/two_phase.out       | 228 +++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 177 +++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 +++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 +++++
 contrib/test_decoding/t/001_twophase.pl            | 121 +++++++++
 src/backend/replication/logical/decode.c           | 250 +++++++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    | 282 ++++++++++++++++++---
 src/include/replication/reorderbuffer.h            |  12 +
 9 files changed, 1204 insertions(+), 52 deletions(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,11 +4,13 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..e5e34b4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..cde4b83
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..1555582
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..fd961d4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -68,8 +68,15 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						 xl_xact_parsed_commit *parsed, TransactionId xid);
+static void DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+								 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+								xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -239,7 +246,6 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	switch (info)
 	{
 		case XLOG_XACT_COMMIT:
-		case XLOG_XACT_COMMIT_PREPARED:
 			{
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
@@ -256,8 +262,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeCommit(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_COMMIT_PREPARED:
+			{
+				xl_xact_commit *xlrec;
+				xl_xact_parsed_commit parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_commit *) XLogRecGetData(r);
+				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+				DecodeCommitPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ABORT:
-		case XLOG_XACT_ABORT_PREPARED:
 			{
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
@@ -274,6 +296,23 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_ABORT_PREPARED:
+			{
+				xl_xact_abort *xlrec;
+				xl_xact_parsed_abort parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_abort *) XLogRecGetData(r);
+				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+					DecodeAbortPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ASSIGNMENT:
 
 			/*
@@ -312,17 +351,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -659,6 +716,131 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Consolidated commit record handling between the different form of commit
+ * records.
+ */
+static void
+DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+					 xl_xact_parsed_commit *parsed, TransactionId xid)
+{
+	XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	RepOriginId origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
+					   parsed->nsubxacts, parsed->subxacts);
+
+	/* ----
+	 * Check whether we are interested in this specific transaction, and tell
+	 * the reorderbuffer to forget the content of the (sub-)transactions
+	 * if not.
+	 *
+	 * There can be several reasons we might not be interested in this
+	 * transaction:
+	 * 1) We might not be interested in decoding transactions up to this
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
+	 * 2) The transaction happened in another database.
+	 * 3) The output plugin is not interested in the origin.
+	 * 4) We are doing fast-forwarding
+	 *
+	 * We can't just use ReorderBufferAbort() here, because we need to execute
+	 * the transaction's invalidations.  This currently won't be needed if
+	 * we're just skipping over the transaction because currently we only do
+	 * so during startup, to get to the first transaction the client needs. As
+	 * we have reset the catalog caches before starting to read WAL, and we
+	 * haven't yet touched any catalogs, there can't be anything to invalidate.
+	 * But if we're "forgetting" this commit because it's it happened in
+	 * another database, the invalidations might be important, because they
+	 * could be for shared catalogs and we might have loaded data into the
+	 * relevant syscaches.
+	 * ---
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+		}
+		ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+		return;
+	}
+
+	/* tell the reorderbuffer about the surviving subtransactions */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/*
+	 * For COMMIT PREPARED, the changes have already been replayed at
+	 * PREPARE time, so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 * If filter check present and this needs to be skipped, do a regular commit.
+	 */
+	if (ctx->callbacks.filter_prepare_cb &&
+			ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+	else
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr  origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		 ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+		return;
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/* replay actions of all transaction + subtransactions in order */
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
+}
+
+/*
  * Get the data from the various forms of abort records and pass it on to
  * snapbuild.c and reorderbuffer.c
  */
@@ -681,6 +863,50 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Get the data from the various forms of abort records and pass it on to
+ * snapbuild.c and reorderbuffer.c
+ */
+static void
+DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			xl_xact_parsed_abort *parsed, TransactionId xid)
+{
+	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it passes through the filters handle the ROLLBACK via callbacks
+	 */
+	if(!FilterByOrigin(ctx, origin_id) &&
+	   !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+	   !ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		Assert(TransactionIdIsValid(xid));
+		Assert(parsed->dbId == ctx->slot->data.database);
+
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
+
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+						   buf->record->EndRecPtr);
+	}
+
+	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+}
+
+/*
  * Parse XLOG_HEAP_INSERT (not MULTI_INSERT!) records into tuplebufs.
  *
  * Deletes can contain the new tuple.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c1bd680..1456bc8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -421,6 +422,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1514,12 +1521,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1538,7 +1547,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1572,9 +1581,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for decoding
+		 * catalog snapshot access.
+		 * They are always stored in the toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1768,9 +1801,22 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
-
-	ReorderBufferCleanupTXN(rb, txn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
+		/* This is a PREPARED transaction, part of a two-phase commit.
+		 * The full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1898,8 +1944,12 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/* Discard the changes that we just streamed.
+	 * This can only be called if streaming and not part of a PREPARE in
+	 * a two-phase commit, so set prepared flag as false.
+	 */
+	Assert(!rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1921,6 +1971,11 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 /*
  * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
  *
+ * We are here due to one of the 3 scenarios:
+ * 1. As part of streaming an in-progress transactions
+ * 2. Prepare of a two-phase commit
+ * 3. Commit of a transaction.
+ *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
  * merge) and replay the changes in lsn order.
@@ -2006,7 +2061,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2297,7 +2352,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT
+			 * (for regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2331,18 +2395,32 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
+		 * We are here due to one of the 3 scenarios:
+		 * 1. As part of streaming in-progress transactions
+		 * 2. Prepare of a two-phase commit
+		 * 3. Commit of a transaction.
+		 *
 		 * If we are streaming the in-progress transaction then discard the
 		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
+		 * streamed (if they contained changes), set prepared flag as false.
+		 * If part of a prepare of a two-phase commit set the prepared flag
+		 * as true so that we can discard changes and cleanup tuplecids.
+		 * Otherwise, remove all the
 		 * changes and deallocate the ReorderBufferTXN.
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, false);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (rbtxn_prepared(txn))
+		{
+			ReorderBufferTruncateTXN(rb, txn, true);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2372,17 +2450,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * abort of the (sub)transaction we are streaming or preparing. We need to do the
 		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can only occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we are
+			 * sending the data out on a PREPARE during a two-phase commit.
+			 * Both conditions can't be true either, it should be one of them.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started  || rbtxn_prepared(txn));
+			Assert(!(streaming && rbtxn_prepared(txn)));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2390,10 +2471,21 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/*
+			 * If streaming, reset the TXN so that it is allowed to stream remaining data.
+			 */
+			if (streaming)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+						txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2415,23 +2507,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2473,6 +2558,120 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	* The transaction may or may not exist (during restarts for example).
+	* Anyway, two-phase transactions do not contain any reorderbuffers. So allow
+	* it to be created below.
+	*/
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2515,7 +2714,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index becd20e..03f777c 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -635,6 +635,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -658,6 +663,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool 		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void 		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+								 TimestampTz commit_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v14-0003-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v14-0003-Support-2PC-txn-pgoutput.patchDownload
From f962b2527e54e607ea795c6c009f0d36fe5eb00c Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 29 Oct 2020 06:25:46 -0400
Subject: [PATCH v14] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.

Includes two-phase commit test code (streaming and not streaming).
---
 src/backend/access/transam/twophase.c             |  27 ++
 src/backend/replication/logical/proto.c           | 142 +++++-
 src/backend/replication/logical/worker.c          | 288 +++++++++++-
 src/backend/replication/pgoutput/pgoutput.c       |  75 +++-
 src/include/access/twophase.h                     |   1 +
 src/include/replication/logicalproto.h            |  33 ++
 src/test/subscription/t/020_twophase.pl           | 345 ++++++++++++++
 src/test/subscription/t/021_twophase_streaming.pl | 521 ++++++++++++++++++++++
 8 files changed, 1409 insertions(+), 23 deletions(-)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_streaming.pl

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7940060..129afe9 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index eb19142..9deb1ef 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, 'C');		/* sending COMMIT */
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -99,6 +99,7 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 	if (flags != 0)
 		elog(ERROR, "unrecognized flags %u in commit message", flags);
 
+
 	/* read fields */
 	commit_data->commit_lsn = pq_getmsgint64(in);
 	commit_data->end_lsn = pq_getmsgint64(in);
@@ -106,6 +107,145 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'P');		/* sending PREPARE protocol */
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In which case we
+	 * expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8	flags = 0;
+
+	pq_sendbyte(out, 'p');		/* sending STREAM PREPARE protocol */
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case we
+	 * expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK] PREPARED
+	 * uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b0f27e0..593af82 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -244,6 +244,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -720,6 +722,7 @@ apply_handle_commit(StringInfo s)
 		replorigin_session_origin_timestamp = commit_data.committime;
 
 		CommitTransactionCommand();
+
 		pgstat_report_stat(false);
 
 		store_flush_position(commit_data.end_lsn);
@@ -740,6 +743,225 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/* End the earlier transaction and start a new one */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK PREPARED
+	 * for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 * ========================================
+	 * 1. Replay all the spooled operations
+	 * - This code is same as what apply_handle_stream_commit does for NON two-phase stream commit
+	 * ========================================
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * ========================================
+	 * 2. Mark the transaction as prepared.
+	 * - This code is same as what apply_handle_prepare_txn does for two-phase prepare of the non-streamed tx
+	 * ========================================
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -933,30 +1155,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -964,7 +1177,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -979,7 +1192,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1048,6 +1261,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1055,16 +1297,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
@@ -1904,10 +2142,14 @@ apply_dispatch(StringInfo s)
 		case 'B':
 			apply_handle_begin(s);
 			break;
-			/* COMMIT */
+			/* COMMIT/ABORT */
 		case 'C':
 			apply_handle_commit(s);
 			break;
+			/* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+		case 'P':
+			apply_handle_prepare(s);
+			break;
 			/* INSERT */
 		case 'I':
 			apply_handle_insert(s);
@@ -1952,6 +2194,10 @@ apply_dispatch(StringInfo s)
 		case 'c':
 			apply_handle_stream_commit(s);
 			break;
+			/* STREAM PREPARE */
+		case 'p':
+			apply_handle_stream_prepare(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..b4f2c9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,7 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
-
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static bool publications_valid;
 static bool in_streaming;
 
@@ -143,6 +150,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +164,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +391,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +912,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 0c2cda2..ee38f89 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -87,6 +87,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -94,6 +95,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -101,6 +124,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData * prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -144,4 +171,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..f489f47
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,345 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 21;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_streaming.pl b/src/test/subscription/t/021_twophase_streaming.pl
new file mode 100644
index 0000000..9a03b83
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_streaming.pl
@@ -0,0 +1,521 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 28;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#77Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#71)

On Tue, Oct 27, 2020 at 3:25 PM Ajin Cherian <itsajin@gmail.com> wrote:

[v13 patch set]
Few comments on v13-0001-Support-2PC-txn-base. I haven't checked v14
version of patches so if you have fixed anything then ignore it.

1.
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -174,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200

/* Does the transaction have catalog changes? */
#define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +237,24 @@ typedef struct ReorderBufferChange
((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
)

+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+

I think the above changes should be moved to the second patch. There
is no use of these macros in this patch and moreover they appear to be
out-of-place.

2.
@@ -127,6 +152,7 @@ pg_decode_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
ListCell *option;
TestDecodingData *data;
bool enable_streaming = false;
+ bool enable_2pc = false;

I think it is better to name this variable as enable_two_pc or enable_twopc.

3.
+ xid = strtoul(strVal(elem->arg), NULL, 0);
+ if (xid == 0 || errno != 0)
+ data->check_xid_aborted = InvalidTransactionId;
+ else
+ data->check_xid_aborted = (TransactionId)xid;
+
+ if (!TransactionIdIsValid(data->check_xid_aborted))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+ strVal(elem->arg))));

Can't we write this as below and get rid of xid variable:
data->check_xid_aborted= (TransactionId) strtoul(strVal(elem->arg), NULL, 0);
if (!TransactionIdIsValid(data->check_xid_aborted) || errno)
ereport..

4.
+ /* if check_xid_aborted is a valid xid, then it was passed in
+ * as an option to check if the transaction having this xid would be aborted.
+ * This is to test concurrent aborts.
+ */

multi-line comments have the first line as empty.

5.
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback
is called whenever
+      a transaction commit prepared has been decoded. The
<parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can
be used in this
+      callback.

I think the last line "The <parameter>gid</parameter> field, which is
part of the <parameter>txn</parameter> parameter can be used in this
callback." in 'Transaction Commit Prepared Callback' should also be
present in 'Transaction Prepare Callback' as we using the same in
prepare API as well.

6.
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid);
+ else
+ appendStringInfo(ctx->out, "preparing streamed transaction");

I think we should include 'gid' as well in the above messages.

7.
@@ -221,12 +235,26 @@ StartupDecodingContext(List *output_plugin_options,
  ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
  (ctx->callbacks.stream_stop_cb != NULL) ||
  (ctx->callbacks.stream_abort_cb != NULL) ||
+ (ctx->callbacks.stream_prepare_cb != NULL) ||
  (ctx->callbacks.stream_commit_cb != NULL) ||
  (ctx->callbacks.stream_change_cb != NULL) ||
  (ctx->callbacks.stream_message_cb != NULL) ||
  (ctx->callbacks.stream_truncate_cb != NULL);
  /*
+ * To support two-phase logical decoding, we require
prepare/commit-prepare/abort-prepare
+ * callbacks. The filter-prepare callback is optional. We however
enable two-phase logical
+ * decoding when at least one of the methods is enabled so that we
can easily identify
+ * missing methods.
+ *
+ * We decide it here, but only check it later in the wrappers.
+ */
+ ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+ (ctx->callbacks.commit_prepared_cb != NULL) ||
+ (ctx->callbacks.rollback_prepared_cb != NULL) ||
+ (ctx->callbacks.filter_prepare_cb != NULL);
+

I think stream_prepare_cb should be checked for the 'twophase' flag
because we won't use this unless two-phase is enabled. Am I missing
something?

--
With Regards,
Amit Kapila.

#78Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#62)

On Wed, Oct 21, 2020 at 7:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Comment: I dont think a tablesync worker will use streaming, none of
the other stream APIs check this, this might not be relevant for
stream_prepare either.

Yes, I think this is right. See pgoutput_startup where we are
disabling the streaming for init phase. But it is always good to once
test this and ensure the same.

I have tested this scenario and confirmed that even when the
subscriber is capable of streaming it does NOT do any streaming during
its tablesync phase.

Kind Regards
Peter Smith.
Fujitsu Australia

#79Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#77)
3 attachment(s)

On Thu, Oct 29, 2020 at 11:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Oct 27, 2020 at 3:25 PM Ajin Cherian <itsajin@gmail.com> wrote:

[v13 patch set]
Few comments on v13-0001-Support-2PC-txn-base. I haven't checked v14
version of patches so if you have fixed anything then ignore it.

1.
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
#define REORDERBUFFER_H
#include "access/htup_details.h"
+#include "access/twophase.h"
#include "lib/ilist.h"
#include "storage/sinval.h"
#include "utils/hsearch.h"
@@ -174,6 +175,9 @@ typedef struct ReorderBufferChange
#define RBTXN_IS_STREAMED         0x0010
#define RBTXN_HAS_TOAST_INSERT    0x0020
#define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200

/* Does the transaction have catalog changes? */
#define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +237,24 @@ typedef struct ReorderBufferChange
((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
)

+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+

I think the above changes should be moved to the second patch. There
is no use of these macros in this patch and moreover they appear to be
out-of-place.

Moved to second patch in the patchset.

2.
@@ -127,6 +152,7 @@ pg_decode_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
ListCell *option;
TestDecodingData *data;
bool enable_streaming = false;
+ bool enable_2pc = false;

I think it is better to name this variable as enable_two_pc or enable_twopc.

Renamed it to enable_twophase so that it matches with the ctx member
ctx-twophase.

3.
+ xid = strtoul(strVal(elem->arg), NULL, 0);
+ if (xid == 0 || errno != 0)
+ data->check_xid_aborted = InvalidTransactionId;
+ else
+ data->check_xid_aborted = (TransactionId)xid;
+
+ if (!TransactionIdIsValid(data->check_xid_aborted))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+ strVal(elem->arg))));

Can't we write this as below and get rid of xid variable:
data->check_xid_aborted= (TransactionId) strtoul(strVal(elem->arg), NULL, 0);
if (!TransactionIdIsValid(data->check_xid_aborted) || errno)
ereport..

Updated. Small change so that errno is checked first.

4.
+ /* if check_xid_aborted is a valid xid, then it was passed in
+ * as an option to check if the transaction having this xid would be aborted.
+ * This is to test concurrent aborts.
+ */

multi-line comments have the first line as empty.

Updated.

5.
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback
is called whenever
+      a transaction commit prepared has been decoded. The
<parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can
be used in this
+      callback.

I think the last line "The <parameter>gid</parameter> field, which is
part of the <parameter>txn</parameter> parameter can be used in this
callback." in 'Transaction Commit Prepared Callback' should also be
present in 'Transaction Prepare Callback' as we using the same in
prepare API as well.

Updated.

6.
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid);
+ else
+ appendStringInfo(ctx->out, "preparing streamed transaction");

I think we should include 'gid' as well in the above messages.

Updated.

7.
@@ -221,12 +235,26 @@ StartupDecodingContext(List *output_plugin_options,
ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
(ctx->callbacks.stream_stop_cb != NULL) ||
(ctx->callbacks.stream_abort_cb != NULL) ||
+ (ctx->callbacks.stream_prepare_cb != NULL) ||
(ctx->callbacks.stream_commit_cb != NULL) ||
(ctx->callbacks.stream_change_cb != NULL) ||
(ctx->callbacks.stream_message_cb != NULL) ||
(ctx->callbacks.stream_truncate_cb != NULL);
/*
+ * To support two-phase logical decoding, we require
prepare/commit-prepare/abort-prepare
+ * callbacks. The filter-prepare callback is optional. We however
enable two-phase logical
+ * decoding when at least one of the methods is enabled so that we
can easily identify
+ * missing methods.
+ *
+ * We decide it here, but only check it later in the wrappers.
+ */
+ ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+ (ctx->callbacks.commit_prepared_cb != NULL) ||
+ (ctx->callbacks.rollback_prepared_cb != NULL) ||
+ (ctx->callbacks.filter_prepare_cb != NULL);
+

I think stream_prepare_cb should be checked for the 'twophase' flag
because we won't use this unless two-phase is enabled. Am I missing
something?

Was fixed in v14.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v15-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v15-0001-Support-2PC-txn-base.patchDownload
From 7ab59781a820d7843e5da427a62045dab2c75358 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 30 Oct 2020 04:35:43 -0400
Subject: [PATCH v15] Support-2PC-txn-base.

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 189 ++++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 146 ++++++++++++++++++-
 src/backend/replication/logical/logical.c | 229 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++++
 src/include/replication/reorderbuffer.h   |  35 +++++
 6 files changed, 643 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 8e33614..6eb1420 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid_aborted; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -73,6 +78,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -88,6 +96,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -112,10 +132,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -127,6 +152,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool 		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +162,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +254,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+								strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +294,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +354,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +599,26 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/* 
+	 * if check_xid_aborted is a valid xid, then it was passed in
+	 * as an option to check if the transaction having this xid would be aborted.
+	 * This is to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			   !TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -646,6 +810,31 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..f5b617d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>stream_prepare_cb</function>, <function>commit_prepared_cb</function>
+    and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +598,56 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +657,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +740,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +794,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +850,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1041,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d5cfbea..a4f8113 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -221,12 +235,27 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
 		(ctx->callbacks.stream_stop_cb != NULL) ||
 		(ctx->callbacks.stream_abort_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
 		(ctx->callbacks.stream_commit_cb != NULL) ||
 		(ctx->callbacks.stream_change_cb != NULL) ||
 		(ctx->callbacks.stream_message_cb != NULL) ||
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require prepare/commit-prepare/abort-prepare
+	 * callbacks. The filter-prepare callback is optional. We however enable two-phase logical
+	 * decoding when at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +266,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +813,120 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin supports two-phase commits then prepare callback is mandatory */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support two-phase commits then commit prepared callback is mandatory */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support two-phase commits then abort prepared callback is mandatory */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1003,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not enabled. In that
+	 * case all two-phase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1246,46 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming and two-phase commits are supported. */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..4c1341f 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index dfdda93..93c79c8 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -244,6 +245,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -405,6 +409,26 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -431,6 +455,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -497,6 +527,10 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -505,6 +539,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
-- 
1.8.3.1

v15-0002-Support-2PC-txn-backend-and-tests.patchapplication/octet-stream; name=v15-0002-Support-2PC-txn-backend-and-tests.patchDownload
From c01c70853a46a173aae6fd82c3d4bcc52fcdec45 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 30 Oct 2020 05:03:40 -0400
Subject: [PATCH v15] Support 2PC txn backend and tests.

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.

Includes two-phase commit test code (for test_decoding).
---
 contrib/test_decoding/Makefile                     |   4 +-
 contrib/test_decoding/expected/two_phase.out       | 228 +++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 177 +++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 +++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 +++++
 contrib/test_decoding/t/001_twophase.pl            | 121 +++++++++
 src/backend/replication/logical/decode.c           | 250 +++++++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    | 282 ++++++++++++++++++---
 src/include/replication/reorderbuffer.h            |  33 +++
 9 files changed, 1225 insertions(+), 52 deletions(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,11 +4,13 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..e5e34b4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..957c198
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..1555582
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..fd961d4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -68,8 +68,15 @@ static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						 xl_xact_parsed_commit *parsed, TransactionId xid);
+static void DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+								 xl_xact_parsed_commit *parsed, TransactionId xid);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 						xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+								xl_xact_parsed_abort *parsed, TransactionId xid);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -239,7 +246,6 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	switch (info)
 	{
 		case XLOG_XACT_COMMIT:
-		case XLOG_XACT_COMMIT_PREPARED:
 			{
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
@@ -256,8 +262,24 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeCommit(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_COMMIT_PREPARED:
+			{
+				xl_xact_commit *xlrec;
+				xl_xact_parsed_commit parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_commit *) XLogRecGetData(r);
+				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+				DecodeCommitPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ABORT:
-		case XLOG_XACT_ABORT_PREPARED:
 			{
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
@@ -274,6 +296,23 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				DecodeAbort(ctx, buf, &parsed, xid);
 				break;
 			}
+		case XLOG_XACT_ABORT_PREPARED:
+			{
+				xl_xact_abort *xlrec;
+				xl_xact_parsed_abort parsed;
+				TransactionId xid;
+
+				xlrec = (xl_xact_abort *) XLogRecGetData(r);
+				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+
+				if (!TransactionIdIsValid(parsed.twophase_xid))
+					xid = XLogRecGetXid(r);
+				else
+					xid = parsed.twophase_xid;
+
+					DecodeAbortPrepared(ctx, buf, &parsed, xid);
+				break;
+			}
 		case XLOG_XACT_ASSIGNMENT:
 
 			/*
@@ -312,17 +351,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -659,6 +716,131 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Consolidated commit record handling between the different form of commit
+ * records.
+ */
+static void
+DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+					 xl_xact_parsed_commit *parsed, TransactionId xid)
+{
+	XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	RepOriginId origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid,
+					   parsed->nsubxacts, parsed->subxacts);
+
+	/* ----
+	 * Check whether we are interested in this specific transaction, and tell
+	 * the reorderbuffer to forget the content of the (sub-)transactions
+	 * if not.
+	 *
+	 * There can be several reasons we might not be interested in this
+	 * transaction:
+	 * 1) We might not be interested in decoding transactions up to this
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
+	 * 2) The transaction happened in another database.
+	 * 3) The output plugin is not interested in the origin.
+	 * 4) We are doing fast-forwarding
+	 *
+	 * We can't just use ReorderBufferAbort() here, because we need to execute
+	 * the transaction's invalidations.  This currently won't be needed if
+	 * we're just skipping over the transaction because currently we only do
+	 * so during startup, to get to the first transaction the client needs. As
+	 * we have reset the catalog caches before starting to read WAL, and we
+	 * haven't yet touched any catalogs, there can't be anything to invalidate.
+	 * But if we're "forgetting" this commit because it's it happened in
+	 * another database, the invalidations might be important, because they
+	 * could be for shared catalogs and we might have loaded data into the
+	 * relevant syscaches.
+	 * ---
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
+		}
+		ReorderBufferForget(ctx->reorder, xid, buf->origptr);
+
+		return;
+	}
+
+	/* tell the reorderbuffer about the surviving subtransactions */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/*
+	 * For COMMIT PREPARED, the changes have already been replayed at
+	 * PREPARE time, so we only need to notify the subscriber that the GID
+	 * finally committed.
+	 * If filter check present and this needs to be skipped, do a regular commit.
+	 */
+	if (ctx->callbacks.filter_prepare_cb &&
+			ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+	else
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr  origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		 ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+		return;
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/* replay actions of all transaction + subtransactions in order */
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
+}
+
+/*
  * Get the data from the various forms of abort records and pass it on to
  * snapbuild.c and reorderbuffer.c
  */
@@ -681,6 +863,50 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 }
 
 /*
+ * Get the data from the various forms of abort records and pass it on to
+ * snapbuild.c and reorderbuffer.c
+ */
+static void
+DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			xl_xact_parsed_abort *parsed, TransactionId xid)
+{
+	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * If it passes through the filters handle the ROLLBACK via callbacks
+	 */
+	if(!FilterByOrigin(ctx, origin_id) &&
+	   !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
+	   !ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+	{
+		Assert(TransactionIdIsValid(xid));
+		Assert(parsed->dbId == ctx->slot->data.database);
+
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+		return;
+	}
+
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+						   buf->record->EndRecPtr);
+	}
+
+	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+}
+
+/*
  * Parse XLOG_HEAP_INSERT (not MULTI_INSERT!) records into tuplebufs.
  *
  * Deletes can contain the new tuple.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c1bd680..1456bc8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -421,6 +422,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1514,12 +1521,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1538,7 +1547,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1572,9 +1581,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for decoding
+		 * catalog snapshot access.
+		 * They are always stored in the toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1768,9 +1801,22 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
-
-	ReorderBufferCleanupTXN(rb, txn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
+		/* This is a PREPARED transaction, part of a two-phase commit.
+		 * The full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1898,8 +1944,12 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/* Discard the changes that we just streamed.
+	 * This can only be called if streaming and not part of a PREPARE in
+	 * a two-phase commit, so set prepared flag as false.
+	 */
+	Assert(!rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1921,6 +1971,11 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 /*
  * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
  *
+ * We are here due to one of the 3 scenarios:
+ * 1. As part of streaming an in-progress transactions
+ * 2. Prepare of a two-phase commit
+ * 3. Commit of a transaction.
+ *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
  * merge) and replay the changes in lsn order.
@@ -2006,7 +2061,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2297,7 +2352,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT
+			 * (for regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2331,18 +2395,32 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
+		 * We are here due to one of the 3 scenarios:
+		 * 1. As part of streaming in-progress transactions
+		 * 2. Prepare of a two-phase commit
+		 * 3. Commit of a transaction.
+		 *
 		 * If we are streaming the in-progress transaction then discard the
 		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
+		 * streamed (if they contained changes), set prepared flag as false.
+		 * If part of a prepare of a two-phase commit set the prepared flag
+		 * as true so that we can discard changes and cleanup tuplecids.
+		 * Otherwise, remove all the
 		 * changes and deallocate the ReorderBufferTXN.
 		 */
 		if (streaming)
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, false);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (rbtxn_prepared(txn))
+		{
+			ReorderBufferTruncateTXN(rb, txn, true);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2372,17 +2450,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * abort of the (sub)transaction we are streaming or preparing. We need to do the
 		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can only occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we are
+			 * sending the data out on a PREPARE during a two-phase commit.
+			 * Both conditions can't be true either, it should be one of them.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started  || rbtxn_prepared(txn));
+			Assert(!(streaming && rbtxn_prepared(txn)));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2390,10 +2471,21 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/*
+			 * If streaming, reset the TXN so that it is allowed to stream remaining data.
+			 */
+			if (streaming)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+						txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2415,23 +2507,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2473,6 +2558,120 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	* The transaction may or may not exist (during restarts for example).
+	* Anyway, two-phase transactions do not contain any reorderbuffers. So allow
+	* it to be created below.
+	*/
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2515,7 +2714,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 93c79c8..03f777c 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -234,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -614,6 +635,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -637,6 +663,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool 		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void 		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+								 TimestampTz commit_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v15-0003-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v15-0003-Support-2PC-txn-pgoutput.patchDownload
From e931beee5d4d769a1868b59b71e512c2ed9acec7 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 30 Oct 2020 05:09:23 -0400
Subject: [PATCH v15] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.

Includes two-phase commit test code (streaming and not streaming).
---
 src/backend/access/transam/twophase.c             |  27 ++
 src/backend/replication/logical/proto.c           | 142 +++++-
 src/backend/replication/logical/worker.c          | 288 +++++++++++-
 src/backend/replication/pgoutput/pgoutput.c       |  75 +++-
 src/include/access/twophase.h                     |   1 +
 src/include/replication/logicalproto.h            |  33 ++
 src/test/subscription/t/020_twophase.pl           | 345 ++++++++++++++
 src/test/subscription/t/021_twophase_streaming.pl | 521 ++++++++++++++++++++++
 8 files changed, 1409 insertions(+), 23 deletions(-)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_streaming.pl

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7940060..129afe9 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index eb19142..9deb1ef 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, 'C');		/* sending COMMIT */
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -99,6 +99,7 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 	if (flags != 0)
 		elog(ERROR, "unrecognized flags %u in commit message", flags);
 
+
 	/* read fields */
 	commit_data->commit_lsn = pq_getmsgint64(in);
 	commit_data->end_lsn = pq_getmsgint64(in);
@@ -106,6 +107,145 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, 'P');		/* sending PREPARE protocol */
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In which case we
+	 * expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8	flags = 0;
+
+	pq_sendbyte(out, 'p');		/* sending STREAM PREPARE protocol */
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case we
+	 * expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK] PREPARED
+	 * uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b0f27e0..593af82 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -244,6 +244,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -720,6 +722,7 @@ apply_handle_commit(StringInfo s)
 		replorigin_session_origin_timestamp = commit_data.committime;
 
 		CommitTransactionCommand();
+
 		pgstat_report_stat(false);
 
 		store_flush_position(commit_data.end_lsn);
@@ -740,6 +743,225 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/* End the earlier transaction and start a new one */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK PREPARED
+	 * for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 * ========================================
+	 * 1. Replay all the spooled operations
+	 * - This code is same as what apply_handle_stream_commit does for NON two-phase stream commit
+	 * ========================================
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * ========================================
+	 * 2. Mark the transaction as prepared.
+	 * - This code is same as what apply_handle_prepare_txn does for two-phase prepare of the non-streamed tx
+	 * ========================================
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -933,30 +1155,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -964,7 +1177,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -979,7 +1192,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1048,6 +1261,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1055,16 +1297,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
@@ -1904,10 +2142,14 @@ apply_dispatch(StringInfo s)
 		case 'B':
 			apply_handle_begin(s);
 			break;
-			/* COMMIT */
+			/* COMMIT/ABORT */
 		case 'C':
 			apply_handle_commit(s);
 			break;
+			/* PREPARE and [COMMIT|ROLLBACK] PREPARED */
+		case 'P':
+			apply_handle_prepare(s);
+			break;
 			/* INSERT */
 		case 'I':
 			apply_handle_insert(s);
@@ -1952,6 +2194,10 @@ apply_dispatch(StringInfo s)
 		case 'c':
 			apply_handle_stream_commit(s);
 			break;
+			/* STREAM PREPARE */
+		case 'p':
+			apply_handle_stream_prepare(s);
+			break;
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..b4f2c9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,7 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
-
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static bool publications_valid;
 static bool in_streaming;
 
@@ -143,6 +150,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +164,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +391,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +912,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 0c2cda2..ee38f89 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -87,6 +87,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -94,6 +95,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -101,6 +124,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData * prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -144,4 +171,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..f489f47
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,345 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 21;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_streaming.pl b/src/test/subscription/t/021_twophase_streaming.pl
new file mode 100644
index 0000000..9a03b83
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_streaming.pl
@@ -0,0 +1,521 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 28;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#80Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#79)

On Fri, Oct 30, 2020 at 2:46 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, Oct 29, 2020 at 11:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

6.
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid);
+ else
+ appendStringInfo(ctx->out, "preparing streamed transaction");

I think we should include 'gid' as well in the above messages.

Updated.

gid needs to be included in the case of 'include_xids' as well.

7.
@@ -221,12 +235,26 @@ StartupDecodingContext(List *output_plugin_options,
ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
(ctx->callbacks.stream_stop_cb != NULL) ||
(ctx->callbacks.stream_abort_cb != NULL) ||
+ (ctx->callbacks.stream_prepare_cb != NULL) ||
(ctx->callbacks.stream_commit_cb != NULL) ||
(ctx->callbacks.stream_change_cb != NULL) ||
(ctx->callbacks.stream_message_cb != NULL) ||
(ctx->callbacks.stream_truncate_cb != NULL);
/*
+ * To support two-phase logical decoding, we require
prepare/commit-prepare/abort-prepare
+ * callbacks. The filter-prepare callback is optional. We however
enable two-phase logical
+ * decoding when at least one of the methods is enabled so that we
can easily identify
+ * missing methods.
+ *
+ * We decide it here, but only check it later in the wrappers.
+ */
+ ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+ (ctx->callbacks.commit_prepared_cb != NULL) ||
+ (ctx->callbacks.rollback_prepared_cb != NULL) ||
+ (ctx->callbacks.filter_prepare_cb != NULL);
+

I think stream_prepare_cb should be checked for the 'twophase' flag
because we won't use this unless two-phase is enabled. Am I missing
something?

Was fixed in v14.

But you still have it in the streaming check. I don't think we need
that for the streaming case.

Few other comments on v15-0002-Support-2PC-txn-backend-and-tests:
======================================================================
1. The functions DecodeCommitPrepared and DecodeAbortPrepared have a
lot of code similar to DecodeCommit/Abort. Can we merge these
functions?

2.
DecodeCommitPrepared()
{
..
+ * If filter check present and this needs to be skipped, do a regular commit.
+ */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+ else
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, true);
+ }
+
+}

Can we expand the comment here to say why we need to do ReorderBufferCommit?

3. There are a lot of test cases in this patch which is a good thing
but can we split them into a separate patch for the time being as I
would like to focus on the core logic of the patch first. We can later
see if we need to retain all or part of those tests.

4. Please run pgindent on your patches.

--
With Regards,
Amit Kapila.

#81Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#73)

On Wed, Oct 28, 2020 at 10:50 AM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Ajin.

I have re-checked the v13 patches for how my remaining review comments
have been addressed.

On Tue, Oct 27, 2020 at 8:55 PM Ajin Cherian <itsajin@gmail.com> wrote:

====================
v12-0002. File: src/backend/replication/logical/reorderbuffer.c
====================

COMMENT
Line 2401
/*
* We are here due to one of the 3 scenarios:
* 1. As part of streaming in-progress transactions
* 2. Prepare of a two-phase commit
* 3. Commit of a transaction.
*
* If we are streaming the in-progress transaction then discard the
* changes that we just streamed, and mark the transactions as
* streamed (if they contained changes), set prepared flag as false.
* If part of a prepare of a two-phase commit set the prepared flag
* as true so that we can discard changes and cleanup tuplecids.
* Otherwise, remove all the
* changes and deallocate the ReorderBufferTXN.
*/
~
The above comment is beyond my understanding. Anything you could do to
simplify it would be good.

For example, when viewing this function in isolation I have never
understood why the streaming flag and rbtxn_prepared(txn) flag are not
possible to be set at the same time?

Perhaps the code is relying on just internal knowledge of how this
helper function gets called? And if it is just that, then IMO there
really should be some Asserts in the code to give more assurance about
that. (Or maybe use completely different flags to represent those 3
scenarios instead of bending the meanings of the existing flags)

Left this for now, probably re-look at this at a later review.
But just to explain; this function is what does the main decoding of
changes of a transaction.
At what point this decoding happens is what this feature and the
streaming in-progress feature is about. As of PG13, this decoding only
happens at commit time. With the streaming of in-progress txn feature,
this began to happen (if streaming enabled) at the time when the
memory limit for decoding transactions was crossed. This 2PC feature
is supporting decoding at the time of a PREPARE transaction.
Now, if streaming is enabled and streaming has started as a result of
crossing the memory threshold, then there is no need to
again begin streaming at a PREPARE transaction as the transaction that
is being prepared has already been streamed.

I don't think this is true, think of a case where we need to send the
last set of changes along with PREPARE. In that case we need to stream
those changes at the time of PREPARE. If I am correct then as pointed
by Peter you need to change some comments and some of the assumptions
related to this you have in the patch.

Few more comments on the latest patch
(v15-0002-Support-2PC-txn-backend-and-tests)
=========================================================================
1.
@@ -274,6 +296,23 @@ DecodeXactOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
DecodeAbort(ctx, buf, &parsed, xid);
break;
}
+ case XLOG_XACT_ABORT_PREPARED:
+ {

..
+
+ if (!TransactionIdIsValid(parsed.twophase_xid))
+ xid = XLogRecGetXid(r);
+ else
+ xid = parsed.twophase_xid;

I think we don't need this 'if' check here because you must have a
valid value of parsed.twophase_xid;. But, I think this will be moot if
you address the review comment in my previous email such that the
handling of XLOG_XACT_ABORT_PREPARED and XLOG_XACT_ABORT will be
combined as it is there without the patch.

2.
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+   xl_xact_parsed_prepare * parsed)
+{
..
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ return;
+

I think this check is the same as the check in DecodeCommit, so you
can write some comments to indicate the same and also why we don't
need to call ReorderBufferForget here. One more thing is to note is
even if we don't need to call ReorderBufferForget here but still we
need to execute invalidations (which are present in top-level txn) for
the reasons mentioned in ReorderBufferForget. Also, if we do this,
don't forget to update the comment atop
ReorderBufferImmediateInvalidation.

3.
+ /* This is a PREPARED transaction, part of a two-phase commit.
+ * The full cleanup will happen as part of the COMMIT PREPAREDs, so now
+ * just truncate txn by removing changes and tuple_cids
+ */
+ ReorderBufferTruncateTXN(rb, txn, true);

The first line in the multi-line comment should be empty.

--
With Regards,
Amit Kapila.

#82Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#81)

On Mon, Nov 2, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few Comments on v15-0003-Support-2PC-txn-pgoutput
===============================================
1. This patch needs to be rebased after commit 644f0d7cc9 and requires
some adjustments accordingly.

2.
if (flags != 0)
elog(ERROR, "unrecognized flags %u in commit message", flags);

+
/* read fields */
commit_data->commit_lsn = pq_getmsgint64(in);

Spurious line.

3.
@@ -720,6 +722,7 @@ apply_handle_commit(StringInfo s)
replorigin_session_origin_timestamp = commit_data.committime;

CommitTransactionCommand();
+
pgstat_report_stat(false);

Spurious line

4.
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+ Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();

There is no explanation as to why you want to end the previous
transaction and start a new one. Even if we have to do so, we first
need to call BeginTransactionBlock before CommitTransactionCommand.

5.
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
- TransactionId xid;

Can we have a separate patch for this as this can be committed before
main patch. This is a refactoring required for the main patch.

6.
@@ -57,7 +63,8 @@ static void pgoutput_stream_abort(struct
LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
     ReorderBufferTXN *txn,
     XLogRecPtr commit_lsn);
-
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);

Spurious line removal.

--
With Regards,
Amit Kapila.

#83Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#82)
3 attachment(s)

Hi Amit

I have rebased, split, and addressed (most of) the review comments of
the v15-0003 patch.

So the previous v15-0003 patch is now split into three as follows:
- v16-0001-Support-2PC-txn-spoolfile.patch
- v16-0002-Support-2PC-txn-pgoutput.patch
- v16-0003-Support-2PC-txn-subscriber-tests.patch

PSA.

Of course the previous v15-0001 and v15-0002 are still required before
applying these v16 patches. Later (v17?) we will combine these again
with what Ajin is currently working on to give the full suite of
patches which will have a consistent version number.

On Tue, Nov 3, 2020 at 4:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few Comments on v15-0003-Support-2PC-txn-pgoutput
===============================================
1. This patch needs to be rebased after commit 644f0d7cc9 and requires
some adjustments accordingly.

Done.

2.
if (flags != 0)
elog(ERROR, "unrecognized flags %u in commit message", flags);

+
/* read fields */
commit_data->commit_lsn = pq_getmsgint64(in);

Spurious line.

Fixed.

3.
@@ -720,6 +722,7 @@ apply_handle_commit(StringInfo s)
replorigin_session_origin_timestamp = commit_data.committime;

CommitTransactionCommand();
+
pgstat_report_stat(false);

Spurious line

Fixed.

4.
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+ Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();

There is no explanation as to why you want to end the previous
transaction and start a new one. Even if we have to do so, we first
need to call BeginTransactionBlock before CommitTransactionCommand.

TODO

5.
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
*/
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
{
- TransactionId xid;

Can we have a separate patch for this as this can be committed before
main patch. This is a refactoring required for the main patch.

Done.

6.
@@ -57,7 +63,8 @@ static void pgoutput_stream_abort(struct
LogicalDecodingContext *ctx,
static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
XLogRecPtr commit_lsn);
-
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);

Spurious line removal.

Fixed.

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v16-0002-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v16-0002-Support-2PC-txn-pgoutput.patchDownload
From d7a66453369bc4c05a0d561ac9f0f85246fe119a Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 4 Nov 2020 18:05:48 +1100
Subject: [PATCH v16] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.
---
 src/backend/access/transam/twophase.c       |  27 ++++
 src/backend/replication/logical/proto.c     | 141 ++++++++++++++++-
 src/backend/replication/logical/worker.c    | 227 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c |  74 +++++++++
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  37 ++++-
 6 files changed, 505 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7940060..129afe9 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..26e43f7 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,145 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In which case we
+	 * expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8	flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case we
+	 * expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK] PREPARED
+	 * uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId	xid;
+	uint8			flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index d282336..e99cf74 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -742,6 +742,225 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/* End the earlier transaction and start a new one */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+		StartTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK PREPARED
+	 * for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 * ========================================
+	 * 1. Replay all the spooled operations
+	 * - This code is same as what apply_handle_stream_commit does for NON two-phase stream commit
+	 * ========================================
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * ========================================
+	 * 2. Mark the transaction as prepared.
+	 * - This code is same as what apply_handle_prepare_txn does for two-phase prepare of the non-streamed tx
+	 * ========================================
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1969,6 +2188,14 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..9f27234 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -143,6 +151,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +165,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +392,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +913,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index cca13da..d28292d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -54,10 +54,12 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_PREPARE = 'P',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +116,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +124,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +153,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData * prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +200,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
-- 
1.8.3.1

v16-0001-Support-2PC-txn-spoolfile.patchapplication/octet-stream; name=v16-0001-Support-2PC-txn-spoolfile.patchDownload
From db3f5b04d89c30d545512ddceddbca40ea0a2795 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 4 Nov 2020 17:43:04 +1100
Subject: [PATCH v16] Support 2PC txn - spoolfile.

This patch only refactors to isolate the streaming spool-file processing to a separate function.
Later, two-phase commit logic will require this common processing to be called from multiple places.
---
 src/backend/replication/logical/worker.c | 58 +++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0468491..d282336 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -244,6 +244,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -933,30 +935,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -964,7 +957,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -979,7 +972,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1048,6 +1041,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1055,16 +1077,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
-- 
1.8.3.1

v16-0003-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v16-0003-Support-2PC-txn-subscriber-tests.patchDownload
From cf9a73649e99dbaf9b8c8747c19ca3dac3396a40 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 4 Nov 2020 18:11:58 +1100
Subject: [PATCH v16] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl           | 345 ++++++++++++++
 src/test/subscription/t/021_twophase_streaming.pl | 521 ++++++++++++++++++++++
 2 files changed, 866 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_streaming.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..f489f47
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,345 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 21;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_streaming.pl b/src/test/subscription/t/021_twophase_streaming.pl
new file mode 100644
index 0000000..9a03b83
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_streaming.pl
@@ -0,0 +1,521 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 28;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#84Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#80)

On Fri, Oct 30, 2020 at 9:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Oct 30, 2020 at 2:46 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, Oct 29, 2020 at 11:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

6.
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid);
+ else
+ appendStringInfo(ctx->out, "preparing streamed transaction");

I think we should include 'gid' as well in the above messages.

Updated.

gid needs to be included in the case of 'include_xids' as well.

Updated.

7.
@@ -221,12 +235,26 @@ StartupDecodingContext(List *output_plugin_options,
ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
(ctx->callbacks.stream_stop_cb != NULL) ||
(ctx->callbacks.stream_abort_cb != NULL) ||
+ (ctx->callbacks.stream_prepare_cb != NULL) ||
(ctx->callbacks.stream_commit_cb != NULL) ||
(ctx->callbacks.stream_change_cb != NULL) ||
(ctx->callbacks.stream_message_cb != NULL) ||
(ctx->callbacks.stream_truncate_cb != NULL);
/*
+ * To support two-phase logical decoding, we require
prepare/commit-prepare/abort-prepare
+ * callbacks. The filter-prepare callback is optional. We however
enable two-phase logical
+ * decoding when at least one of the methods is enabled so that we
can easily identify
+ * missing methods.
+ *
+ * We decide it here, but only check it later in the wrappers.
+ */
+ ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+ (ctx->callbacks.commit_prepared_cb != NULL) ||
+ (ctx->callbacks.rollback_prepared_cb != NULL) ||
+ (ctx->callbacks.filter_prepare_cb != NULL);
+

I think stream_prepare_cb should be checked for the 'twophase' flag
because we won't use this unless two-phase is enabled. Am I missing
something?

Was fixed in v14.

But you still have it in the streaming check. I don't think we need
that for the streaming case.

Updated.

Few other comments on v15-0002-Support-2PC-txn-backend-and-tests:
======================================================================
1. The functions DecodeCommitPrepared and DecodeAbortPrepared have a
lot of code similar to DecodeCommit/Abort. Can we merge these
functions?

Merged the two functions into DecodeCommit and DecodeAbort..

2.
DecodeCommitPrepared()
{
..
+ * If filter check present and this needs to be skipped, do a regular commit.
+ */
+ if (ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid))
+ {
+ ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn);
+ }
+ else
+ {
+ ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+ commit_time, origin_id, origin_lsn,
+ parsed->twophase_gid, true);
+ }
+
+}

Can we expand the comment here to say why we need to do ReorderBufferCommit?

Updated.

3. There are a lot of test cases in this patch which is a good thing
but can we split them into a separate patch for the time being as I
would like to focus on the core logic of the patch first. We can later
see if we need to retain all or part of those tests.

Split the patch and created a new patch for test_decoding tests.

4. Please run pgindent on your patches.

Have not done this. Will do this, after unifiying the patchset.

regards,
Ajin Cherian
Fujitsu Australia

#85Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#81)
3 attachment(s)

On Mon, Nov 2, 2020 at 9:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 28, 2020 at 10:50 AM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Ajin.

I have re-checked the v13 patches for how my remaining review comments
have been addressed.

On Tue, Oct 27, 2020 at 8:55 PM Ajin Cherian <itsajin@gmail.com> wrote:

====================
v12-0002. File: src/backend/replication/logical/reorderbuffer.c
====================

COMMENT
Line 2401
/*
* We are here due to one of the 3 scenarios:
* 1. As part of streaming in-progress transactions
* 2. Prepare of a two-phase commit
* 3. Commit of a transaction.
*
* If we are streaming the in-progress transaction then discard the
* changes that we just streamed, and mark the transactions as
* streamed (if they contained changes), set prepared flag as false.
* If part of a prepare of a two-phase commit set the prepared flag
* as true so that we can discard changes and cleanup tuplecids.
* Otherwise, remove all the
* changes and deallocate the ReorderBufferTXN.
*/
~
The above comment is beyond my understanding. Anything you could do to
simplify it would be good.

For example, when viewing this function in isolation I have never
understood why the streaming flag and rbtxn_prepared(txn) flag are not
possible to be set at the same time?

Perhaps the code is relying on just internal knowledge of how this
helper function gets called? And if it is just that, then IMO there
really should be some Asserts in the code to give more assurance about
that. (Or maybe use completely different flags to represent those 3
scenarios instead of bending the meanings of the existing flags)

Left this for now, probably re-look at this at a later review.
But just to explain; this function is what does the main decoding of
changes of a transaction.
At what point this decoding happens is what this feature and the
streaming in-progress feature is about. As of PG13, this decoding only
happens at commit time. With the streaming of in-progress txn feature,
this began to happen (if streaming enabled) at the time when the
memory limit for decoding transactions was crossed. This 2PC feature
is supporting decoding at the time of a PREPARE transaction.
Now, if streaming is enabled and streaming has started as a result of
crossing the memory threshold, then there is no need to
again begin streaming at a PREPARE transaction as the transaction that
is being prepared has already been streamed.

I don't think this is true, think of a case where we need to send the
last set of changes along with PREPARE. In that case we need to stream
those changes at the time of PREPARE. If I am correct then as pointed
by Peter you need to change some comments and some of the assumptions
related to this you have in the patch.

I have changed the asserts and the comments to reflect this.

Few more comments on the latest patch
(v15-0002-Support-2PC-txn-backend-and-tests)
=========================================================================
1.
@@ -274,6 +296,23 @@ DecodeXactOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
DecodeAbort(ctx, buf, &parsed, xid);
break;
}
+ case XLOG_XACT_ABORT_PREPARED:
+ {

..
+
+ if (!TransactionIdIsValid(parsed.twophase_xid))
+ xid = XLogRecGetXid(r);
+ else
+ xid = parsed.twophase_xid;

I think we don't need this 'if' check here because you must have a
valid value of parsed.twophase_xid;. But, I think this will be moot if
you address the review comment in my previous email such that the
handling of XLOG_XACT_ABORT_PREPARED and XLOG_XACT_ABORT will be
combined as it is there without the patch.

2.
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+   xl_xact_parsed_prepare * parsed)
+{
..
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ return;
+

I think this check is the same as the check in DecodeCommit, so you
can write some comments to indicate the same and also why we don't
need to call ReorderBufferForget here. One more thing is to note is
even if we don't need to call ReorderBufferForget here but still we
need to execute invalidations (which are present in top-level txn) for
the reasons mentioned in ReorderBufferForget. Also, if we do this,
don't forget to update the comment atop
ReorderBufferImmediateInvalidation.

I have updated the comments. I wasn't sure of when to execute
invalidations. Should I only
execute invalidations if this was for another database than what was
being decoded or should
I execute invalidation every time we skip? I will also have to create
a new function in reorderbuffer,c similar to ReorderBufferForget
as the txn is not available in decode.c.

3.
+ /* This is a PREPARED transaction, part of a two-phase commit.
+ * The full cleanup will happen as part of the COMMIT PREPAREDs, so now
+ * just truncate txn by removing changes and tuple_cids
+ */
+ ReorderBufferTruncateTXN(rb, txn, true);

The first line in the multi-line comment should be empty.

Updated.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v16-0002-Support-2PC-txn-backend.patchapplication/octet-stream; name=v16-0002-Support-2PC-txn-backend.patchDownload
From 4bb24a77200ae43bada3c8df3f9f8085cc4acf6c Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 4 Nov 2020 04:16:49 -0500
Subject: [PATCH v16] Support 2PC txn backend

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.
---
 src/backend/replication/logical/decode.c        | 213 ++++++++++++++---
 src/backend/replication/logical/reorderbuffer.c | 290 ++++++++++++++++++++----
 src/include/replication/reorderbuffer.h         |  33 +++
 3 files changed, 463 insertions(+), 73 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..d21d92a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,9 +67,12 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid, bool prepared);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare * parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -244,16 +247,22 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool prepared;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+				/*
+				 * If this is COMMIT_PREPARED and the output plugin supports two-phase commits
+				 * then set the prepared flag to true.
+				 */
+				prepared = ((info == XLOG_XACT_COMMIT_PREPARED) && ctx->twophase)? true : false;
 
 				if (!TransactionIdIsValid(parsed.twophase_xid))
 					xid = XLogRecGetXid(r);
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				DecodeCommit(ctx, buf, &parsed, xid, prepared);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +271,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool prepared;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -270,8 +280,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 					xid = XLogRecGetXid(r);
 				else
 					xid = parsed.twophase_xid;
+				/*
+				 * If this is ABORT_PREPARED and the output plugin supports two-phase commits
+				 * then set the prepared flag to true.
+				 */
+				prepared = ((info == XLOG_XACT_ABORT_PREPARED) && ctx->twophase) ? true : false;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				DecodeAbort(ctx, buf, &parsed, xid, prepared);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +327,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *)XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+									xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -581,16 +614,17 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 /*
  * Consolidated commit record handling between the different form of commit
- * records.
+ * records. Handles both XLOG_XACT_COMMIT and XLOG_XACT_COMMIT_PREPARED.
+ * prepared is set to true if XLOG_XACT_COMMIT_PREPARED.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+					 xl_xact_parsed_commit *parsed, TransactionId xid, bool prepared)
 {
-	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
 	RepOriginId origin_id = XLogRecGetOrigin(buf->record);
-	int			i;
+	int         i;
 
 	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
@@ -609,8 +643,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * There can be several reasons we might not be interested in this
 	 * transaction:
 	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
 	 * 2) The transaction happened in another database.
 	 * 3) The output plugin is not interested in the origin.
 	 * 4) We are doing fast-forwarding
@@ -647,34 +681,153 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
-	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	/*
+	 * This function could be called for COMMIT or COMMIT PREPARED (part of a
+	 * two-phase commit) determined by the flag 'prepared'.
+	 * If it is a regular COMMIT we need to replay all actions of the transaction
+	 * in order by calling ReorderBufferCommit.
+	 *
+	 * If it is a COMMIT PREPARED, we check if this has been
+	 * asked to be filtered using the filter prepare callback. If yes, then
+	 * this transaction has not been decoded at PREPARE and needs to be
+	 * handled like a regular COMMIT.
+	 *
+	 * If COMMIT PREPARED and not filtered we only need to call the corresponding
+	 * callbacks as actions of the transaction were already replayed at PREPARE.
+	 */
+	if (!prepared || (ctx->callbacks.filter_prepare_cb &&
+			ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid)))
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+		/*
+		 * Update the decoding stats at transaction commit/abort. It is not clear
+		 * that sending more or less frequently than this would be better.
+		 */
+		UpdateDecodingStats(ctx);
+	}
+	else
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+
+}
 
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare * parsed)
+{
+	XLogRecPtr  origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
+	int         i;
+	TransactionId xid = parsed->twophase_xid;
+
+    /* ----
+     * Check whether we are interested in this specific transaction, and tell
+     * the reorderbuffer to forget the content of the (sub-)transactions
+     * if not.
+     *
+     * There can be several reasons we might not be interested in this
+     * transaction:
+     * 1) We might not be interested in decoding transactions up to this
+     *    LSN. This can happen because we previously decoded it and now just
+     *    are restarting or if we haven't assembled a consistent snapshot yet.
+     * 2) The transaction happened in another database.
+     * 3) The output plugin is not interested in the origin.
+     * 4) We are doing fast-forwarding
+	 *
+	 * We can't call ReorderBufferForget like we did in DecodeCommit as the
+	 * txn hasn't yet been committed, removing the reorderbuffers before a 
+	 * commit might result in the computation of an incorrect restart_lsn.
+	 * But we need to process cache invalidation if there are any.
+	 * Even if we're not interested in the transaction's contents, it could
+	 * have manipulated the catalog and we need to update the caches accordingly.
+	 */
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		 ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+		return;
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
+	/* replay actions of all transaction + subtransactions in order */
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
+	 * Update the decoding stats at transaction commit/two-phase prepare/abort. It is not clear
 	 * that sending more or less frequently than this would be better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c. This could be called either on
+ * a XLOG_XACT_ABORT or on  a XLOG_XACT_ABORT_PREPARED. The prepared flag 
+ * is set if called on a XLOG_XACT_ABORT_PREPARED.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
 	}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	/*
+	 * If this is a regular ABORT or to be filtered then just clean up by calling 
+	 * ReorderBufferAbort, otherwise if not to be skipped or filtered and
+	 * previously prepared then it is a ROLLBACK PREPARED.
+	 */
+	if(!prepared ||
+	   (ctx->callbacks.filter_prepare_cb &&
+			ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid)) ||
+	   FilterByOrigin(ctx, origin_id) ||
+	   SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+	   parsed->dbId != ctx->slot->data.database)
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
+		
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+		
+	}
+   	else
+	{
+		/* ROLLBACK PREPARED of a previously prepared txn, need to call the callbacks.*/
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c1bd680..dff4cdd 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -421,6 +422,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1514,12 +1521,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1538,7 +1547,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1572,9 +1581,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for decoding
+		 * catalog snapshot access.
+		 * They are always stored in the toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1768,9 +1801,23 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
-
-	ReorderBufferCleanupTXN(rb, txn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit.
+		 * The full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1898,8 +1945,12 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/* Discard the changes that we just streamed.
+	 * This can only be called if streaming and not part of a PREPARE in
+	 * a two-phase commit, so set prepared flag as false.
+	 */
+	Assert(!rbtxn_prepared(txn));
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1921,6 +1972,11 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 /*
  * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
  *
+ * We are here due to one of the 3 scenarios:
+ * 1. As part of streaming an in-progress transactions
+ * 2. Prepare of a two-phase commit
+ * 3. Commit of a transaction.
+ *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
  * merge) and replay the changes in lsn order.
@@ -2006,7 +2062,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2297,7 +2353,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT
+			 * (for regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2331,18 +2396,32 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the 4 scenarios:
+		 * 1. Prepare of a two-phase commit.
+		 * 2. Prepare of a two-phase commit and a part of streaming in-progress txn.
+		 * 3. streaming of an in-progress txn.
+		 * 3. Commit of a transaction.
+		 *
+		 * Scenario 1 and 2, we handle the same way, pass in prepared as true to
+		 * ReorderBufferTruncateTXN and allow more elaborate truncation of txn data
+		 * as the entire transaction has been decoded, only commit is pending.
+		 * Scenario 3, we pass in prepared as false to ReorderBufferTruncateTXN as
+		 * the txn is not yet completely decoded.
+		 * Scenario 4, all txn has been decoded and we can fully cleanup the TXN reorder buffer.
 		 */
-		if (streaming)
+		if (rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, true);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn, false);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2372,17 +2451,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
+		 * abort of the (sub)transaction we are streaming or preparing. We need to do the
 		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we are
+			 * sending the data out on a PREPARE during a two-phase commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started  || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2390,10 +2470,22 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/*
+			 * If streaming, reset the TXN so that it is allowed to stream remaining data.
+			 * Streaming can also be on a prepared txn, handle it the same way.
+			 */
+			if (streaming)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+						txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2415,23 +2507,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2473,6 +2558,120 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+   ReorderBufferTXN *txn;
+
+   txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+   return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	* The transaction may or may not exist (during restarts for example).
+	* Anyway, two-phase transactions do not contain any reorderbuffers. So allow
+	* it to be created below.
+	*/
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+										  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2515,7 +2714,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 93c79c8..03f777c 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -234,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -614,6 +635,11 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -637,6 +663,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool 		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void 		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+								 TimestampTz commit_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v16-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v16-0001-Support-2PC-txn-base.patchDownload
From c44cc520b0c15777de67b6bf463b55eda89ec26a Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 4 Nov 2020 03:39:12 -0500
Subject: [PATCH v16] Support 2PC txn base.

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 190 +++++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 146 ++++++++++++++++++-
 src/backend/replication/logical/logical.c | 228 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++++
 src/include/replication/reorderbuffer.h   |  35 +++++
 6 files changed, 643 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 8e33614..4be2ca5 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId	check_xid_aborted; /* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -73,6 +78,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									ReorderBufferTXN *txn,
+									XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -88,6 +96,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -112,10 +132,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -127,6 +152,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool 		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +162,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +254,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+								strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +294,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +354,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +599,26 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/*
+	 * if check_xid_aborted is a valid xid, then it was passed in
+	 * as an option to check if the transaction having this xid would be aborted.
+	 * This is to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			   !TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -646,6 +810,32 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						ReorderBufferTXN *txn,
+						XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %u, %s", txn->xid,
+						 quote_literal_cstr(txn->gid));
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..f5b617d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>stream_prepare_cb</function>, <function>commit_prepared_cb</function>
+    and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +598,56 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +657,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +740,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +794,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +850,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1041,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d5cfbea..76c85a9 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -227,6 +241,20 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require prepare/commit-prepare/abort-prepare
+	 * callbacks. The filter-prepare callback is optional. We however enable two-phase logical
+	 * decoding when at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +265,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +812,120 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin supports two-phase commits then prepare callback is mandatory */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support two-phase commits then commit prepared callback is mandatory */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/* If the plugin support two-phase commits then abort prepared callback is mandatory */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+			(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1002,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not enabled. In that
+	 * case all two-phase transactions are considered filtered out and will be
+	 * applied as regular transactions at COMMIT PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1245,46 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when streaming and two-phase commits are supported. */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..4c1341f 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED and
+  * sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn,
+											 XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index dfdda93..93c79c8 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -244,6 +245,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char         *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -405,6 +409,26 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -431,6 +455,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -497,6 +527,10 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -505,6 +539,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
-- 
1.8.3.1

v16-0003-Support-2PC-test-cases-for-test_decoding.patchapplication/octet-stream; name=v16-0003-Support-2PC-test-cases-for-test_decoding.patchDownload
From 87f23a7950aea9c442a54064b0fea73f5163a5ea Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 4 Nov 2020 04:20:37 -0500
Subject: [PATCH v16] Support 2PC test cases for test_decoding

Add sql and tap tests to test_decoding for 2PC
---
 contrib/test_decoding/Makefile                     |   4 +-
 contrib/test_decoding/expected/two_phase.out       | 228 +++++++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 177 ++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 +++++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 ++++++
 contrib/test_decoding/t/001_twophase.pl            | 121 +++++++++++
 6 files changed, 711 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,11 +4,13 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..e5e34b4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..957c198
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..1555582
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

#86Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#85)

On Wed, Nov 4, 2020 at 3:01 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Nov 2, 2020 at 9:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 28, 2020 at 10:50 AM Peter Smith <smithpb2250@gmail.com> wrote:
2.
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+   xl_xact_parsed_prepare * parsed)
+{
..
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ return;
+

I think this check is the same as the check in DecodeCommit, so you
can write some comments to indicate the same and also why we don't
need to call ReorderBufferForget here. One more thing is to note is
even if we don't need to call ReorderBufferForget here but still we
need to execute invalidations (which are present in top-level txn) for
the reasons mentioned in ReorderBufferForget. Also, if we do this,
don't forget to update the comment atop
ReorderBufferImmediateInvalidation.

I have updated the comments. I wasn't sure of when to execute
invalidations. Should I only
execute invalidations if this was for another database than what was
being decoded or should
I execute invalidation every time we skip?

I think so. Did there exist any such special condition in DecodeCommit
or do you have any other reason in your mind for not doing it every
time we skip? We probably might not need to execute when the database
is different (at least I can't think of a reason for the same) but I
guess this doesn't make much difference and it will keep the code
consistent with what we do in DecodeCommit.

--
With Regards,
Amit Kapila.

#87Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#86)

On Wed, Nov 4, 2020 at 9:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 4, 2020 at 3:01 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Nov 2, 2020 at 9:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 28, 2020 at 10:50 AM Peter Smith <smithpb2250@gmail.com> wrote:
2.
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+   xl_xact_parsed_prepare * parsed)
+{
..
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ return;
+

I think this check is the same as the check in DecodeCommit, so you
can write some comments to indicate the same and also why we don't
need to call ReorderBufferForget here. One more thing is to note is
even if we don't need to call ReorderBufferForget here but still we
need to execute invalidations (which are present in top-level txn) for
the reasons mentioned in ReorderBufferForget. Also, if we do this,
don't forget to update the comment atop
ReorderBufferImmediateInvalidation.

I have updated the comments. I wasn't sure of when to execute
invalidations. Should I only
execute invalidations if this was for another database than what was
being decoded or should
I execute invalidation every time we skip?

I think so. Did there exist any such special condition in DecodeCommit
or do you have any other reason in your mind for not doing it every
time we skip? We probably might not need to execute when the database
is different (at least I can't think of a reason for the same) but I
guess this doesn't make much difference and it will keep the code
consistent with what we do in DecodeCommit.

I was just basing it on the comments in the DecodeCommit:

* We can't just use ReorderBufferAbort() here, because we need to execute
* the transaction's invalidations. This currently won't be needed if
* we're just skipping over the transaction because currently we only do
* so during startup, to get to the first transaction the client needs. As
* we have reset the catalog caches before starting to read WAL, and we
* haven't yet touched any catalogs, there can't be anything to invalidate.
* But if we're "forgetting" this commit because it's it happened in
* another database, the invalidations might be important, because they
* could be for shared catalogs and we might have loaded data into the
* relevant syscaches.

regards,
Ajin Cherian
Fujitsu Australia

#88Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#87)

On Wed, Nov 4, 2020 at 3:46 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Wed, Nov 4, 2020 at 9:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 4, 2020 at 3:01 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Nov 2, 2020 at 9:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 28, 2020 at 10:50 AM Peter Smith <smithpb2250@gmail.com> wrote:
2.
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+   xl_xact_parsed_prepare * parsed)
+{
..
+ if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+ ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+ return;
+

I think this check is the same as the check in DecodeCommit, so you
can write some comments to indicate the same and also why we don't
need to call ReorderBufferForget here. One more thing is to note is
even if we don't need to call ReorderBufferForget here but still we
need to execute invalidations (which are present in top-level txn) for
the reasons mentioned in ReorderBufferForget. Also, if we do this,
don't forget to update the comment atop
ReorderBufferImmediateInvalidation.

I have updated the comments. I wasn't sure of when to execute
invalidations. Should I only
execute invalidations if this was for another database than what was
being decoded or should
I execute invalidation every time we skip?

I think so. Did there exist any such special condition in DecodeCommit
or do you have any other reason in your mind for not doing it every
time we skip? We probably might not need to execute when the database
is different (at least I can't think of a reason for the same) but I
guess this doesn't make much difference and it will keep the code
consistent with what we do in DecodeCommit.

I was just basing it on the comments in the DecodeCommit:

Okay, so it is mentioned in the comment why we need to execute
invalidations even when the database is not the same. So, we are
probably good here if we are executing the invalidations whenever we
skip to decode the prepared xact.

--
With Regards,
Amit Kapila.

#89Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#88)
3 attachment(s)

On Wed, Nov 4, 2020 at 9:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, so it is mentioned in the comment why we need to execute
invalidations even when the database is not the same. So, we are
probably good here if we are executing the invalidations whenever we
skip to decode the prepared xact.

Updated to execute invalidations while skipping prepared transactions.
Also ran pgindent on the
source files with updated typedefs.
Attaching v17 with 1,2 and 3.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v17-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v17-0001-Support-2PC-txn-base.patchDownload
From 806112f115f06782f2177a11d376b37862a885e6 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 5 Nov 2020 04:08:22 -0500
Subject: [PATCH v17] Support 2PC txn base.

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 190 +++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 146 +++++++++++++++++-
 src/backend/replication/logical/logical.c | 242 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++++
 src/include/replication/reorderbuffer.h   |  35 +++++
 src/tools/pgindent/typedefs.list          |  11 ++
 7 files changed, 668 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 8e33614..80b7b51 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -73,6 +78,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -88,6 +96,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -112,10 +132,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -127,6 +152,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +162,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +254,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+
+				errno = 0;
+				data->check_xid_aborted = (TransactionId) strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +294,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +354,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +599,26 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/*
+	 * if check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -646,6 +810,32 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %u, %s", txn->xid,
+						 quote_literal_cstr(txn->gid));
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..f5b617d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>stream_prepare_cb</function>, <function>commit_prepared_cb</function>
+    and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +598,56 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +657,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +740,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +794,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +850,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1041,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d5cfbea..e9107cd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -227,6 +241,21 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require
+	 * prepare/commit-prepare/abort-prepare callbacks. The filter-prepare
+	 * callback is optional. We however enable two-phase logical decoding when
+	 * at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +266,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +813,129 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1012,52 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1256,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..032e35a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+  * and sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index dfdda93..66c89d1 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -244,6 +245,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char	   *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -405,6 +409,26 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -431,6 +455,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -497,6 +527,10 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -505,6 +539,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f2ba92b..1086e51 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1314,9 +1314,20 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
1.8.3.1

v17-0002-Support-2PC-txn-backend.patchapplication/octet-stream; name=v17-0002-Support-2PC-txn-backend.patchDownload
From 40bed2c0b4f75e5f5e87263b6f1f9c8e8387561e Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 5 Nov 2020 22:01:08 -0500
Subject: [PATCH v17] Support 2PC txn backend.

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.
---
 src/backend/replication/logical/decode.c        | 222 ++++++++++++++---
 src/backend/replication/logical/reorderbuffer.c | 318 ++++++++++++++++++++----
 src/include/replication/reorderbuffer.h         |  34 +++
 3 files changed, 503 insertions(+), 71 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..06f5970 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,9 +67,12 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid, bool prepared);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -244,16 +247,23 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		prepared;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
 
+				/*
+				 * If this is COMMIT_PREPARED and the output plugin supports
+				 * two-phase commits then set the prepared flag to true.
+				 */
+				prepared = ((info == XLOG_XACT_COMMIT_PREPARED) && ctx->twophase) ? true : false;
+
 				if (!TransactionIdIsValid(parsed.twophase_xid))
 					xid = XLogRecGetXid(r);
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				DecodeCommit(ctx, buf, &parsed, xid, prepared);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +272,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		prepared;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +282,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * If this is ABORT_PREPARED and the output plugin supports
+				 * two-phase commits then set the prepared flag to true.
+				 */
+				prepared = ((info == XLOG_XACT_ABORT_PREPARED) && ctx->twophase) ? true : false;
+
+				DecodeAbort(ctx, buf, &parsed, xid, prepared);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +329,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -581,11 +616,12 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 /*
  * Consolidated commit record handling between the different form of commit
- * records.
+ * records. Handles both XLOG_XACT_COMMIT and XLOG_XACT_COMMIT_PREPARED.
+ * prepared is set to true if XLOG_XACT_COMMIT_PREPARED.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid, bool prepared)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -609,8 +645,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * There can be several reasons we might not be interested in this
 	 * transaction:
 	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
 	 * 2) The transaction happened in another database.
 	 * 3) The output plugin is not interested in the origin.
 	 * 4) We are doing fast-forwarding
@@ -647,34 +683,164 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
+	/*
+	 * This function could be called for COMMIT or COMMIT PREPARED (part of a
+	 * two-phase commit) determined by the flag 'prepared'. If it is a regular
+	 * COMMIT we need to replay all actions of the transaction in order by
+	 * calling ReorderBufferCommit.
+	 *
+	 * If it is a COMMIT PREPARED, we check if this has been asked to be
+	 * filtered using the filter prepare callback. If yes, then this
+	 * transaction has not been decoded at PREPARE and needs to be handled
+	 * like a regular COMMIT.
+	 *
+	 * If COMMIT PREPARED and not filtered we only need to call the
+	 * corresponding callbacks as actions of the transaction were already
+	 * replayed at PREPARE.
+	 */
+	if (!prepared || (ctx->callbacks.filter_prepare_cb &&
+					  ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid)))
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+
+		/*
+		 * Update the decoding stats at transaction commit/abort. It is not
+		 * clear that sending more or less frequently than this would be
+		 * better.
+		 */
+		UpdateDecodingStats(ctx);
+	}
+	else
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	/* ----
+     * Check whether we are interested in this specific transaction, and tell
+     * the reorderbuffer to forget the content of the (sub-)transactions
+     * if not.
+     *
+     * There can be several reasons we might not be interested in this
+     * transaction:
+     * 1) We might not be interested in decoding transactions up to this
+     *    LSN. This can happen because we previously decoded it and now just
+     *    are restarting or if we haven't assembled a consistent snapshot yet.
+     * 2) The transaction happened in another database.
+     * 3) The output plugin is not interested in the origin.
+     * 4) We are doing fast-forwarding
+	 *
+	 * We can't call ReorderBufferForget like we did in DecodeCommit as the
+	 * txn hasn't yet been committed, removing the reorderbuffers before a
+	 * commit might result in the computation of an incorrect restart_lsn.
+	 * But we need to process cache invalidation if there are any.
+	 * Even if we're not interested in the transaction's contents, it could
+	 * have manipulated the catalog and we need to update the caches accordingly.
+	 */
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction commit/two-phase
+	 * prepare/abort. It is not clear that sending more or less frequently
+	 * than this would be better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c. This could be called either on
+ * a XLOG_XACT_ABORT or on  a XLOG_XACT_ABORT_PREPARED. The prepared flag
+ * is set if called on a XLOG_XACT_ABORT_PREPARED.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
 	}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	/*
+	 * If this is a regular ABORT or to be filtered then just clean up by
+	 * calling ReorderBufferAbort, otherwise if not to be skipped or filtered
+	 * and previously prepared then it is a ROLLBACK PREPARED.
+	 */
+	if (!prepared ||
+		(ctx->callbacks.filter_prepare_cb &&
+		 ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid)) ||
+		FilterByOrigin(ctx, origin_id) ||
+		SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		parsed->dbId != ctx->slot->data.database)
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
+
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+
+	}
+	else
+	{
+		/*
+		 * ROLLBACK PREPARED of a previously prepared txn, need to call the
+		 * callbacks.
+		 */
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c1bd680..feb305a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -421,6 +422,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1514,12 +1521,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1538,7 +1547,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1572,9 +1581,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1768,9 +1801,24 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1898,8 +1946,10 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/*
+	 * Discard the changes that we just streamed.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2006,7 +2056,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2297,7 +2347,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2331,18 +2390,32 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the 4 scenarios: 1. Prepare of a
+		 * two-phase commit. 2. Prepare of a two-phase commit and a part of
+		 * streaming in-progress txn. 3. streaming of an in-progress txn. 3.
+		 * Commit of a transaction.
+		 *
+		 * Scenario 1 and 2, we handle the same way, pass in prepared as true
+		 * to ReorderBufferTruncateTXN and allow more elaborate truncation of
+		 * txn data as the entire transaction has been decoded, only commit is
+		 * pending. Scenario 3, we pass in prepared as false to
+		 * ReorderBufferTruncateTXN as the txn is not yet completely decoded.
+		 * Scenario 4, all txn has been decoded and we can fully cleanup the
+		 * TXN reorder buffer.
 		 */
-		if (streaming)
+		if (rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, true);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn, false);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2372,17 +2445,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2390,10 +2466,23 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/*
+			 * If streaming, reset the TXN so that it is allowed to stream
+			 * remaining data. Streaming can also be on a prepared txn, handle
+			 * it the same way.
+			 */
+			if (streaming)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+					 txn->gid[0] != '\0' ? txn->gid : "", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2415,23 +2504,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2473,6 +2555,120 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	 * The transaction may or may not exist (during restarts for example).
+	 * Anyway, two-phase transactions do not contain any reorderbuffers. So
+	 * allow it to be created below.
+	 */
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2515,7 +2711,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
@@ -2604,6 +2805,37 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 }
 
 /*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ * Note that this is only allowed to be called when a transaction prepare
+ * has just been read, not otherwise.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else	
+		Assert(txn->ninvalidations == 0);
+}
+
+
+/*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 66c89d1..13c802b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -234,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -614,12 +635,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -637,6 +664,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+								 TimestampTz commit_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v17-0003-Support-2PC-test-cases-for-test_decoding.patchapplication/octet-stream; name=v17-0003-Support-2PC-test-cases-for-test_decoding.patchDownload
From 540c3c289caa9d9b40f6555dccdbc081bf4859da Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 5 Nov 2020 22:03:33 -0500
Subject: [PATCH v17] Support 2PC test cases for test_decoding.

Add sql and tap tests to test_decoding for 2PC.
---
 contrib/test_decoding/Makefile                     |   4 +-
 contrib/test_decoding/expected/two_phase.out       | 228 +++++++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 177 ++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 +++++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 ++++++
 contrib/test_decoding/t/001_twophase.pl            | 121 +++++++++++
 6 files changed, 711 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,11 +4,13 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..e5e34b4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..957c198
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..1555582
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

#90Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#83)
6 attachment(s)
4.
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+ Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();

There is no explanation as to why you want to end the previous
transaction and start a new one. Even if we have to do so, we first
need to call BeginTransactionBlock before CommitTransactionCommand.

Done

---

Also...

pgindent has been run for all patches now.

The latest of all six patches are again reunited with a common v18
version number.

PSA

Kind Regards,
Peter Smith.
Fujitsu Australia.

Attachments:

v18-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v18-0001-Support-2PC-txn-base.patchDownload
From fbb0fad329045544e2e25aed148f8eb5b33d309c Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 9 Nov 2020 12:00:39 +1100
Subject: [PATCH v18] Support 2PC txn base.

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 190 +++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 146 +++++++++++++++++-
 src/backend/replication/logical/logical.c | 242 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++++
 src/include/replication/reorderbuffer.h   |  35 +++++
 src/tools/pgindent/typedefs.list          |  11 ++
 7 files changed, 668 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 8e33614..80b7b51 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -73,6 +78,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -88,6 +96,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -112,10 +132,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -127,6 +152,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +162,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +254,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+
+				errno = 0;
+				data->check_xid_aborted = (TransactionId) strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +294,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +354,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here
+ * we demonstrate a simple logic by checking the GID. If the
+ * GID contains the "_nodecode" substring, then we filter
+ * it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, " %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +599,26 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/*
+	 * if check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -646,6 +810,32 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %u, %s", txn->xid,
+						 quote_literal_cstr(txn->gid));
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..f5b617d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>stream_prepare_cb</function>, <function>commit_prepared_cb</function>
+    and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +598,56 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +657,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +740,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +794,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +850,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1041,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d5cfbea..e9107cd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -227,6 +241,21 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require
+	 * prepare/commit-prepare/abort-prepare callbacks. The filter-prepare
+	 * callback is optional. We however enable two-phase logical decoding when
+	 * at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +266,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +813,129 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1012,52 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1256,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..032e35a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+  * and sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index dfdda93..66c89d1 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -244,6 +245,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char	   *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -405,6 +409,26 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -431,6 +455,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -497,6 +527,10 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -505,6 +539,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f2ba92b..1086e51 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1314,9 +1314,20 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
1.8.3.1

v18-0003-Support-2PC-test-cases-for-test_decoding.patchapplication/octet-stream; name=v18-0003-Support-2PC-test-cases-for-test_decoding.patchDownload
From 99ea1ff6c83a75d116d4ced5da9677806b6db916 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 9 Nov 2020 12:22:31 +1100
Subject: [PATCH v18] Support 2PC test cases for test_decoding.

Add sql and tap tests to test_decoding for 2PC.
---
 contrib/test_decoding/Makefile                     |   4 +-
 contrib/test_decoding/expected/two_phase.out       | 228 +++++++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 177 ++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 +++++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 ++++++
 contrib/test_decoding/t/001_twophase.pl            | 121 +++++++++++
 6 files changed, 711 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,11 +4,13 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..e5e34b4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..957c198
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..1555582
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

v18-0004-Support-2PC-txn-spoolfile.patchapplication/octet-stream; name=v18-0004-Support-2PC-txn-spoolfile.patchDownload
From f7a8909795f313ca3cdc5d2b3b5b9a7ed588906d Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 9 Nov 2020 14:10:07 +1100
Subject: [PATCH v18] Support 2PC txn - spoolfile.

This patch only refactors to isolate the streaming spool-file processing to a separate function.
Later, two-phase commit logic will require this common processing to be called from multiple places.
---
 src/backend/replication/logical/worker.c | 58 +++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0468491..9fa816c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -244,6 +244,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -933,30 +935,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -964,7 +957,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -979,7 +972,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1048,6 +1041,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1055,16 +1077,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
-- 
1.8.3.1

v18-0005-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v18-0005-Support-2PC-txn-pgoutput.patchDownload
From 6d360e67a130738462f55e94f61fda8fe0a9d856 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 9 Nov 2020 16:53:00 +1100
Subject: [PATCH v18] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.
---
 src/backend/access/transam/twophase.c       |  33 +++-
 src/backend/replication/logical/proto.c     | 141 ++++++++++++++++-
 src/backend/replication/logical/worker.c    | 236 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c |  74 +++++++++
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  37 ++++-
 src/tools/pgindent/typedefs.list            |   1 +
 7 files changed, 518 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7940060..847f85d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
@@ -1133,9 +1160,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..cfb94d1 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,145 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK]
+	 * PREPARED uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9fa816c..f1e94ad 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -742,6 +742,234 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData *prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK
+	 * PREPARED for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare_txn (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1969,6 +2197,14 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..71ac431 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -143,6 +151,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +165,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +392,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +913,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index cca13da..7c6686c 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -54,10 +54,12 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_PREPARE = 'P',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +116,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +124,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +153,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData *prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +200,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1086e51..f9df33c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1340,6 +1340,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPrepareData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v18-0002-Support-2PC-txn-backend.patchapplication/octet-stream; name=v18-0002-Support-2PC-txn-backend.patchDownload
From df74b8490cc27080de8cdac3b0b81113d10dc558 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 9 Nov 2020 12:16:41 +1100
Subject: [PATCH v18] Support 2PC txn backend.

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.
---
 src/backend/replication/logical/decode.c        | 222 ++++++++++++++---
 src/backend/replication/logical/reorderbuffer.c | 318 ++++++++++++++++++++----
 src/include/replication/reorderbuffer.h         |  34 +++
 3 files changed, 503 insertions(+), 71 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..06f5970 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,9 +67,12 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid, bool prepared);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -244,16 +247,23 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		prepared;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
 
+				/*
+				 * If this is COMMIT_PREPARED and the output plugin supports
+				 * two-phase commits then set the prepared flag to true.
+				 */
+				prepared = ((info == XLOG_XACT_COMMIT_PREPARED) && ctx->twophase) ? true : false;
+
 				if (!TransactionIdIsValid(parsed.twophase_xid))
 					xid = XLogRecGetXid(r);
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				DecodeCommit(ctx, buf, &parsed, xid, prepared);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +272,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		prepared;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +282,13 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * If this is ABORT_PREPARED and the output plugin supports
+				 * two-phase commits then set the prepared flag to true.
+				 */
+				prepared = ((info == XLOG_XACT_ABORT_PREPARED) && ctx->twophase) ? true : false;
+
+				DecodeAbort(ctx, buf, &parsed, xid, prepared);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +329,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -581,11 +616,12 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 /*
  * Consolidated commit record handling between the different form of commit
- * records.
+ * records. Handles both XLOG_XACT_COMMIT and XLOG_XACT_COMMIT_PREPARED.
+ * prepared is set to true if XLOG_XACT_COMMIT_PREPARED.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid, bool prepared)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -609,8 +645,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * There can be several reasons we might not be interested in this
 	 * transaction:
 	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
 	 * 2) The transaction happened in another database.
 	 * 3) The output plugin is not interested in the origin.
 	 * 4) We are doing fast-forwarding
@@ -647,34 +683,164 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
+	/*
+	 * This function could be called for COMMIT or COMMIT PREPARED (part of a
+	 * two-phase commit) determined by the flag 'prepared'. If it is a regular
+	 * COMMIT we need to replay all actions of the transaction in order by
+	 * calling ReorderBufferCommit.
+	 *
+	 * If it is a COMMIT PREPARED, we check if this has been asked to be
+	 * filtered using the filter prepare callback. If yes, then this
+	 * transaction has not been decoded at PREPARE and needs to be handled
+	 * like a regular COMMIT.
+	 *
+	 * If COMMIT PREPARED and not filtered we only need to call the
+	 * corresponding callbacks as actions of the transaction were already
+	 * replayed at PREPARE.
+	 */
+	if (!prepared || (ctx->callbacks.filter_prepare_cb &&
+					  ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid)))
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+
+		/*
+		 * Update the decoding stats at transaction commit/abort. It is not
+		 * clear that sending more or less frequently than this would be
+		 * better.
+		 */
+		UpdateDecodingStats(ctx);
+	}
+	else
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in COMMIT
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->origin_timestamp;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	/* ----
+     * Check whether we are interested in this specific transaction, and tell
+     * the reorderbuffer to forget the content of the (sub-)transactions
+     * if not.
+     *
+     * There can be several reasons we might not be interested in this
+     * transaction:
+     * 1) We might not be interested in decoding transactions up to this
+     *    LSN. This can happen because we previously decoded it and now just
+     *    are restarting or if we haven't assembled a consistent snapshot yet.
+     * 2) The transaction happened in another database.
+     * 3) The output plugin is not interested in the origin.
+     * 4) We are doing fast-forwarding
+	 *
+	 * We can't call ReorderBufferForget like we did in DecodeCommit as the
+	 * txn hasn't yet been committed, removing the reorderbuffers before a
+	 * commit might result in the computation of an incorrect restart_lsn.
+	 * But we need to process cache invalidation if there are any.
+	 * Even if we're not interested in the transaction's contents, it could
+	 * have manipulated the catalog and we need to update the caches accordingly.
+	 */
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is setup correctly for the main transaction in case all changes
+	 * happened in subtransanctions
+	 */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction commit/two-phase
+	 * prepare/abort. It is not clear that sending more or less frequently
+	 * than this would be better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c. This could be called either on
+ * a XLOG_XACT_ABORT or on  a XLOG_XACT_ABORT_PREPARED. The prepared flag
+ * is set if called on a XLOG_XACT_ABORT_PREPARED.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = 0;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
 	}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	/*
+	 * If this is a regular ABORT or to be filtered then just clean up by
+	 * calling ReorderBufferAbort, otherwise if not to be skipped or filtered
+	 * and previously prepared then it is a ROLLBACK PREPARED.
+	 */
+	if (!prepared ||
+		(ctx->callbacks.filter_prepare_cb &&
+		 ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid)) ||
+		FilterByOrigin(ctx, origin_id) ||
+		SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		parsed->dbId != ctx->slot->data.database)
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
+
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+
+	}
+	else
+	{
+		/*
+		 * ROLLBACK PREPARED of a previously prepared txn, need to call the
+		 * callbacks.
+		 */
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c1bd680..feb305a 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -421,6 +422,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1514,12 +1521,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1538,7 +1547,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1572,9 +1581,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1768,9 +1801,24 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1898,8 +1946,10 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/*
+	 * Discard the changes that we just streamed.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2006,7 +2056,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2297,7 +2347,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2331,18 +2390,32 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the 4 scenarios: 1. Prepare of a
+		 * two-phase commit. 2. Prepare of a two-phase commit and a part of
+		 * streaming in-progress txn. 3. streaming of an in-progress txn. 3.
+		 * Commit of a transaction.
+		 *
+		 * Scenario 1 and 2, we handle the same way, pass in prepared as true
+		 * to ReorderBufferTruncateTXN and allow more elaborate truncation of
+		 * txn data as the entire transaction has been decoded, only commit is
+		 * pending. Scenario 3, we pass in prepared as false to
+		 * ReorderBufferTruncateTXN as the txn is not yet completely decoded.
+		 * Scenario 4, all txn has been decoded and we can fully cleanup the
+		 * TXN reorder buffer.
 		 */
-		if (streaming)
+		if (rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, true);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn, false);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2372,17 +2445,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2390,10 +2466,23 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/*
+			 * If streaming, reset the TXN so that it is allowed to stream
+			 * remaining data. Streaming can also be on a prepared txn, handle
+			 * it the same way.
+			 */
+			if (streaming)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+					 txn->gid[0] != '\0' ? txn->gid : "", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2415,23 +2504,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2473,6 +2555,120 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	 * The transaction may or may not exist (during restarts for example).
+	 * Anyway, two-phase transactions do not contain any reorderbuffers. So
+	 * allow it to be created below.
+	 */
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2515,7 +2711,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
@@ -2604,6 +2805,37 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 }
 
 /*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ * Note that this is only allowed to be called when a transaction prepare
+ * has just been read, not otherwise.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else	
+		Assert(txn->ninvalidations == 0);
+}
+
+
+/*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 66c89d1..13c802b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -234,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -614,12 +635,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -637,6 +664,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+								 TimestampTz commit_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v18-0006-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v18-0006-Support-2PC-txn-subscriber-tests.patchDownload
From b4fd33329e2e5677ab58132012638432d3f64e61 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 9 Nov 2020 16:57:10 +1100
Subject: [PATCH v18] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl           | 345 ++++++++++++++
 src/test/subscription/t/021_twophase_streaming.pl | 521 ++++++++++++++++++++++
 2 files changed, 866 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_streaming.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..f489f47
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,345 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 21;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_streaming.pl b/src/test/subscription/t/021_twophase_streaming.pl
new file mode 100644
index 0000000..9a03b83
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_streaming.pl
@@ -0,0 +1,521 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 28;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#91Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#90)

On Mon, Nov 9, 2020 at 3:23 PM Peter Smith <smithpb2250@gmail.com> wrote:

4.
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data)
+{
+ Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+ /* The synchronization worker runs in single transaction. */
+ if (IsTransactionState() && !am_tablesync_worker())
+ {
+ /* End the earlier transaction and start a new one */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+ StartTransactionCommand();

There is no explanation as to why you want to end the previous
transaction and start a new one. Even if we have to do so, we first
need to call BeginTransactionBlock before CommitTransactionCommand.

Done

---

Also...

pgindent has been run for all patches now.

The latest of all six patches are again reunited with a common v18
version number.

I've looked at the patches and done some tests. Here is my comment and
question I realized during testing and reviewing.

+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+             xl_xact_parsed_prepare *parsed)
+{
+   XLogRecPtr  origin_lsn = parsed->origin_lsn;
+   TimestampTz commit_time = parsed->origin_timestamp;
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-           xl_xact_parsed_abort *parsed, TransactionId xid)
+           xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared)
 {
    int         i;
+   XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
+   TimestampTz commit_time = 0;
+   XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
-   for (i = 0; i < parsed->nsubxacts; i++)
+   if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
    {
-       ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-                          buf->record->EndRecPtr);
+       origin_lsn = parsed->origin_lsn;
+       commit_time = parsed->origin_timestamp;
    }

In the above two changes, parsed->origin_timestamp is used as
commit_time. But in DecodeCommit() we use parsed->xact_time instead.
Therefore it a transaction didn't have replorigin_session_origin the
timestamp of logical decoding out generated by test_decoding with
'include-timestamp' option is invalid. Is it intentional?

---
+   if (is_commit)
+       txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+   else
+       txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+   if (rbtxn_commit_prepared(txn))
+       rb->commit_prepared(rb, txn, commit_lsn);
+   else if (rbtxn_rollback_prepared(txn))
+       rb->rollback_prepared(rb, txn, commit_lsn);

RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED are used only here
and it seems to me that it's not necessarily necessary.

---
+               /*
+                * If this is COMMIT_PREPARED and the output plugin supports
+                * two-phase commits then set the prepared flag to true.
+                */
+               prepared = ((info == XLOG_XACT_COMMIT_PREPARED) &&
ctx->twophase) ? true : false;

We can write instead:

prepared = ((info == XLOG_XACT_COMMIT_PREPARED) && ctx->twophase);

+               /*
+                * If this is ABORT_PREPARED and the output plugin supports
+                * two-phase commits then set the prepared flag to true.
+                */
+               prepared = ((info == XLOG_XACT_ABORT_PREPARED) &&
ctx->twophase) ? true : false;

The same is true here.

---
'git show --check' of v18-0002 reports some warnings.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#92Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#91)

On Mon, Nov 9, 2020 at 1:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 9, 2020 at 3:23 PM Peter Smith <smithpb2250@gmail.com> wrote:

I've looked at the patches and done some tests. Here is my comment and
question I realized during testing and reviewing.

+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+             xl_xact_parsed_prepare *parsed)
+{
+   XLogRecPtr  origin_lsn = parsed->origin_lsn;
+   TimestampTz commit_time = parsed->origin_timestamp;
static void
DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-           xl_xact_parsed_abort *parsed, TransactionId xid)
+           xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared)
{
int         i;
+   XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
+   TimestampTz commit_time = 0;
+   XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
-   for (i = 0; i < parsed->nsubxacts; i++)
+   if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
{
-       ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-                          buf->record->EndRecPtr);
+       origin_lsn = parsed->origin_lsn;
+       commit_time = parsed->origin_timestamp;
}

In the above two changes, parsed->origin_timestamp is used as
commit_time. But in DecodeCommit() we use parsed->xact_time instead.
Therefore it a transaction didn't have replorigin_session_origin the
timestamp of logical decoding out generated by test_decoding with
'include-timestamp' option is invalid. Is it intentional?

I think all three DecodePrepare/DecodeAbort/DecodeCommit should have
same handling for this with the exception that at DecodePrepare time
we can't rely on XACT_XINFO_HAS_ORIGIN but instead we need to check if
origin_timestamp is non-zero then we will overwrite commit_time with
it. Does that make sense to you?

---
+   if (is_commit)
+       txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+   else
+       txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+   if (rbtxn_commit_prepared(txn))
+       rb->commit_prepared(rb, txn, commit_lsn);
+   else if (rbtxn_rollback_prepared(txn))
+       rb->rollback_prepared(rb, txn, commit_lsn);

RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED are used only here
and it seems to me that it's not necessarily necessary.

+1.

---
+               /*
+                * If this is COMMIT_PREPARED and the output plugin supports
+                * two-phase commits then set the prepared flag to true.
+                */
+               prepared = ((info == XLOG_XACT_COMMIT_PREPARED) &&
ctx->twophase) ? true : false;

We can write instead:

prepared = ((info == XLOG_XACT_COMMIT_PREPARED) && ctx->twophase);

+               /*
+                * If this is ABORT_PREPARED and the output plugin supports
+                * two-phase commits then set the prepared flag to true.
+                */
+               prepared = ((info == XLOG_XACT_ABORT_PREPARED) &&
ctx->twophase) ? true : false;

The same is true here.

+1.

---
'git show --check' of v18-0002 reports some warnings.

I have also noticed this. Actually, I have already started making some
changes to these patches apart from what you have reported so I'll
take care of these things as well.

--
With Regards,
Amit Kapila.

#93Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#92)
2 attachment(s)

Hi.

I have re-generated new coverage reports using the current (v18) source. PSA

Note: This is the coverage reported after running only the following tests:

1. make check

2. cd contrib/test_decoding; make check

3. cd src/test/subscriber; make check

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v18_coverage_test_decoding.tar.gzapplication/gzip; name=v18_coverage_test_decoding.tar.gzDownload
����_�][l,�Y��Ivq6����9��v�=�=w�Z�����m���BV�==53���=�]���l�EB���������W�B�<���� P����P���/3��]�w�/{B\�]�TO��������W5O��qq��Z��z���J�j)��oR�^����T��RY��P�y���M���i���m����?�M��?���T�:�c��*��e�R��_�U�T�^"���������T�g�:��h��z��th�S���/����nZh���3���/"�g�q�x����[�/�:�����aZT�F_��H���h`Zc����u�=4����n7�����>VMK!�i�#J	[�X���Fp�#�Z^'���J����h�
O���!z���RzcE3���H��r8�)�6�#�� �bD&J�����I�9^GRir���AV�X�/�0�a�����K���#�4!�.r��>�P"���k�YG=S���=�
����a����?���F2o�$j�Xs��<����O�	�>Q�V���R���R&���>�qR<a"b�x�H��Y%�����(=:e�%���;�i�]���RL��
�^&c���k��Z�����J�.bm�+��vMf�u%�aeyz���1���n�w�y�@g*���W)�`�n�4X;s�>��]��i��jg���t�V?q�Q�{�����R^�gi�s`� ���G�1
�f��wD1�Y�\������tI�����r��LK{F��5�����|v��QP��h�*1��)a����A[WBk�wm�D�Dt]qX�S������YX%&5�����	�>�%E���l�hV�1�c[��	���u:d`mdAW�X��7Q�i
.�����AE�Tc�fc.�c�"���_��]G�a(�j���`�C���by�Y7�]�LN�c�0����X�C��QH��`�+��D�np�O��9%��������D���P:�l�Y������n�3���/	���[I��e��:^:�=O
�S����f���%�Oa��8��b�tY�A��6 �)pC
����.��	� ^^�PTu��O��f��g���{d�GZ��LD�t�<�s����+�����V�
��A��f%{����;��B���y>��C=?�H�cP�������;i�!�L}&�"��2��E�~����i���F��:�Y;��OS�9^?���I`f�"���g�5'S	|�6B��?�����(����zq�U�z�CgN���!�csJ'"�f�56�tL
�a=h[w�a<�{
P�g�2�����]4A��u��0g�Kez�k���k���_l����.�7V��f>D������{�
�v6���c1Zh����r�����#-()���a�"�9V�l�sm�����w�t�7���V�����Q��4_(gz8}>�L���Y8K2���a�8�a-9�Ez�����R*��8	����&]%���fn��"�i��U����������/�����)������K�}��/�f ���oO5sj�N=�^���K�����wSY�g)�:�gh������xN���j����g��qpUs=���Bd����%#z�q��!��s�84��]��.G�R�=)M��n6#c�'P��}���$�E�'�eS�e}x���?KJ�5t�G��&u��E�����R�U���|�m\��X<A���`8�:|���I`�L���G����Y$������������;I��q���1����&���i$��J���V���m��M��<>|����a.�����#����/�����O���y��?l����^}�������e�������*>������/_�>���p�����������|�~���A���������+q�k�gq���9���.��#/����+�oa�}�����������6����o������K���������D�����^���O��o�	����v���w�|��?Xx�k����o}���;�~_�&����?~�m��k��_}Xz����������������o�^X��/��o_��[?����CWl�Z�^�+u��K�����DK�����}��x����U�s����������������o|�s{�@�(~�����7�|v�������D{�G�1�n������i��z�?G�_�Uo��i������;_�Y����������������k3��
��y�����FZ6����M���O2�0N�������_aIS�������L���K�������.d������x��[����U� ������d�G�&����FZ��,���Rh�#���
��_��k�����?�mM]s�3"c=����\)Uen��+e�v���D�x�u��y����:������TX-�T�)[��{��V�P�R[c�u�^,�[��l6 ]1��l�=X�o-"�1�DA#B&����v�Y���d�s1����6,S�������b��l�tvWg��&[;GO�*:v����T��S���Tsw����t�-�o�b{�1) B���Qm��Fl������H����+������N����'���T*�^@=��ck��T��,��r�}� 6K�v���A�HeG���f�������i*J��������.lmh�!�-u&��_��,���|F?(:�Ba+0�l`�������d�%����'4t�>O��Ka
[;S�����Vqx�P��#8YSTB%dU���s�
[�� �M����E?�i�G�3����o,[9�k���E����S������������G�������"#���	���H�/"��(�����K��1���<&����?��E&]�Dc2�	7���{�@�1�XrB��(l�e)�M��7=�T���Z�~U*o)gP�\�K���*��$����r��:��]+���x������������T����Z��������;���
�
�F)�Ju��D^�b���;��9N%��.�*Fgf�-�3���y��r��<�E\�����@a��om��-����-nL,��Jz�6P_�m����"�;nI)NU#x*epx�L��%�Y?h�uT\Y����9?2���Z	S�)�����a+����^���xBg�n�k��O��?d2%h�O��&Z�V���S��cN.�Db��.#�$��l��<������`s2f����N�<���Di�4���������������ax�&0j2n��P��TJ&JW�T�?�~����%��%�#��W��|G3T}Jwpo�75�=�am�~���-	�#2�lU��5C@�dI���(�#KG�xUT�6]y�V������R��D�8v�F_@pjI���(�K'�8��}��3�J�T�s-�+�7��I�V4�d��e��s��)|�c�&�s~9�LU�,�B�3������Q��y��h�]�M5�h�d�Me�������	�H0�e���t������A@
pf9�3��,,r���Tn,n,��8�+g�2���c?�><j�����������"F���O�F�,��Qg�m��S��F�����&���Og�xi9�6�G.�l�����K��^>���-s�,�[�.O��n�nt��;r����E�Y��V�^������F��,pj�`����\s�nL�y��}�p��(*��!���s���(���+p��(\��g�����@�\��<	�_���Y@�hc5��G�8iE.st������r�=WT��S��"��92l���e�UG�1�B�WDQ2G��"���eX�	\m�is�9)���~�%L}�l�Uz�Ep��
�-A�O���^mKF���]�*�#W���������j�`���J���xsE��t�Mu��F��Nw��M2�,�������+*9���x��
��r���J@��34�Acs+dT�!��hJX�[�2"�R J<=��r����*�5��Ij��s2x:`���!Z�|��W�
��[w����R5�D��a���,y�\U�*��y�4UE{	��/:[4�#@�*��	����F��p���#A����EY{��cb!�_���|�����Yv��@�jB��U���l�Y>����*;�f�
���j���o;�B���`-!�l	�4�~���XlK�ay&���w�~��k9��W0`�	������@���0�,@�&z��,���9c�Q�{����|k6�2E�t7c+�p��`-W���)Cf�#ky���!d��P��l���:R,�2��@�(@�ZtD��g����=}����c=>������3d�#u#��0��TAj��(����J0��#YK�����������!��u!�������=3����\()�}D$��m�]8')Yt�a=����oH���M�-�,9J��.4l��(�M@��beY�6�����`#?���&�6�c�xS��,������F�Mu��
t��F�
�o������\�b3�������G�	@b3?$f��5hl����;�&�����t�&���@f�}M@�f�X2���	�h3���7ur�50m^L��H@�f���gBMX���l� l3�	M��P��K*%�-�~���TJZz�����k���GWR)q��,��G|��-H�dt���tk!���z^���a����z��;�~fc$�(�~�Ho�d0��/�F�Y*%c(���6��f/��N2����UKJ����0=����z	Hg�D���0�������8�����H'���#J@N�$�I�"Q�9k�d�����t�	����Y��1�����A"Y�@��4�RO��*�D�����(��j��`U\��w��f����.JP���@��4�=OVq�R��d�4�x���(C���@��$�Qgm���E�(��h���{���,�0^��������1�Q��.��f��tvi����v����7�,)��;��=�MG�O}���wi����7���e��%��9%tD�q�1<;n��a�I�y�7�<7~������Yu�FK�nfiE"~8K������.��#z�3���+ �_*gB��wy�����E��8�
�������u���u^5�{�Op�4�&���r�K�����K"��.�G���>[�5_f��bv���3���7��y<�PH��{Q�����xzN"�,i��4�r^d�X@�r^��N��k"P"U��}�L2~<�$�*y�_�T��N�������Q���3��:P��Dj8cOrD\�B%�� ���-R��	(m�D�-�0x��P�>�m�4Eg���!o���'R��~D�*����
|��
��H�KNB�=<�D�%�����D4��`��1}P�D���@R)H|�^������]�$���[��^6�@�m�B��H���\3����C�p!������9�r^rD[�uS--�����%����"�p�f�C�����&��}��cL< ���'x��A�w"��*W�G��%�~�������r/��RIT
��m�<nw������A��x���#Go@I�T�K��,m�6���7����9|�01+~���j���+|}4�	����H:�d^��h-3����/}����4������8�TTJ��i-�����y���N���zy�f�L���A�a��e�X�`"({}����s|���<�?�v�ouN��n�@B��C�g�U��YX��)�m`'�r.��CUy���UO\R��u�<��m�h����
��{�<.�7���
>*1�zJ�Ir`�j�A[�c�q��Hh�M<�E�P�!�
3�a���<J�>![b�8&S,�n��������}{;t�.zL,]	#+�C�a���RO����h����x2����]/ ���U���=�L%��`*��g.PC!���`��r�$��V��['�y��1R�){��]�4�t��l���u��'�j+�F��K����$q�O�o��Bj���{���t)&4��R#e��
�K����&���]��c�����t�L'/{����H�F6Rb${�iT��}I�R����G�=P�!e��X���d���V���d�`��=�~����i?���'�e~_jH�5(#c{�������}��=W��{�u����P���V�r��(��DP��c<d./G6f@���z|������s��������;�e���w���ql%���}e%�w����Y�P�J�q�4��ww��g����Z�tjE/��b���g?�,���i�~y���s-Xm����Q��� ��
;f*��������
;����7���v���j�������`�a������� ���<��Y�A���:�cW*C���!�U��(0DWa�a�\��*,^�erp1����; �AA �������t���q�����6k�C�k���%�p%��l�p@��3�P���z<��^8�#��m���K'Df8z����aN�������2*.�,;Z��t����8 ���"9�j��\;�QS*��yh��f�4�W9���N��<�o����!uGAQ~��
j3w��tXz����:�9P�?����8�Y��D�8�z5@�l���������"L�vl��F5���(�86����O�]����	�_C����M��P�.T��n�`�,�Xv�n�h���A���	�r�����J"knd�L �^�����:<gh8�q��(��k8�q��&�j�tr#^�Wi8uq�.�4��<�D�8�(�z�tV�u�)��0�!fk�
`���t�m��P�k�eW�
�b�NT:�6��q��=���6�`q�,2b�#��a�7lqL�����7Gq�KC�j�hr#G6=�p@��c��S
��8�8���+(
�l/��E�&� "�1�b�T:6�p�+L��+��$H�Jg�!)�R]�Al��1��2_o�����K~r
�\*��(�\�(��"�Y�HgA��������g;,�~h��z��� ���w�� ��i{�Jg}!9N�{Xv�� |��nsA��������+�q��<�R����~�����,���� ��A��
bs������,��q��UX��p�p���Qa��6J`A��c������P��E.�D�8�}.d�������M0��z*��(������S=��
�
q���A��PX��at��@7hc_�q�?[S�������b��!����,U����5M���7���zJB��X�� |��\�"��^-PI��.���@/��@7�>I�w3�Yb��?��qB�H5}�1�r�j�H
w*��@����P.�9�tQ����
;+�,J����?�/*���AF���=FYC ���6��la}�Euo��t�8_�t����8�R�Yc���)��G�&W�4���aDu�K��rP`P�>��(��a]�7����8�D����^�3�z��"����3N��R*S.���>���%��r��t�i�<�@8�C��b*5K\�L�I]�M���j����������iy�]���V���>����
n����e!�N
��k�����81;���Uf�j��oS������4�@s@�C�3R��".!r�X�A2�.o���.�y��-��I=���G\*z��b��w�M;��s�z�ZS�fK��&.u��/z=����EM���_B��	O����I\������8JCP�8K�����i<�n����s��I�0IF�Y=�W=V����|���H���)��-�b=�W;\���e�T\�V
��p�h
���:�����B��K.u�- d���L��S��U�vs����S�_��p��	]B@������y��#\*:B�H�J��S���D��?�gi�$%rAd����X�Ez;�(�@Ca��F^�����9���!.u�V
_p�wwX��R����D�W�Q�Wm�+%HLi1dp�@�_�z��Pqz5@��Z�������L+]m�C�q��J��m��EqE��0�.��
�8t*g�}�eQ��|�l<N*��7>���U���#�hv��[v�oo��(K�u2������.�,s'����:gw�56��c���\.)p�!ndz��+�r��� ��e���8d[�UG��/�PM!LX#�R1z5@)b_����}��������K���^�����a����l:��nT��."�+���B��BQl����w�p~��J[5���z�v��.��W�Z��������0���p{�y=�=p\���.E�����^�����0�ww=f���-m����������u+3��.u��,
f��]�iB�A������lGb��x9������H['k~���l��j;���aj=X<Z��56������������t&N��a^��c��=���"��j��q��h�:������s}�Pq�w��q�'o��+�A�������L�6Vk���>\�3�C���
�7.��T�K����:^^�o��Z���Ir�LS��Wz�����y�DZ�����\�\�e�uF����X��F^*?�����B5!�0��������ub}s���:��_�M�<�f�cY�:��K���drr>�e�{�5���B���KS���o���iN���e�d1|�q�i���k�%2���o��H����ij�\�O����a�����XA�$I�D��E*��pn=�����6H�.�;(�,�\d9�zYrV[.h��4��G�&����������&���������������@�Zn��D��.���)`�]���2A��G������_�6"vCb��6v�`���)�g�=��K������)Z�����K]�+�w]�KL���l��RX�^
PC��)��g�\�{@��R�T�p�j��D�\����v��.����%�Z)�,�]p��g�/�~x�����[�)6������_�)��*������'���]�4��7����j�������p��!��1���i����K��z5@��i��]��Q�u�f���?
�n�����Q�1v���z5@��`p��3��.��^������_����]�����-i��^k�����%}�������fQ{��(�\8�����a	���c��}��m���=���1��e��O�����Q���j�r�
�����.fc��g�m��HQU�{�=
@����N�n7����L�����[c���g�6o��=��bu1�<�%{,y'��Xe�n�o�h�yW�l����fmo�p�x��c������d����j�6QX��O��y���W���[�Zc���=.�lY����,{�;���Qt2*%{PE�7���h��,��98d��!g]/�V�A){OJ>�r�O������p?���^?g!��0;�q��u3k��A�c�{8��V�%K[�`��K�G;6����� �=��fu2���/���A�g�m{[WW���p������w��<.{Lp9oaT����@�=*�7��x8�8����A4�Zj�c4������k��ir�n>��dd���5����G�����a���Q��Z]��x���a��W������cJ���w'�����:Uz�������O�D�=CV�J�q��`�p�G�����y��X��M��^�������[��b�Z�
i�-�,b}{h
'�d8�,��V�����f��������R�)�M��x�Po�
�x�4��r�����xx�W4������l��`���&�.�B��S�Kl����()|U��He���{����rx+{O|o4�&�?�&��3P�B"��5
�����0��2��y���S2O��?#�GZ��h��"��G�����^�9�v[fO����imjZ�'�;!�j4?����d �0����" �=S��2+$�6��6(��=*��^
������� ���8����U�S]��)q���t���R���=s���~��
�\/Y��������D��������|E�U�������E�{U�_�y���S��C|��7v���[��������j�LI
�-�������ZI�J"��K��H��2���������b?��4�y=�k;�A�_,H��	u]�)a�h�9�\c�.�!�#��@(lW�P�^���_��5@�z(�������D�@ .L\��,Vi >X<,�gpd��3������\���z���P��ll�{����%�:���R�U`��V����<�zP���������Uby���q��h��^��E�����������FS
��hW����;J���I��&@�z,���ly9�Y
_	�]������!���� ��5�=����"y?�~�[������jae������x���ku���#�=�\G��9�vH���S�wK��+f�m�b�c��5<�����������w���x���I�#��{����F�:�����51�}���.5�R���E����r�U���X>����������� �> y}��m�O��,�w��h>������\0���L�����/1]���������T~����m�|@���K��|��,��@"���OT�c��&���;���:Y\����X�&�[�������*�O��'h�>�����O+�|�E����^7P	�%�R������t�����M�7��h�h��x���\ew�Y�yC��Y#��";H]U�����$}Vu&���~H*>@h}*3�,���^��g�����������R&�H��%z\������"�\5�3���JwQ����U�ru�3��.���LL�nL�@S(��hEf}>S������S�������P�y���������I2������9����&�Z��
[�Y�����b|!�Q:.��Z���Gh�O��M�:^Ha|V��lV��������@�xi~}�����FF�V���j��Z^���)���h[�a�s|@���~��o}�����*�sp&���L���B���$:�C����Z,�1�O��sz��D��"� ���5�i8���F��B�n��B������X ,��6��K����4/��%V@��������t���^0���r���Y�]�V��l���jZL�K#{�Y(��z`���4vJ�*\J�(O���[���&��������z�@�,��������+F��y�Q=��2�G��S���Bg!�@)�L�z5@X�pk�l(��;3�����>��
�{�2��$������&�[�� >}�Y����v-6r��)�|�{9e�������{
o�Lje����F}�������-k~���U@�����`�^10�Y�Sd�^
P
S��|����^D��0L�F���zb�5��_��,���Q�c�j���'����Z4�'������Uu��75�����$@��>�$�]�i�]l��`�-�����������������������$g���ce����_�^F�z5@�H�u=���[�`���s��Z��5�L��,�B�gHt�3�o%�ZqV��'��^�?����>��j���'q����-h8L�v�������?������>����e!S��}�� ���!��
k��f��,k��mO�f
!e�!|@������,�+�n$���<���/O_����|8?������/���{������s�8�����P\Y=��K��*Q��@y/Fr:�Y���*�p�s���� f�$��3�zUqr�M&]����zy7ek�����j8����~5�t}L����:����G(�~��a������%�(�O�K�|_?<�}^�_�~��?y�^��������k�M����������w/?���~y��w����8:��ll 1I��56h`v�:O�c��2�V����9�&����c� ]\s������VW�q��������=��S(����0_��|iH�>+��kB���?����q�=�Syk�4���j�����j�h�����0�#^p�p^���6���& {}*q�,����������^������LX�2V�?�U&<E��R���MF����������Y�j���=&������&dv�Fd7�
��S���������}@���V��;L�����p[Yt���`n��������tK
! T�[Y4�"�l��e�}I8�N6�8Y��h��A7C@��D�*�Z��$���d�ya
-�O!�{F�L4N�O��u+>D��T���O����Oo�YB�*���F��)t}����&������r2��G��C�L�H�<���a�jR*��/g�2&?P�+m@!������u"`e���Z�z����M���6�L`���me�E���;�m���d2�����7�;��p���#��K�R�#�y�. 0�;�Sv�|�O��k:����>������o`�x��wvv�Z���o�_����w��'o��OO��a���~���b�[.&Ir�oDyC�H���(�r���������4�J��@��r
3W�j�_�G�T��.�����#��QW�*y[6�����`�vV��C���<��_�K�H�7C<, �y����������m�Ew��f�;�&I�y�p!�=4�EW�����m���9n�;R�80!�bs�R���I2Twx��]���s���=��d&�$Uk�l9WW�^^#��4r���9�&���8>��/g���~@v4�,
K��Z��N���9n���a��:X�q����p���6�MX�zU�����G4(+'�j
��7+�����I\��Q�������9�.��������K:$����=zD=�E��,�/k]W/B4z�|�����h �������{�|�G�S��tC��U)�z;��_�y@�Y�q�)2��
4#�<py2������x�"Y����Ie�&4 �S�a}�����������;}}|z���_-�y������=�������If�vB?Q0:�tXy���@����R�,�li������SF�?7]��<�6HI������dI����K%F.�:�V=`����I��(�JO��3	�Xd����������<M2��ySY�TT?(*�o�$CF���o�(������g����=�df�,�p1���7�<SR�37��P>�K���x�|����,�`�����6������>8�1��^
Q^O6wC��}�s���k*��
������J��~ �W�O�Q��&�l�F#�3'$@��?�!�@�����nf������j��(���"(�
D�4���I�
5�$���qMec
�3����4[h J!`E)4v��B�A,Q������q�B�+^�rw-o��@@�|�Z5 , ��=���y�����z�g��C�b�8�}?���%�$�����dT����� d�M/�fA�A�:b�{�E����(��� ���4�O�7��.q;_����~��R����5/n'���[��c �({�,
�rXA�5��7\d@�C�
p �T66�A�C��L��0Q+{��7�1 ^"�r��9�o"$�h�������@hE`J��� �"��+�j�2�~��S]1��^�ql=�	����}�5*5�^
�Cv������(���n���QG���0��8
2�&Q�!��6��!7�����t{A7�LZ�Q���hx����������	@�F@�o�=�������������� V#0�j�S��k��Y�u.�l~o����[1O@�D`��h�0��FQ��;��	A�Eh�H��^�y���L��'�e���m����_�����H���1!�����h���-��!�Y�����������1KP���A$EHe+���a�o��80q!/�y�Bv���7�2������,=�����2�� "4@�R
�B~�C�e�r�X�eQ?C�vb��'h�)�����������>���� J!�G)0�T�0���b���:Ip� ��&��6����:��K�P2��L�������]7Y@�E�K������J{�j��~���ZB0�&B0:le,��5_m��?�9���"���x.����	U�����E��r���B�A�DhJ��/C h"dM����cu8��@)Y)�k�j7��r;��i�����1v�y�������i��M]����^�Z&�2��O�B���v�C�0�j2��0�C� L!���j����	T)t��BBBC6zNZ � !���A4A��O_y��!C�V.�����T�z� 	!	�B���@�k[���0�C����*�-V?+����6�&����5YL�{�J�E��'z���xl�I�+�>D����v�\����0�9�??��0��GzK�QI�u�����U{s�`�C�_iK���%$����7��z0��#�\<�E���2s
N�����.�3��WJuDZ�hY���!��C�_�������3��y-j���<��e������;�B�����8B���~���!��C6��<]��"�'��s�
� ��}[���J�k'��C��o1Tz�s�ju�l��o����!p����S���I�HjY��+m1�Z;G���WH*���I��O��b��I�k}dY�����IR��L��R@��M�W�$ ��VM���T�v��M=nDk�%�Bn2V4B�+�O�7�\�m� �!+���{d1[��#F$����m5*�xz9Y�U����
������3�+�T��X2�.&�<�t9m�0��A�9��I���CSv��������4u�7�v���0�0�z���q���M!�[���!�
�RX7-���y��E����-��C�uW���wHe�����a�����ip�����Y8x�)���x�"�eq�f'��C><d���0���
0����W���@y�b���p9�-��@H{h���@<{���;��2�uc���������M��C
j��2F����p�0n���V�����`�����B�XT?���"��R:!Q�p>Y�L�����������6M�����F�d�l� �{�#�|�f�+���L��������U&2`��"����	
h�"[����F
h#lc�C��\NW���by�h�#(~40H]���C
�d��&�$��|?2���!�����(r�W����z5@�(��47"�Gl�^1O5s�6S��G���=P+�#�~d��S2:��h0;"��G6��F������x~�K��>�c���@�G|R����g�xwDe�XwTZ���<|��,)@�GNk����`���ol�d
TY��V�����r!d����gef[�,�%�x��R<�~���c�
����	�o0�=��y"������
@2�|�1�d��p���mG�!x��N�72P������p�l<���a*��#�`�|d����G����N��~��5R}V�g�>rY�C��#���Rv
��\���w
���xI�#��G�N�����k;�l�t�(�=B	����'�=��w���	��i�%N>�8y}����M����K�|D��z5@Xi�+Oy����G������b�%Z>������a���'Z��Q��V
@�#6z�<t���G�|�cU%@)�&�pnV=2e�o�����W�J��Ts�������(^���7�Pd��=���
�2uQt<Y��������z5@���F���S�Q�sS�=������G�K�j_�R�fZ��2	�V�k�d��u���*"�������s����*��.Z�rDO��5�ng2��x���e��h�����r�����E�T���������(�?d����g/rsW^n��������Qi���@�#^^s���i�����{De1�
k/(�������Z����[n��95#@�G9�o�2q�q{���[0��GTJq����&�<
d6),^���;��/��]n�?����F�%69^���U�����#=��
������5)d����
�WT%b�����`�#'fc������o��Xq��W�l�
/ �#�4��a����K����b���Na�x�q�����@
�zC#�#�"@�G5�Uh�(&���~�����~&K ����pRdRX��ZG�r;c�4������Gf9��4�O��������<yDfB�YA�#������ �Q�vQ7���s�D��Zr��K��<�pr�l�i'\zD%f�,��1?�z���
=�������)��R/�����l����M�k/�, 1?c�)��Jh������kSY����?�<6ej�U���Pq;�;xwL�Ho0fE����w=dXy�b�n�����p��������f��FSy��yl�������Z���b@��R�7�b�z���W$�����be���b�0J��x��,y���&������"���$�I6�-zl}�8�O�?���C��nk����)����c�^���Q@j]����b@h��M��Xz+��P)���?��;��������@��<�UST�^
�M��$���>u����;�slH!n4����H���b@C�]h������1E>���1$
�~<n����?o+7�slH%��]D��u������)�Z��M �c��cp���z�c@|��{�1�1�cC*��fT��#��{����_���k��\~%��_�//�w���UX�������=<��.���|`]&����Rtz�_���H�W^����tT4r��$s�x����b~�|���H���������^�9�u��?^����$}�'�D��hf�����B�_9>�^|�L��x�#������u-v����������$�=\����Da%O���r&*���������v����D|��T�i�����w?X��X���������y_]�F�e�_�7�����)�������(�TfS��[�����+1��C����(A������U_���;�������;����w�����v�e��v�H?���{����|���w���������wo��������wpt������������-��Ba����������r4Xa�<�K��w����/o������ ��r����W�����m�Wx�������:���!���g_��F��8�MM��}�>����F�O�D|�5v������Wb�Je�BO�;��{���$Y\'I*dA�0o��b��KK�bR<�w��W��/3����f�/7�V1cE�oUi�������b9���������e���o*U7|����adW'��4K���\����8��t��5���z
?'K���#���t��Ci���tvkM����/:~���_ju�hc/[���z�w%��`1�9yjE���2���_�[5�����2U��z$��VC�0������r&������^|;N7���Y:��6�jx������0���&��jMuaj�[��>������p,�����cw����}��N��C�z��.-��]����m����:�S7����.�U�~w�?��7a���v��"���o��b���v��{��b6N�Y����*��i��J�4��7�Vg:K��_��sS����x[��^l���l��p[��d���X����#P�[Y����z]�/������/�QI���SH�`Z}���� �F/�y�w��e� �P^^,j+-�5i%�?]=g�RYz��y��&L��05��B-��-�;��:�?��v4�4�=�u���3�~Rz�a��_k��WS�����V��H�����7����s���-3�_�Y5L�;h�-����s>����j����S��
�����Z�Ja}w��������_��Ms}�����y���U<q�������C1��k���l���]y�Pkw�a!�o;�g�2,�����k���A4��N��Woo1.Z�xGcY]�Y�s�ecw�j2�k��"�a�����+��u{����bx��J���\���2��P������,s�����<]��FWf��9Q�or(��h�qj;�v����VD��vM���P�T��osT�vC�;�t;�bu�h���--���6M<y��o���p$�?~����^�����[7�}[w�nnK�[P��k9�ve��;)����+����'s�������&�r����l`_Fw�f����/H��e��00����\�=��u(��Qvg��x��������.J���G���G���G�������(�����&fb�DWH���<�f�Hw��-����z���?��2�'g��=8�w�{�����9��+,t�`������w���}�7�?�#�w��w�v���{���7�����;9��j������]��:�`����A6���;7`�1��@�?l����.��.��e�x��G��D=V�c=V�c-;�V����O��
^6P�f�:~�Z5�_X�d��
}��%�����Wca������O�?L�g���P���@%?9��$����#+46�&�e���_������]�"^���U���73��z+���(����E�_��Ies!��vg�41���V�I��-���m&500�I2���'�@/�I�����oj��UK���?,tGTY@�d���m�2~kO���������W���z�M:�|���������O�H����������#�?N�7�R�7
�=��BS���w�T�z�o.�y��#o������=�������S�m��4���z��O�+c�������_��Q�.M���,�����W�r����m��B*����3�{$�������3���	5���}���������z�_�����������������z��N�M1��x�x^���V���^�������;�z�Uu�������]�����;)�������������z�_������������������A����:�{�_������o��z��D�6��s=����}�{��.����u���g_��K*��I�;�z']��k�����w��:���z7v����]8�^���e:�f:v��br������C��2\7���u�������?�������o�H��������_��o�����������z�_������{�_���������f2��1y]���2�w�������;�z��4���G�6�.�����vRz�_��������������������������������d`'�o �����������������/}�K_vT�?	xj�
v18_coverage_replication.tar.gzapplication/gzip; name=v18_coverage_replication.tar.gzDownload
�`��_�]]l$�U��Ivq6�GAw{��=q��V�{l���-<3f���J�f�����TW�VU�����	�G%D"�xE< �/�(H(
EH �OyEZ��U�]u�n����;!��]wWu�{���;��S_�v���5NYenvM��l��_��k��Q���f]���T��i�J���P�Qz��27p<��e��w�������>-w�Y��'���)�_���zl�u��Z���#�dI�����$�E����[!�m��a�c����u�v��������X�KnV��������$8�Bn�x���+G'wD�5���|r,���5�.�L�	9q���C��
���_o�v6Z�.H��:�k��c���pC(�a��Y7����c�:��k���3:����GYb��M�b��a��7}�������@tgX�)��a��\>���vM��|����_!T<��O�/�}�:e�+{�5OF�<�=�Btm����3}�3:z�����B���!w��\��I��{p<%��{��C4�����y��1��{���g=��
�(e;g�1@�jXC�����DJ
��-��=���z���2�2!��������o(9�Q�c�`�k�ZK{�RT���LE�����h�d[Q=i�����hFH���n�lOc�T����g!�0�mvM�:���|Xg�������:}�5�������@y�
�����k��9+j�}���6�<*�X��AB4����t�+�>;E��&���XSX�B=S���sBz�k�C��-�����B�5��V-�Wp	rH�\��$�����}�v���"D���$�j�tY�w`B�l��"�e-V�����1�e^�5<��V������.)�*�4�Io�	�N8���55
�bp�=���q}�����}W�	��~,j��(b�B�	�Z��:?L2���S�x��a���n�,�KhS���&l��Ct]����}����ZL���9��0Tl�s���E`�4�y�����8��y�8v��������j��+�}h���O���a�j�W���_`�s��h�s���� Jr>6�.pE��|�R����%�W?�U]���*�5�O7�)}K^�9���.��|���R�l'J[���t��vkK��qR#��I�,�E�fU�.q��j^4vfQ/����>`t�����)L;��=8X�&V ������P?� �4\����D�����xX��8{�;��g�2�n��O�k�%�-��J�J[�$�\�a��X�*�����ES�Cg���
&6C�#D��m��������{�uwM���.��F�c+��e2,M����bG�k��e���4.cs/��p?�7�S��>Tc���������(�eXa{g����LU��(i�@	rvz��������/}�����������M5���V>l��B/Q9-RJ���8�1�;�aGWz�r��c;~���9I+��*;��4��Vh|�O������V������*�X�Nl�dd��n�.�.~|�}���yk��
W\���:'�M��n����U=�(�2ONrf�v}���Sh��]�����'7�{��6?��r���~<���gIpY��%�����p���{,����"'��npF����+A�r,����w!`�������2��Z�*�y�#�a��h�������u�����k�5��=S�������sEU�v��O`+Lm���8^A���dx�u.~�����I���(�g�A�}�I4�C3�Y�?���q����������]���=F>�S�t*�?�z�~��������|f��8�>��������}�E��������������/}��W_)}�o��b��`�-0���������+?���8���17���������:�?����zk~~���8�
82���q����w�������_�W~������?�����o#�]����Q��F�Z��W�2��?�������������*G���w�y����p����������7n��m�O��N>���o��mk����������g����o���}g����������������]?�a�7����)�^��*Z��?s]�������������������9�_�����o}�s�s�D!J(~�����7���\�|�����v������]G�������l�?E�_k����+i���?������I��������~��s�����F��l��9��z���6��s�����s�?�'\��yk#��T�1��i�7��5m2�/|�����77��w���w���<����vf�������_m���������~���������^����������w��2;�Bu�0�����4uz=�W���?3,�u�	?��>o��,^���V�1r�?Z����$��z\�:����������������}r�����&)�+�G��Je�h+8Q[�(9r
�3�[V��}��>?���X�}�Vbvp���yBV��7H��e���|�V�tl��~��|�J�|[+q�R��s�tz��1���N�U�����o����A0
���Oh���2���J��W�Q]f���S�^�1�D|-���y%�s��Ziy���&D[������$�/�U<7qfv��Z�j��%r��]��i��,���L0U�-$��i\~����wI������Hq��?P����6�u���y�x����j�O���JD�zJ��U�<D�`X0!��X��������0:>R?���}M������>����7��������u�A�

ynM�����VZ����{�V�u0����)�.�$��8��o[+IL��:Y)Y��h���-{0��N(F�������>�z�Q�&�?��/R3}Pi}��/t���*c^�!X�N��������M)|-����*�Z��3�����"$��RZ���������;����� �_����M`p]��2�e�JW��J�uA��D�����)��Mc����]��bga���!p�Z���+E��M�GK�=&D�*t�x�����]Yl��L"~w�26���:9_ZGg�F��!�CJ?�Uy�9�W���cw]�>�g���E�$������5 �{�b�Wg��S+C<�
1���B����JB�O�
��,_V�k�X�G����LvSE�����J*=���k��#����LO�
A���i{��@=2�X���������u:�������6��XR�"NCm/����b���\&�O���b}H)�S>��������8MD��R����@F����t�*6�m���R��-�M6��������"�5���������c�"[<�q�����Ia��/
,����H<��x���1U���mm�;�����8��O��5U��$�����iA�S�T	�[���*a_�:BK��a�k��/�;t�[Q	h�|�A�K��zApK��j���s���AOe1u����|��2@0�+0�@�9\]�D���x�q����
'�c�����=�T]�T�����7�V�Y��c�?���:�C]����
F��k�b8�z1�*B�bL��������b��?r�����(��y�>�g�TE�^E�.:��&����*����R/��
�?2���[|�q3�L����������
�K� @�*�\�����<{�9#��*��g�<�O\�����8���;��W��wU�^�1����Oz�2/�=�xv]OJo�����YZ��d�*�v�	��>���������G��������"Lb�UX�m�������(��{�.���������X7�P.�#��m�)c�p��<4���b5��@S����W����V���Zx���q;1.����K� ���&�7.w4/�C��ne�����@����ZcR
B��n�1�����$�:'�����PAtM�h�����j�����C!8���H��u��^����.�h��k&�}0����Oil5(��o/L�9L�p��/����z��t����d���-'L�|�+�&�T������X���F)�#�+�[�u��$�m�6-��)�����5>:o�	>��a�_�L��0������Rj�X\"_������"V�����
�
mR����Z x��
$:4T�/h�k7�7T����wl��O�~��=���`�p��G,
$4&
Q�3���]����
?h a�Q(
��	F/���Y �����8��p���		9��T�����y"���GL���g�TB�T�P��c!X�`�0��e�����j�MZ�H�jL�>��-�B������z���%6���������� �2,�b
�M$�4���tROHpi��m�x�����7���,�7h"A�Y,(����=XH����58���M$4s��G�����L�;����m>���c���Y��|��G��,�h"HmCjl���ywh�,���l�Q��`�Y�-��b�BL'/\Xbj�V�l]Z�[t
)OHJw�,9��R$ R/�[�2�?y�,fx��I	�B����V���bA{+��6�	1}�3���������2�9�C�a�@Z��3����Y�A��4~c��a�7tH�3|�E�E�;^�t8�(	�����^{ PPU5,k�����,�1���o*�����i �1-�|��5��Z�6-�M�@�up�>h�p����Tm#����/���6R���
��Z�D�w�N���F�I;g?#%m$���1%$���bYo|���3��Z���+bK�-�B������z{�����Ty�,/v�,"����2f	5muu$W[�.�O����.Dh#q���^;>��YEaDR-?������-?�)���Z~����Z~���w�iR��G38�������/��Z~4���l�5�����k���=?�����?�}��CEM��`���,����G8?�1cj�0�B�HU�@��v U��n@���rN: ��*J9<k�|�����&
>|�Y j<�Ik���/�!VS�\D�{F�&�\=��GU��t1�"�?J��#!�f�&Td��zE��T�|_��$-�@�&�HQ�0HU���h��E��TE��HM�"�A�"�� P�V�q�,�"DBZ�HH&!���r�i5M]�"�B�\�bLQ�������r7���%���L�"tC����v$������F�,HUdA�fS��������L�_��Y�����D�^�����x��@�
T!ER%)R"���V'.���(Bz�J�#IU6(�{�Uut�a��"<IZ�'I�$�MY:��2E��TE���<�����)gZ�������r7H���[^I�
�d���e��]N�
�|�Z!�E���v	U��R��Iky7bH���0;�����B������w�z�[%�:�`{�.�;����j��s����?���)9�h��1�%GB�K=��E3� JUQ�2�fQ��'�B���+�!,R�b����	�b{�J�
�vC����^����3��H��.��v����|\9�D����|R�lG�O������������t�R\��IU��4k�"<MZ����{C��_��2P^��c�F1�6EH�� ��".�0'iA�dL���i9o�pD��	E��TE���A0��Y�S�F��T���?%�]Ko���?�����������$���*�e(S�����;,�M�����Xw)5\��N�������ei���ba���,�"/���
B�����F�yz$L4�&_M�WA��tr�f�@���"�!�M�G��B��ydM�:��K��"�D�f��#��}���x��+$�4�T?�X�]�4CbM����m�&(�a����Y�~�V�[�u������nU��C�������<���r��IUt��!����(H<h]��P�
A&�? �%���M�A��G�������|�����l����������T+fo8��*wZ�;y�F8�4��)�J'���y�����*v��
S�91%��"�L�"d��T�j������Sc�;B<7�r2�p9i{�;$����"Q���lTB����4����,���;G��*�	"�X��e�v�Gv2�P�Z������ F����Y�{�w-�2�������������,��H�����A�k�lNt���kE��Z8�Y���p!�8�b�O�MR��I�/��t]E��Z�����k�������yl���]w�q�e$�-�=��Z�5���u�p2�$z��,N>�u
��h2�k:������
��G��C���U��\a����nA�t�1���r7�������c���z>*�������*���q��w9��/����G��h�;��!�8���� %
����Ssf�	��r,��"��p�a-��?����y>�����b<l�.�#������������(����K�[k������J�*�l��s������y�����m*��T��f�T���hz��3�s���t�6��2�oi}F��l����;�����Y\Ah��t>���:��EF�&����w@2}�A�S��H�����|.�$�)��y/d����J,_'��TG��o�����p���YV�9��HT|�	�@�����vd�������]����1���e=�K���4����%�q"&)WY
���]��>��Z]��>�&RO�D�Jr�����H��p6��i4���4�dQ3����L39j�e�1E��#B	\K��d���_�{��R�	��gr�.�|�����s������L��3��S"`�87������\���E#��~�'�Xw{rh�:(�����w�U�����h:��J��sd��>��A/�c�L����a�r�=���9����|�����b��� �^���S���'f�dK���xx<#��d��������������Vk��XS+U�--����d��MS`�8L�����@y�}0���]T�]���� �~� �q���S��(/6*=Wm,����p�5�l�eq��, �>�:S����4���������2�j��a5 �iV|M�_�%���4���<�9�e�~��3H} -�*
�������?���a��ui���+���&_�	O�W��*Ko
���s>;�.c�
ma�������s�vd��U�����L�1�C��r�K�L��Ar�����B����������������~����I����v� �_���*�g�h#M��xU��r���J�(����o��2V
��~3��>@I}��P�Usx�(�o�>���BH�w�L*����2VT�T*-]� ���9�������v������l�p�WM��������e(�Y7��@�>��������Od=�Y(�5A`j����r,��zn�Y���Y,��q���������G�2��_�n���;�V�����/;��y��_w�q�W3��>�]}�w�dM���s��J��y�V�W�����Z�/���YS.G�*O�'i/�rc
K�j���������7��0S��N�����#���.���jn�"���,S�E�.-�oi���&�������R���\D�P�4�<��Lju?JN�4w����r�l�4��sfp]����z��;�]�m,W��m��u�D��t�;����\��q���E4`{}��`3��/X��p�c�����A��3��pS����p/����G�L��b�k@�.*�����;d�j����wx��w�Jr�������(`�s�jfl���uZ��3g�t��_�����wh;n��'`���m�QS��X���8`�#+�yi�Tc���?�E5�$I�:�����
V�C�4I�8!�i�������|�y8�y}'���I��c�W6	�������
u*F��E�ln������J��K/�W�9](9��1��W��`>���O>)^�<�[8f�����o��w{�5�����0'��.�Z��4�;�����|tv�O���8aL8��}�W����(�[y�js�)�%���{������dN`��pSQ�~�j�(�KXsl�����l_��\�%F����q��
0M��}�2��j�Qs�o�Y���3@����!<����s0�j6�����r%s�g��v;�q��@�rOF��Ms��O�?��	���b����_�'L�W�����E���zq���A6�z��s��p����'�Y���#�`��f�a@:�l���r�2��T�=��;{����YGT����KriVL����hX��z!�evx���� f�����$��bHc���n���sjz{C�2|hxq�%�Y<)��o�f����j������� �=��������\^��r�Bg�E��
���^�sz�(>s.�R��.�|d���#F}��r7;���%�}��G>/�03C�����������S��"7*X�s�%�Db{��C��:�P�����pR������K_�T���v5w`r8h���������c�� �����J���aV��$�Z�<"7��Q@D���������~�)W9�V��a�60������Z�%�VHe���0[g��B%�t���n����t����
�kQ�:����@�y6��#��i/@���C���l���h9`}�7�ijTL
���y����Nig�'��V�eO���u|k��lp�1;o�q�pp��7�MVS\��x�] ��L6�AJ��v�y���fnl�qc[X��� ����p���h0o���-:8M8�@���M�0�����l��s�f-�H� ��AU�V���Si4F�X�����6'�g�3z����z���j9W)-��vX�9J����\9�7����6	 ����e��P�pM%���d��TC���\	�����4H�
Z���
@h��d,�h�=�g��
\^_WK����+(w�'�5p���MA��x�["�m9�l�,>��f��r5p�������0q�w�Gw���	��5��V3����>j�P����q���u��A
��>x�Z����������.�����
}T�P�@��jf��C��6�z�7�e�7��eK��%�S6i���l�����Y~�A��do�er��W���J���CR�l�<�FUwVj_���5x(�Zt�m��F�hjfL���,+�b0|�\]���������U�pi�e?]�aBe�Q�Z���@��-�O5q1�o�[
8/�f6�v<A-����FB�wI|���l�bC��x�?�T0�����`��jd����lZ�`����Wc<�gwi�IYN���w���E)
MN���(���)�����b�PM*vH{r����nj��������+�-��t��m�f�=������,sj��pp���b�h������YMrY��	]�f��&j.� <]����C?�l�fwdyu�E���k���[�\\k�9�	��vah����Fn�����Fs��8�<���������<�V����
X4+V+���e�O��R�G�������C�M[	k%[U�Z����=b��!�!�AC�'������;���F��������gk���[f�n���fj4��r~i�l�*�y�&�-��&y�_��]���f6n�<|��}��nrTg�y5���������f#�o����%��������~�O�]2���y"+��'�2�Fq�fG��Y�5�����)y�����e���JY��8����������'/_�����/��N��E�&&OG��x����+�������r��?�J���C<�����l�H�d���'7����W�����<�g��*����r��j������X����f�D�U�u,KL#Zk?�����_EQ��^z!U��TV����e:��z���������?>-*J����h�I�W���WL�j>�/�/��K���e��v;��^��S�I���V�/^�i�A�������-������C��Hs����r{{��9�=��h��w����.�������������x�������W�������rTV��D#��^���JPG���4O���$��!��m,l�����2~~�d/�������i��\��oo��)�Li�k��Q���������QY�d&uPi��8��p��/���1E�m�w���_�4���r�2���DNO�<�g��8����4R�B�a�?)�����m�� ����?k��-���IY	��h��t`�f3�9�Z�z�w9��F+oX~�x���y�������k
�5X��v���b9P���s���krZf%L�ad�m2�_��{�
+9��F������_b�`�'vO����mo�\IM�&�o?���������(����W�>U��e�^n�8jCV�����$rtF
�����_~Lf����t�����
-13Yj����t�3��`:Y]���U��2�\S����M(��?�e�O�z����������;/����U����o�
�w��m���v�����huY��K�z@��P�;�������W���4c�k]~����Dw�����q���Q��2�`5��b�El\�/�5s]���}�5LQ@�f(~]Y�����E��o��c���T������\��,G����_hT3��'��Q�--q�G��������[�r"��"�~5���*�o�[9�������}�x����J~�+��MN��r��|�HN?�s���a1����|JN����M)����'����7�m��4��������S��r<�b�
�#kWg"� ����y���VrOKn���o+0����|��U`��.�W�U������E�1�3;��/�����k�}E%D[�zzs��O��z�����������r����^���u�������`$o���\s��>�&N�v����[�_z��u������a������m���"}������m����m�������m����?��o��������?�0�:;�='��1��g�:�Xw�����������������?ro��1�xL�m�������������,��j�ze8����'�he�?��?hw�o��������J�����m���j�����.����e1U{k������/�v�����nw��<���v�!;�a�e��.����E����������m�����G^s�ao�/����>�L_��?O��|=h���T��1"/=2�����/�5u;�'���n�B�����x��rq:��c�%ne��M����������eM4KPf3u �uu!\r;��rR��MS���h8�'��������3\�b�sp;s��*��N�3��bcoSx�n�E��m{�������n���Z�}��3\��"�����d�^tK�O���.�!�qp{q�w��L���_�}��D����\#�1J�Y�\�g�Hf���l��t�\.��������|eM��rr6�3r;���S��[=cq<���O�x�x����lJ���Y����������R�f��<[l���P[�mB~	6�����!��D�IN���E�F�A����
6����]gT.�I��C=��w�x,.crAu5�c@�.���B�(��������=��=9�=<�x��!��Z2H�O��\���e�b������c�d�GY�������r��~���z*��HS~L���vO������i�%(�
2���a����C���&�.>�gQ2���L����(��Q�����#q}�K��#��vz���!	�P��_�r�F�G;�,:�l��+r�%r�l�6��.v����^^O�/����X��"�S5#3�&�E���z��2&���D�`�eO���A���
����G��d���(O�J�i[+J��"
0D>��Jq�d���e(��:���x�(R�dT�/E�)�!	0f>o�����%���-ec�����,�o���Z��Dd�������������yN�{�?��2����L���h��M{��	��gu�dw��J���sL�2�+�o��$�N!Y�U�y��"q]BW���Zh��5��`C`�|>l|����$�=�
#9���u\v%��g�)jE`.�a2iE���T�pW�8��3���0@��,�����V���,G�E�)2�V5c
-&��b���@.��(�.��r^ ;�h���������#�E'��T`��O�G�k����	.l��0M\�19n������
o�R�2�������9F�t&m�m$������c�
8&�O����K���eZ2�-���C�~;����t�:�e��d�+���S`���e�c}��YZS)���e:����J���z�8�+gB��|C=�W���b��<"���n oK��U&+��U6������B������Q:���Z��M����.D����(Kh�G����|/n�b<��.��FJ�\���6f�9�!��L���T�������DU���Z��\�K2)�/���	�-E��)�%�D���<���0���g�g^�
"1�wy������~����CC`���h2���O�����8=�����:k�f��i�Y�4<M�����(�@�I��d���qa��msq�v���PQ�@�4�M3r������+�j�X.2������ON�������#�?���Ey-�M��i9XQN5�b������8���?�U��DW�}4��U�~&��gF.@��g�B���b��>��q����q|~�;F�@��HgF.@���f����[���U���G����q���\�
q���[�y����Z�������iAD2�n�SD�Fz�	.���e>Kfz�!^�O��n�.E]��xa�
���^���I�����q����/�'��gFq@��&.���L�!��S�������������5S�zft���,P.���)P=3
j�n4R�X_���U��������?
&:`����u#W=3�f���,�P^��E�a��&�p�d��������Q,����uFD��(���WL-�8�}1�R�����.�.����x.\����_+��E�Fl)�Kt����
w/V��8�[%�N6�/��[9��"w	.r��:�p������^�VN��t�ko��UeU?��Ki(��N��_|����8-����&�'�����D�x���������Ibl�� b�"f�����>x���?<����{�o��A$/�E������c����	7���E���a�D�5���D��] p��w=-w��hQR�/o�V� l���v�E�Br��1-��W_����-pe�1�h�5IGq��6���[�e��qx]Y�P�_LY@���\FY���<��!�D��\��&%�Y�9��L�@�{��K���A�,��W�'����h�F�I^���GBm.���=F������Y�����bD���E��@�+o�Q��X���3����xN�6��s<�����n���Vf6n����Tf6ne��HSfS���v��������1��1����k�(L�x����rB$��H^��IYn�W ���������>Gy2���Q~9�~P$�|G#��[���?F�H������?��oS:)��O9�	R6��@���9�s�:)�8�!_�$�D�Nh�>�1r���f%�T
F4���������������O��]�p"s!G���G����P7���)��\���D�V���'�cM�(t�YNgH�F�����|xC����-����nK��E���p&6�G����LZ��y2&�Z�u^M�������6��G�c���w����m�L
G2.s�����������f6��p���
02Xldxk+c�l+����a�f6@�9F���9����58<?88>;1s��Q~Oig���x���uD���H�
����%��Z�A�R������|<��L��G��2���l����z�w���jr\�:YF�u�=��
�$�������6�q�It9&,���4/��[b�k�7<��y��4����?99:�C��G�}3s����T�o0w����t��V�7��<��y���F/^�qx+���.�8��i]W���Aff6@�8R��(yI�x���fY4������8���b�x�{���(rU[ek72P(��od�?\%��yoJy(!$Y��9�3W���8����5��<�9y��,�D�[��uP��&��������!��0sZ�N�N�f
����h�������	U��1P,�y�fN@�8������������V�\����'J�,@e�3�p���M����w�|�.q�����(_����5Q:���������z�AM�~��=�����t~������qt���8���^���?�"B��v���8���o*���/�9|*�=�=9�����(UH�*��A�B�N��g�j)���#�3;VQ@A9XJ������ Kc;�q��.�l�c�`��?�q�����B�a�/ �<���e���P�K\�qp�������	�o��-���!W:�����t,H��!U:�����|-�����S:���E���H�� )�����'����-�<���Pem��ai*��<�z*z����������Vv,���aNeu������{?������N�AyU�G��*oK+��q�������Si]T�w-�x���M:�����-����1O:��v-�T��QM:_iG�'��=�k�'��<�^*Z�����dW����j�8�Ig��Y�}68=�=���KQ����R���
��0L�0�B���f%Ki���p����p�}_�����X�X`i%�U�����c9$�Os	("���j��%�8���� G�u�����#K�q,-��[lg_��-9d�le[���w������;S��adf4�����
1�<��q�������!B�D%�Tj�P�D�i`�}Udd�!�����
/��|��
�"����t��cQ������*�(�i�Lf�oG��T�c��l��r�P����Y�*u�5M4�C���P����)�0��-T��$��6�����sPQ���W}:J�
��/�)@y}df��C��9�wv����`F>����@#���l���s
���b�����n��9���'��N����k]����sR]�8k��c�����6>@�|-2�qk��!Af6n��9��8d({�}��n��9>���5>G���!�&Y-f���o�L�d���5f6@�8����
�To���0��������oF������t����iu�^�Z3��t���5�o�?��,�R�X�F�����R2u��0>����Eb9���P&�|���k�t�c`����5���4���&�1>K������cc��w�\�"S�/�R���������c�l���Jg��@�8����PW����a|��S�\�jE����o�����9��	^������9*U��k[5�Ef�:�/�������-;P�u��3>���h����8X��2>���a�'j�@4>��u	@��I��fO���"���u@���\����9jf�����Gpfft��$tt��f�@�8��i�������A5OE<����d`NY��n��W��(!G���l��5/ m|��!N���7���g#Y������oU<�������r|��P�J��q9Oku8>G�����1K��:"�c�}��xSk�Z�sh��
�J�tH������?�e�n�m��,���o�������2Kk�&@��%��zM���H������z�V��,���H�O�������d�\J�?����g���5����u2���xF�����?QW����d�����-��k����VT{��T���"��
�1G�T/?�Y�f��U��a��u3T[���9���,����3���s�e�������=?��Y��ek6�w�>��
b���)PQr#�Fn�Wc)������/o����^{qkN}������`���Y#"�^�CR,k������h�W�d�K������_1e*�=+����x�)K�f����@fl������sd�����Ng������1>�6`>�:��0������0�:1 7�f��P��O]{����gr�6L�YL�gG��W8Pa��1�Z��=�~~(�/������Om��"�*�O�y<��t���S��J���y����$���{�j�+�u<���y���w;�n����G"�35V�n�7 ?>�Vh����SU\UDO�z�)�E�g�-L���: ��
bcR|h�*Pr��P��@ �����T�|��L��1A�
@~|��WA�����M(���x����4��\h�����{e�dr���~�xR!n���~]���J��iWweZ���fan�p�P�K������G�B�{�,���K
��?DP��*f�T��u��AA3 (@P�AOWl�������������X4���8�����D*�D�e�p&FIt=Is��]����)�eLN�R��Y�����Z�q�����S���Mi���-�*����w����n���a��U�+��bp�E��Q���ao�Z�y^8���K��
KM�i�_�����T{�i!�%�$����Ju�_���L���a�5���wv3���zF:e�X-������
X�Zc���=8�o�����Y0m�Ukn�80�r����Y"�<<����E�+�]5��f����~����^��Z4Q�o��m����+���T�|Y�(<���P�f���w�ef�����2!�����W�KNK�����ff4�#����2c2��~���3!E%�N_�����e��"XU��a����e��#��������\���ey0u\��
�b����,j3�Xf�~T.�����`@�g�d�[�,�)G�Q�+��`��|��l�PZ����i��N���6�,��7�Wy�ic0����d�����f�@�Y�����
,���z���pj���mU��}`[�-`�?U��0{��`����D�C��l����_/5�,��Z�:�n�c�-����g��������)0�G)�����8�|pV�w?EIw��+��C)�%�	-�)H)`������t[�	����Eq@w�>��~;f���������f�d���1��wpt���������v7���-����Y��f�@���(�]4�P���������L��/����r��6�/�5���l+�l�8�e�b��TwR_�R��q,6v�
��J�T,�J��0�Y
PG���b�tZD���GY4�����N���#�!w��p��_��Yz[@�g�����#@^�ri%��t�D|*���1�����p����3�����QO.���*�Q���\a	Q,������&���*V0y�'a�42D"SU��S�R����e������%�������H\�wpj*�a�]
������(L�$EKg�5�^�v��6��vzE1��<���a��:�wv�5rS��,#����/p�~�3����Y0Z<Xg�
�f�0�l�:�	���tw�U6�(0���v!,D(����{����\�a�r�%�j���*��z�e�n1p��5rT�P��"�E���T��t��\<�1��F������H8"��t��n8p�z�v�����A.�R�M�`�3�KT(�mMT�`��
c}�s_{q
���#�`�0M����h�Zd��R��(.�V�V�@
��W:�2�v4�`��%����^�K���E��T��w=�7K��Lo�b�1�Lo�b������ec���r:������������%�C��Vm���N�b+Thm�8�����ae�dy�Z�f>��P�A��}p������&�b�����8�K�h�oJ����\����p�Y���\9rQ������C�!���3�:�a(V���_�\�A�6p+m���b�����3��[�C�64��)9������!�������$��[qM�e<������*���s�������j{�[�CK4��v���[IN)��J������1�0D3��=h����(��{1����v��I:������mR����D
oj�/�*P����1E%���
7r���Vo/�������Y�u(W�2��������x!�����)X����E]_4����f!	C���,��l������a����h|/��=3������ZU��	pz+
]/�@=!����m�#������@�9�O&*f,J?/���E+Z�~M1cI���!�����J��;��~B,4���.

q(6	�/*dQI�\r0�@�T���wG���0.
�
t{���&"�0`��
�VD�, C>0�a�N�t<O�(�(=�o	�����iXRrH!�Q�(�a��V��
u�����3�W,��G���������������uj-q?�e��Ajw�-���qi�V\K,`O9��L�H�F�b5�5K,`c9��L�F����P��=_�����c_�3���K�d��o���Z
��7Y}
p3��M����Q����\�������9����J���

�6��L3d1��3�	��NC��Y0�C�Y���;:?<�����r��&Q�'{4X�;=��%���L�4���,��{�����q�2F@{H�hZ7Zj3����^2U��+�D\y�V]h�t3t��+���w�L�+���df�W�rx�NK6�6���R)p�(�b�7�k,g�8m& �f��7A��$�"����N�����r�a� �25��Y����6)Uf\	?Z���--$��>�3���T������O^���+W�2Q���E��,^���e���r�r�(��t�7�h�
b��
 h��8���@ks���@�?���5R44l���

94TJxq�~�p�@lo�l~/w��zg���������!�-qy��d~���7G��;� �����|���/��qq��#��Pi�]��*R�V`���c@�l�r����C���j�p���}k����Y���(����7�T����rxf��$��_�2P._��~��h0GD��89?�o������CK�,,�|]�h����s��~	�~>3�B����*�����(
]�S��I3������d���@
������)��.`��U�����"&��r�_#`Z���
��!�%CG���{rY�R�:"���S����Q.�7��S=��2d���G��w��R����y��P�����s��`2d�a�f6@�Y�q!��l��u���&s��������Q�r�)�8����*�����h�����~�C��I� C��4����
5SC�����&����)�����5T�����u��pt%��d(_u��P��4�\a��2��K��le���E��R?��0��C-��p9q�E���?J`�8l��6��)��l�nMYAo��v3q�a�p��R;��:�i��_E���6�#N�(q����?H��� ��WXe�����2����V��j�(q�Ua���8�*�8�_]`�A���.�������A��8�*s��#N��-D���q[��0�Q���r�G��q�V�����8n���X�?J�Unq��%��*�8�{mq����#��q[��}���-�	z]!�u�^Ixz��iq<-����lq�V��a�:yZO��iq<-�����8�����Z��m�[N^������Z_��kq-N��	�8�'
iP���r�C��fq�[�oq��S�\�]���3G`�x�[mn���������@
e!�n�,Vg�������0��YjF��1�������Q&��R��-�$����/2^$�=o�G[�P����i����xRu�s-��y�C�U���x~QrW�P����	/�?ZZ�����tK`_:�����j�*��z��������?�o�}m�|i{Lq�}q`����6P/���������Jb��������g�h���F��T������P���
�;�G�����c
��Ho9�t�U@e��euu!�����PK��
����B�cJ���[�@�-��m
k�y�#��l������)5s�G;fI�>qH�>������^||{t�<��etJA�/�7����_�R�� ��'1m�c��_'[d��	����[�����n�4*8����CY�3��nqD�9����������dg.���q����G�O��y����������������k�v��;>;)>3�	������
�vt�n���owO��?{?�������/��g��������H
l�x�e�g�'g������������Y��I��_���������4_������+l�����}��
~�?=;:�0������}8���2�����K����}����v�����>�����DV���#0���Xq\�A�,����G�xS����������S������T�'7��B�����C%�eO��(���'�Wjq)}���������b�������?�������R�k����6��2�����E����������k�g���4?�0?�4?�z������!%��l��+�����Ng�<�M��8��>����96%�x2K���}>�o��Z���R��&�[-�s�
(��O/�<\F�O�)�`@o8N�#*���%5���,d�t��k�r�X�%z�dcw�A���*�(K�@R5��~����	�R�8����k�+�F��)�����4������>�<1�x���X��p�Y�TV��?(�^}2���	��|���v����j���n�a�]]�?���M����y�)=�������h�������F���af�5�.���||�xf�
�K���zyx��\��R.7�_��K.�6�,�OX������+Z��
�6�IYj���G���dx#n�r�vY8"#Z����!�����f8w!eqR���!�����n��CpC����(��7k���VG���lp�������@��6��M�w�C%_oT��im-N[��w�����F����������N����y-NG�����-�r�����jq��8��p�E�8z����Z��Go�	�fB���Pz���#4�9�u�w������H��=R{��.���K$f6����!r�v68�?|/�}s�\iq�C�l�@�>����p6�L��I>�G��l��^]p�n���u�wD�����X�������b��m�[ -���
P\�Y��p��w�|��;�C����
P �
���f�ig�|jkNVr���&�����1�����n��M��c���f��+3���}�2\��,g���~���R�*LQ�Z:]/d����-��F�����Iukf�rG���a����)��>����K�?�>���s�`�l�U�����.e����4�a�Lw�CK���WQ��( 
�C`q�)q��X���6����3[`c��:i�}�{Q-���zjR����v�U�w�Z;��^�����C��� 	��vp����Q�x��J�����FI�y�dU�b���� 9���C�](�f��*H��
bf� �C�N��������f�,��jD*��Iq�2���3���qgWiv���Y���4����|�HGyz�������G���#�RS`�W8n�h���N��+�R���A��hq74\W����q��hq�/�q\�h��;��YA�y��)�g��o�S`|������+��+��
#Xb��.N��'Jf�	[+�����r	���.mq��ivq�`W���4�e��5�l�8�S�B,���2F�T��o���B�����\!*�I`����dc�a����������i(�*��nezT���'=��D�}q��I���+5����kB����Jgk�Li����MX�(����n<�f�7��M���ptv���Y���Fk��6�}��n��X��,��6�Y��n����mo��+�Y����`\	h;�,�3������M�y%�)����Y����9����6����of���6�����&z���6G�;�F���p|���M�l|�c����U�,�XTw�_�	�/��Mq�Apzb/'t��Dtv����[���BD�m�gv}f#4�+����������� |Ks`��{3��<_��(ie������o��@�(�[�	�e������y2K>�["�I�f�,RC���$���s������
������`}�	���T�hK�J�X�Q�1���Of�v%�@}����@h����Z�������{����_�W&��S��
�n�*8�6���T�J��h��V&���Z�����KS`q\.��}�M�p�Z+�PUt��8��=���+�����X@�G��Y���S`����eQ1�/WZ�bg�]���.�i���;�m������1���k:�Y0z�o
��os���
09Q�29��Q���}���M�H����83�(t��5KK����qv�_�:�Z���W0K���2�&9�fo�4�J�\���-
�)09<�����P1�E�W*�W�	�&��q��y��mP�.��U1@��y�����6���<W^���4��o|�`r8(�29��ns���
09<)����gG���9�_�k��l7q�yn�j�����g�
6��y�<�n^����.�������������<p���U�Y����{	�����z9E	k�����r������\o��p�����R1��USoAS`D�~���[�����e���v�g/f���1X5��������ni�Mq��.�[+���L�_���`�.S`�����+'be�)�I
��Y�5�0��r�����{��S���+U�6�	�+�4�Rr��5T���a�f6�r<q���[�sX��i���}�/A.FqF����$�9$�5�d^�`�6�C���NJ�8t+a[�G��a����v��:��W~X��h�p���qV��a���,�������j,@+�9ZY'�`^�v�cG�c�{����Q�:���;U+]��kLfa+�-S`�8J�2���ns���
�L<�����0���u�E��Tf���L�Pt�<���|��u���m�v�J��V�������t�C��T�M��N����8�8qv��e���������5��\*��m�s�J!���~��������u_+Nm�1$6���	���	�d4�&��-[	��6��������f6��p�����A}Z<�P���S�L3`Sp�m��^�7�Q�L��)�/�������*X ��^�TQ�._p���V���+E��<1���^���j3���swo�|V>!�b�+o���,�/�=W�{{t�(mQ�e�������+m@����kMi@����m�>�9����0�,/�������~��TW�*Zl�v�$�9$���%�/����6���~K�U����WS�����0��cn�sr���K3�`9��p���	�m���\^�
�{�b##9�aq�����������r��F�/.���U�j�+�5j���,7�|�/��
y.��OQ�������k�-�	��Tr�E%?&��%���wu��������W��Cw�������]
6
g'��1�&t��:�(�mnZ��;����=����:(Sns�q�j������d3`;���pD�����u8��s�We�������s���+��W������u8ze��w��_<�[m��y�p�����:t8��@�r6���1����!�(��__�M>9���������v�����w���d3�AX\�v�^y0zoft��)��d8��bh0�������]�^S����uJ�T����� _�B�q��u�x�����x;q5�����,���:W�|����f6@�8������������sv1aZw�w8X.����r�"���^<)5^���p��/;U�{��������d�Q��v������q�R��n�\�F�,%~����Q<Wrp�Y�]���0��F��2'��������q�O�lppT8������m�cn+ze��^���8H��bc�s��6�M� t;�[���8������!�f6@s9�Z���yQ�g��*��~�B�v��Z
7u@��]�����h;���g���X��u*��7����r�!�x����sQz�T����%�Jn4������@�?�>�����;���g5�_��vx���V��p��vZM;����[����q���x��������f6��q(����
�K�D`�8��wq�tmFs�}�r��Hrqqt~v|~&�y��%;�����+Zs��v8f�����-H���9��p������
��~��O���|6���(}A�l)����_�6����`ji�|�5X� ^;<��x�F�o����.�z��B��i�O7�L�&s4��J|��(�[;<�*D����h.�:�*w��������6�)��jcUPG�Z�������������`�PaL���\i��J��$������4�dg��� 2;�if�C)��f�i���p�e���2�cv8�e�0�N3��Qv8���N��pJk���F/�Tv8�����Y���|�B�����Fm��;w����/�B�1����C�s�\���
sjY'�@v8����]df?+#��5�0H�G*:v�\q������58c���8c���/1�j�:v8��w�f��S������D�cq���"��������G1�����4����p������n����fX;Y�������Hk<��4C+;��ph%?�z����F��X����4�\��Yz[Z�����hR���Js��d��%@���O�8Q�U`e�w*K�
�EYQj�y�f����/^Xu���m�U_�n8}�)�f���V�s���:��"�-�YY��.@�9���9b`���
����~sP�%3�jj4���1�+�&g&o�p&��p$s�0��I|�Q�%Pw�I|��%�.&�4�E�ln�:�tnu[r�1��F2�;]hf�����.`�����,��Y��t9��wqSt����������P����"��e\�����.��������#u����=3���r�;G�h��A{R�t=s��<�.W���F����.�n�r�����}�0�q�m�����O��z��8%MI:�����r�EK��k�H�^wy����Iv��R|���w9��,"13�_����	u�B���������
�\|?�o��k}��.��nO�E��'���j;s�a������xq�2�7]ZK�X�������l2��G��D���~�>���]�v9����A9�r��N��V�?���F,g��x!<���%����>P�o������
P���R������~��*X��S;�[t;}��L�/e���7t���Y��m\/Y-��.G6����������?�����(E+`8���(<�VW/'�w�{�b�rBc��������1�MW)Vy���?��F���=��DJ5S!5�)�<�f-��x�.�#���.�����.���(G��t��8K��9�6�QF�y�h��0��9��j&�[�$�%�8���t������
v����g�TV� �������$��
�Y�r���`��f��v9����M&�����B��@�9x�������s$���R�p����dg�,
�=�����Y�t���1�@�9��Ps]��3����r������������tVS��K������Y�un����r���]_Ha�t���*=e������=�g��������r���
�C�S��q��Ec�F_����n���M6����YzG�6��U<�"���MQ��:�X&��\���l[�B�����,���t"(�\:�Qi����Y2P�f����:�L�KU����S���P��T���:~�}r���re�Yu��oev
��u&i/��������@������%C���8���}���MSx�.����^;�Q\�dv#
���x{���D�$���3Y��Y�a���^��/�.������4e�K�C{C��=og��4��r�^���^���l�zs���
�U����C����_�'o�N�f�@}9����v]�[\�}�lO�|�Z�q��y,��`��`���`�7��O����������r��,Z,�Q�����
��2����p<9��(�V[)���A���|'uf�x]��P�������+�7�[T�����c���S�DZ���z�6"@��������r�b.�r����bc����Z���������0{g�v]o'�z�L���
����>5�]_1��r���C��.�����n�rp[-P��[����[���W5������\��z�{�!�}���|pzt~�g�`^��.�����w����u�Wk�S���2i��R)��!'{�������V�9��X�h4���\Y��L��������LN%My��:��k	����u9>�t�����'�^8[4�F96��|�%��"��i]�L3���uy�l!�:D%�M���.O���J��`���?�.@��<RV�	oP5�xQ���������V�muAd��`�.���U���-F��[�m��1s|�X��jG4�u9��~|W����l��pVe|TW���(
GU�v]Y����2����,_L�Et%M9�#����yt���dS��(3�,@duyo�N�\�t����������C$3���'*���z���@pvcw���:����,�i<P!/e��t.��&	�S��Z��c��6\fb�.����F�%�MS�}:���DD>Gn�-Q�!Q�%6R)CF"OL�������~{�/��u��
����b��.ZF�N���$n��sx|����-:����������"������K���j�[y�$n��s��K�����>��2VY��Hpe=����V{�-��lYa<����t�]{"�5��z"���>k����3
&��oc=��p�@�z<F�����b#w���xN��j��F���T��f�U�w�2���o�/-�6�
p\=w�j!��d6������������-Q���s�[���<�s�g�~0�����Yo�9}kBc)o��t ��@Mz��:��f��z��a����_���T��I�����rz��e��sG�^��>�s��I��Gc
�������q�U=�Y���H��V�Eu�j^�����:��\&��/�����S�
���*��L^��2�����t���'+c����*�����e��Z��f��T����������+/���7h?��i@?��Q�uZ,�����z ��p���z.�i5��a�t���(4��5��)�3��sy[4xy��'�VU�k�(����#W��[��zB��8������`&����c���
���z�df�����}�d"���O?l�����&jF1>��&E�h7@;���-���h�#|ot����y>��L�!�!������j�����������tj/���s��K����f~@g9p��h�#jnql�CF@����hkJ�������.Kf3�3P|��|��#�����B�$�6��h?�&�T��6d&z'��]��y*���(S��2U�'��w]�s�d��La/�g*�kQ)~���r;�`��2�NU�w���,���	/��z�Tl�H563�����-U�����������9qO�[����l�,�$�|2S�M��W�
dc�P��=C�x�e�]��;j�d�vq18<?88>;Yq�C�b�������5���5$�U0F�S�X�s�����|8]�oD�����U��K���y�D���e9(�J����m�=�i�����[�E9P�lB��������7������	HY\��+�AZ��4G�������V�����.�b��}�c{.6�I����l9(��w�qfJ�W&��jA�u�
,
��u��|2����W����l�����k�Y�8��hr/�i�kg����39`Q�IFqf���s������zO���?"]RN����
���
I������L%��H����Ut�����OTij��L>������A_��f=��-���l��5���gd�b�z%�V�cS`�@�BE/�a4��_���THsz���3<����]i*��\Rab�����8��\�Q��H&EP�M�*�4�Z�N�����������9�\����4�gt����.���E��R/$���*&$#m����2�����q���b��f�*F6Me�V����M�������+��{E����rL��J��[�+hG)��X�P��+��;��������Q�u��K5Z�[�8��rp_Y!�;*�z[�/��mi�Y�H�"Im���s2��`[����l�������E*��Oe;dr*G=Z:��o����|�n�hcv��� u��+�����-���1cY#����d��;ZU3�Z[�F2%&��}%�d�U+2��NUM��$+�,o����X\�V,�|C9����BWi�$I^�@W�lx�H���
����d��>�T�)�!]?"$(�B�+,-L���6���r�wm�8���@�J�-vN����N6�h�RW���t>��t-S`r]�����S��?�w��8�]Y��$_�+����L'�0)<)�)��n�~ja�������z����-���$hj[�G���*X��k��u�:��T����gN��"��d��X�[��V-���������������sFE6�V�������8�����D����=�����j������:���@`������F���$��N�`6�&6����Y��p�$���H��..~R��/.��&�u�n�����'�d�MSsI`��+��"]\�����IOF5�x�f#2�����aWI��F�I�f��	k�L�i`e�u��W������=���\q��.�4��e,,��'����w�M��`mZ�%TW���Y��W��,�Rj�Hmz]�QS.Z�����:����8
=o����qr�6�vO���'r�d���P���U.�S���[��+�~X>�%r5z�P";g����Mv����V��#	��~_T���J7������6_���lZ�;m=o�s�����}!4{�F���E��S�&�dqZ0z��H�R� ��!�������p!�s1z�D�X�:�=o��E����7�s����+=o&C����Vz�4kQ��m�������<��3���j���d��`8�L���S�e-9�x_��������dw1`(��34��EPFJ~�$e��f4wv�1��B������*X��=��l]1Z�Z�(��tC����@��[U�fz�y�B��m�������T�����&��c�^;C`B\�����k��O���
[f�_���K��Vn ��sH�������7��)���W�P7�K�]=@���
4������^o
t�
�����h���-�k�<�6�S�M{�ps���I�W�V�(���(;�������U�����K(`4�!�K��	E{�.��r�F.E!�3TY�z��HL`(9t���W
jf�F���c�������.7'M�9����
I�s�����%�����dn����e;&�����1kz������
L�s>[8�t��������@���q��������[�E��[�������-7�L�]M�"������Vs�adzO/��_-�*
(�+^�"U��j�������[��kK�+�3��R�����!fz�������y3�s�������\y#��Q�%#�\�/ ��v.Le�o��rK��<�,D��(jf���2�$�@}��J�	��h��#��v�`2=wH0�R��`E>kx����(���B���������ez�x�v����������\X}�M!�s�������������X����B�bB�xl�����b�z��*��?[�N����	���x�@S�3�f���Zg�������2�{�X�C!?�Z��&/^��S��dz��-��l{�������;Pnb���2�����*
�_�)�|/=o����s�w@��[i����Z����������P6}-F|���YVz^��-P ���"��q`W3���Rz�lF��B�y}>kG?�&n����L��q��q���=!��0�4
Z�!*���P���#?��~h�����
[
h&Gh�����\R������O��|���C�~2�g��e��X^BTAP��u��SB��zic76I�9�qqq.�qc!�eC�$=_�$��4�V�!Xr����j�2�,�#d��u��������5#����|������U��� em�X�E����wS����n��Jz�x���G'}�V�x��R������������?�
�+]�*��6���J�[��YE��[�r���_@�9���(?�BZ�Mo�X~�o�(���B�c�n�-/,�Mk��7�q�`IF��;�'nm�8,����H���L��,:�L��Y��5��`�r��FMjW�[�<��[�/\=���%�U����[�<��+rY��6F��[�<����qk���vV>nM�8��:�5g���<���]�Y�LP=��y'��sO@q�Y�M�x7����*P�%��9���9���A�[f������7s���slZk�ni
�<������cr�r�\���X+�rO�u���,�����+�]�o9����<�K3�8����J����4/�~����q��UPJ��=��5�~Y<.���z�eRQ����[5
Z�jJ*�,���g����W�;���+|\�W��M�=�)�7��-uf<N�d?K?�����%�v�PR��V�QT���V�����Z����_����I.{u���^T��.�m������JTG�������N�N��������Iv�@�9v��(�G�����85���-g��Q�,�sai����F��"�Vr�iO����P��-u�]U�FY���p���YV>@�8zjqA���SN��k;[�iB���`�Z�p#�	&�N��b�<N�G��7�9o���[BXM����v�8B���`�����@|���ft:����y�a9h�Sn�1~di@�]Wyo[�3�W�G�`�<�������3��<����qW0���S���'K��o�W4UR��V�����\�����DZ_��,����W�����_�\�����V��@!q��`+x��b{����r�LGu������Q[V>@G�0���G����8�j��o��]�^�f@�<������V���Tv�h�n�2��xZn�)?R���"�<hy��H�~��)Y���Y*��`B��� r����\��e:e���Z�N���f�j��5��y���(V3�u�S,lf.��w���D�����(�s^��������2T��b��
,GsY���[����{����Fr(�*���
"���������q0��0!K�j����.x�������.s�E{McQl�;�����13E�1~� ��r���xDK�w�d���b�
�/����QQ\d���rs��X`v8<����
[�q�8��<"�]��;�(���x�+t�]��n�����dR�U����<7[�YI�` 5�����a�Z��i5���&���e�����\\�3�eV��Y����X-��FEm��S��jJs���TP���+������y�F�ic|1�P8c��c����9�������'������3r�-;U���d�e��qD���$K���������X�moy),��1Pr����u��8���P���X;����v���:e���,.�9�c���Vk
�u�	�}1��i(�f��.�J����{�pn�&W:�.w@9����?���y�e� �m��jD����:qZl����wJ��Sk3�o���`�dY��2j��'O�������4��B:���\��D����P%����z��ynG����W,��E	��<����C���lMQ��4�S-��r8(�c?�s5�8d�h�Q�d5���V��u����rY�S����(�EdiI��G�h�0��h���.���r!�$�����a:%:d��25�U e�����n�]K�}��0�d�m	v:H���3�T����~
��Nq|����`����U���������2��U��>����2)��tj����{�z�S#��*����%�gN����
���
�5P���]0�~C�>�}�����g���s�#���M����6�w;4����A�9�)�"d\'iU�j��m|N|*�@��B_�>������M>}�[�bcd������r@-�����
�pE������/�X#]����3��QL�����F��/�5-a..>����r��������}�F�6�!
���"��x��!Bu!�\��/��A���h���x�^��5�M�R0�R�j�[~��[@s1�
���}XP=��tmJ����$�	m�����b�@>��N`�|�h�)����N2T�A�������D[v�M�����d���veUqz�Q���[��J��d��5$���2$��o`���z�>WD��	�@�I}*>��3@��>���6���loV��]��z�/���V4�Y}�Fk�
�U��Wk�.����A����h*�U�CW�|�m	d[V���p������EUH�Fb��z���#��b@���[$�aF������y�X�j���n���$��<��]�19��Ma��F��V4�������M�up��8����}{tr�uQ�������HZ�Y|x�=`Ex���}�\��R}T��h������XY�ce�|�!�2���eh
�~�p���Z�|�kB3�S���5����h�����6�`�����R��|�	(Z��h�|�Fs��
ib�#> b}���}��5��X�#c����'�s�l)�yV���o��~�����TD�i�N�D]���a��dZ����d���oN���yT��w�c����	w����t��}��	��� 	�����8��y9�����^���e�:�0��N��_S�10�S���8���[�j_��~3\�J�T��v�7��^�o��.���g�^`9����U�/
sZ��V�-
�Y0�>����t&>Mh+���]�^+�+���5tp�~C����;1Tn��~�=���8Q�
���{���D��%�F�5|���
�N�S���>��g��,��!Z[�����^��$Ys���n���Pk��~CX�M>��������_����Y��Q���n���Y+`
:�p���U�M����^�w�]����U���	S@~�j����^�mc`;����.~��QL:�K���|:������p�]�����
�#��j��wE���~��>���_i�eK�������X�����m�+Z=�2+��|�<V�l��NZ�2N�vwl����mu��
8�M�y4��p����?^�G&�;�i$�u&���^�`��B��F��E���l��d�������%f��XCPZ��a���^�5�*���N�����&�����^FA-N�'��2���7}�d�����(����������\y��2'����~o��G�����.��g[~�h�jW�&���Zf
]���,�Cu���x4Bb[���o`j�"���m[409.@���7t"�>���T�0���n���/u���"T�z-�+�x��i��Dzu�Z����S���:K�4�y��/.J2��4�y�	F���m��4
��/��(�W�V=X�;R�Eegn������L��xj�����gLis5��QV ��~9��m��f����j\��z���
�c�@
v���7��!Tj�S��;[�x��,��9�M/��'��o	������9k�y���F��k�E]ctyC��)���{N+���wN+���7�G��6��@}���r5@��G��b�I�p$'Bu�D��\���cFs?�6�5
8����0���d�������.����L`����_.���+���C�Os�Q_00+���e�L��w�P+`Tz�Ip$%0*��@e����}Q�d5��Y��9�a�d���t[�p`d�6��42��R�z��<��
&�S�r/U�q�s���"�(�`M�/�=���]�1�+��xe���w���0/.���$����-��N����aj�#�����K��O���'���@��2�`&'3yD'�wI^8���pi��2�JBV�_w��]G�etm�7(8AIu�v��n�NF�pN
E1'#�6���$���Lt+>/��MR�,��b>���d&�i7'\S�v��7�\e�
�,��r�W|}���+Vt�}�����|g��E����bk�S�[S�>����(�/K����o�{u�0��|(������=���3�X��~<���V��59����`1�������vtp�w��_�*�����!�����(2���qUL!���2��Iu"WE���4 (���Se�(���Q���xp�E�/��j8��������� e�N����;��j��I����2.�Y*�v�������I:��y���q������pf�RA+��j��������
B�2��O�qg������R/�� ��d���y���G������x�F����B����=H����#91I��O��;����z��=��,�|e���.[��?BoS4d*��4��R
Q����Rj�iYzX��c!�V�7��\A�-��c�@/�_���7����v&��jq�5�����^��6����r����e>�j�����&�xXK����d�B�0��1�:���4��8��V>�H���6m�H`2��|� g@7�\��?�NL�bkOM�F��!��������;�
�{�X��E��X@�I�������@�n�O�d2�*���_~��3���{���;��' *��,�g�1K
�a�pP��	X��c-�|��ba���kwO`&8��E;��=��pA���	`�����jz++����'���P��2gZyjO\� ��$KVS�tJ���'y�t
��A����M��Tv���4����e(pDdi(z�1�L:
 %��~A�=<��
E�2�}8��h����
�%�?��8��Ev�q4����Fe�M=��Z��*�j.@L1Y�2wv�C�U���A�1@e�y���q���w��7�w��R2�H��?��C�N�<:Y���n2�H�E7��o�n�����P������
c���9(��f������:u;tqxJ2t��4�G!� C�M����[��bf�.�����bu0�.�q���P�����3�3�tF�V6�������IS����'�di1�)C���Vh�gO��k1�*����tj1�/���e��������:�Ir$e���,?�|nZZ0���|55��i�^�xF�@����TU���,����Y6��,eB�/����������X|���$�����!&C�,�7����!��R�C�1�0�)CW�t[�1p�D_M
��*�g���2t;�t�5u��!c�&Kwi�+�
�����h�0xi��]�e��i�l�Tj��p���!GS����������t���n'���E�����t�5U������$Ku8r�d���#KC�L�f�a�0��!&L2�]Vj�����!L�v��.W]S�\,wJ��L2ta�����N��(�^�X���t�u��>�j��0�!�\�����Ms��f�z�
p��3�`L+`8���^uP�!��}������+)Y���G,��@{o�/�,����83t��t�:�]n_���L��map3t���dk30�Yjs�1���^3����9���X�68��8�p�!�]���>�f���5�0���4��@�����0�FSF����M����6%?--g�Q�-�=F��	q����3�(N+`
x��Z�-�&$f���,��1c2�2C��di1 3C����hF�9���������o�-i@w�lw���d�z�n���di2�:C�u���:C�u�_���6�:C�u����-&\g��:) ��5p�����2f�#1��hv=I�bv[���+t;H=�f��L�"�3�X�R���}��������a�X��c=�|�A�1O���
^3�x�R�3$Z3\�Ue��)���"�!��!�h�HUi@u�(�{�,��f����J��u4Wi�|���
m"4l�
3tg���Z�mh0������lV�@4[����J�~�Z'���l�Y|��w~:Ng%�n�H��u"��<'���-�x����^����	Is�?>���=�?:���np��������m-Z.��"�oL�Wn��� [m�X�u����i�E�}�|����iJ�kdp��
�^'C�+
-���1�W�WZ&m�0�j2��`J[�)my��Rm���}�-�O[xj�,���a��[�lq�g1L��������PJ�0H��>�.�7L���!�u����E[</Z+�s�88�P;J����.jF��%d�rv�/�n��`S[�������:��:�������Q����j
�RZg�|j��S�d
Q�=m�np��?\o����2@y�7�
�����=[Rmq�j9@Y�k��-=-����� U[.R�+��kZ�����
� JZPm��7��[mq���0
����7Z g �-1-{�
�5��Km�<}6�q��jp���[kt2`;�fQw2`X8���	�E���B[�)mqLi���������_�5'i��|55�������l�O�i~�2�+-��8��AZ�t�P����&K���r���#v`[Ym9�>li
����Z�s�C�Z�-��������_�����������I[�}~~55�i����:k��*��h��C����c������|����W[����j7�b���8���XM�Zl1�-��8�������;m���ZZ8�VS����x��H�f��0�i����-T]M�^���U��G\�m`��n�K��s7�`���Fao����Z�-������}�8���:q�[�!m5w�@��8���������i����:��S`���\���R`������|����-���!�6�<k��Y�|�)�	T��h�$i�#I~����M[����
x���T���,�6=�i�}�L�U���������"0�C�+�@U[�8����GY�Z����S���q@Y[MQV�.��q��-�>f9�V�+��i2����V>�zp�E[Z�fP�-�"-m�}���-�i��`���:m��$t[���:�Ah��j��:O���b�����Q
cH�4�<`:��������Emq,��,[8��Z����������kxN"�nK����Z����Y���B\�3ms��n��`K�Tj�E����
@��:���,�f���6K�Xj��V���lh�����e��m?]�m:��:�9 �xh{<tQ<U�4M&F<�,Nkep��vST�*�,����V���[��isLi]Zc�i����QZM�0��h��FK���#|�k�����6@Q�
���V���R=�X�l��m�--��n�|�ti�E�Z
�D��P�������B� �~�&�\h�!\�� C�!C��^�]�NmJ������� �6��� ����cC����c�E�~�(���a��P�m�*����wc�G�NmB�����>f	�����52����)�39k;�����4�g�y�b�m�y�y��N���q8%�m<{�F�lZE��#���Xv�4�m�ON�N��6 @��X���g���SY���������l����b�h?G��$������d"k>����yqY����^�jp�O�|��h�������R�������a�*���9���g,x#�3��C��`��9����G�m�������
��6��Z���{5���EZ����b��l����m���.���>Y��^������G��m������>%j�����V>@�yHTw4l���<&R`��m�:�>��{ ?����EO{��p��V��E3��
p�6��Z�#�{�=�����l�]�R����4@��]���1z�m�������o������1�������l�����X���vG�����F�Z��ot������mm����|���m@���_�e`��m�5��g�G<U7kY�J��mw��18}��m��T���h�������t�f���Q�����l?�����m7u&Z��_s���m-j��6l���h��vS����!msiiA��Y�x����{��ms����9<�-�M�����_4��h����5�y�f��g�s6��g�Y������W��ms|��P{>���g��7`=��Y��Gp�m���]h����t�Q����������0h��/Br���6Z�[�C�����i�l�]����y�����7u���fD�o��g���� �ls���O@�����f��@7;�Yv�Gl�u��q������:�AMgE��������d����m
:MPSe:�������D���������g�a���0; �{��G�\e���3x���`+;������<���0�w/�����X����-�He���S>�ep���t?hrO� �iZ����K�?��dP�����>&]��|i�0����+�������k�Yv�B��|�������Xt7��.(�q5�d�%�|�����V#�������xk`��j�����<�`>h��|����wj(�5:!0#>�|xP'���*�|"��I�	�s��$;'Yv�G��Kv8^�����uK��X�����1��|e����TZ������Y��/v8x��g�8'������n���xx;���o�s�p�c���G�/t��	N�n#�H�GBZ�����V>@���y�� c�2V����t�99��������sD#%�����c�����8y|Mf�PDC}�%Ooc:��E4��/��o�lKD����D3���9O�8�d���]�|?K?�q��t����Kf7�a,&���e�$�KW/��[,fR���%
0&�`�rQ����F�W�'�"���G��j|/%�J�X����^��5��-�i	l�3������&R���H��nb)EtE��e1��<�������).�3�P��w�|<�&�M�\�Xr���D$���v�N�������������-���^�X�+��/�4��t���j�}Z~��6��O����|Q)�������eO���*Q���(��Z�R�hb�����g2�Yg�*���|8���&�7��Z����j=-d�d����V�[7�zo]9["�_�������FY��ii�q4��&��^�RY���3��c�Nj�e*e��'���Tyq�G�v��9�
��u�q�~�n��VU<���A���^\N/>�=:�����@Q�3���$�����}��r?�mK�[t;}��W�����x���Z�{���VW���N��||t|z&�M��d����qD��-Y����@`�xw����Je;���9�Z�O���i��5j�,R�wX[?�_�P;��T�b�	X<��v��M�l�0�����j�L)������$��F�a11y�a��&�w4X�3���(�l��S=_�r���[�2�a7�c�*L��b�N����h�I����s>�e���TU��o/��V�O���Z,�v8��V>�68"�3:Q��0v\Hj�a\�Q;�f!M*����W�^����BlK�pxv�S���7�B<WV�}5`:�vz"V�SP��xj�P�Y~��pqq�{z��\������o����6�F���teM�t����,��i�J��c�qn�#�_;�fn��v����*�����,��Z�+�i�Al���%H�;�n����ph���<�j���������F�Z~���A�����3�qt)���BS�1#9�
��S������*X�>���^�\��������}P@�vx�T�>��D>��(|�5;M��Ye����q��a*a�Y.��%�����v�G.:���^N�ZC�oiq����O������<o���
���.��/������������Y�f��Uv���NC�����������t&�q&��^N���g��p�����TZ�%����"��<�9]~���<����~OFs�s�*`a;N��X�?�aOX�����^��q�p�N��J����������oelN�~���HR97���X�W� �'��:����O�f!���`,�������z��� �U�+[	��� ���bsA~�%~��V���6Ij��U�&��Sp�����V	`e8��eRVs(m��f���h
��4��`e;�kT���j��M1K���������KQ�[�`v�n��uyKU/x��U���� ��m�S]4����>n��E�-)���<�����![��w^�m�:U���$�]Q[��+[x���6��*{��h{���Z��m��/T�Bo��R�����w��������mP�M��zA+{���6�rj��v�:1-���
V�������,.O���?�T`�"����������d3wU����.S����.���&�[�w�/mR�`��m���S0��������@�1oke	�.bd�b]#���l�u���e
��^E�����M4�g�h]�/t��=�n��ldy1��������&��l<%WVn@�]����@:����9���
��3����}=]��bs�:��t��u���6�L�.���x�?R��,K��:�������T�,&@������
"��.��E�B����(��Q6��nk���.�����]��EeGP��������
@[�kN�;��8�e�]�$������+_)�	im��6�L�L�%pe���7�B�����[�>b����I�rn7�dY_@�v��7�&��?�"��Vw j7|�^%:m�tQI���:�PZ�dr��sY��L4��^��l���s���j7|�I����_n����j�:!0(.4���.��Z�s����i3�.���w�F9E��) j|2eXh���j�������Z{���8��;��@��r����i����'��:O����r������8���,J�7��P�dF0�}yZRs��d�$������r�������.��
9�.��|�u����0���s����f�'�b�aoO��K���~������]=v�!�K����d}��j��8� W-�X����D�g	�3>;�k��c���N��W���O���=U����.�^S��=S�e
�}�6D�}�:��w�:��b������o/�q���c�	�.x����mo��]@>v���cS��@�y��62��V���F��
����A���`��N5���$4�:�B�s#c����L+����L+�����Q�1�>��9�\]
��IEU#A���f��Yu����.G>ddw���,�r�/�a
��v��X�V�k��&�t�,��q1cWsu9���,�L�����v��m�&v�����S��x��{�u@����+�v�7{I�O���8�. 
�����$�@���� ��eR2��Mn���]�k��:��!!�����V�g��Y���_��Z^���U�|�H���7���M�{}��f����]��
��;{�����@b5-[�D�*������V���v���&a��
%*vQ`v;C`�z������N��t�o����nx���e<�h�*��a4�����i�����cK	`���#<r�����p`l`�������a���|�!iW���n�A���.��lh��0���#
�^
��b�@���z�U/��X�v�=W��_����v8Y������
P��*����B����=��^�m{z�h��6"=L�U������?��{��?8?�?�?��=����������C�0��������
Z��;��������D�{��YE�
R��|����D+��<p(��SZ6
{�0���!�	��8*Z�(L��4w�c\���=�5���a����]|/�?N��
X�������
��{�h����|��r\a�m.s;����vV4��Yq>U:S���8WWk��z�F�q4b�m+[G��������T=@ �����VXhH���wV�<b�ov�zQf���t�B�1��E+�����S[]��=�_T]�2�*=Zy
(>86��5���\�������^MT�3	�?�[��l��8���I+`8fR]o�W�FRX��z5N�����;�=�F�84���F^������5i�m|�f���8Qg .��e�s���9��m���ql��[[�^�������j=�#p}�����������){|,s9O.�f7Y:�����`�����{����>:{���q�di��������+4�w��t�
��� ����{.������{�v�Nz����k!���p���e/|�i����7���|Z�q�m��k�f���s%5br{�����L�x����s�+�@=�i���6���Y����k�i�;������\k@��V3�Q-�I���$��C����"=�x�\>0-���^�������ot���^�Y$��V���e����\����2W�iv�X�����`={l��J]H\�]���YqDXwLV�����S���2����? ��m`j8X�~.
���=�B+I���x�����@{.��M�=�	��HP���B7�j���z�Xq�0�=�	�4�=w|��k��N�jv��B�=��B�=��8H�������9����V�@9���s��,Ug��8�qP�=W�����EP.
���?����=W�qr[$g�v�@B�=Q�x�RM���b�MiQ���O��{����&��5zP���d�� .{|��y`'{���2�F����
�K��HHWT	o����h3�������d����H%df�t�����{.�@�:�2���#����B}�����*|�L���(��=9�hqV���E��f�_�U���_�F���_�K�$lR��R~B?�'�.�n��;;�?�i6��7;b��R��\s�M��H�Vd,���lT�
��3���������'/_�����/��N��E�&��	��x����+����Z��s�<��\�O��!�/�B�\��J��M_�yr3�Mx�j<����y6��\���I<{�|[F������������E��,y0���{�����������B���������ttO�~3���_��[���=Cv�W��^_�H����}U����j>���^������~R���OJ���tZ~��x~�/;~���"v��W^Mr�2!�e���k9	�����M�������q_�x��@��=��O�_��[������w�A�r�g�W�P����^���
�� ����O���$�����m<��I��m�C�����t2��f��~?!�����'���L����[f.�����m9|)����Q�i[�f��M��������A���������SYR�r�$����'�L�W��8�I�(�+������.�9YQz�L�z���6�_{LQ�>)�^]+U��L�8������]g��7,?_�~��������J�5�,��j;��g��4����$��a9������J�w�Ff�&�Q�E�'��S9��}��BY��oW��t��O*]P�|T7{��J��mrJ����|�Z��6�WM��jEAjPW��Zv������������oX���~>���Q���/?&�G��Y:��PFW{�^��,u���j��Y���nU��l�T��L-�� '�y�j�Of����tw����Y���L4������V���s�5*���w�=o���������}d��/���������l����;_���ki������������Y]���J��?o�L(���o��vu�h������J������(�\��,8��%��p��P���xe��VFwe$�N@������/����V�C�b�gKK\��8����]|�TN]�U����&�PW�-��)A�� gm4Wb�.^�f����_�J?�
�t.g����I��\>�#y�[����n���oJY�W�<��nd�rm����R��������G�*��]���G8[]��P�-l�5������
��A�z2�Y,/��GB�p��m���v���i����%kX��i�����T�o�?��������o�?�����|������m������o�?_u���0�:;%� ��1���D���Zw��������o?����O~?��-eZ���Ok�#�o����~����H�W�����o�>��}���}��v~�0*^-K2��}�X�����m������C�����#h���F���z@��_����o����^s�ao�,G��~�E@_��?O��|=f���T��1���O�S=�����c���zG�uAH&�f���bW�Mr��T�����}�vE>�O��;/����s25����1'�o�.���U	5�P��.��p[�����"1����J'S#�{�Kr2������}F�Tl�m��{[�:�6����\�m�;^K���Xv���]$��}:����n����� ��Y'S#��u2�7���W�4Ki�E�x�H���q��������|�%���r��y�l��t�\������'V�K�3u�tg��-�����X�/��S*�0��')>}"5�s�*��&�L�3*�o�8��a�}����jKX��w����o7�5C&�)�)��^9vZ�o��f���k�����S��3"����I��U_��lt-�R������9������$K�?����L�^�������]��%�nR�T�Y�<d�(@�li�����5������~����(��}��vA�����OO���i��wO����vO������i���%%3������C���&�.>�gQ2���LTC7��FqN�e��F(��5i�fY<���N[��{l$� @�BE�tr�z�v�Ttr�\�W�}jK���(PH�m|ww��z2�f���Z����]�^���r5�w�v",�w����"�5�O�a"k0�B.~���A�6�r�N��#i}�ki~����Vb��������_X�����s�)
99,��:���x���	�24)iH.C`�|��#R9)K<�Ie[��N��Q�,X��32ky<��!0f>o������j�y8wQTazL&[V�N*��i�����$�=���� ���XV�D������-P�F�H����X5������%t�/��5�f��cAl,�Y����I�7zp�aeV��L�=���4D��s�,�V9�M�w�����u���0@��,�����Vr>��X�.�X��UM�X�v�L��-O�
���Y��r^ ;�h���������#�yU��T`��O�G�k�����,Kh������,��������i��]��fr�%+�$����8�I�~I�0'�nO�R�$�j�vu_Z�T.���l�<���d�i��3�������N�I�<+�{��H��XZ��}e��2&w��L�2�o��"������yD��1�-	x[�L�2Y�T�������B.�����Q:�������t�
Q�=�=)��Q���O�����^���xJ�]�S�.��--��S%�m4�RsJ���,���g4�
R�<�1��
W;Rs��vI�hO.�6�3���<4�����u�����,��k]A$���.���������Sqd��q�w���f���"w?�����^�P-�Yk4��Z��Z��>�\xk�?���s��8Y'(0G�o��E�\\��?=:��-�,��g#`Y8����r������ka�7yX<���������:/Z,B�(�������Qh�iTV�SM}�=�y8^���O������$�����
q;A��5��q;0��0jF���x�	[W@�Z��-�����/v�\��q���\�r��`*����[��<	W���G�����EFr;&�������[�j1jU}������
�9�5rz�y�5rz����t~�����v����[�m�j���W����NOJ����V
����]6��T�,��E
d}�.���:����Fa@)9o���F�����0��Q�U�u���	EZ�)L}nT�s�k���
�W;��m�������W�g��,s��,��(�>�����>���V���LJ�d������� ��f@������R9�W.��r[%�l���\��%+�96r
���}@�%.#��	6]���|�U��U��^��mq�wEhIL����I:��L���5D��g����;XP��L/e+��C:�����D�������4��K�t�/>d�|j��z vIn��gf����I�V7�O�E�D��$1���n��0����b�+b�w�`����X�h������������z��a�-����~6B���` X/�f��f}�Q����X
���=��8����KF��M�2���X�h'���X�5|}�^���^���t����':\�44��}�$/�Nkz�v8�����t�/i�s�t�Q�SYC'�t/�\%���-{^*����VJ��c4�2�2�Q3��m~m�4��h�����w���$}>=(_b�m�r�����d~��Ihv����L�,��m|{)��-1O&��b�p�nR+n@��^��~(|��3Q%��������Gi����1�10����k}J>���:��qZ����<9����$�v�FA>�gf)��#�-��d|��r��"�R�3��'��V}�r�0�Q�,7�q[���<9��O9��������t4�����H!��>��FG�Nt�>Y0r����R5)o��rG�cz'�:�"[��z/�rRO3+�T��:&���
uo��v��H>g%��"�<�U�^x���dy��v>�!�F~����Mz�%d���*u��b��6W�9+�}:���r��U>u�v=4No�m����So�m��s���q���h���f�nK#���U�!!������
�|�������f6�hpd���������J������f6@�9���("�]\���N����r8�S��<����<���(��?����(�M�K�@19�Q
!�|A1�������x���EYF<��A1z�X/���;�NL$Yk�w���x��������B3��(5�?��l��rx�� �z������Mvp��=M���������`������
��������y�������5z�9��XaM�8���p�a���qP��
��f��`5���(��:��e�������������]���3W�ToVil����A�8���M�q4�Y��@a8&,�B`���f0�1�/�q�X}���/3�,�%���
�Mnj;'P��Zv�����������4���*��h�I���mS����(�E����qh%���`�k��
��"��T���L�F�"������(����+�"W�z�&D�zV�@�8*������3��o�
�������i;�qT�����1Of6@[8�I�u����n�8����w���%_0����p�:x��;�=��3����Pf6@�8�i>!����(�A��POK=�E��4�N��85m4��8���(J+[���T9@GyEImY�����#����m��[��l�Py���8���ji�v�����n@*��,
 )����l��q�:����u����%>-g8�vv����]������SM����V�r�;y�ddH&�#��l�rq�R�\({�*sev�0y
��(�����5�&u4oKP�]��#�GG�����!3�"2T%��X���Z�� ����l��ptPa��;M�v���q0��
�2�)v���\n5K�S�S����	��{j�`z<��1����:��Qe�����f�@�8t��Y��Y~<P��UH��!9u{����8*g9p��5����)5�;������8B�����v���X�cq��P�
uu
X�cm�z i<��)E���"��0��nT��i�l�Bq$�m���5�/�r�1
^��_^��Y�`
S�|@��
c�5,��Kv�.>���
�*>���������8�������"�uX���ju+��A �]��k>G{,����%v���� ���|>~������������gF���9��8�v0��RW�y}b9������]�4��x*�q�� O����7P?�^�>��<�d_�3�m^�{�c���2c6���!ZM]��\f7/��n�;@�f���GCj, A|	b
���1K�� �@���Z{0��D�l�V���j-~��y���G��QV����6����r<��
�F�X2_f�@�X���z]\�w��������f�@�x*�
����� b����K���@��E���m|�����]�3r�/6���Bo���d�����-��k����P��1�(>���@��,�"�����Oj�]�0z�b(�5���Pj3���-�8��L���{vv�����of
���N���X�����^�������L#��+����������7o�_��\k�����������`ss�FD��OD��X��������`�*������4�W/���LE�b�t7ei����7���b�m�X���g|��q�������6���9���F��g]
�flD3C>�h|������b��f���T�q�39[��,&�3���n�+�0G����g��U������3K
��8�U����|>����������N]���o����9�F�D�{����u��q8*U���*v���^�\K�D�gj��[�o5Pk��Y����TW����b�)�E������^V�`@g���SlL��WJ��?��`}|�R�����oP�4��9�~����?�*���~�l�&s���4�fq�ms�.���l�L�s��O�[rq4���������6�]��i��#����zIZZbi|7����(�B�[]�:	�t�
`�B`~NX �,������^�Vn�����w:�k��x�nV�0
�<@�9���,��K�fY<��Q]O��n�G�t���t�M����d�D�����Nk�s�QuJ��A�)m�]��Y�����������
��?�Z�*��|�V�48������f���W���</�Rl���L�������U?���J��bYRK�Q��6���������AV
�F����F�L�>�p�H��[���XuVY>dUkLW������
��oj��(X>`��[�a��\� 3`�,Oi5�<f����W�mLz�7��k�D��ef��c�`��C�a��������<VWC@�9��hq3�K>�|��RuBzO���q���:�����9_���;�/�-�y$����8/<���{���u�_�]�?/����Q�sh��
PF�
+!�w�~�y�@���5
��`��g�#���Y�Y�.f3������ ������<����@��Y �S�$�Wh�����������8��������%�Y��~T���<��1Z�ah�;���l�����E�
X�p��G�_�v�)�Z�rjf[~��VnKpP[�>�����R,�&����g8��o�z��v"`�6�:�A�0p���n��?�����OOM��#` 	�����8�|pq[���t������{��\w�lBK|�j��.�
���m�&��j�]���*}j��v�"�-��;J5��`v��c2v��d�>?����6;��<W��n�3�
��o��,e!����q��Pw
)j����g�x��0����tt���p���t���,�J��6�g����������nn���M�|��J�T,�J��0�Y
PGG��r�tZt��G�5���C��J�%�������2������s>����L��B;�-R�8�.��O�E�R�i~@�w����~������&�Ua]�c3�@�G�EJzITN
67iT<Ta���&	#���!���R�����v��#�)�(����(�����W*��U|'���T��l�t�93�KS��ZI����j�/X;
Yp��N�S/��(O':z|��:�w�����2`Xg(��`�\
�~��������"�b�`��*�F�e�����&�\���V�`�T`zx����0`!Ba`u���g���z������R��B�������^~YG0��[\IU�b)�������y��1�����xXc�a�z'����~��h6s�a��
k�p���g��$\��Z�
���g+��P�����#��Z������G� ��.�X<@3���"������b}v@qY����jR ���j�R���,1`�D�SQ�kw)�,����*qC����&,,���&�,l�D&;��,�X�P^U�'iT�
�j�h-�$��em�0,�`�b�4���sX���7l�
���m���JL%��P��t�r�|j���4s:�0pn�U7�u0
�!���\*Gc}SR�f'�rU���[�r1��E�����nb�a��0����b�N��%���k���](����?syM�hC��[���Y����������$��[qM�e<��cR��{�����>��R�:Xb�b��������v+�)��X���b[�W]�!�0���*�>���^���:�]�x���o�_7�xD�T�b�-G��� *�
Tx(�qLA�jw�7p���Vo/�����gx�.P��e����_U�M�B��Z�YS<�lF�^��E����q}f6�����)���/���;��a������l�p`��^VD\R5*'���H(2�T��{����,�
vdN�����*f\	���s��
�~���O$��O����{�R�Mq��`c,��fR8��AA!� A�E�,*I�kS`#��B�+�zw�9���p�@��!h"R�(���q��X�
�K����O��t����c��P�l
>�� �R�Q�(�a��V�
��?��<��q�b�>B(��:�z�(y������<d]��ZK��Og�6m���h���r\�����R�I)�^�F�f�l,G[�i�hR6Yj~C3���1�y�r�c�k["`f9ps)����MU�WKa���&cj�[����M��,y���PJ��<�>�f�4w,�Y����:�l�yb1��3�	��NC��Y0�C|$L��w����L���'����D����`���L��D���a���8�%=b��UYB�m�l0M�"��8����!�6C��\�%S���m@�@4��7o���Y�[�C��29�x���0���b/�Of�z%xf��Ti�&���C�[*n%V,��|`���������]�&���dU��N���S��U�����Y��31�\�]	��&���+�B�\@k���L���+�D�v^S�Z���>y�b��h`d8����Wm�"��M��A����V��qtg���V��q�N� 4t��5�s8���@ks���@�?���5R44l���

94TJxq�~�p�@lo�l~/w��zg��N���b��
Yo��#V%���(P! DC�U/v�{zF�u�b���mu\\���=�st4dy��{�"����3PZ�5�:�a��:U��S���e����(3d�-.��,Sx��7�T����rxf��$��_�2P._��-3_���_����o�H6�F�+����
tq��1������%����|e*j_��!�O._�(P���7�_�4���q��:�NvO���\\���2���eiW�w�|�NT�t�p�_#^Z���G�,@�9ZR��{�X������l;%(>�bx?�c�!@*C�\:49|�/U_���7P{���q��\em
r��
Pj�e\�,{��l^��nm2�^�jyg�U'��B����,����d;�&�Xj���?TZ���2�0H#;��h��a��?6Q���MY�����Vu�P9��\�\rp%��d(_u���O��4�\`����1��0����T��-���S``\�S%#�H����8���b�����yU�������5�W�8��q��T;�/
9����H8�����CP�(q�9��?J`�9��Xe���?@`�9���Xe����V���Xe'����p���.��3�G��2��Q���q��8�*s�%��<	����2�M�1������8�*s��%�����8��o�����;���"�t�m��cfl,�t��}���YL��!d?U������0s�����vcN�i��P��h��w7q�M�rOF~K�Ey	D�m�j;��m�VHFy�l�D;�msp�����(�AZ�������P��}���hG�AGkG�e����S�mf����-
w��k���!�b���d��.�I�\���-��9JZoy���_N/>�=:X�w�2����_����/M)��;��IFL�dx��_'[�}����f���-x2���C�W�]~�P|��F�5r�����vy�Y;J����e�2�3��4[��O*d:�������\�������������ON�������g�u~�6����l��`���O�pOg�fq>��F��.�����kF~`�x�z��q�3�SA������������q1lu�����4t���3�d9���26����������+0LE����OR~)
�F��4;��l�Y�)��GB�]���!���L�R����C�Exw�o��\����-���n�G/�����t(������>�9��f�>�9���"����j���������_i{��^�zJz�A�s�A�:�e��r���M�nL��c�5��xr�Y@���[,m�8�>��;���&i��q���l��b!�B���i4�
>��A����k6.��b��-��8�XZ����X�������ba�-�8`����y-����\�U�eu������Q�F6�m�.V�7����S�'Y����&�n��u�N�G���M��Z5�����P��b`�`��4���8f����%G�����fk ���uX��'����u[��3�����������m�f��*��y�������ho�����^�<=4z�QH�7e����~���pK),6�V1��o��~����f��[�m�\������U��qD��C���D1)8��Z������zgWiv�hgni(Fi��[U7�|���m�"��[54O�R��#�RS`;��[o�h���C0y��Y)�^gO6���a�� ]kR4���a��H��A�E*�To
�Z[�cu��S>3�&�#�W��Y�����u���[�>nq�q�8�F�Q��[ul����!�-��8���?��rx�?��,J��r�������:��5z���z�����je������zfQ@�9R��@�9V���$���cL�.4�w]B�N�S�����f�hn����~�jV,c�,
h%��Z���-��5�z���zG���_������6���^7�b��^\m���N�<�%��-��$W3���v�TQb_G���i�7����&5/��S�r2��4K�:��-�bTW�R���,��OEK��XZ'�M+�:e��e-e�M�z��#��q2���������q��r�+��E��O9!�������4���r��6.�6.�i\���xU�����,�{�-�uu_u���������f��fp�u=�����E�YmQ�w�8�n3�����V�����n���n�>���X����/�'[?ift����xzvtj,��w���f�����<�w���r���Qy_���P��2��5��y��L����j��-���������b<h��X������VT�[5)�qz�8�z��Bo�Hf���
�Z��Y����Z<��93o	����
���RM	4�91��'��8��	�y5���������+��XtX�Nqj��R��#�
�N���[�1VO@_���l�v�<������]4�v����^7�(+OFqy��:nm���!}���uX���gvX���8�o����b�[+?����ZlZ������3�j�Q�V�P`���l�>rT��
�C����p���O6������t�L^�
��(o8��,�,�S0 �o�<>��
���E_��l�/��[EM��JG�+O�2@����QoB���Q_�����?>+��F �FW,�8�4���rl����,J[w�fu���@�J�k��v�jJ:�,�zp-�3�:��K_��I���}
�������0j-�Q3�z��J_�vw�����3�W��L&'���=���%Pr�ZK��Vpi(G�nm���G9v��p��6G���t�p��Er�
K���J9���1����#�p��6��tqq|t�]�o����5b���Ym���{M��
yNU�����ksL�N�y�����kQ'���yT�q��-9����� U�����mO�.��_�~��i��������`��
P9��nLH������'�6��n����d�Z_�l���~�n
`�6S2E���lsT��
�j)�}�QV.�k�6�"����%��2=����b=�Vg�{G�����6K�����6���hf����ly�v�������r~}M����e���|V{��
��6�l�����b����6��_X���^��yoft�u'jEN�n~�����������i�LfWf	@��P�5�:�ls���
P3>�����������sv1aZw���9�O.����r���x�r���x��c��m���XT��!�*�>���G>���A3�����%����YZ�������4��������h�\&�dv��Vw@�Yw�S6P����57�#lsaE��6����^pCE���6M�:l�J�L@�zhfT��W�6��|�e����E}�%3����i\l
��U.�k5��- �9q]�'K"�����u[C"`-8�����$K"`x�nI��xCN7����}.����c5�6�dV�����~�j����1��pSZI=�Y�f���?�����b������
�A�3��G����T���Gu\�9����D�4�v�C"_=�n�^�%;�����c�k3��h�{�u��N�/.�����)��f��q���n������m�4�&�c-��f.f�@�9\ri��������?���q:�M��-rp^�
[�@�zx��TSK+��s��?�|����*���728�6'������.��?��2��c�4����Q�E�]X���n	
4���G��tl��E@h0j����
��.�Z5]���;t��mQ�)��3���X�x�6�+.�IlK	��e�|��5C�Qls���
P,/�c�9������5c����D8�Yc��\hft8��]���Bk���F/@*�9R��_��]V���Z����6j=u�h��@��Tk�����c�O����]��+�����Nbls��
0<|��~VF�k�a��m>F��a7��Y��hZs�3����m�3�9�������c�CW}@�������0�qquD=�y�Y.���}�)���LS(�6GA�.������:������%0<.�x������v3��
��6�V��'�=it����m�M��e:����Y~>K�� �m�4K������D?I����6�\���0�����Y�~���xa��;�v}���:����S3VZ`�~�9��K��f�����,�n��.O�����6L6�W\-���������j4�qkf�M^94!�83y[�3i�#�j�����$>��������I|��%��
t\L�i��F��"�u������p[��$��@v��+�8����f6n��pl���<cN�8_���~'1��+\a�(��\.	�Z�:-\�^��?�������|k���9R {�f�;���p�;G�h��A{R�t=s��<�.W���F����n�����G�����-G�	���9N	ES��'zwYGB���fl����u!��&���l@�y������h?����O�;��u�������HlP����|��rv`lktE��(���+Ol��F�v�� ���Q/cz���������=�&��4]N�>��'�k���a����� �v��	�Z!�$�����&����6���
�>P�o�G2�p���/���~����������
��o�����2��l��*����o�fu`N�7���nY`t�^�_q�����J�(�G)Z���Q�f6@�9���z9��;�������4g��I�pLb�U�U0�[�f��<`8(�R�THM~�9��Y��������@��fPb@�J�#�a:�{����Z]�(#g�Y4T����q|5���d���h��]N��TM�
Y����F�n��Y2���r�m$�i�%�gu6@;��'X����0��9|�n������dK���98���;G>�(5���~Mvf����s�����M�>M�~t���l��s���
@�:*����������Hg5�i�4��J���z�Q��� ��-���f�@�9���S�/�����}� �::�������=��8��w^4�m�%�����,��d*�|��w�o�_]�Crh*��������a�;��<����V�:���b��O'�|��#��9��%��<3��?��R�e�6�k�D#��4�(��_k�H*�W.�,�(h��@���i/��������@������%C���8���}���MSx��#tL���2����lD�=��a�Z��^���������Dz���{f6�p�����udf,�-�B�����|���{���z�3���Awf6@W9�N_]�Di������l��q����(�3d�xb�����=U��4������{���PP�Y><=�+���f��1{�h�pG�_J��P4�{��2����p<9��(�V[)���A���|'uf�x��+v�e�������}���EoTc�OM
h�N�����uxg����d�\T���#w��*UC��*_���o�ua��*��:.�N��,��c�����i���b.��p9s�
�o6���C�;n�pp[-P��[����[���W5������\��z�{�!�}���|pzt~�g�`^��.�����w����u�Wk�S���2i��R)��!'{�������V�9��X�h4���\Y��U$�g��<�SIC�u88M�B��1�i�O+��*#n�	�����Q�
���V�E64
$ �:�f�7����B`u�Jb���e�,3^_� +�*�}3~��ux�����jx��p
&G3�����V�����a
�d&[��8�.�<o1:������[��7�O3��uyp�nD���������E�f�b�����b����"�ef�nE�rTUi��E?��*�M1����D\DW���:"�N��Gw���O&1E"��{S��uyo�Nc��7R�5r��7q�U�c�dF��DE�QS�������]�tu�n�y��4��<��Q:�en����R^���1�|z.��3t�u�|������FMS�}:���DD>Gn�-Q�!Q�%6R)CF"�mb]���E���o�{UPD.��_t�2�u�$7%6���m2�����6��������|G�r2b,%�J�Mo�����d2����H�D�cq��Q�L���.���W�]��uy��0��HF����]�uBd]�uy��N��l�G��������+^�%+0�`����^��k��]�����w?���Yf��.��e��.J���k�����9.���d6����������9���v@ou���j�(���������v���e�[v�@���X��(Hs;P�����t��w�.@���*��W�6�����*);���^�j� Gu]p��������[���}4�p5Y:����]��c�����l�	���oe�����:��U�Ie���������j���P��B&/^X�u�Se������1��f�b�����b��u�������������:��f��
�0k�O]w�X��osk�@�.�|�L�"�O]���s���D:���f	���l������Q�������-�<\��r�*��5c���Q��qX+r�-�b=��U�vfYf�r0����I��LR�����df�����}�d"���O?l�����&jF1>��&E�h7@;���-�ks��J]�V�)�L�z�<��_�����E��By�P�z���
�,�7���y�r��PK����f~@g9p��h�#jjql�CF@����hkJE�������.Kf3�3P|��|��#�����B�$�6��h?�&�T��6d&z'��]��y*���(S��2U�'��w]�s�d��L��.�g*bQ)~��r���`��2�NU�w���,���	/����Tl�H563�����-U�����������9qO�[����l�,�$�|2S�M��W�
dc�P���]Cuy�e�]��;j�d�vq18<?88>;Yq�C�b������l5���5$�U0F�S�X�s�����|8]�oD�����U�����q_�'&���l �[i^�����8�.�i���E9P�lB����uFw������	HY\��H��AZ��4G�������V�����.�b��}�c{.6�I����l9(��w��y��3"��z�����������0�Ofu�����r���
0�U^��b��F�{1M�\;�O�������L2�3s���.O���M'5������D�����O�9"��oQW1��J��Q2����.��Q�u5����+�|��U����*�)�9��[�/��:k5J'��6���J~���� �N�����^��h�+���-����0k]g|Y~��T�)����8M?E7q4S������L��v��ULi�������!=S��ss2��.��i��6E�;]����4Q�^H���U2LHF���5	eNqu����E����LU�l��j������C��e6�V���������������'�tW��R�?��-0��W��wJ���{Q1��J�(�j���Lq�U����B�wT����]�'������E���9I�d@M��XQ�q�.�5�v�#	�T���v��:T�z�t�t��%��X�N���0��:�A�0GW����[�gCc��FdcM��HJW5����l$S`r9�p�W�L�]�"=�T�D<L�"�����Ma���h�"�7�3<*�.te�v�A��
t��7��H��{!H�.�cO�����#B�b)T\����T�Ji��>,�y�6���.D�����`�����~��dS�V*u���M���J�2�&���O�J?���cz�3��U�M�E������t�����������v���
�-����w�x,���[0L����xD����UZ��^�@�#KU�;}�$N�(�,K�;�eX���ANk���.�����;.�|��=gTd3/���#~���TV�/z��\����Cq�i�\��^��m������
t���;@q^���p|���R�����$n�s"�i����
S���Y$�d?�x���b��:R��IG�����d2��������m'z����tq��c�'=���]���xJ���]%Y^H&U�1S���9X�e�L+�{��Z���]������+\t1�9_.c9`��<�7�w���m��k�����;���S���@�{)j���.e���iit[B�������"�|yU�8�U�z��g��9]2���=W������)U���-���g?,X���HW�2��Y� F{1�L�%�/��t?��}_J'jT���Q_Lc��G�V�	US�B���x�eX���_�P���������R�����������0x���W]&s��j�M�Zu��L�8O������^0��F
����,`^{n��,�y��(l���T�{�
��4l��i��
�b�*rL��X�9@��01�����yh���8����L8��b?�]�k��P3K�4�y%�Y&�#k{3\1��Zt��.P,k	��^~�L������^3z�������n�	�K�TI���0�[�+0!��������Zyk���f6@�y����P�/��������L����V������{�pK�2~5�z��p�{��
�e<����
#p�y����fu��k�t����M�C�����5X����x��z\�[�s��sV
��(����o(�V��J�����pG;.e
$m;$�����,|�d@>`�P�d-_���)�y������[)��+0�<!mV�~�5�z��s������=|�����.�um����t���)9����gp���?�t�A@��$��{�G�{5o���P��2}hk��D�m5�K�l���{�X���_</D6[
p�=>�1���E���vE>^��2Y��T���C�Q���Z5
���L��bK���sEP6u
 ��f�{���0����'W.���DT��H&W����`�=S����s��h<OQp;�|Y�X���	qL��(R�~�,������z�-��l� 6&UG�'P���_������,/��d�9L��`�{<{lZ%�x�x��i*J#avf@�x�x������B�+��
����p���������T9������A���������E���5?�{P\&�#[���e��|�r�A~���yM^��wU�sR�E��:\mM�'���eP��s�����@v{��Z`�='�[��uO����k+�T�7Pf��Zm����Z9�i�bp��O{}jI���	�L`���C`A{jf4�w��X�=�m�8n�b��H��6�M�{~�aip�`9T�*;Ta`�=�\7T��b��qH��
�G�d#��[���������JI;���Ud6�L��t7��0��T��x�eI��@�9J���p�������T�
.�}���%������3�R�	X�M��q@���%�N�m���^3b�
u���jW���w����1m��'W�.�y���sZz�x���G'}��x��RH�����������?�
sZz{"URMm8�=�[��YE9
=_������|��O�6�S��y�|��N���T`z�0��snW����b�4/o�
����2J�/2���?��A|V>@�8���<Z^<��_
���*�����4k�P�b�^q�������JN�S����%�vqt^�z�L/�o��q����4������q�i�K+S�v������<�g���o��pzj7�f�����
��s��XY�Y"iM'�:�Mp�'L2S�X��+#�3.�L��b�-�q�g���wE�������&�����V��"��ZfYT���W$� �rn�7XF��G��. ���:b!���'���?�%��0z��yF�`�� �x\�R�]B����0z�
�l�2�����Y�\�=���;���iuZ�cW8��� ���fLS�E��[���x���~�~��+�OK ������.�N��d)���!���������y��F���q��K�\���vi+^��_G�TB0����(��I��q����V�����
���f�����9n���*��n���;��bn���3������������9��'��f>#�����x�*6�,�
���r����?�Z���%*��P:G�����MF�s�9��b��o�F�mK�=T��Fyn�6�o��Z�q�U�|���������v��f����
�D|����t:����Y�a;h�Sn�1��ei@�]xVy�Z�3�Wu��2��<hU�R5h7jE��������9�{)�h7�xR����M�����}�'"��e����>0<�����f�������c���������e)Vn^��;�Ni}�P������Gn��w�����YV>@G������G����8��j��o��]y�]��i�
��#��|���LUa�&�]/��[PK'?��z�-�T�l>6�%YPt���L?���,���,ry0!OLZ��V1b.i�2�2Km7�E�6���������
����j*��u�v-"���n����_�����y��/���3���P����7+�$�e�,	Om]\�Sx���F>�!i�8@T'��S�s�Nl_fL	GqY�S����j�Y�R$7�E��/t�y�,�������w
�Fg�cg}>$�m^+o��	G��]wI6<*-h�h�u��D*�������-�����f�����������Y0)O���?wZa����<W>����������������7zF��������M��s�0.+��Ov�k��1�����^]������>���x���j�_5�<=�X���t�������c*\MrNS���C����!��ye��v>������&�m�W_`���t�{���.���)nP���[�+�+�Qs�r��Y���7z�4J�
_�L������3,�,��;B�$�
��s�!�T���8�gy�T�6�u��7Y\��s"��7y���I9z�>������(��c;�.�8����y|�\����)��������0<
���{M�D�h3��[gQ�������N�Exjn2x���\��$��%��w�[x���,/�^_��0.���\@�*��%���xn��*i����{������3�X����u���`�q�������mc��������q��c~��v2�p���������~��%���$�"�;�����������]�1���
� ���'^��0�P����n��������tEw<���M��)�_�x<^P�c9��SX�������su���A5p1�@Z������t�UX�+���o�l������� 8�f�TE�������i����Y)�/n������H����������������S���x���
���|$!X+ig}�����Owb������B+��tK����Qj����QS��
��u�
|�*����qa�@�������^�w��R����l����k�v
�d�����p$��"I�����	��	�E�� �@)h��J_��^��P>�}�P_�=�2�I|m49 }�h�C��<���b��|4�}*{��Z�\x�=�i�[���^�P�%��?�=�A����-+�m�O������&���&���?��W����UwE�K�!h��ixNo��`��_�1h[���(�������]�O_�������c���L���~��.�J�m�=<�Cd�I^,�W\%��
�>,���V���2j������/U<��k?g�8����[�M��`I�������~�[�9���^V����_	p~a+��\���*}�T	�@�V�������P���[����r�/�����b����f���8.���74p�vFc�
�V��Z�'�_���[�W��T@�>�����-���U����W?j9���N�����#�BA ���)��hA7������e��j@��v��z�j������D��������Z��W��d��=j5uz���]��r������_\q]��>/u���a���y�����}��6W��T�~9^����������Z#�:[O-����Z��99;<e;p �K)��������Y�Q����F��z�PNS���.�l����[=���J�A���xY��e]=w�n�������x����9����~j�4�wDc}���V4�u����"�/Ryyn���<'s�~Lf���sAW(����V>v�]�~�>�f	)�)DE�%��TBAm���,�sj�F>�f������T��Aw��������LjfK
x�,��Wk������Ec+����C,��=K��{������=X�r�e�l#�������y��nrg��aF�����m��d��&K��rc�;2�>`l}^�P
�}wxA��I*c���'�z�/���1�k��I�����=}[��"&�t���d�H46L[����%yu���������>@h��ZJ+��.%|�v�M��6�Ec�hZ��i�|�G���d�T2���s]p��M���b�f��-�V/���(�}����#���,����t������|Y���|���6���n�����:����}�/Xu��U��������i�uL}�5q������f\���OE5(��E���^EF�����I62;`^}[o�C��wS|���V	W)-�U�c����F�f��QXK�%jHW�h 69>���V�U��^�b����C��(U�� ���H��*��UlG�a�rb������U����yF?ZG�����&d-�����-L�1!��.h{��j5R���K�����=V��_yL{��
��-~��NM�fD�!��tZdgw8�`������	��<8�.�B�L4���	����xV������'����p�5p�����*]��OKy0�
&����Y_{�
!����p���������D`EW���&�*58hi��K��� 5�[�p��}���hD
�0��<OJ���T���%�W�4�`Rv'�_�
���L
80���=����k���E��fiF�������\�~�Q\3�j���&Tw�-� W�\�^�9kl�B�W���l�s�5���m�������p
�C��h]��e�
�?'�i�����4����y>�>�~:4���_�k��=V���O��$�5$�'w`K�-}�mn��$5�$CW-��-
�GK�U��Om���c��R��e����
��?��e{6�i)h��S5���Q%4�e�q������'���E�w�Q��W*�y� ���	����2�0��i� e�G�~�+M=U����U,E�j;qt��8y�e��R�Q�q�Mp�����"�js��O���,�����N�����p�����.�U�F�~���*��_�g�	�2pd,�XV���N[����Q�vi�2�H~B4�_O���������IXX�Jy+w��n����hI/��.���7i�Guy�����D���J�m��1y��s1oIaQ�9��j\=�Q<p����	k��E���C��]�a�A�u*�LS�i�����4�s��	��"}{t����������4~���	�]�Ee�I���Ju@1���Q��g	,��_H��:@o��f����7�|���	�/�_�?:^�>������2s1���Tq\#7��n��e����D"G^`xj@\�v�����U��"�o�t�A�L��h�?���
��d�)X�N{U�������������i�����6���p��zl�����#�Lg�mAe]/��g��eRx|j�C��3�O�&Dq���
lQ��VF�����zt��H�Ue����_m����$���{���gG���#v~qr�cc���#GW=��������������4��Y�=��������z,�9[ k�[v����y�x���/��:-@@�:m����;-@G�^��Vk�hh���

x4Tv���i����^�����N�����-:��?�x���������.�8%���w���vY@�����D>$���]��Z4��S�.�}\��h`eE��~����Zu�~�}�wZ@�������3�75���GMe������6UoBL�i��`uS��Vv<�e��!���;�)+u��`���Q�8x��
����s/��82�`R�I�=�o�@�XZ������cp��{`<�'���M+���!����RW���%����1GxN���^xUv���!C��Zv��M=C@��6rt�!��m�,��Nn4�������[���,
����N�4t�W
�|��t������r�h��7{Z��Q��uv�9zt���d	enO��^����;5���
,r�h��G�= HC�P^q:�]`�05z2�EC+.��W����4l�&E:M(���
��+�t���2]$z8����U�T�\]��c��X��#AA
��W�RT ��Iu�G���4�S��rIO��}U"i�L����&�����������kkTxNIU�s�)��X����Q��m��yU�u�
E!`WC���JN��"��*��4�s�]�C���^:�(Z��4o
9�V�J������^y���n�/�D�!��JP��W�5����'O��AE5�9<���w�������(e�U]���:&6��XU��([,���8��rTP�FR��HS��iUW�.�a6��s�#�[5���/�s�P�������N�xl��P�U�&S�Y�gD�m�����3
>�0+�O��.����������P3�f"po�����R��0����/)�����;
W B���R�!�u�`k�j����d�]�n���!���
CK8��q\�y^�4�`u��q���p{�v�>!�m��3!�iC��-�Q��g�n��2���(��>����D6S�����"
�y$7��\��H|����y�R���>�.��W�?=��R�O]"���p*�+S��V�N��C@���k�<�����^:=���!�z>��
��������rz���g������������Ho�=�K4�A]a�(�����T
����������
���D�
�W�-[��^�X�`q�G����\k��������O|�#S�8���~�vU+,q���F>�9
���
9������ 4o�����n���
�������U��'7���r:�������!G�V�gl�l7�EW���-�R�����PM���~�ycg�2��jSk����M!`C����nn��]��;
����Z�����g!���������!���������	Mr4���<H����_���'	����,�iM�f���E�^@eH
-|��z�2&(yZ�c=3@��Vz����A�� �����h��W���_|s~�G0v[��o�dW�1���5}��'~�����4�q�����GA���������(��L��0�o�����y5���
EvIr����%oB��Ve��^�S�����#���#��KK�1�@���*��B{�C�S���3����1K��[�`F�D9y��JW��hu_Y+�y\=g�"�������U������<
/.[%��8��zRO!I	��/��7f����)�z�c���Lc�p��������#����T��
���[����������}x��z�U]�h@��6�W� �Dqe�M�T�x�W/��f��GpP�f2���h�*5�1�r`�f2��mC��+nw0��6���no��p��]S�	�rY3�q��!Vj��"��xmh�zC}`xC@��6�V����
9��y	��
9����i�No�,��j��|��
m�����9������j���D
���.]mP���`��6���P#�~��xQa�i)�p��k����4(����!m�j�g\��I�ck'(jt�v��4j'Z��n|����^���qp�:�.q�HQQ��+Ilo|/v#�ZF�l��`iP���Z�DA�"�F`���o`���QUj�4FK�8,��n�O���h=y��x���9&��U��*����j���6�&�Sc2d5�[��uV����+�	�C���ff�E8��F��xR����f�}:�T��l�\^*�J�Ia9�hdEY������r��}\����v���(T�2���V���	p��(�^��Z�-������|�����������88�d�!�m�8z �|�\?�~:�1r$#@>F�P���w�(�y�s����.ro�~H���]rO����1�����Lv ��S*���|+�L�W2�wR�S���]Y�Eu����^�GF<)=k�4d���j��r�����J����Ks�?�d�Vk�5C�9b�� #+�:+E����S��@i�
�$!��;���b%v{+���sG�����5g������\c#6�e,�#7�`-O�8=�������}M�����������`�e����n��QB4�_��~�R�n�#q��k������#a�r� "�:+�6/A�W����
�86�1J�Mc<`#N�t����(�z`c[��V��D���-�xR���H�d�HFN�tv��]�&F�hb���� ������������$j,�Wm/
�y���iuZ�����G��������1p-�G��e,���������a�*m���M�|����~Vw��W��M��-�uZ�J�MFK��m$���
y�ll����*��:��sGe�8a�z26��q�"�!pS�j��	#^���R�����NF1mV��7�g���3�I����l),1�>��#�ki��
�FC��-����a���F>��8F��<u[9v��.���r�F���M�|5K�Sj��1�>���)�u���
|J����k�������[-l���0�;�X��8��FL�x0q��b@���������P��!�"(F�h��G$���0�����h4�a����K�7���3i�D�F{h�,���E�����m����bh����|��;�~F�+�8��q�����a�����F��(��*y��Jrye�i�
��G#�m[�6
����kOglP1�^���N�"8����`���UH4bu�b�C��E#`������n
��ce����^-����@�9�t��.
\�B�������	���,(�h{���Gw<���I#��p��������Ee����$��~��{�<����-�@t��f��1��j�5Y�4�f���q*�A,���e�S��������+;_�����ue�9h�������\a����=Ys[���t��b���t���E��m���Le�c*��*;����Qv8���mFeg{�����~
2;s�Jvl�����d�&��[�@t�AH��P�����}y���Al���f�ou$b�@M�x������;�nHv���J�n$'p� ��
���:���R�����cU�,�!C�����l����;�����l�v�	�����K��,OO~o==��c��4|4@;;AKo�z�S��C;���jU
�����G�@�v����<�C�@K;6��pH
�ph�X�����e��3p0m��ye1g����n�s�^�K;]�<3�i�cLi5
�������TD`����@D��eZ����jN�Z,�X4��������'���'M:�,�8��@�v�� K;YJ�
b6)Z�"���i��"�v��mr�hs`6�<�tJf���x�;MI�H�������,�Y�_E!�&�]660��!5f����7��zP����E�'��*�r��9�j[zj��d2_�NFf��nm���kViw^,����� O#`��n$%a�G�t�,�Bc���I2N�r}{3�a,������
c`��vv8��m��kvs�'Ev�c��=OY��J���p��I�k����dT��V���MC��4�hv:`+�������d����&7�"/�35�0tG��`*;(b�|���J���X�������o�q����K��1~
��	4����c����},��:L�S�����������IS�i������i��ev#�Hv�`$;6FR�_�cP_KO�8<���	2j%U��(&.�t��������$kn���6~`�&�X��V�:���w;��4���&�(�E
b�.������#�)/�x{Bvlz���c�(>������*�sG������;�E�y���1pH2���6�
�;6=EcXdb�#�|�
����(3La�c
�GV�Hv1����?���d9N��lNH|������^�A�O�����/���Z|�����tsC~y-~B���x�����<8���:[����o$�	4KNg��w�x<����|�X������x�tz�����/��_����/��g�_�E��]B�����x���+�nUS#����y���&��O�&��e�����w�Hn^}qW��^����<[.Ft��6��%�K)��&/'�������������<�+_�����/^Y����5�F�������l�@�~WL'���=���i�}D�~IQs����0�I:�����'�o���>u�-�8����_��9��K��z�#)� cY�� �"��;��7{�ZZ��E����gyq+�������M�������{�]�=���������������G/_��z�>��|���9)u�x������(�3x�X��/���IK��G1�L��A���/�������<���7���E���~�k9W���z�����O���{��S���?N=?'����[��H�W��'s���`�A���W_���$����NTT���(��(����������/T.�s�=��_T
 ��WV[Q�0�k�Z�]���x�	�����h�����f#��D��C��,N�D�/��[u�������!������g,�0��m:'�����OE6�&��h������/1�����^��I�����FX�Z�~����Z��vVh�Wv��89���\w�����D#����a���������w_����x��_eE<�u��pTZ����B,���]%y�t��e����eZ���S1��]*P5���X<|�������m��TL[��/?Ws���-�;8^���������W��#��MeR��6����ir+\�s��_�bc%�:��_���r�R����;r�Y9�My^+oI����%75������+�VN��K��v1�*����0Q?��~t�����S����1{��&����9�b��++��h�k���~�RL^�BXw�~�����t��P�F�y���OW��0X�)h~����~��
��0#8U�E.~����|9g�3�'���c<u�����S7������f�?�jz���h��c�tY([�`�������?Et�k��}�*�{)��6�����������O���������N��<���#������������������������������������l�|����{D
;&�q��Q�^�n�<m�<m�<m�<%-9��TP�dmH�e����A������i����G�����������?z�g�����I���1�Y�^������i�i�����
��i[@�?������;l��>�0�z���z���"�_��?�I�$j�"�+
��	��*�H�jl�YLy�������"��"��+o_jw��7�D�|J�\����^�H��8M��z�M$���n��d#c��7������o,fL��h���K�*�n���e>�����pZ/���]��Z�z3N$�+�iY(=��l�� ��vF�=0����e����v�]��|M{6'�B��E��oV�z�N$����2��wA�X�����r+�x����T4�,�7�b��'wVWQ�^�v�]����0��N^�]��99:�:9�C=���+��o���(X5����	0kVp���Vv�M���UH�8������V[0�w����B��;�$F�	VjU���/�b!��tv�16�iUjC�HN���+>f)E�N���^6�n�`�v�R�8�vb�:-�����D��$���W��W����V)��Ir/�J��\�[�	~�u)E��e�AI|W�����d���{��y��"�#�UvV�jUv��Sj�+�"��:W�����[��jh����l��TC��V����p����IV&������cS:����R����
C��	v��%E������E�l���]!RL0yo��@�{E&�������X��s;��2O(p�j&�t���H���/{��?�y������^K_$*�b�i���)�rR���?�\�����tV&w>�Em�z��$!�\��s����������Sj��+��;���\��r��_jv`�u�8iR�EE.�5}��%	��N5�q�Z�Xs1�U�8k��i:���1�����1����h�Z��H8u�u��l��36�I5�3]6d
��Sg\g��p���%�][Q���M��}��i9k�D��1!��jS����i�hWZE9�h���"��^����X��]����D��q2I�)�X�b�����v���z���F��H����=����;=�������O��kY�d�\�Qq����z\������u����~�Y+�'���L��>��iQ��O/z�mq�������B���������/��zC�`X���*�����r�#��_����q��k#P?�[�)���uU����]Q��4���"z'�8�&���_��]�Pd������&Q�i3������G�=�{(w`�������qz�L)�T9��O�t���Q�����j���M�1�k�����)��K-`��������������nz;��p�����7(Jp�U�UE.UAQ����H^�d�$$����<�@�G��Ms/��5����
�B�2b���H�1��K�y�.��0FN^�������f?������O�V����������0�6zs��O'��?],~��2����������\\^��c���\�1r��+���=CT�.2(�D������b����q�"Te���-j��$�+�|������8`��2�z�������h{`��Z����<����t/jo`�{�|�3�����=����V��";��7�J
F[����f(�V����zx ��h+`�} %��0�>c�mE����Es!���"[��RNW6�>�����t��u������z9���@�zRn�`n�c`�\�c=�C����c�`�\��*i������p��^D_������oii�u({|cQ�-�W-�)�=�%m@M��d���v�Rk={�cQG'P�������z	�������V�E9�r�E/���=
/a�Q,�t��
/1�����fjT,�t4a���q�E	�~���u���o0��\������rQ��i�I�6������
�2�����o����sA��������������0����n	I%�E���?hc��l7C���f��C��"��C��W��f���9f�<|�x��]Vm�l�����Y�sd�*f�R�NSi�89��h-�=o*
*G����@��������)�]�w�|��:�g��������w&��!�#��l��q4���Y�����u"��K�2Qc��W��r���!A���s����K�Q7���|���=�o��o�P�`��E+��_�e����'dt�Ty���=y9���t��1*��|��D���)Rkh2M>���m��*)���Pd��&l�
rI=�����X�%���e�Z��J��"������Dt��V?,��F����.��S*-��E�XC��l�n�I�>@�|ej�R����i�;�^�f����>F��o�ky�B�
�����9J��.�xjX��s���li�|ct3v����J��@)��(]Rl�2��d���$-��#@���C�j����uJ�8K�����+J:�����(+���[y��`���?&b�uSHVZ��z�����W�<}J�+���(���������gW%eiT������=<��$)���Jj�t"�o�����`����f�m��K},����!�V���������QO�j�V1�2�+IM�O�zE���b�6t>`����������S��7/0c��_�^�]q�L)�]Y
bG��D
����]tZ��e�?Q�?���}O���g�<��^��T4��w�\���l���Qa����ie���������f�c�k�m�}_��/���>�_��Ik3�y'M��A���>�Q�8���$��L�'��r%��<��5L��sT�>�3��n�J�G?����-]3���o��k��jt&@������C��'�����7��S@�'�����B�m��v�7a�����GO�o���?K��)��\=���Zp�%��<��=�����$����� ��n���q m2��BA6�Q�@��(��n���fQ�*+7,�w�������������3���lU}��������K-���

���z��n�O���������������g6��������������O�O��i����;eO�@�6��6��6��6��6����������}������������_u����C����?�w�6��6���~e��}|����>�}}����w�:�����i;~�?���v��;��=�}��y��u3m:�m&1��9�V�Z[���A6l���"\�o;�m����c5)U��o����}XGk�}�U�D6>�.n�a-�7�0�c�� #�����2��K���ZU�S���a�\ZF}}�+<I���6H�����y)�G*�3/C�r3y����1^�47���-��^��H�G��&��1w7����Z�����6������2N��K�P�UTH��D���fX=�H~��H6[N����\�x�M�4��N������e:+����4)�2#��{tYk{���+����c��&���7��|�����I������LI:�UJ�1H{�h�����\x#�7��rZ�~�)YFz3����y:y��bQ$EZ����(>P!f�d�S��}���F/&�iR���eYQ�'���7I�9i���)��lYx�%��������*�"R��>[��"�*m��2����|[����2���)���9�����u���}I�P�;j@�~u=���iE�D��1-O�)|t.C�.�g�GKSa�w���n���)����~��%E�
�K���m����<d�g���Nh����nG#M�mh[3@l��U��|{xzz~Tvk-���m����8>?��,MC�)�m�6��y2*�mM�6�-���m���p���6��rVn���j�V�lh������X�M�2x-%9��9O��z�-��Rj����?S�Z�l����ri���"+L�G� ��B/P
)}OQ�T�9m��e�Z-��Qj�$SVZ�Yo53��]��V^�P��������b�Z!-#����D}�J+����K%#�\��e�[-��l��
mc��'�J��9�������H��V��+�_������m)}�8�Z.-��n���2����{���R{��������jN�������y�?�
/��
�]��_�\���}g��p�{�_���l���=��D��n��Qw�}��(�x���(������[�A��n� r(�������2�0%�%�S
��Ks����9*�cR���V�i�!����/�=��}k�7�=3�^UT�+--��i�`�
������z�qT�����h{w��p�`�m����zDd�7��~�O��F:��"[�+�7��������-���v��v���z��G�7v]���yLI���'�,�Y?"�15�Q2�x�C1������*scS�mDc-�G�,~��e���H<���
�@���$�u�\�c�Y��=����X������I���}k�SF�`F��=.�{�D�$��ng2���7[�c�%�Qz�o7�ln�����*N<.��-C%k�8GH�r6H���jQ�����[0��2�2%=xp��Z.���,��p6�
�+!:�T�P�{�4s�p����o��==�a�X�c���x�����>��T_KVY
O�E(�V����q��;��\��(����9�����W)����������Px���n���\���K�Wj.�7��V��z8���
�G�6�a;����w��Z��8���v�cA1��2�T}�c�=�v������:X6��
[��9��{���W����h�sE���i�Z��+�����S��������:��������A�3��=�����A��������s��y���$��t	�����O����O���;\�W���y�V���C�}v�bO������(����f�g���s��1��v�����?"=��<��<��<��<��xO�?O�?O�?������~���Q��wJ�b~<m�<m�<m�<%#����1�~��H|������������z���w�6|�6|�Q>�nv{��u���i�g���p�b���U����<m����i�'������}��}�m����.��\������Hq��rm&#^�t��Oa��>����
����{����b>h����]���1�\ZFr���FI���4�n��
W�����fb�_�x-���uh,�J~���2��%��}<��Y'����\J��G�aJnB��e$-��6�����C�D��P!-�7����n���BZFv�ri���*�������%3��LF����,5�b����[�G�ri�@���x�Ih���~��mC �f�5ZFM��Bk������h�a��-C9��5ZFsp
��6��Kh�r8=F@�PZ.-#9p�5Z�oh�QD��-:h�����k�1ZFn�RM?h#5��h��J�Y6i*�~�1�h��
�P%������
�����2�C�~8�n/���B{%�dd�l�
�P%9ulH�� ��j��J�
J�����)mBT�ursLC�Q����1'n+Bs�m�Eh��!�g�����A�e���b�#t���{Cx������+��m;���P+��PY�{Y�VD���"�A�.���]]h%��Q�p5�������s�[q��`t�?ot�}�w��Mh|�?�9�<�9���wZ����������?^^��roW`���w��kY��/��������"�6n���<���������^����\3�`L)�/���2��0e\��=>=9;~wryu~���}K��U�����jy������������Z�-���}s��9��L����Y:��3Y��7d�R�������Ck���_?���[J��{��7������E�5/�V����9� /w�����/���z���?W����l[������/���������t��m������3.����*K��/�rlv�����v������������=�Ad{5�<;|'?�=Z�����=ge}&��b�����4>������7��n+zod�Y���fZ.-E��!�h�'������g����dHG���'�
�`���$�?�5���H��Sy����O��/����t�����a�?�E2���R_OE:M�O�wb�L[��]v���L����ZZ��7�����a6/K���s��e�*�(Q+������Z�������1g�|�����6���Hre�mC{n����������=����J[�p��xq��,k�T�5�D�H��/b�n�X��WnC��J2�g�x��.�����T�����=��=�C�"�����7K��i))����BT�ZwM��l�t���_Uc����4e}@z��{=���u�������Q'j(���pl�K-e��Ok������'��'K�otQ�g_?�sv�QWz��6�����G�AL��1Yz2�(�_�!�Q�*-�+=;`�=UO���h)���6�Wm
���c�6�7��z)��8f��8g�[�sl�fi��]/X+GUm����K���U�%��~�`�y��(���3���VZ�M�X>G`���o��s<��
0_���.pT�����m���y$Jw�f�9�Is���H��"I�x|��,j�%}R#?�#�m���q�*�vG1��="�����M/e�M�\ #@.���'}��%����I�I+�n��h%��������J�&�D����Hg���K>�/�	�%�#��}�V�$���I�r�k-���'���G-�0��/�"�|���.�i����Y��Q��3F�?�Q��f�4S1�b��_�&��;-�������q5F�aSz��|���>��e���8����9�,�<�#x������g����pS~�e�*�:���tV��(B��|��\}�'A�?|���SlU%}����F]���y�J����"�M�x?Il�{&w
��D�I>����B�����?<�U��?k�(���X�6?UG���c�����,P?�6
����T�����2�y���,��^p�e����9P�����g@m��UO|��9�x.�����������4W����x����W����>`o|�����-�����������
�2���9�C��W��8�LO��
�3������
�������dMv�[7�|(���r�G�>��^���U*��I<�'����~h����j�d�����{��)����D�R8�����
�������6���]0m>���S	a���^0o�i3���/��~�H��I�����tXM��t��0�����R5&Pu/��#Ly�{�s�'��|����>D���`�z��ep}@��q�'�;��?z2�t��"#������������=�dV$��b�X,{�N�#���w*����WO�6�>@�|�������~+�t���3�������3��`D=�	x(���&��h���}	8��o���c�HPA�_�/���(C=i])Ha�!���+s�"�`�a�z��� �V�i�����7�)�k�����"���Pv�p�a=�Z�o�x�2�a���z2z��O���������o�[vg�c��}^2�`H=�=���BWK��R�#��pd���G>�cIYP��(N���{T2����.�^.�%%����������%��h���d�{������Z/8��d�!_N��� ���_c�d4��Q=�
0�������;[��V@K�����P��+����&��
����������J��X���q�M��.��J��b����4Lu1GG`� �m�cg^����w6�-��@��4h6��Y���J�����������+d�{�^uw��]�6�9��n-��%H�q��v����X���*���*����U:��q�~������}��~���:T����|xxk��C:���{��O�^��WE����,VU��	��P�A����>kS
�
��F�?+#���<k����3#z:`�i-������%w�^j��7�M���c�����]{U+���xw��wS��;��W���V�P�������t���xw�z����V�^v�~�,���sq��:���6Nn��A����CuJ�b���@,��u�
0]�2��vxd��2��`���Y%}x�y�sJ���y��28��i|�M�;X������KE0���B���8�%��<�>��`�0`z������m@���g��c����wk���ph=��p�'j�@=_`����m�����x��e�
]@����B�R��@�.d9%}�ty�����+��~p����N����8� �����L=�`�NU4RS��D�<��N��a��q�����U�0Xz���S2:�`=�EZ�U�Z��c�����Fc�g�V� ��BOI�X��\��_m��`�NVOz�|�x��Y�$O�!���G��iH���Y.0=%�k��B��M@��f��
�I�8���7���
Cc�H��#����Y-��\�go��=h���a8����Z4���!����k����������3��mv���P�!��.r��Ux��g�?��
��������A�p"�����������(�f�!@�C��J���=<p�X�`����C���V��z�v:>���[���r��pu�����u������C���.�<������X�B�����B7����!'�kD+�y�������i5�"!��\������_���T�� ���M��zH���nh~�g�����E�����CN��e�*��yh��U�W����.�Qk��9�	�����0����'�F@�����K�,?X��p����t��y���v����a�9����-g�!����3�M��0�C�dJ�0���E!Y�8� �b��-A���`�YQ�Cj��N�����!���2���8��E��i����"�z�r����W	�(}dJ��(������:�<]����v��
��a��������
���G��~�����B@���+�k�1Lq���������HR��0�7�+�:B�=t���
����z����]WxY��]O��)VnZ���Y�ic0�v�iK�z�������'3�i�����+Q�N��]��*]��:��j.nl�t�a�z6�����$Ko��
d�W���f�t}���DL��E�s|qq~�|�4:��0z�I���������������<��k�B���Gk��4�-d���x��Wf�}�\)	��?������P�l�7���m�L���V^-�����2�a�)ny������(�����q���|�8t��C�|\�w�O�.�1=8hq�jW����
-������p�iP����8��kF���o�u���T-L���J�Y1hs�����
0!��>��NX�����!�6�����Z�N	�hC��eQB���Gn�V��]�7tC~y�P�������(����),8tQ���+�x����(�lC���v�}0��c�^�5��d��lC�������MZh����l����i�5 lCWm��&x�P�������e��.�vW�C�����n��@�QXw
8����5^#�Yc�g��V�6������/�6��������gzTu���g�+�L�1q�sZ�����
��w7�����gL����������E���>l���,_��/0c����5��6������f
[��!�YC7��0��g�����e��8���������-��K����8�]��n��8��Z��n����6`���SF�z�Wz�v��\ O�
��z1vk�Z*#G���,��=2
��h�7��mz���P�G�����v������3qx��
0D�
��0z�`hd�&oViF6�l���|��4�Y����ct�]�E����"P��<�]�4�T����F����n��A_�6U�R.5u���������_�_a��#8q�/��2Hf���������E\�#��?N�Q��+p����\S�t���`2��I}��d8(��h�J:�&��R��l���I���r2��I=`�6>R������i�1�&��d!*�w�'f�^v����4������0���������_?���h��n���
k���s(�E�yxM���-Fn��^2����_y��~����W~���F�M�8��z�^$��mf�a��U�D���`���-����?��%q���0~��������?���d)�J�`>��x6K�#�3U���"����?���b��_-�+�/��_^�����'�}:.�^}�|��w�-�����7J&Q���������X~Wf,��c1��f��$�������W_���T���_���n�(�
��$��+���x���+�nUS#���"O����&��O�&���"{���c�n������(�_�|9)��y�\���n��YR�Q,��6y9_����_x��u'���<^$���?����+���}?�&����~|y��������O�����V�����m
/���������x����������q R���~�sP��R�����{� ��� ��A���w���~���$2��~7����r��>�4�������~|w�}w���{�����#��/_~�|����� �?���E<�S�������/��R�&�0�/���IK���iR�6�"�e�~|��Q6+�%��z�'_������E�����kot/ra4������?�BX����;������q�1��:�]�[�{��*K�\����O^}�b������+�WVk��_����5d����K����Ge������~Q==�z��z�����>��\�]�~QD�	�����h�����f#��D��C��,N�D�hG��U�c����r;z�'K��+��j�t6N>I'����l.�����������/1����ZT���f���}�E��g���?��k1<��r��\������#�}��������������a�1c��W�����?}�������'����J��3Y����5t��D5���+����o�Vn(�����U3����
���l�G�b�����������k�������?�|����U���SU�V���]Jo�`�����ir+�s��_�bc��:��_��r�R����	��������e�W�K����$��>���,�����n��~����p�z$��D�xc���Nwc$�O@������O��h�V�C�����'���;����v�[�b��������#���ba���W��h��~�z����HA�#��ML��b�a}��!���"���#y�S/���~���wU]�G�}R{��|���`�����'[MN� w�k����-��|��5�4�$��p��iw�����S�������{2�2���AO���������?"=��<��<��<��<��<��<��<���V��y�����?�����)i�A��c�7����������������Sb���'�	��Y���f�]��?�E�`�?����?"��������=��=���O���=5�{/�����yrX��������������tZlC��^����`�����z�������Dt�^w�������S��1G_}�a������v�C@��/�/��I��E�WP��#�q,&~"]����0��3�/j�W���<����r���J��i��<���kY�l8U������E�%w�X������qWw��������S����l�������s/��Yr��9],�����{'�W�-J��
��q��(�J�BE���pz��w����f�|�N���'c�������_��|z��S1��������q7$w�~r��b�����.r�N�i��?�y���U3�ZN��Fjr;y���9�/Nzwqr�Z���/w�T{�w1��B�~����D�D[$)5�h���f6��m�>�[���K����%�"�.�j�F)^�/K1���K��EO���&���:�K��NhA<�a��R���t�]m��.��r�����~�7����^��L��<6t�.�neC�w�\t��>�9�Z��	��c��y�K�ef�s��h>�_���]������E�<Y��E�����[Rdw�����d"�C!�y.�3�h��Z����S������}����/���
oE��{�V#�G9�Q#���-�&�y����3������|S�d��'�W�N�p����l���(�k�9N���P�9�hZ�v�?x���wz~����������(���NR�/I�n�y��}sY��K%�|Jj�@�
�����';�N�(�!_�����&Y�����	xJN-B�L@,���"D�����(O8{�������.O��:�x�'��tt'm~���y���V�$9�	r��Fz�Q��W�s��Sc��5|�����dF!*
�x�,��`����I6��ns��������I�!�{X���v�O�M���D���l��]/�����E��PUg_0���Slh��{sx%<I�Xd�X
�0<.dU�f,����A��h&��E���Q"���$��"�<�O�~���N����mjy�@�,��Q��p��Q5���&�[r
Qz5�4����ii�	\+�!&g�]���
w�!�x{��%�2`�;����t�ix��������Ht��}.�^O�N���W�����}��7T&����%�����l�d����k��U�Po�,�"�Px��R�"�j�@�Wr��2O��?��P��B�c�[P/!���ps�UpFc-hr�R�Mr9��x^v8�B�o���$;,�O�o��$�W��xUY�:�il�O���l���hU�����I@.�N��t@iD�E6��fP�I�!1�=���o��"����i�~4����9W���������<i�e�j�~4���� ��G��0������&]L���Z�������p����f�n�Y���d\[Vn�F���5����LA��������,U�A���$�4+a�R����7I�Qt�pP!�3+S#
'N�C)�1�2��N�X������FRm@X4��fl%�GE�������wb�"c�e���Qz;��b
���m���!����Zx��t(��i�}���s]U
�������whR}��M���Uu@��'��tz�8�i��izk���B�~��TU��X�K����T�*�c���p�*�,�L�,�l]��V��Y<���+	��y\�����������av��rf�a�R{]��f@/����������`���*�S0�c�LYD��S�(`�\`���>���)�((�)G���r~��oS����(b|�y^��l�6���,�	�@@2��B�&��X�r~;��b=n.
�%���d|iDD1e���(��*��/����ysy��p������E�X�q�q1�������)Sp]7�1I������e�^���5T9�_nh��q ��"�����(O�S
�\h2��Y6Nr9�_1e_��/�'���e��i�0�QK����.�s�@����9��4�:���r2�j�qQ�,E�"
��������zq�Sp\@3Ky��)�
.����Y<�^�fTA�<.(�����H|���0�R^�d�)8.����rm�\"p-=7��X�P��?��[s�����<�
�[�l6�@��c�H	Cj.�M���iL�7Loq�<..SL:1�'�ws2U9��j��T�0\P3��E�=�.}7�R/(^,��4�Y��eZi�(��!g<&f�q���J����4S�"\�3�$��x���<����E:c�? F��ws���%��@�3o��T)�djY���e���(�>���,�4���f�e��s
 ���=k.(�Lu���\$���h_%[2�"�y\D4�$�XSf���y'�u�������vlA<3��g����?���T����a��c:�������h�\8�}�{K�EC�R�U:$~sz������Z�8g�L�@`3��?n�>����Y���c�6������P�{�,��}$B�������V�������N%��f>�lWrND�{�dQ�(��uV�����,�6�TX]���2��^#�����h��]J>��&t9�����]��H%y�V:�^$�<���Z��M�<r�:v��s1�Du�y�\e�N7��.��<�p����}"���v��+b�:>�MTd����I��?x��l�!.Z)��%�C]��$:����s�Yvo��vw����Si�� m�BX������g���'�������x\��\/������>���s��^�z�4�#;�>�\�Y�0��N���x��l���Fv��s1�T	G��5�����U�.���-4��a�#�E��}�h�OM�N���"��f�"���`������wL���(_=`��+�<2��HCk�P��,�T�����'��n���=��p��4q���P`�g��d�����Mu�j�����gM��m}��&�V�CS�_��Z����_���"��
mA6U���)|�*8�V��}�,^�I�.h����� �����5���1�D�Ti�FV��#�������/���ku�c}E�
�:,_���j$���q,m�!��0,��-�4dH��>_!HtMGX�^#��X�v���1aoh�9��nft4[�@V�c+{����-��pY��e����W�B�^���lP
���i�r~����'�3��KD�����vOJK�V���l�s���
�|5^`9}��>���
�`:��a}�.�_Ig4^��r:��+��Bs���q�(.~�����*�}��Ey�J�N.��+���- ']R��
y; -��/[|��1��y<�����9��9��iv{![I��n�������9xR��2�N����������Kr���E<�P��X�H�����������z�����d�q�/����Rq�X���%?�����B�O��R��OG{z�������*���MD�����������[~�L���o�8��C>����77���V���h����t�~^d�%����N���s�gi����z���TQ/	x���t�|*�NON�C<�����S�1��>�{n����b��D�'�����7�*���Z6
�9(�pt��9�S��w��h�����'�Sr�iS4����K@�gr���-R���^�J8|�v������MU�.f;��bO&{������!�q���{������2�9��4a���C�����\��I}�&��N��E������M�Ou���kbyv+'n���t����4�z�m�-
<G�V���(��:�N���|Y����>��I�����|G�6UL�������C�g����~qU��H�q��S���f�I~rs8{�y���#�s�hm��_��,wX�r
��D}���������������s���N�{eG]��U^�#A����e"�[�
���Zv���\u�{Y�p}�xQ��E��t��������tc���Z6���9�SX�w��;���xQ\'�T;Sq���Z���T�(Q�H��a_�>G���vt1���E�C�*9,O5nv�P��B�l��s��a���9��N�7�A��%���xR�bo���l�\�G�r�$�J��N��I�m�uq�K�TOv�
�O����<��U��j���/����/��s,��N���7�B�������� ���A�p(+*��U�'7$_R�����u2�h�Y��Dq����h�t�z���(����
f	�
��n��:�n���9�-g�N�����pP=�:8�S�K������������zs�������x?�\ �;}������7Es������H8H"_�TK��Y�
�t$��i,�3��yC�����e�}�H"k"A�a\d�t4,�������f.�=q��W00u����v�q���M�C7�6�ZuSW�����"�V�]�F5j��2���9�)��]RJR\'e�+`��K*�%���x����O�:��zz�f���X.�	��Qk���
�JoG|Hc�^)�]8BTTJB��v�h��p��H����������EM+0����Vk��{��gs����P�>��Oq�n ��{O��������-o&
���Q��9Ch����z6�ypD(��b�8�Oh��g�Tu�xs�����?�����w�H�"��k�`������qr��m'=S�s	88�oz>v�H�H�m=o�JZ�X<S��J�Z���n�G\��f!���|�F�v;8�R��n��P�T����O��r�]�-Hy�F�Mu��
�p��QV����������?_�y����G�}1$�E��4�XK=��D)�t������S9���IM~UL�n�	0��TZ��GX�����S�4k�����d�����a8e����5?{e�PE�����M���%�e�M���G��t�������#�ZS11{{�������R�F��{n���h`�J�w�4��"�I�c���GG������j2����@������{8�Q��/�2
Wt�-
����Ws�t\.�OE�8zM��Ta;b�vx:<������2��9(Q�.G��yG���6������
��T��9b�����vJ��6vh��(�L���so�>7
�/�6��.�i.\n*��4�6�����OfiA���IV0�
L���G��X|�����\����X��E7���f�����z4��f�b�^/��0�I�\�v���p<c�|Y�i���^s��1�h�����Sp(�h6)�Kn�a&p��	|���0��d����2
��BnX���.�"�2��1��r����z**-���G���;�6�x%�����"���h�_��B)�pIAj��?�K���)��KcjW	��8�^������o�1Rs��f�1�|q���L��8,�/���7I2&/b/�1�l(xh.8-��l��dO����<�(��v6K^��(-����%����{R��nY+���siyr�d&�p9�hgV��M�b$����������,�7��	����|�H(H������61G�����%�{*�h,g�'c�S:�S']r'��f��W�f����4d��uN�>�:���E���$d��z�4��RsbR��I�7�B|�S�L62K>�CP�-�^���h<��hC��&M��1+|+G��@���:WYW��C�9?��4�`��F
8�~���Nft�>+����������b.�����Y��9�U�8[�QmX�$�$<}>Of;���R�^-n�2	����
�����T#�w�I:���	�1��������z6�cq��]��N�H��M���)�=�)�E��.�D�L�@���#V�{�YO��V���Q*���~_��o������b�?��U}����Hu�>%�r�p$��p2[�7�k�8p�����[=ug\.��D[^��D^�U��E��)��k������%�%��PV=�8�U���p�*SM�gu�M��\$*<����b���`uZ��p�kY%��)��~��D.�(^��B��O�uQ�~�w����p�����������W�_m|��a
X�U�������|S���������9a������$�82�e��zL$�,���75���Il�D�\�-���Q�8D�r�/���*��r��J�DQ�d���p0��b�,�f��3������uSs��%	�P��R��� ����#o>��&�(-���z��sbaVU�J�k�y�m��{��i��� g#���1�XM+b85��.6��XU)�����W�!�m���n�A����x=��F����&;��6����n���� ��_�����W�W����G�������n�l����ZJ���d:/��e�Q�����h`"s��[�v�����[9S)����z�?�"������I�%��Y�9���<���u����B�O��]=��8��[pl-*�$����*n�LEa���n�5�
�6�`�������$���J�T�3����`q�+)���Zii�[Q�����������]�Z6���X0��x_�,&k
S��������h��xWw%��
8��a���`���Y�a��w�U�r��v�����xSpL���Pm��j����������E"�"O�?eP��k�����JM�/�>�M�&r����+k"��"*f�Q}}Hw���ksU�r���������}�������^��������?5��R$�v�>�3��q��q��p���W����*�,�2�o��7`�2!���]Lx��V�o���v0^��i��)<��?UI�Ei�|��;�����l�~*<h��t��������U����+Z��]��g�q�&��8��bJ��1tqh��W��R�(��m������0�f��R��0���!��l�j��e7��-���D�^��$������
�b��I�����*{�F�m&�����;G�>��s��0��@���o[���O��Q�����7�_�����U������U]�r1�49?���p��-�5����_���d�P�a�o�X���C���s��% �C����������<	3����O����fu���a�B��U���vG]�T�H��W���-|O��t�y/#lu���z6��p��1�9��������C����
��J�F�v�!��7trni���P�i~$���`\��C�^���0dN���*i��/��K����z6������#uJ��,��	��a���b������l������	*@~C6�<�]��X)���������'��>WGQt���*h�<�1���`��Rr�M�Z+AUy�C��M�o�E^��
��!+�*�=@|����Ee�b���L��o���h���~yc-��``=�BX�T�����^�W��j��Z�(=�1�g�`n����
�>c��m��=�1M	�G�����������������x������������/�V���!�1�P�a�j�E�O�a������yE-�(m���!����������V�n�i�7ZE�D�F�����
9J��z>��qh��
�gk����
9*�y��V�C&�p��r^�P*�S-Qg^�'�X����9f0���`
G���6L��v�C��E�)���H��_�A����!G�R2�
�YCN�U&�(���Er��q.�pi`�H>��_����]\�IH�'[)��uf���e��l�J����`�
����)�����
8����q��E�V4f����ky�i�	��v�u](5�EV�,�q;��
9�W=�����y��5���6V:���_{
�e��z�(�����2��L�-��
�
��)�����g����!���)j����gW'o~^�xyu��:w���!K����!�����Q������g7]�n���8
9�T�$� ���X��)���Oa�����m�g��O�l���������3� `j6�sC�'��z�!�:CW�S�g��v��T�}�
x���:���V�&�3�4dR�Z��xR^�*���Q�b�)J�$�!F��b902���I����
*���O���4r�����f���1`�n�dx���%)n0�!�LR����R��g��WN9�'%k���y7$�6�&��,�����}+�C��+��,U�N^��J���X�B�����t!Z�V���Z���>�U��!k�������)�t���T/e��d����������*�`	0���,�l��o��=g>�/�Yq�#��,�{���?9��F�7�o��-@Z��zi=����'U,���sc���!'o�g����b��?������djE����13!�i���)%}6H���>����Z<Z����d��w�5;�[��DF��f� ��F���8qTJ������&�����ggw%����Do'������O������B@"�Hs��[�o�W�_�,���/x����=oG]����M@4,��?�}�>e����"7�2edA(-���?Ac�����2n�9���Fq%���y��Ida)-�u���T���1����[�2��������(�Q`����n�����y~3���PK�J]��w�v#X?@��
b�&���lO-��=�lY�IL�]� g�R��xz�|��J�[]����*���}|�������J�6)��<xH?��e�R�Y�MK��y�6]83NZ��5�2�\��9�_l�,�����%��,�r����I6#7����Q�x?v�2��B�X2t�-�����eJb�\3�3
�vl�M]&���t�S��jf �]���i�F���8
���Zm�LH -�y�b���	-�dp���i����U��z�W�x��o�M���
T���ZJ���Y��$��b��SVS���b!
��(����js%��A�������5]�puN;./Dm�%sp�(��im�HA�H������w��V{:9��,������~�F
�F�m�y�[�0���>�#�6
�����QT������x\�}\��i��'�I�RzI����d�u����Y����
^ �Q��#U���_'��l(
����W�C���z��v�����<�K��p����o���z6�/�_��j�����}u/]�[6�/�i�����d��g7U��o��J� ��K��� `z#����^����l�������p���,������e��"D��&��Zd��7b7�n����������P�l?�I,p�����}*UI��������l������=O�Ky�=��fYco.#�z*E�jj0c��6Y�Y���|�!��*1� ��
�m��l��������3{�=�0��g���}R�������z���:[�S�����S����g���c;�*U�g�Ag�YE�0�}
�F:[m�7_v�b���F���8bV����,C`���eW�����������w'�W�?2�Z��E`
�k��,��5�P����`�j�]��X	i���C4w7��F���~�P��CU)���dv��h���}����f�T�������<m,[������K������w�W������������"��F6����fX�-�+�$�D��5������~(�|���H\��e\�C��`�����#Z�l�1��Y
�H�����^���������!_z��QaU���yR&�Wdw"�g�R{^P"`���b-�+�F�����;,l���z6�l{-��]�����h�3:n[y�=O8o��U-z��	�d#$�V�w����S��N�Mp���
p6>���2�Ev��<���oy���(����-D6����VY����#������z��X�Mm���@�n�Au�B4W ,4����_c����`\=�^8����r\��
���~���e��r�Rk;��L^m��L����d�8�~���+�Z5j�����)�[1���md��^�bqQ��S�V��-|�\����5����b���-�%�{��M��%:_+�:-d\:��|�Y�
�X����u�-�����W
8]N��>������<���Yp����������oxr�F?���6-Gm�e?�o���lss}����a��|[P��6���}[���6����m�]w���~��w��p��&j��k�)m��7|���������n�v%�tu�����V���@���6�+�?����v5���M;-�)Y�+
�"�X�����{�N�\����f��a|�������{M�oV�Wm�@V��p�7�g<��g��f,^8���-	���C��p>�x}~v��������7'g�?��|�����������_sr��w,`��T��9��8�hf�:��Xxki'Y,h��DJ������{z��3���Y��i.���,�����]&z��(��Z����/����Vwh�w<7.�u��p=��8�]J/I��HS�T�s����Y���:�UZA����N���v���I�$D���=�������3b����F����^%�u^�c��m]���~��������o#1�E��z�C1���3@�w����z��5/����
2�{h���=�+�t��=�s��A���T��1�xQN�X�8�����jN�>��s=�wB'D�@�N�����X���<�m|��,�6[��;�<;���{��1�c#����\XW������v��e�oa�	_wl�5��Y������w�t'l��!��T������jx�����!�������P�x����� �����r����9��{�'�su=�~����<�c���mJK\���;QK��UY�d����f����X�N�8/��>���������e��W��t6@�K�� �Ng�b�W�|����;Q����;q�z����r=U]�
<:}<|}xu8<:��x�������S�J��E��})_������AF-���� �5��b����6
�P��:��^�-m�l��F"����z��2:�z����j:�X�m����������l�_���K20E	����[���T�XH�G�G���d���>$f��1�Fb�����z
P��
�(|��n3��3��w�[z���w8�^���������;]nb��: �X�����������3��@�����cHw�aq�]���L�4��{J�lFA���8�
�6��s�� M�e.���u~���w+%�i���I���v�%��}o�w8(��s�0|����J��Zx���@�w8r^���|�
����wX�goS�WJ��3};��p���
0X��u��T-+��wz��H������%cr���>>���_
O���.�/�������
	xul�[���Lo����P~�@X�I2������"��������;�L��
@�;}k�2�_-2�R^�{�S
�Uk8H@�w�-�#���j�i���[�6����'�Rz�S�M{q{��5c���A��������������~U?yJ�jAS�������T�q��j.�e:7�����6�d�@����qS���������k��7�[L����$~���]K��5<3|{����G�o��cU���
�N���o*%�X7
P�'��g���M�X�d�3�G��>��_-��'��h���3���{�t����������YF'�y����4B������K������\�M��Q���~��>$c*�]��8/��[1��y�O����q���	p��������K���D,�&2>a6�<�����x�����dM�x��.?�y��������3��@����M:���4����\]�R���:��k^�.O���n���M��iQ�K�]6��g�t�N��3���c����-]@�w�vc����A�i�.4�����i�/���k��������t����T���"�u8�e�.�A����o���E���v��=hI6R�S�Nr1����������`�:]�|��8��0���n7�G�����l#���h���
��X�P7bW/����x������{���e��.'��Jr�g�g��N��c�HoT$=�X�r��O[F�I>�����@$�x,��B��!�oQ�E"�LM$������_��2��
�����(�Tg���=������w�G���r�+�%����Z+��������&N'I���1���"�os\�v���������a���bc��.
|�U�n�L��������S�H����M?��D��L>-DG�V=]a�{������A�RS".��s�J��y"=S�r�������(~f�HF�b�G�����ul���c{�p.C�nn�] �]N8��������6�K��]�����E�s��/]0w����]�m��t��
,�)�.�k��M��.���AK|p���	�7����n����X�$��j��Q��t�pi<��mT�9��n���N��]�v�����&�&���2�.0�n��%�f�^iN��� ����X������'�JI�_<�Z���4�^��P���s1�]�Zwt�n�����)������S�1@��"3�?���I3W�T�����bS�Pw"��r���
p"�#��^�k�v���\=`�������l��R�H��m
����y�z�_-/��[�������]�v� ��Zqr!���A����Vk�
���XZcZ	���E�vK���K���4y��N���j3��I���S�=���m7��v;��X��D����|*�NO�,��v����&�
�7�$��d��5
�2�/�x\=��8�����.����v�7�R9����PyVV�L���^�;n�Ps;��7��EH�*��8A'Z�ho�����]9S�G�e�tNIX�x��r1�h����:^[�	��7N��gF���A�Z�n�����:A	�]����j�4o������z7��7��l�6�j�ZUY�pl�<�fu6wr�W/�:��NY9H�y�$�ME��E2y�+����Y��yiOY
�����E6�U��g\x�������*����ZE��u�������[Qy��	�JL����KZ�
>n9��|��F�IUs����-�8�`2]0wY�Y�R�y��$�6{�M;��II /���������y�@��M�V!y" :�Mj���j�r�R���}������3^��^{�~��+I����~/�PF�6
���Y:�x2����"��^%�fm��Z�|�SH[�g�G�����~G1����Zh5�u��E^�Q�-���]���}��n�<���w!N�rcb�����\T��y��n���c}/����U�c6����)�WI���w�
�]O.�����Z�����Y�6�\����rL��9_.iT��yix��<Y|L��if���.������ �X�f��k�W�Pgn|�
��N���Q2D���F��a��I��q"�4��K]5u
9��LM�j]*.��~s��y��rwH�~�^4p�6�w��=e��Y�,`���V�"��<ohI�u!��]j�w��������9S���x�&�����-.��x���{5����k������6B������,����F���\�y�m#�����<��]^��K@�w�$����Z8��I������f����UJ������k�����z6�#����%%'VijFv'���|��r�(������G��~K��)7�xw��.�\k0=��_��j��h[�t��](�����{&^%}�����&(�W;e�s�d(��Q���iK�.m:����g#�����4�-�7����i���Um�=��&���q"��0�-�Y�����N����F�S�'�tJ��Xa�w�G��G�$^x*���A�8Ki_K���s�8y�Z�D}Fw��41y���\Q,��[a�n�W�U�����kdw�=��xY#���g�=�i�@?{X/�VMC�5�����fZ���d�w�w�����{�&�/����2v��r�����]=�n���������������r�������'�8T|��K=G����Q�T�%��HvH������G�WG�����v���s=��,��?������"���V�J!_��+�2��#�k�N�"��='���x��xY�J�i3"S
�������!�z��\�m%e��{�o���Z�A���=������{�b�Y)v�a���A�����3���`���,U������X��jy^=�;��)y��=+��*t��,�� ���3N���������=@��8q����{AK�DS����NyP�=D���g�����v�{�_���u=m�;N��]�Z� �>'~+O�4]~��r��w:{����*zo<M�-���{�b���1+|���U`m8�6�1h�CoV�a����
���w=?��8H_��2��M��Y���X[�����~��~�.�����������l��B�=m��,E����i\j\���R����0�x��X��<��f���o9{���(�i��"������k�o�`��������V�j��d����m|�k����V���a6�H��-���Mz�N�����=��������� �������������g
i��cb��{�����=�����i��|*�s��Bhm���������Kz6�D-�G������Z����f���������o��=��{�E��"
�Oi������k��[v��������_���?�����i<�Z�X�i���[m!��t~�M(�x�^[�l7� �{m��5��j`g_�Rz�g|�
�7��'��r��r����7�,��-��~����E'�$/�=I��*U���X�Y)2K>�[A���YY�]��U0�=+�n�	����p	�����)�����Q��V��{�.�UH�u�����s����������?�^@��A�Z,�K���X|g��!�W�����/�������(7"�E)�D<�H|],��AFF����Z,�����Xt���8Dt_g���z��L���v�c��l�A��r>��"�7;�n�
�T���*\���^������[����)i�]�u��&��\�f�������s���k�l�S��#��\�7;��iV$�)���@�^[����T��w���������r�z���qd��
pI��6�#���A6z�;#!����$�w7Z�sB�bNLC���z�gU��}���W�D�L�
f���]o�~ ,%_.�_�IQ���0_�)����B�x�J�w_E$��$�w	��O�I%��F�{5#1�[�r�P�NF�icJ@�^[q�r8]��U"l�������3�=��-��VE�~�7�t":���X<�����v�b!��N�=�hh�<,&�u�du���v�N����'���~y��Y0�s��~n��
��V�����S����tu��������n=�<�r�Kb��y�P~i���o2Z�eJj-�{m�o5�6+	��(}��Y��|�;4,�y� �J��.���_�E��7�sS������.��?K���F�&�����������&�����Z����>U�H�w���QU��%su��
�l�gUW����R�Eh�g�r��-�8���������S�'�����0r=�z�0��D����>p\6�[������&��w��qW`����%���XQ�F��H������� Q�*%���\^]�=9���Mn~9}n���
o�(v�C��l�������J���>�=����\��/�LnR����`�$�L��([�\��Q&7�I&B�Q5��Z��������N:�]q_���
�,�l4�����N6a
9-Khgx]���1^�D5����&C��2���@�> �����nI�S(2����yq>I����e� ���6�NVu~��/2�M�\�p-,7�+7�)6��������+�Q�������Ls���.���}@'�-tr���AD�4!i\����6�r��^k��b�"^�;���h�/Ujr�\D�>�����6����o��f��h��O�\�pM�`���|�3�iu��[e��A���Z���ezUeKC����`vCCF��K�G��}�������#��vff\����
f�~������,���������x4�Q��Mn���S?�����h��,���t��t	��b�[(f�j���Z�����;�����H}8+�	W7�t~	�P�r�Q���W��8NY��(m,��S���1I���U���q:�7���}�8���^��>N�7�B|��a����r�c��l������,%��y
p�NW�$/�z�Yu��nIBf���ff�::G�������Y^�`�iW�
-s}%V��FQ��>�6�ssC��Y��N�)|1�\���~?��;�������G6��U���x%�\������9��T�����+i��"�%��X|������Y��������r��(o,�����W]��f��Z�)V�^hv��������}�.�m"���3��6�����\=�G��8
�}������W0
���XL���d��`57 ���v^��rI�/����\��7��z)�q��j�i�����L��������t��G����VI����@���� �D�F�G9)�si��,w��$_���%�z��Cr�B��D-B�$I2N��
���:%y�S�����d�C������^�i��5��F5��������
 �}�&���	��~��Wv���B�w���h��)}���M"w1�:��X�t��KJ&���uz��/��^��^�����)��]���B�T��i���o�)��(r7�-�p�4��+���"=�&�	�L������t��+���Fu���*����w����h$L4W�I�"�*���"UW�e�d���g�8���	���P���h�E>V*�62��Th5c���}�����b��8���#���<�U&�����"7��6�����F��\�G������WS���= �w�*W��\�����m=���[����R���Xw�9r���}���N�c�
?(�>KY���:,��O�"~L��@��%]�=���w�'G�W'�g����+�>��g!l�6h��@��,���9k�:��|����<Q������s�h;���n�M���h:ff����;)lo&�h��h����z�>�|v�IC{3]\^��y~G��%R�����B�>ytw����9x��Y�8a�(���;�y�>���A��~�z3e��g�,�z��@�~�i#��Q����8��Sx�>���6l�Z[����[�; t��_�h�p�)G�7����d��(�g6�E��Q�Fw���4'�wJ����I"/�����dB'.�w�{�U(�����a�z6�)��&���{W�b4������L|E�V$i���h��U���Wdbm� �G���f�����k�NW�����MM���K���W��|r�zxz�����k����UrP	���-MJA�l�P�G�����?�]�0�u��d���#E��G�������N�K
r
9^����X����J3};�9�Y�����������$���;:?�2�lE�C8�o��1@&�L9�����Lgx���s���]~w~�}�\]�N���,*��\+T��z�s�s��%����y�����'���m�1���6�kJ�3�r�:�
������6
�����J��T����3�t�&���>�3v�t�8���u�g���������f�L�3&�������dw����w=?���e�ru��5���r������^pp���=���t�������w�@�~{'8p�jjki+'8�
h��������?���;�A[�mJM[ ��������
<P���N���W>`����@����������4�j�8�S��Y��t�&{�W����P�p����P��r&h4*�J������;6�8p\�����%���ny�����`=:iA��������v�P��
gO[mi51��a�����r}���^��L^�W$V3A*M���������{X��.n�i��~z{���������g���P��(u��]������������
�@�������1Q�T���h�p/6��J
�=��D;iJ
�=��Ac=��6@�AK*2��j��#��?p-��)���p�S�C��������f��p��z6�1����'��f������G��'��.����pX��
���~�#O����"�@�1
@� ����M��1g�j8�[.z�&�<�� r�P�������W��0z`�����Q��}b'��G������(tsZ�M(~[�x�������[1�^�(2��;��<��:>��@�6�Z_�$z`C���q�g��wc�-���;Uw�_����'��sB�@2�H� �6�����4��J3SZ�U�<x#���'�\8c���H�!���Z���E��H��$��J�}XYfU�Y:J&���������E����\�Q ��������dB|�^������x���D2��m�E<���9]�����������7`\�*�dXIfU'1f�{W+WB�)���I������F���R3j�8����gVu� 	���D��wG5�O&}��+�,��')�b�Q���plhC�k�<p����5'�����w�o�/�,�qyj�4�����N�Z�x;��c�K���[(R�<@����v�tu��Q�I_I>z��j��Jk�$o�BvE-��= �1�g�
��6Y@K�
����1��$^�.���BJ���.�_'�SJ��R1^��%����� oD��\�JS:@2�/�h�x�����B������o�`G}S�qW�1qTl@5%��F=�`�N��'��z��
]��y�=�8~��
jz�Q�zz��BM�&q�-�|K-�.l��y��������iQTl��)�p�L���J,�O0�'�=���QR�q�����i��l�{���z6��Y�n�;�'��2��H[��{��^c4��N1.P��-)����w38��-� ��2{���+�����������������<5vtu;a� �n"��p\�f�A4�I��&�	�c?]�3�g?�����5c/���U������{� �*}F{�`�*MR&l/��}y��D�����f$�	���s�3`�n��{,���!��_�Y�e��{�V8�JNF��+��9��J���2�r��_������zm2���y��?�����a���zQ��1�q�_.�k�0n&���N��77Q�����h�eZ= }�[,8e��k4z�����>�'�������fb���y[3�r�)���K�|�����������Nx��m�����*Z�Hj;��M��dy���������F�L�����:p�NfR�i-m��!��?�	�:�p:�8urv���V������&'��"���TC��Z	���$�a���!�m+7�d9�Bt)b�g��Q<p�[��I5�;���EJ���.}����:w�A�����|,���"�]$yn:m;SO�;:$�)m�������U���e��.��Y<Y������B�+)3F�)������.���T"b���(�������6	���W�y\V�!��F���nD])b�Q�I9�Zm��09���q�>$��N��I��J��y������l�LXZZ�hT�U���U�f�������E���3��>�k�s���CYfT	x�0y����|�����*_c�i����6����bH��aH]L{�s�,������������#����d\1F+��/���q�V�������Y���71�(�1����C����������D��:
��*�jb-�����JW1�?(2��]Ve'��sT#G`��_{��o�N���1t��c�(V�(������Jv�~l���l1�'F������kZ"�B�J��.�E)E^��pX��,����?}���D�����)������9��o��>&�+;��5��(���}��.���p2��y��u������9���d��8��}�v��>�����U/��f�q~��m����y���v�K�Dr�>i'����U�H��JR��\��j���} �P�f��!at�Z��~#�{�<��;ik=�m��P������y��Hc�2]�3���[S�!���x�{��\��^}�VSL���)���8!J��v���v��>�~�\������?����a�]od���!�������B��z,����Y�z8|������#��^�����D�uLvt�>��Tn���&r���/i��:���-�i<���l��'D)�Y������'�3��y�I���/��	�r��>}��#�����cO����_k����"�7;$�����~��y^��|���/X��57p3�Vk.�v��za���1��+����IQ)]L�x�B3y����������f�����C����F>�Yu��4�@��!\]
bG}3������q�@���J}�m��&�������}��W�h�|e�����|����/w_���/�H={��3�i�$u1��v>Z��B���q]h�}�-b>YNg�WF���#������o2�6�h��������2�TT.f�b^V�uzy���M��D^�&R'G��4��
���k��bU����_-I������%YA}��NT�fR@�Sg�FY���$�=[/~�g��1g%v��>��C�Z/+�E�b'�o���.������fU*�b�v�0��Y�x�)�����Br�,�{~5*����#��������^�pu~�z�{!�s`���~���n��j����h.������\����>W����U��o���>rk�v��>����m�����R�u��+h+�m9���m|q�q�j��s�����m"������&j+�"�R��1GD����v�J@Lx}�j0%��W�����y�w.7l��.���_�Ee�;�O�;�3p�6��d��!��A�cs�c_!Mj�����ba[��o�
�3���])�v�3���jbr�mR\%�
)����=_'��8v#��lp�x����HO,�������l����s�(�	����c��||����^Eu�E��@
�*n���R|�������j������'�	���o����3`��^I������m<�)%~���|�F�e#FI"�G�p�>@�}U7��{��WNCd�);tO�;����FQU�?���?hE.��3�z�f��i�;>�s��~[�r�AC�A��s(���)6Mr)9�H��`�����	�o��P�m0f�i�������dO�����[;�fu���_�8n#�W��w����I��A��k����4���\�K��l#`��m@�>��}�����������b����7�~��V����H�:�v�g�i�8��s�3��.����l������l���7��.�.
f�A����Pg�uVO�`w�p
�p��R"�Y����o�j��$�}�6�����7�$�nVQ|�;��1�D|L�e>y(�m��//7�/7��+h����}�c�U����������,�?7l �>�<���e�C�-��e?t��I>����Y���y�����s��a��J�mT�����n2�K�]/o�m9�~hY+�]N?��TU�(���������i���fD���i�\$j����EM;�#�B�o��u=����w�TdZ[Q_1e��>���u�a�L�[���2�S�Xb#�	��<�r8����p$�P��yy����tc�g~!r�o|�	�����Y�����I^����.����N�'��l��ZR.r	>@�}�Jx=������GW��������������?�<�:^^^�i}@)��R�=x=|[<#ay5Z�$Wc���JN���Y.�<�m�*����0�]�9v��a�I���m�V�]��U���r������O�$���4��N��
�Kbg�p�3�s��F>������)`�}�����&[����8�G�l�������F����p��������p|V5�]/��K�����b`\��>-Fw��/�J<�y,�����QEj?7&��m��R���}5/�}L��F�}�h����l���Os�p�IJAx��lN����h�*cnNV��w\�J6�x:��(�w?\�6�Q���(_����?�j�����;�a�/�e*wiN����@M�������M�,��L�~_��|��D�z�M���fk��&`j�-L}�b�;����[�qHj�k�m:����hh#��l����8�����2B"�L��G{�|�^���|��P�A�G�%����>�J^�i��)u-�U\�����-��U�\�����jZ�I{G�������H�y6#})���HaM��0�|�U�|W�'���_��mpc��@n�{����+w_��T�q(u����s��/mU8w��R���[M3K�zk�$Q�A,"���6��"c������.ik�"`����9���d�Xi[�s�Z�����	�;�����{�g���M�z��q�j����op�@c����0l��^=]��U����O/�^�u�����������@a�}��`�}�r���#q��Gg	j���5Mi�4G�J�����i��%�XVdG.��7{������
��E������z����A�5���u��e}���zD
�W�W�0g��^y����p���K.~87��I�
�32����4��&:)���a�Vp��E��i<����l�BTk8an"`f�ff�>��������H/��^�����pG���L�t|��*H+dV���K`�ORM��U�����4������$�j`_o�����(���X�@�q��7������������_le{p:����p:6In�t'ba��s;u#�^P��G����2����4�f���@)[���y:;�F��/9t�}P�&����X�qf�9p��W5�(��/��[M��	,�3h��l����~,��f#��������������9p��@7[O�.f#�m���9�[vz�98p�@-6�l�)���
L�lNX�zk���=Rp�5����i1-<J��m4@-���8��#T��M:KM+����p��^UiK$6�s�	\��)����CX�h�2�������:��U���9�
�X�f�	��Oi����5�P���-���W�~F����}��t�_*j��������Q�
�93�x6.���y��+�t�N~V�)�Q#��0��h�;z�����!����=[{�c������N��o�>�m[�-��
�pj�F>�5q ���I��i��O��t����L�'!�q}AL:������������[�$-E��Y��o���Ym�8^����#Xx��R�q���]qG;2"��;�RY�w�,&'�J��d=g��)H|-����t����M����"bh�ad����9lON�����P�!���C�KL�=o�zbI����IO���t��Br��V��
���\Y���������1u��XLtr���	���,s��VG�rm/��R@y�1YP���oz��o8IH�^%)���������r�	�x�bh���*�}�m��������k��R��v��v�m����������d�{v�����"�YDS5����6�!7��T���jw�l��^��9q���K�����R)������Ad�QFn��XE�w�74��<�����H�j�'�����i�����3xw���� ��RW<�X�������o�9?0f@cV�t����T�{�Xex�k����>�����ZU���:p�@m�j?,��O������7���r��������C9�'��wk�`�\��s��lY49`��]�fIq�->�{W2M.���3I�K��Z����	��[�0�����3��������*���P�������v��`9e�UJ9���-�x@���\�;�g�����cTCs����
Gk@6
��*��������k�D%�� 6����=,����[G����FpS)?@��f�3�����Z������{�N]���=����<(�����F�G��$��K<!����+93�[� ��p�6�s�X\aZ�r�T�P7��;�lx)@'Nb������,�s��9p��7j!�O,?�M�#�?�,�?`hs��6�s�Z;�Q��z�-o�c�LG�U��sud����[�W-+�\��g��^9���&mT�L����z{�u�xET6������z�������4�EN�dHk8�]}�t��4H�+z.�2zw+���,��d�;������^a�t��@WP���v6�����
�����?��_O���/.������������
���.;����n���r?��YE&�����h�����$VH�0��S�X��;�I����Y�q�J{^]��<��[��~fy��5q��F>�������M�{]7bF$�����	����6��y����"[Q����� ��2L�����b4�����>�}'��[1�K��}�3i����,����j�%u��qX�1~��k�W�
/�w�PaT���v�\�hX�2{U6@s`�����'8���U�J��X.X��������;���GN8�pT�O(\�:�����zQ���Yn>�#��4X@6��$�v���X�28���fO���$�f����*������
���M����Q*S>I�9��Y
0cG�7�o`�|��#��+Lg4x�����;To�J��a�f��m���vrs8�����
�vzC��� �2�]���c�9�GwM�M!�h��	�f��	�|�40�/�P�!�o��IG ���hk-w��^6��v� �tRFY�f{�����mj�?W�9�9K�7�['�����
��
o>�<\�b\�:��>���kE����Ln���������\��&2�w�"��>�i��a;
�tCNq��xwa����Fz���)]:^_y����?�\
�����_^}���=%�2��-qQ���������_�\~sz�jQ��������W'o���_�#mH��Qr8\n����;�2beU����p��x.L3'������8b:>@������7^
�}��i\����m���}�������
�����X������������E}����V�7lM�
�8��������w��>�P}v������28�D��|��c���D��I/��Hy��b�|�}9o�B���������h��V]����
q���a�,�����v^@����|�b��u_xW�9��@���X&�R����	Mke$���bVLn��	�VN��2�B�E2���;�G��@�C.���1-�dw1��N	%�-�/P*;����t�����hxy��E{"tq����~'.dr���U�7wB�%����\p�q�Uz�~{h�0t$���{�!��C'�X&=����{x���'&����`sI�9B�!����~���#��#
X�yX���>���\�ef�v����p�^fZ-/]G{��3�1�P����s��0b�����
9���[��y)����!�wC��5�4���z
������X��^m�z����N�$�k�	���nC����<$D7tBtE���
����z$)�yn���>�1��n�q������l��u��h��`y���bE�6%�����	rH���6c��7�Wd��Tg�7�Tx����Ft���������'G��������P,���/�Go/M�7�ohW"�b2?1e���m��7��Kf����m����u��P�������T�r�����M���������h3� 7�W�uzs�S%�R�<���SBe�� �CW�����\����p��E�x��z���0�!���u�ZV��N��+e6�B�9,�`H�������B��+t��\��8���-�8��c#�;8N�����
Y�wWn�z2�)�I�f��.��7�?N��4�:�\����d����h} ��w�o����7d��I��+����6kv�z4����I�|�4R����M��#:j���7�����+�St�"^������X�v�Tj���v1�-�qNBl��lf�!�7�q��T����(hI���@:��&��C�.��0�a���d�~��
9�r������9ms�t�uu�D��l!������7Iln����$���M���u	rH,%�m�]�7dtJ��n�V�9�xC���K�����
md,]7YA�$i!���T�[����3"�	��yB�Y������Q%`�6Q^Q%��%�ly{'��c�Nh�nue/��\N�5P�IH�N
[�#{f�Q	;��q]n����w����.6��X���o�	�vH�e=��z���;�9C@��n���`���1id�!��f���l���H�8}��U�m���vC���@l���b�~Ug\�~{&-�t�:���z��mb���g
�tkU�d�s�8�;���
@��K�!�x��p�/f��KM��w>������'��cf�|���q�{\6����\@��g�bC���Pl��������a�����`�b#G(6Pl�&>�?�VXD@��Qn6�k�������������Do�������@Dt�6:�'��6�p��	�6�����aq�m���������hq,l���^�8����)ns{!tl���F>��pl������@��W�DzF�����\A����W,��[�X3�C��:���dF4��(�k���E�J��>���g�J��A�.�b@Oe-��:I�F�A���z<���'K[�P��{�����D^�Z�{�n������@��A���
��Fnb��*�;��Y#GU�� �47)C�����l9���|��n�/�S#�O���k�@_FaL�IL��5HB��js{���?���6��m��@\��u�s����k����e��@_#�
G�G�U#��Fk�|+����z��f�X��*�1�v���a,%���|�4�����,��w�
���]���xD��s��O:��=�:��1F�d���C���9��$�'�1*\G��*�����"K�����i��8�RH�
�ZN�:���XQ'&�W	��M<W�kxR�Fn�j����#��0<�n�E�^�������^�d�@�F�|m��|/��:����-�|TA
s������R�o���nB5��Zu�5��+�XJ7�-�Y����[R2p-�Q�,����]k�ZM�I��>O����*���w�����6��o���X@g#��UJ9o^��.�.�Y����M�i�F����<p#�]�g��H��dO����fy�^�}b����bYq�V����u�Z�rT��8^��U��U+���4odU�����Q����^O����J�I�T���\�(r���}�:�F����U~�t�k��/�y(vH�4\�qZ�!�����oc]���;�x!@�#G��`�'�[%��S���Dr<j�C�K��0�Q����
:��[��F��:�X��X�Y�/�8����&+�^���ql��+:�2%R�>�B�J�*-���JH�(^����6��u�������!~?����RqH���9�(�����A�W��F�0��o�x@�
��/p�Q��R����]���Y`_���x�.�o��z7��3�x�v��~5m��Y���$g�8�)X����
�V�`U�l�����$W���0�2��%��s%�y�P�"���B(��Q�5�j��:+u`�!p����\2`uBw��S�<�����:�	0��I3X�b��n���^������S9o�7.K?�z4�����7�����b9)7V�:;pq���F>�qdq��T��p�����1���yE�6�8����%l\�\+5� �Q�m����[�yE�E�Y�����EVMi����G����������r�%7�P9����J���3�lZ�U�lg��:�D^����9�3F@"GVYU��vF�MS�����3��M��Y��}�N6v��*OdE�U���IyF~���E\$c�vv��H���*� w�����'t�a�
�+�PpY����08�<�-��^���!Fq����&^9r��u�����"�4G�Ls�����
�=�gRz��
.nn���9�|�|�8.�6[�\�9O����������&E=���9rD�#�6G�VS�� Z@:���h5cS+����������O���d]��h��<�����q���O����������<v��mQ�t��������p����:O�����I���;�j��_w8�bc2�u���w��O�hIQ�f^:+g��s��
���o;��p:����t���A^���M�jq7�����T.�G��,�O�V��������P�F����9��q�j����^�w����i��t�p�3�3�6�;o��K":���{�e+�����.^x���L>:�������w:�i�������"O���>u�����g�8) �d&�7�mc��F/�[3{`�6��m2���=�H}m�,��'o��M,���d=@w8!\J�kq�*�+��i�]b5�gu�$$C6+(|�����������#&r�)�R�����1�e����2���([�2b"O����O�2��xW
�������A������O6�$��;�2��c���uY������Sy�Ea�pG	]^^����}Z���b#�/��h�
�����r��{�bAmnH���Z�`M�/�����g�U�~V�P��([Ntl�P��%�]m�I�U;rR]��R�g���d��I��X$3Q�+�V�kP��E��r9r�M�/�������D��UMb�#`�;V���Gw��&�=n�L����+��4�r'l����'��;���J�`�����#�`�����_�Y��(�����������.YRHhC3������� �1�wZcE�����]SG�8~wzrtxur~6�<=������R�M�%�����Y<��D��s�j�3�\�)�u*��/��$��w�gt3L���?/�77�S�x� w�MC�J:��0����n��/E�~�L'cy��X6u2�,�=�����F>�5E[�����w���c#�Ml��iU���Xy�]�-�����AHo���6�K�q��[�-i�W��d�����\�-���G/���������&��/�i�W>���:k ���m�2���]_
_�9|z5�<�?����p��#�[{����x�/�R.�����[f-�G������o�	��-9
��b���b����>�;�p�6�>�h�z���p��&���X@���;���5�� �;������vt�Z�z<�:�a(n<@�;V��zR��������r����W=cQn����c�f��1�0U��i^����0��a����%w8(��x+G�`m$��I���&M���`
�<G���;]k(Z#7�p84X&�+'1�������Q�\a���:]��*��quR��b����;V�x��LU�������\J�D�Pp�+�lN�H��n=#7~��Tp��h:�����;6x�8>p��z���_�}��:.��8ON��cQ�4��q:�X�ad8�U�����
��_Q3)�7Q�� �o^F��{"~���N�������x�fWh���f��X��N���_���$+��1`�;=�}!@w8r�J�*������r�W�k�9	.��!p�����;=�xL�
�1p�	\�M�`@�q�; �l%f��p�_Hw���L����"����L�|2y���IZ��w����0�	/sy4f\�x����m>����I�a���1 �;v�b����w�t1E2N���C��a���[p���5���M���@�[:2�w7�hUZ<������gg'g��y+i�v�nA�;���pt.ce��m��*�WS���$�m�<�fw2#�Z�T"�*0�cswUp9���,f��R�fS���{}���o}o�|L&r�@�����q���G��"}���k�#m��vl��

�����e��������D��jOnj���������&_jt���/�r��n�
�?��v8	c#�yl���Y|�c�gw����O&�}�X#����<?���x�$�o9��p��\g��i�A���z�ZL�o�y�-����Z������m��=����wr6|sxrz�z(&2�����-�m�.�m��>�������m�{	�_��������wu9��Ke]V��v�-H_M���?�&�
�]&v��uDz����I*�-`���.��l�c�Q]���<^�]�����N��Pw�Kz�
����{�!�n������f�r�R�F}�~�����>� ���*�;9�_��&#bq"���~JN{��>�]��]�w�z��+�F��;����b����C���&������a� l��q�2���R�.��)-������|%��q�Z����o�nJft��*���a�i0u������k��]$�,]
O�b�Jo.�"�M�_��������8�@k�0V��������7?/��:~�~�v��|���Y�<)��|(VM$���*�vO@@w�t�����f�qP�]G�����5B����8�������IPt���h#�n�>+����V/�M&��n[�YV`���9�<~sx�������n�o�:miR��=F�����Of�i��E�y�j���PA���n����������e�v�'�W���>� ��
����l#���t�
Wk��IV49@Xw-���~�!f������`wC�mr�v�9�VwC���Y������?'�w�Mz����lxWG-����-xNG-�. ��H������l�`���:��Zv{|}q�nKkHw7r��u���uD�����vv����%x��FU7j�����\#���Z"����n����.-���E��������4�1�:
G{��I?�����ba�;����/������s=��w]���\sl�W�Wu�fwj��v\u���cz��@g��������.����z��IgR�A����t&���(������w��xI7��m3��w�b����$��o{��\Q�� D]��w�����z�����@�'��Q����.R��	}.��
�1��!�q�$����}���J��a�����4�R)����	]����n�<�a�'�b����������g]�<z�"p�P�d��]+�^&������F>�� i�rEZ���/�K�����	����|�
G��;=9;~wryu~��}�z���
��F{�w-�o��x��cxv�u�:�������W{�ZD���G@�w-r���|�Q���.�.=�_�EJ��]vYL�?(��A��d,����=�D���$/.��P�c��pBj������f��e:�O,�xI��Ta�5x�3Z�?�kA�w9���x��#�����j���������G�:J�
��r���pDPn�Lb��?�#��'I)�1G�=t��Np%���"�Mhw<��?B���\`�F��Y�����w���I�����vfr5Je�y�4�j������%����G�`�N5j5������dL����?b^h��E�.w@���]$�=Nn������<b���%]����z9�i�6]p�7�E����r��������Z�z2q+Q�w`!h�������v�]���D��������:�`{���mZ_��Z�;p��b��~��`�.��o�iY�w����}��h�"�J�>�\����=4�� �����U������x)����� �o�T�g��}����*�Xd����,����'�p���}������_��
W������k��\�����E����,�WZe�1y����vB�J��d�S�=�����a�=���7����x��6�������h��}N�`k�S���H����a��?�����F>v����x�/����]Wq�eP��&�*
�R0\�����@RW3�4�relT	8N7[��X�"e�qy�p6K���2���|��M�z���&���<
Iad	��w0Z��u{2jB���;�Ong*����������{6�|9��[�!���7�"/�+}��<z������r���_%Rw_�W]�|e�����(���x��|��p�H��k�s�=��{�����d�P6�I���k=�TP�=���a����\�A����i�M��4��a�v�l��-�/��U�M���"�����8���y�fe�|��w��������BL��z����$��_1pV��2���r�|��Kx%�-�w|��a��u��\k�q�������}k�>��u��y�%�
����������_N�O���l���_�i��l�=P�S��"���j3
��j��=G��wC�vH�xwX�(w/�x�y+��{.��4� 0������u����a������$�t��X�+'�,(�����H}���s��i��v�>�'pt/t����mk�6�C������{�b�q����p!\.3�u�����y���r�
V���M��<iw�������W�a(fe�
U�&��<@�{6ylJ����=�,�y��r���j�s��c�=�g)
����������l�!Um�;���q�NG�qb�U2�zT�����'2rp<���a�~���l�@Q���E6�Q|A�p����Q'��8��V������i����� ���}��cRm
��k�<"*vt��Gcg������?|�����l���*���e�W����-�J��up���D�u����=oz���)�U��
�RCf��I)#�J��AyWJ���\��6�^�:N.����'5���i��'����"X������=����8�������������R�1
@�=BnL�U
N����q���_�r��J�,�)�z�B+;i�C�����g��z��a�`s��6��Q�b��:��G�����T�������]	@��B:+�>�4
p����=[�`+��u�r�\���I(��{]���9���g���H����3a�LP*�D����E<���V������8���g�u]����������`�xT�o���ga��}��7�C(�'0��]��k#v^
����v�L���v�	X����6W	����{~l��^L�4<���
�*��g���d(�����Lr)�y*4E�T��p3=�U�To&�'���1����x5�.��[9�5���M${�>�+^����'�
F�+(p��;\�y��7&�t����,p�*�]K����^�Q��(����V�p��
���{��� �X�g��+���K�p����7�o�Q�@����vs�1��f�+�]/�6V�q,Xo���>�&`��=G����{�d�e��Y���i�V��aL�tX�e�"s���(����q	���3�k�����*���s��no�*��m�e.����9�i�����iW�!�{���Yh]��N���YU�R�-@��l�oc���HU�"�=����> ��Vr��l�}��.�>����?p\N
V�����U��}��5��{�>�uM�p�}���9�V�]�2��6x�*����R��b_�$���r�~]��}@��9
lU#�����?��t]�	�1�>z�6�l}2�To��
�	5Z��D���������=f4�>���-��:��)���d�!�3����l�'���6����^�O�o�E�z�e8��>����Q����}���3��\^��e�j`Y9yX;��&��2��������<x��o�Mt���j�����}@��9���'����[�W�I�'h�`����;��}@��m�������q��G76'J�#m��x�]�}�v�����wv77^���Bv��� [�B}����}���-"���U�:�Cr���
�5��i���la,!���[������/�����~��P��e�m�S?6�����-�5�%`�$�A������|�$c���_pu�G]]����q�F>�5�����p���oA_?� l����f����>�6&��p�c�F���?�r�m������,�c���#b[zOn���>���������0���Y�������
	���dQ)-����LL��RB�|.	�I>��?������������������?X?�G[B_����G��J!��iV*m�p����+�d�D	0�}�a5&Ib�[4uYc������-�@��k�"��VC�vs��+����	[)�et�V�[dt��;TB�dU�_��_��yJ��F/�r��D��p8�e�.�5�Ta�.��_��:-T�$�YO������9,��fX���V���;�@���rmT���>�U�mqU�ie�r��j���6Q��n���	��o��]���R#�r,j�Vs({��c��1�F�dh�B�Z���������E����$����D�j�e���b�H�7>�<��	8gj�p0Zj�\�
5����[�]�|�T�/���F)��]^&���t7N��,�W7$����8���{v!#Q\2K�b�+Y��F��7N2��tv�-���(f�yw	������y����=�|��"��n+�_9���l6�>�����}�l2Q3�yFw�x2y�G�f]����>�N[�jD�D�����E�HAFH7�$1g�D��j������{[�n=��>G��g���
6<�q�z�v9���q�\���SI�������\���K���M��>�?�,�i�V{���C�,����o�kj���$���)�C.1�}�������E"�26�T�w�*o&����`�\��x.�^`�Z�������@�����6�m<%=��������i|k60��#%�Dc�n9��M��&�d�Inl-�i�B e��R������0���������}��-���,4�@����I)��S6��RL��M6>3 �tmOG+��#���c�8��U�rb��,Vyiv����s��+��6:���	����*�Z?�MgE7���@���V���'?�Z�k���z�*O��j
98���ve�����;
�>���A�9�dU�|�>lp���4���d��i:�t,IE���������DkP��
��'[@j����A���z��9es��F>vO6��Q��R�[	�(��?�EF���Fk����������[ sp�rO�����E�����Y���o>�M����|���38���R�]u�J�].��?��������U_�{k���2�j��9�������1�{;���6S����������@�V�������BLt�a�L=V�S��s���q��F�F�/;���c}��y���<��V,�j���`�:$u�!�F>��-o��oI��?C�e�[�p����G��`Q6U�IR�=�����W��r6~Q,��7�o�����F�Y������a���`�	�X�m�~;�� �����5������Q�s�����~xr5�����xo���|5�g
��hU���oN�_~�����j��"���Dp����w�?��]�����n���q����
�-�����,  ���R�].����2m����za@��UK�<%-�!�q�����*/�&��H�ft�b���IaT	8B+���������b����P��OH'���Y�1�=���r��b�N�X�we3}Lf�q�)	Ld���(oz���8���Pc���N�OQ��>�X�x�qS������_�j+H����P���e^L��U�j,�JUW��J�Q�e��zc�z���F,>��B�A��iX���56�(@\Dw7���&`��%XXb��z���J��L�Ux8�V���`��E|��JY������.�G~�E�W�j�������;�!BG����`�pm<���#3U�
lj�
���8��v��n%�%�x���&<�)��9��<�PS}���z�,i�p�x`��x��u&guFD-9�����l��$��1 ����N��<�8d#�C8Y�E#7������{u�3��JZQ|��`b�B�#�L���{��u,Xge'��43�k�����b���<�J��}P����1@x�i%�������}���$�����
���~k�l�[�V��I�� �R�Z�j���jm�V�����ZE�.�������'���|E���:;�\��,@��"4L��E~��{�{���9]��
��n�~�
����_�&C�%�@�JX�d�@�
�-8���z\�1���~����m �C-x3h��V��
<i�M
���j�T5(��>�#;����^?���'�����O;��������6����G~z��Awk�((��p���T��Br���e�]z���_>�F��|Y���.r~�p���6���G��h5�|8��F��L@���4Mg�]�H�S������D����?��#�\�=���{X�gK�3��mh�W���������M����vz�s���!��YOr�4tO�l.0�0����DML��pwF���[%�OH��i�@��/�W�d��{`��5F	�i�4x��`��m�|�j��]���{99(;��x�����P�q��LoP��r������d���f�g���)w��,�e\���wS�{�G{^�gGqO�[��;2|�����������l'>���^��+���8������#����&=��`��~����"����7�8[#��	� �6B����|���2�zo��7/}���O�qF������<&�t�;���+�~�N��p^V��]/��������}L��M�)�F���9����s��{�4�e���&�^l\��}0��U��"�D��"~F�d�n��yy������#��L|hL��>�J��C����zm���p���]��&.@���������t���<��[}��������v����;b)�I��`{�*3����=��y� ~�����v��"� r�::��]�f���]k\�BV_I�?�B��r���[���3w��VOG���~a���C����>�M�C}�j������������^�L��v��>��k�9x����v��>�^���5;��5[?�S��>���<������y���Yw����`{�Qp�n`��=�+���PzFd��
�7m�+����Vz&8�����g��.=#�/l��(�3N����<}��c6o��|���4��|�'�T�����M�8L|�`���Y����L��������m�y�����=Y��,�$��=��8��O+�Mzw{x �PS��������f��5�!��n��q�X�����{�^��[�yUJ�)D�:����j��zY/�a��#8WcLy^\�]��G��������I�GL�rG��q{�,�nf-�.��Ne������`��]�]F��G��7��D<&{�VF,@QX�|����h�s6nDF��8s�0P�����]��N|��l��h��r�:k��dzU,kZ��T�d�/��Wu����1��h���s�l���33��"�,�>{ ~�B��B� �H���>��}�9-����iUks�"�<����J�5�vv�>w���������l�?U[�����w�Zz
�]z�*�����tyQ�R��;��j�����������N���5m�z��d#��>����wJ$��]Y���t��Cja�y�����g��������C���c���}x!+��B�>8Y������y;�M�����'R��/���"�H@v�Op�W�	�u6�>%n����6}����@�l������6}����w������7}���s;�M�o���(�NZ������U����o����"��d����������U���:��t_u��7R��y���FH���-���	;��a����ms�v��>��(�v^�>w�;�L������2}��S����r57i$�5��lZ��0}�%e#�T�m���)9�5�����k4*v��>w5h�W�m@U�9(w��M��v��>��a6g�����i�L�{��^���:l~]<E
�ns����6x�T;�K�[a���$_��9%�J���t3@4"2`�t�*�N��~0����'h�����=�����1��v��#j�����)�	��V;��1�*��'��
q�����.
��Gvj�>�>��0j���v�E|t@q,��0�v��)�����0��b�@��;�]���"�Q�F:@>�;�x�`3)n���1mt\��0}ng�m&EB@�8Gi# z�8�g���� 0}����g��A��<��t�.�B��c3yK���(Gm
�j���Zx�F�wJ52�0��6�}�@�l��q)��-�>Y�d��6���+e���y$ LV|��lf�"	 Il�$i.�.g������n�#~�.���`���mX�1��W��wW�"%������<�Y��g�
�[���v���o�]�.�pW���LJ���u�?r��>@P�]�O�)	|����:;��U_�(�����l�w:c�n6���>g�l���F���pP�j},�oN{eV*������%�V]���C:[���8���e(B}m"���--d�0V[����\�j����.�P`�7"9^�Y\_��U��S7�b`��wX��0��T�b$jW����~��� �����hjE�[�����)��]��l���W������^�d������y�<��ys�r_!as�����a!��+Q��z�	���k�t�f�l�j��c�/��o��i�.���������<��h.���=���@�~��P����\�7S���;��</�/
������~����E`�>���q���u+��pA�����P��J�@|�i�l��b�����t���`_S������=��U���� (������c��vGV�}#* ~6�eO��'���g3��:}O�s2z��]dz�"�/���#�����=�h��}��]N?����M����������#�Md��V\{S�yeq_Q[eDt��m^gO���_����/n}C�D+�qw� �/�L���J;;�H������N�0�Q���T�_9�lN�N�J��L���\���+��4����a�	�_��D}[,on�����6":�77�m.������E�"������B������W�a��-�)�ZT�T������>�-D�R�WOn�\����/i(I����U6+��I{P�>��^d����������p'�`3�������U�N��D�hd�H+i����,U��^/�����m��<�W&�������u�-
�jF|@>�v�r)�R��qJNi+�	��E��h�[d&g��Z���r���!�t��h-���)�Xr\U�;Ul���4����������<��q��f�T4�����:PY��T1�=43P�tY���:/��/4�?y�z �t�FH@Gm�4�B���p�xZ}o+�<*F�N�5r�c�.��}�V1������C�_D�<D����l�����Lh�W�#Q>D��P��<��'����.��T
GK��.���)�UQ9D
�o�X.�� ���:����.�U7���v���J��is:��wF'�=
#�� ��*[��>�G��#������v�-�>����J�3�=��hH~)Q����L5Sra����&�o���U�����^�Y�������WMn?�B��f6�|���
�-�������<���/�����=u3e�k]I>��T��X��b�M1�*q��Sw�����'o_�=s"������:�Ojh]6=b#��,���i���U�HE�7���m�Q.��	��5M��C�D{R+��
(&��7Q���z^L�BSS�p����XU�H
����Z����?~����]~s[���x����a�,�����6����H��-��!��}�N���/Pw@�W��.#K������W�Y��VE���j��6@�gZ������:����K@M�#��r1��O�����L�f��?>@�����%��:��}���Q���os�6���v\��nV��9�b3n���Y�w��N�YW��v����\�s7o���[��4U�8u��Z������X��?t�����l��t��]7���n�9���$����{�B�i������bQ�v1�,�F�u�c�����}�"��������I��&D??�S4���}J����E}��B.6��������b�����l�J
�
��(���v�I��c3h:u�*���>> �}�|����63�����������a�
�N%OO���V�8�g(�{TX^S�/�$S��u7�����&���t��P�o��{�.Yi��j�K��$�od��?�*��r���Z�����A�G��`�5���t�}�R=^C���6���Q 66�|�@+f�-�$O��|��d�#�|��d���d�y+�
V/jzwBl��.����# Z: �YN���/8�|u�{�����x�4�_������d[��������@�+�.�����������:uD(]����-�W�l��c���>e`��b��V�����`���7��kFp4��B��!��u1�����d�o��<W�{����������P�O���Za[�tG`��l��{v�:�mi���Xv��������������\
�����H3Z9X�����V}k9�Z-��#;��P��Z�����l���/�OsZS_�����}.���X��7'��p�_BUX���������������t�JY�z����F�eF�e|PE�[��|���dn�0�5�1��H�^���������t����,�<��KY^o���{�T=o(�����|`�m�fv���a�������x�r�����j���������VQ�(f�*t}l��'t|���`=��L5X�w�fZ7�(�#�`=��u��`�����@��!X���u`��i�����������''���Jm3_6[���n�5mej�s�{'�e3��0���U�e����f���"�����*�<��B��T�T�8�Y��7i>��N6 ������L��@����P-��,W!m]F#"���EQ�r'��2DQ�S������hv�@�f�����=������n������������i��i��^���u`�P-S����g�����iUV�kUC�]��Y�^��	����=������	������<k���7���3��E*V-O���j���7�4��x���`V:���S5q
r2?������2"
��,��RU�N��<�I:5�ey��L���5�����x�uc��i�uh1k{m�$�p�;��=��K��b�_j���	����em3	�px�>�<=����Mf�/�y���F:@3l|�4`*�r����������v�����:Z�}���\�<��N���S�T��a�����������e���<��s#P�-F�L�k�����#����2a������t�~��+�����t�8�����n@6���<�mVl���-�u`�����.�����D'���f��wY:��������V�L������[�i8X+�O��+��\u~�N-# KVz[��������L��s�Q��t�-���;y�������sNe5X����'��]u���~��nQ�V���Vh��/�����~���!RHm�t(��T���f���X:����l`��u6r�A�F:@8�Xt�^��$�H}�2#�I�E
�~8~�b���[���I]@yl���8H���=Z�o��/���5�H<;m�<�,1_��1�2��+�{l4U�l���Lt8��h���<��5�M
y3)��=���(����[���	��i[�$7$�L[ap�y��Sn�����BFWo7�m*>Q�|V}W^�X�`&*v��xN������H������m��$�:U��
��}��Z�^���N�j����N��FP@���
\"�W� �����=0����E{������,xm���t"	��cl���'gYZ�&�v���x���D��}c����*�S����gUV�L������/[��8:I��
,N��9����O������^_�N������7�H\#P��w�(���4���(�U~3��5��T~1�E��S6@��hhYJc�9�}���t�V�Q�x��R
�������������t5�@E�]>����8_�����[��7���������<[PL��%��>���7����t�� K���"+5+�|!z���it����`�61�U��|��gYuv}<����pH��P����#@�����
�j�{��4|N��j��@���Z4!�zC�����eYe�Y��QO�TR�|�/r��6���c��y�
�`���7���^�C���G�r�ltV�Ao�\�e9u�������F���W�������sS.�X�4�+9=n�r{��@���C�?�+�+����+S�H����
��C@��n��j�X�u��N�rh�������f�V�����>�T��Dc�D?�Lq�^!�<Ej=��G��������j��Hd��lY�
�k���P��(>�fH�f�J���l/��![�
1,�zI;���N�P����Z%n��}C�W�V<�u��J��Z$���`������=�����K����l7�:>�L4~�|�V����t�C@�����%������Y�Vb�P�4��#C������b:����R*#����2�i�t�1�/p��!�uC�G�k'!t��m�t����6�Un5QG��<�8���f)mE����S�� 7��U�7�U����6�H�0���@>ohs�^g��*����:D�]���r��F9 �'��{��n@Cr{�!�P�����O~Gk���\��B07���e�������%��c��L����/�^�������ieWjVTU��SL&��=DaQ�w���i�lS�7#�f��^G�Z�]��L�n��#97��^�����C@�V���F��z�On�b^,��:w���x����H6�q������G/k�E���I�Ab2�KT�N~ ��^*�L����C2���P@@	������sl�L9)�� �<{F+���_\_��O�/Ohb��?�,0��
�oh1�fV��WK|�3�|����/smb�����j���0I�V�{��@���o�H���tZ�@MY\�����1��7		�����jW��*�{��yqy~z��������f��^��k�����6��eu�n.���ph��n�}���{q����y ��e.g���-�������C�S)n**���Y0���g�i�IF:#+m&T��z{�avx����>x|��m#( [VO�uP��wMw�m�TC�i_�l'Mb��]� O�0b�gu�^�4��`^�-�Z�6�<ii�(��B����FL@������J���$?d��a���W�v���'��C|�c[���7��k�FCSq���>��K�#�����'N�Z���)�r�9Q����G{��S+�7��q�,���R�V�u	��d# +~J=�Z�*4#��B��	�y{�hC��z������������F:@-�V�^_{��B����3����Ky���:��.��?�����(��&<}s5��A��C����H��D�*n,��{�:���������E��/�������Q#& $V�e���:�A������{�7���t3sh����k�4�j��BZu�����Znb������V9��2�[UP�a�6��UUj-�h���������d�eSL��#����|�B��}+�����]������L# �C����3�$thuan�^�;c�P��j�������d�_�u�q�F:@�,��`�f�������%M��/N/����?�]����}5~~z|��������7���0�����\�VoN�tz9������<5s�g�l�4i�03\@:���q�{�*d��v����g�[�����r��{�@�C`�M�6l���l���vz���im)��]����:D��* ���thq0���U�X����l���!b��;�Y��<r����>n;��XtuT��:2�����i��ae�p# ;��r�������X�-oO�}��f�#m����p�FE�d����W-��7���fu��t^7g
h�����@�����e��D����
�5wv�~8-���O_��y,
�������W�g���k����I7�D]:�����?�<
6t�Yj�8�.�+G��)��������m)s�d��Q�.Do~9������\5wE�U���&~�c��p���� |-�{u�&�a�	r �1,$xh#������l�9�f#Pim6�f������n'CN�Yk����[.V'N��0q��ys�����{�=se`�����N<={,����;���`o���}��|��
#�xGV���``��#�IS�:������Y���t>"@rGGn��f��9���#�KB�y:���wtd5�1R����j���}�K#^1@�,�s�h��=�l��Ee4z"�150B��C��MH
R#��j)@�G��U�QK$�����`W���k����DG�����������A��7^.}�����x=����;�����0m���a�nC/�pm���P*[T��l���������k�7���:�`k# "����s^,�9ny<�NV{��C��������0;�e���Ym>[c8�>�"�����"���	�������~�[;�������G����&9v.9�;�v�m}��:��������9`i�(��8�{��'��#E�K4�tu���� �s�p6��@:��7�m4T�b���;����d�Y;a�O��
�(p����4���	��_�{��D%pk\��~[�"���1�����y6?c�����-.Z�-�p��$u8���ph+F�U&�
'K�Q��+[D��(t�AY�"��a���j�[���7�2%#�'>	P�p����`$@w)l�7#R�RMP��5K����~��.* ����5���������4�'�����t�?�tK��K�D���pg����)��#���s��|���q��I��(��u>;
w���L������,.�9����hg���X��!����n;�q�fb@�8��H�����f��zhJ�GD�%��������E��E������3G�yO�@t�Dw2�������nLwlj�����@��L5���n�`��"<G�����#�s��yV�e�N����"LG0�/]E���\x��Y��=Z����{���;/-9S��GD�~�DG��*�;+�jW� q�F:@*��;
��5O�$`��E��Zo��w�yz��9Z"G�S�,���|�,0��#�,�i"�U���6��)���:k!�*�L��H��;UzM����N��G��%[��z���W
4��gn��UZ�o+�gOn���+a�e��TY���P����>��@c�S�\0�������������s�Q�F:@�8FY5���6��X��������l^��E6�^E���,2�X�8b���x9r����D��=��q����[?�����7
��y]�!��Z~�`�#W`Yu�R���so������hb9��e�_����,On��6�"�o������/�_~�7���9�5G�;��v�O�e$
�5pm�[��� !�XJ��~o]��0�v�+�@e��}���\4FL@�i�:�O���j{�Z�7�0��h�5�����;���������W�k,�8�9>{�R`�#���\��q�hf���w�
�!���d%y��(0[t�]��]��51e�ZF�7�4�>��E���,@��]}Fo2n0�lj��8������S?<:
���C�l�-n�8Xm���6�.X��qg :��4���������Y"��XY^�3���6.?0nd�J���	��%���I7���`�fo���n�����o�q���0���6�������fS��j^�5��bX��D.�}��G��@(���V�����4h��z���C�����:��6Q�%�<s!O}�����D�E�PM����M�m^���yV�#�e4D�~�icS����X���B��������#2�^���%���G�s��J^JGn�I
s� ��
YQ�����+M#&MPD�LR1��_���2O|j�:�=��RsZ�"VE���b���PD�������>%��#;���b2�Gn����G�C�,�A��i;�`�1�cG��s{i����Y�}�ef�Rc@>�6k}��9� ��Z�=:�E~�zb@8��l�c����m������uvY��7�����x������(;�v`��>��m|7U�?7_:>��o��+��N<M0����6f� ��+>�gq��io�z�}�/_���d�����/K����9ZO���#
���Te�l��i}�G�Mnt�Vv*f������(\��~�}�V��&��!�!�����y��/�?��/����>>�*1G�Z��p�Y���:vq��<?I�n%���8;������6�����o�`)kw��kl����=h����m����&>O�E�/3b"��X71�E!{0�I9�A�G�.��L�,��v�I����d�M�!RX���Z��@Z�skV��de�Z���oN__r@��
�6:�����D�8+n�)a�I��������������q���?���U�!7����*
������t�4X�m��Q��kq��U��=j��Z��68��-@78�We�:���dY�m���m��������=e K �<���s�N�66�6v�de�d���v=�����*V%��jc�b�N���kUL{�ic�X�6�j�����f8���%|sl��?�\�9]��x[�K�.�.f��v��Dn����\j�=�_�k��;�8t9]�:�s{K"!�P�,�@P8�g#��kv	��F���$y��3��\��5.��s��v�P���pr7��]c>:�n��2�{�������t_���S����*�'���^���RAm1���F�Do����`"�Y����8AE�������iO��j�Ss���o�c�4�����@f}��7��}��Y�^�:�D:|i�rbZ�h�����9�����3�s1�ao)�^�y	�e��������F��[`��V���|�h��%p�1���@�1���s�I�;����,���������'Z,�n�qli��7���<���K���4�\j��<��6������4S��u��A�Z�DD�e>������.�1�qc�K�:�i��oh�5�"����}��c��_��7����	i�~���[:����������9���
4 sF�F:@��a���-�,@+l6��z,7�;���qw�&�o����a��Xn�Xt����6�L��t�������
mZ�~�mq/z%�7�C�w��>��m��(�-��6����"��x�6�a`dc�����������hF"�k!~�X��9��u�#,m�����h`"��y��zk<0���y��:�!}�G����W�
@rc�"��M;|�:^�^
�[���������3c������+���;�^vz���jI��4��6��4���`��h�M
Bz7���D�����:V�,��]C�]�O�]U����9.i%
tV��\@��@��`����"�����5����L�v��ob�%�J�M��q�bx���������rSZn���,W�� �B�\���P+��b�8�B�$��x��u��6���*yb�1��o�{��S��0K����S���X�^�es$��(��IL�j'5Q����J����ol�U4��l*q�,�g�����4��x�2:��Z��8�~0��pl��UT�Q����&����5��1��`�EZ��6��	(�7V1����y:o��o�q�����h�Ak����,��4H�K�k��f��Pd*N����h���)Mp	a.S�h�����UV�L��:7����O����;9y�p��3%����Tnr�JL.�cQPr!�M�n4L��*kuL���f�!��[ieT�R��L'c5�,���/��dW��������2����x�#�wU������!��^sx�����E�6;�/�l�z�;V���yc	���(�H��n�����
�6[7�\�e��	��-L���M1� b�.���D8�� ��}�z��J�s�
i_�	=�R�6"��Ib5P^GB�����^^o�uI-�X��W�J��7���,�$�ob1Q��f�>L,�������	��t�z����	gz�^FMn�;l�6k2�i��x���,uRoU����+��6�3��P5�m����m��T����Z���+}��G���[�@�&����1:w�cZ;iR��W�y`T���
��1�p����m}:�ln6s�k
�����8tm�:�M����:�M8��H(M�	�����@{d�lh�sweS���n���?��u{S�6�������yjT%pS�U0���"��
���Y�5�LS��������=A�	�����Vj������R.���n�D��B�����})���A��:c��6��Z=��&gb(�mqy&�F�r�*���&��,7���{�BH�g����yKSU���z�L�$�lh��X�vB^����Q��X�m���������8�?&+��4�{���C�A�F:@�,��[��Z6�.88q��'�'��v_xLi���qs�!�',;��%��s�Nl��F� �����.�c��Ex�K_�4�-�z������X�L�������F��6�������#��k�[l,G�	~-�%��tBJ���,��p������1?��:G:������^�j�V�����w"3������n��|����!E����U�A*����EF�s�PIk\FD@"��&�s��V�n�&�lS�2-�M��~���2+1B�q�a����b�Z�i����UE���e&eZ�fU{�q��A;	bL'�9�1�MT����*�X@FK�7tLv!~Me
[���hg���|sbq1�n��i����
�6�?'�l�4��9�M`��inJM���
%�*���f�7�T����(C]��91Z�a�B:��r�Q�x*��B�r���D��-���\���3KEV���}CJI'���FO
`�	�e�<����h�E��r��]�>S�P�=����1"�����Z+#C?�5FT@Y8���J�w���l�MV�y;*�MQ�i�������,f3#&�/6wd�^�����G�������c�(���VP��t&*�cEf����^,�Z!�j����kkV�n����%Q���%�V��u�*!�����f�	��4�_UT��Q}���\�	���4�v��]���Q�E��p�?���\4���M�OU�tV���5�q'6��Z���������2Kz����Q��O �O�/�<4"����MDrQb"�i���Q��<��5��H�6���$K:=X�"�'�{��p���+_-���t0�	�����S�wbc�E��6%/��K���A�I�����*��Z)�j��6��J���$��{��'V��i��o-��d�;�����|���.q����O
�"�����kyb%������{��z�X��Q���g�i��V�y'�s����d���	��+��e����w�M����'r��u�T��$���!�0�4��(JS�5k���P���H�%��/J��z���	��*��0`%-�[Z^�"q����.��^�����l�S���t;F<@���t3Z����	wYZ��a6KU��a�U(��4�
=�0����^gdt@]����	=��[�=f�G�fO8��Hh��K7G4�IO����Q.J�t-7|�������5��mn71"g7�n��Vl���crJ[<���Q ��������h��������I��}�d��?�,mZ���Wi9��\��j������P�3?cK���O�����g3�r���
�L�rg���x���? ��S-��"a��HwZ5���+�|��Qr��n)�R=$��'��Z�L�(�}���T�[����~W�*k[����Y����]�E;YVT4f���E^-D���Y'����>�\�7FT@��v{��J�9�7��?6j�����M�����&�b��"9��.f��^�E������������<��d�:����K���Y��-�����2����p����������j]�5G_��d;���{�(��:e99����-�H�@X���l��O�S'N�37(����37$���OK��b�@&�6}p��"����bY7��������M��,�a�>8�����{�@�8��H�.S�]���q|�}p�FC�>��A�D��{~���������\|;Bh��1D�}����~����x�?y�����
��
�.������c��]
��1��9X�v�c�! ���7#\�`�c�m�N�
D?�Xd�n2���F�����8��H(�����o��G�:q/��r���c�0�}������TI�����+��x�;v��K�.]�`�6ok��4h�`W�}+UH�`Gk\��>�Y������H�y�n�]
�>�������rr��a�`�zSM��p���H'�rz3�gx�}X���^'�&�-H?��}7����"}d���J�������T�'�����������px���w7�<	+�����P�@�Lw>{P����lWi���$U1���73X�Q���A�o���zW� �i�#pO��t�j1����/��E%��}��P�EN�g"?��O������&�$P����
<-�su:Y�'������fRM���m	2�OhZ�l�e��6C�Yf9�%@Ru�B�������u���6q���[W�3�fD����UF����]���h�[��\����.��r��{��>�o�/��c���w�6��
��~m���������Z����>B�7���S]�)>@�v����*g��  �6���}�8��*��n����A�	����~�h�O��A��yV�X���@��O�}�U�p��ot��?�!���������hB��`��eK�ov){T
���m��WJU�'�������_�y1�����.�;�� �$�^������{7J���n���.��2������F}��
��K�j�6:	|q����>�
r�} weQ%YO�?V���,����;�n���8t�V_��dq6�.����:7b���7���.�������T��b9��?���h�=��`������)��������QEW���CY��O����H7�\)��tv*j&P�],�q�G2'�J�a�@��|6���7#�B>` ^�n�������8���[	������+���n�}�Y2+���c]��]v7Y�;!D�vzT�j6|`g�
�p�`��Je�YZ�dv�<h�pV�">�q# W�]fu��������"M��]�k�[� �`�	�y���CR��j?�e���A��������1�8�����^���9|�c�����z��*�M�j��|y���9�+��z������	����r��,f:X9����z�Gu�����0�*����)���x�2,_]��4�����v a�'|�8O�*�'��m��)*��>n�4��KjI��t������*������e���wG��
��K:���n�D>�0�|B���D*y��d�|t@09(�H��$7{m�
��R��������'��L���*W[�$W;]����������V�$e��G�gt��d��rr�����4KY���`��W��������No������C�(��x��i��dq$x>�7��C�t����C����������{`G�
���`�c�n�S9����lcN������@��_�#=������A���r�x����fE:5X6@^8��H(�hGr�b��_������bx��l��R���u��Uo@gv��Uf�7���4h��}��=����
Lz�=��\�}a��,K�-_���C��6��+�pW����������A�! ����\��!���G;�WS��&���rF�F:v�~���M!g�k����vy�Q��L4���[�.z�q�F:@v����BVG���6@������S3B6���#62�+���p�C;7m�*�����f��Z{j�f������d�����f%?��EQU���X�Bzh�7��C�9w��]������@��� ����
b�-�C@Nw!��,}�d��C��;����������=7�9��0�}���>��f4l�d{��l�=v[���:6}t�'k�����
�c��t��;�7��2��� �m���\?����c��6-�)��Q����U��${hC��^ ���'���_:�Z���:���<��d����C?wu��.�vjx�a�#��ic8@RC�L���`�0d:JF:@#8Go���:7�O�K�<���)�o��c�����*�����.V�4U�����q����m:zh��
���0C������tR�������k|�7}�j��!G5���y�+�\��Y_�ud�����q��w]n��\���!@���"��k�-���A������7�k#G�P�
�iC�=������qp��P�k6���h�Ds��.76K��~����T�f���C�x6���C���:d��#IO��C��*���������(Qm/U���9a��#�
Q���f���K���}����a%cK����~V�wf���q~�F:��q�0]f�����
��$�$u�����\����	+[�n"vo�0*`���9�St�i��W�L�QT�����S��M�^�f}�������i���6l��j�f���r��:��2-n�]������^�~�gO���EB�N���k{��	�`�xT�}�
5���f������R�}��D��;S+���j��!��m���famV;�����sy���yq�U)�!��]�z����e��	����i��D��������R|s�6����[��I���r�z����Hp�V]7'�*���6��S���
p���z�F<@�K<�������'^~�����X�����6;��H��6�#P���H��	�����!��������?r`�����L/5-�&��mL��r���rW���c���c��t����P��4��$��anzJ	`���x#wc1�_�)��0R2���v\���S���o����WU36���H��	����,m�&�T�!6����w�^�E����+;s�����#�h��0��7���wx���3y@�����RF5y�����5E��&n��,f�i�Z�iF<@Clf�M<��y��Ikp:�(7Z���f��n��nwh�v��(j��Q3}�<���V�< ���w�yB�E����!���DR;��O~y:��:����F�sC�rP����k�`h���=�n���9M�U��4��0/��<���7��l5xC��c���d8�!�h�3�!�������&�h��=���v���!��n��O;���Z�tZ��h�g8�5���tH�3J:��30u<%m�������p�# _�1[I@C��O���-��r7�F~�������whsS~*��FMRf��>��L�M4�e��R�u��^}������T�_�,��;�����0$���ma1S�w���<\����l�)#@����!jN`=�C ��hsx�<�+[Mc�U��,�����lH�����e��y��<����hw��U
�]*>�]���e{�yB���#�k�
��S����@sG����+��#�k��Zq�P�7E=��|�����l����������v������w��v�q��� |G6��m����G��s��G��YX^�T���4����xw�;��t����@�GsdKL;@I#���,�����4���f����F�Y���'2P�Q9e;�|��V�#��%R�vYK���\���^��0K�#��X�c3pP�Y��}���_����`R9j���i����U��U~3Og���eG��qs�����;���J3���N�f����������z��=NQ�C�'z����+����&G�������\u��\�	�5_X��$�iOv�����
+ cG�3+#��,�l����|�j?�4����B|+^����~JogL���q>v����l����p�m]��h�g�{`������q�t�����7�����/�f���y.�>���5�\G���@��U��#��m{GL�m{��@������e�]���[|�=����ZNh��z9�������(�#&P�m���_Z8/4�5��E;Qz(�/f�<��[����Gt����Qv�'w���������Y���mY�1��+���Vl��j%u�a���(���0�#��7e�NjsV5�{�XDvN�?����"���"�i���K�F��|��aqU �#y5��9�RwE��X�S�i�i�;�1���n���+�b��8����q3���k<�;��^x`����eV>�tP���&F�	 �����v��m�h��)�3�.){��6������6�(v`e����Z>���O���������7o�b9�m��Q�&`Z`���ww"��=����j��$>��	H������*R��`�G���0���M����b2�>����m�����T��G����G��Z$���
85j��/��Z���X4��(v�f���9���,[�&�C�G�|�)�8�d7+�������x_������J�V��U
T��� ����Dj:,����
�VV�y����^���:�(��<���9���7r~'b�����	��{9���A]5#�7����7*�"��[�������<�'�fO���lR�|\{���D�?���&+���M�M?�g8����g3�;����;bY^�M����5�z=a���u~A>�u�-[�F�qN�F7`�#��K�gI��X�;���r��AN��v,�/�� �������w�!�-z�Tr�9��ST�Ym&����Nvd�d����#+(K�����q�6���y:�.�v�gHw��YU�=�#���j;����
si[
��I3�+[�������U��\�S���1��n%m�������h7��	m����)��3+Skj2�fG4k���.����H+�=�|�J�:�[���5���v����JgU����������Cb�hl;�h��/��Ay����O4���vv4��<!����,��v4��������T��r�3w�)�r��*`�0��h���WY�i��C��!�F:@�v��m��W�q���k��@��Fn�� �#�u�l��:�T96���r���)I�z�bWm����W������S1�m�#;J�oW��T	�����s�M���������N�����G����kh��#ek���]���=k�����t��^.��/�g�yw���rH��U��w�cs�ZU�s�y~����J��&����l(���z����g�_���K���������Vf�|�'��N�����nQQ��	b�O��f���#;"J���P�re�@���������|��@sv��m��Dn��iU� ����� �'M��+8�+�{]�P������+8�����@�����_)����F$����p����$(}���������oN.�^�Eg���������������]�����S#)P�9"��fgB��mU��f�l��u���3k������]0;�I�o��y��hM�:�op�����n�Nw�������A�c�M����!��f��EM�c�l5�D������y8�^\�����_�X@���&�n y����F�0�w=Ko���T��\��]JGR�B�T�!��Q����h�iK���
X3�y�,\����E��� ��4kLF@@&8l�	�Y��������ex��L'���]������������[r%������sjbP�������U��z ���6���;�J���j��.~���7�Bjt���>��M���	��.��U��ME��)}��]����9�]h��}-�I���dN����aB'T�a���Z�^/7D�����u��t�H�����Mv�]Zo���O�Aun..�/O�E$5�����l4��@��*�QP����d8�s��w_���r����Z�e�����E�{���>~9>>q��������Nv��>�a��3�U�26��r^��=����������d���?������Nr�����"�_��~�fH���=v��>w�B;qI�o�mX�)�X���F�v��>w�-�
���<�T�hYm������,����&�T�M��n
z0�4�B>Ae��!I��Q���#}n)�g�v����(�?��������M�o���(O��u~��/�jn�'��O������K4�tv��'Y��oT�DD7��#�o�aI���0�.��2(Vc{�����$���J���zwv��i��v�*U��<�y���zr��LEq+�B����:��V^R��M���&�d�t���,�n����������4.q�b��o������X�|�b40q	�� �v�@��gc���*��m[�����&[�iOX��Q���)��������*v�k�"jQ
��T=;�J�[k�9���U�]�����RiG<��2_\o7��������s�������_>W5��w/�"iGZ��mb��Ov��>w|�@9xu},��uy���,�I&����kz(��c���w��c[/��i�8��%��;���������c������<]{{C:F��ce��X=;�J�o�M������zo�<���5[+�+}�c@�t:�/��?��r�������<*���Y�"��j�b�4)�}��(�a�r2t���/Upsr��#M����.q5Y��]�J=)��\U���2%������H;lJ�;5/�plJ��U��x*}�(qv��>��
�S����\?&F�5��6wV;f�=�s�2i�=��m^!]+1�����>�:��'}�mt���E].'��2p!yj\���������m����G;iJ�;���f�����f*x.v ,��� Rv��>��%�����;�<r�#v���IT�|�r�J���^������p��H���5V��]<�#�n�(���=���w��y}2>?};����r����S����?rXt�U����-�e&�Y�6)����Dt��/�5���=���w�f}���l���q88
��ap���o�D���>�g}<�yJ�I:H����>�g}G|����f����-��.��p��Y�sHm/�����q�fb@�,l��23	>�Y}��W�6���Tc��P�����h$2�Z��0n*���5^�&��,��'Y���@>���������J�8W�U���0'q��,ia9/t8�������������c�t6�D���f�P��A�=�!`�~�4�0<��4*C��m(�����R�T����z`�����|G�t@e\Z��pH�#~��+>N}�a��i�?��&��%������6�����
�,������>����Zm�L�����o�De��eO�E<�j!�����t���@H�V�s"jg��Z)�X#$P����
�
�,+Z��}�i]��o/�~8d�2����zl(1`G}+;*c0���?j��O������ ���Tax����V��(��8��9���@������=�LNl��l�<����D�@xBG�.�h�Zdkf�|@��6WT���U?��#�����s����5E0�>��6���U'��������FI��]R���cj��Z�G�x�������F�(}���� �0!����^������+�1#*P�9�TDu�����&Ee��K�Uth5M�R�7��4�Uf�a`�@��p�>�����[�[�Rma}������Z�s&�F�`�>���e��p�����5�"���������t�������(D�L�������E>����R*w�8��R��uS��O�36�w���b�{��=F���j�w�V�\����U���4�����j�F�E�.�b�O�)���������9w��qE8u�������e��u��A{��t��h��UV��>����t����S��->`+}��4�*���f���>�G�D6m_�4$�-��4��X��cB!k��3	j������2������S�.+�I$:l������L�d��oRg��(���z/^�i�%��g����s ���������}�Y>�W�{�&y��8�g)�sv�F
��>>�%���|��H�d��� ��V�Ou�������Kl�.�$��XV����q�9�N#P�n
��~��w�3���{�\T�w9��^���N��7��HA����]����O��0���H# �Pd��m���F����e�T�K�gNK�[��
.m�[L8bC�s@d������*�3����),}�S�=T(����z���
0J��(�%��#S���tm�����`a��f���U���M�����p�={k O��h���K��`[;��-�@s��(��v�	8�Q#���3�=������\"_�XP(��v/�w+�&y�u������1�	m���z'�T�Se!
U9��D/<j�
[8�Zp����&p9y^^�yq3^V�,[��G�z�n"o~ep�ywU6a7���������8���a�8��HT}����@5������zY����g+bo���������+{&@X���W�w����_��������s���Gg���P���������w���:��@8�4�:`��d���1|f���83�Y�O��Yc(k!\g��uv���� �3��N#P��
���?i�4pu]�c�N�6��d81��9�[� �����GG��@�g?j,9�����$�R���P��eM��eF�������B�$`�t��A��A�U��y���:h���{�����i��.�y%w�����F�5[o�{������&�ii0G�8��U%�h���h����t��9�F����	�\=]����������|P�9���$<�9s|~������@�w���y�f���g+-���
�������=���'�{f ���O}�N����f`=����O�/O��yj�m������/l<��pW��@wX�P�x���:{��|l +�G�{��mh�9�<8�&�\����
�h[�R����^��@�X����lh��e��3�{���o~x
	���������0xh���MG������&\4p=>��+W�9����������q�)���4��R��|i�z|�z$�V�'�p���)���5�;���[�@����L������I�k�y}k1p
>��+0�w���{mp�6���4�h����.p.�F:@A��'�����G��wC��8"�@>�|ws\
��r*{{R�<� ��t��&�1^������EN[��h������{��nt7b5�r,{s��H�� �0��������1( 1��|���;#1PE]I�hH�������hS:�������>�<{��f�b3��M�I�f�!��1���oz"`���_�^8;�8}����b|������whg`9��'#�R��"�����h����}��i<��'��TE�'�-�)��X�S\���,{�U��BN�W:v3���.���2S����&�[�.@>l�����l'���>�������xR4�$�v���}��t��4o��v�{��s��G9|U�g�F�*�n@?l�w�����r�+6
�
X�
8�HH�
}��f�)1S����Wz�rA&�4�4����jKS�{0��+3��y�r#j����x_�U���6��1��!
,6���kVJ������7�����X�z�\{YYN�iFw;y��t����������7���wo��9�<}����?��v����n�;�]}"�l�]/�����x�h�B�'������1M�g�JC��j��r�W�R�ES`d�B#��l�J�e�V����W�WB�Ke��$ r��>�^���q��?[Ir�q���t��v�ND%������#_��n����Q��[�Bf)K@d�N��D���
��vI6��zz��(��Q	b��=��~X��*�dGNV�����:V7���T�@Jm�Z�t��y������Wo�\�]�r�������gb���fb�����[V5���^Y��5e��E�6p�>
[8Z�v^�FNt�;��5����c=^����eh�+��s�$$��S��+\������o����iP���]?b�y�]*����*�����,]������x�E�����@C]�u����*5l"Z�Mc#*�
��F$���v��3����+��=�����:z����
�<W�O�_��U�+s!�{CG�7|oh�{�������m�J�`����d�=4��z4n�-�y!/����SX�D���(��0`�dW��j���Mi��]�Z�gdwE������g3bJd�S������8E5o��*���7ty���;d&�Qcw�>N����$5��
`�����.C��[g-�nx��"q���i[�B���6�T}J0p�l��">!�Cn�_7���m]zr~[|�w�C��%k�=���6��B��Uk��g���O%�^�=wH���y #��ksuVPU.�Lo��qaMY{n"3��&@X��������	�
������}�2/D�����a�����.8!`}C��m/�3������}"���������]@
�\^����<���������rA��ZN�U�����H���U��r6���@(�Yl�P��L(��`	��z����Y6�U���,��	7!�}���P��gs�
�n�AZL�mA"�X�_�����h�g�k�8t����lt�@�tz"8q�d��u)f������O7��������	P%WHY���5�o$a�����E!;k�Q����(��Y�eK�7�E���P��������mF��������4�����y)�+���h������������+���v�X:]��X�0F[������k���sw���������V��������h��F;������9^# �(�^�d$ �C��^]��������m&��;���)��M��r�=�����<����=.�u�����e��^�g�oju�����9m�{��E^=(�������F��oP:+D���Uf������(��C�^����<;t����A\����a��t�����������
���;?9	��0�����������Dh��1�������[�h����K����Y�n#]F��pj���9��{������vzN��[|�B���a�hlA!@�C�^��
p�!���	��pr#�3�N:s��UY�'���A��&�vJ����$x3�f���o�����0x��r*����X�:q���PW�\kM�T�Jd�UV���qY��=�d���<��Q��O�;~��r|������������K9�w�Ak!�CHo/CX�r`z�2�`>Y_�����!�j���eE�P�sB�hE{,7c:�h�<�����TT�~PE����w�i�����n��Iz��4+��:��L ��yq���#����	h�1U���;1#xzh���74�b�F�:���9pW��j2������v�'��� �C�N���^m�:]�x�F�@�l�:]F%�z8�iv��F�}���� 4�2��(x����J���$���{�J��OM��^�=����f��Uo��x]3T��5�J�
t��:�'��AF�����o�������5r��_����_7a���C���@9:+��;�_��6 n���! �CGG�-^��Y�_6P������eG�e�<t����/[e
���6����hg��^���f_6��CWJ���c�����2�������2����#��;9��1�$��G.;^��K1���ayj�rr����O��������G����R�Xe1�*�������a����9E0�O�~4P�rU����v��������v�b�L��=��>��qT�����H ���6kCg=�o�?�XX��9������������H��)g�9t$�w���� �R*���YA����E�����Mt���p��&���t~y�������������[�*�v����_�V���h�)rD�i�-ZT���po[��w�H~{�����s�j_	_�
q(��P-g����b�E���3��-�Zg�P-��P��s�j_��h���s#�Z��Nq�z�'���8�b���`���t=� �B������=RYZ��o)���E1��u���z�W�t�],�l�`D�a�MT�[�9k"�}P@�G�JW71R����&������(�wOh<���m��h=�2��Vls������fyU����������U�	�d%<�!�D�~3 NOo��&W�O��;�?�vv�1�i�-D��+O�������.�=�1�{��Z|"�)0yY4q,�����Y������U�i��
���(���8D�����Jf�|5�R
Q9V�9V�L�"{E�y)�)P=�Y{3R��!1UM|'S�� M4D��p��j��5����jd>
wj���W�M���dmg��_������2>]�	�}d���
!u�d�i�
�i@�-��N��F�5{iO�����*J��.�oI]�����������������x�BJn!��B�����,����n��Kh�}y���t�z[6l�v�<����������P�����g�Y�� �#W��6��hL��(r���,@�G��~`�Z�x�=��Q���E	 �Qz�:�.?����t�j���w�[�Z��^�Hr���49{���y@��1�9R���#O^��L��r!���VIt4m���#7�y%��A_�����+H|�!�F:�����w�fTv���������q`�Qf��n�C����7'I�$9M7'��Y�Hgd�"��Jo$�-�<��h�_�
����Z����{NzK�UV��v��/���WHp����.]7��C �#��Y����K�d4����-�hr�,E�5�N�YEQ����,�1��#��G1��L�;7��%dI%� �/�H~5K�0�j!������|<�\�E �1��L���0J��#7@��7g&�>:����x@H��*���l{�?�[����#�X\'#@�G�Dx�<tI_�����{���T���u���
���1�`�Qb����*���+(���}��F�xT�:K�l�Z
X�[W!�����q
�@4��E'��O�k�����"9����������'��I��#@�G�����,��[�j���/��<��`�y���=_��PE�N7��f3G����<��:1�h��T�����b�x�� ��`k	�+����������{�����l>�
��g���i ��n�����7'�4S}��L"�H�w2U�f������M�|2���^�X���t ���%�9
��G�{����������<����!@�#��������������G�����l���<�3$y4d^1J2���!3��t"�� ������N�����p��|���s�p��q]HrG��7]]2Y�����g�����hdYN2�8`�#��������?�USe�J�v��e��T��nf��\���UzE��Eij/g�����RO�l
�[����i��J?�{o�2�`���m��w���\��u%��1�j�B���H���w����H���Z��,���\=�t4�qe�z�x��c��t�
�x`��2�Z��4���:�c���
�Lk�����m<��*1���#���]�vqv��`�uj2U�)�?�G�h,��L���r�1B��J�1�MH���9��FJ��$�c����� DDg�d���M�ss/����9������,ptl����,�s|�0�-V�����u���e�*����N'�V{}��kh:�^;^=���K1��c���#i������RO���y�T������K�1���-p�����i`��$cG�m����OiO���W;]�d���0��}��
����O�X�R�<?�eL��zL�>u����+\o �tl!�!��X�.�iq?���,P.�e����9�@g3! -�	���86�Z����5�:)��,�������������K�$�4%�\Y�D�+gLf�r���k|���Lo.�t�>+��t*i�Iq'*�:��'����<&s���&
��_��:�x9f��}/� *�R�[������ �O�����,��#�<_�����
K=��T��T����|�.@VX��T(�*���O �T��-���Z��h�Rg;�7�����enT��^r9��e���qlc��v�W��k���^�fFOp����8�2���DF���1 �cl=�2�n��z:���
��r%c%��n�r��
`rc���0:n�m��%����5��M�1 i���4��Y�t�!�Oh+B.����)�x���4�C�=��<��1
k������R�Z��*�W	Q�e�)���%�1j�m0�����?5�`�����f�M��V�*�a9��7�������PJ-�������|zz.���������
$5k#�G��>�!j��~���2���7N�r���e�Hy��h�I�������f����X�����d�&C�)�6Q��RH������E�U�i�$�#N�p*:N��e<�>{:
�N���3�c�FR[�^"����U�cHd#( Y�]�h��"���C���6��G�tG[+�fG��x�+d���y�c�+����������N������;�����9e�*P��q��:�N
���u�#���A;5�s���w���
?�]�_�y~�������|*6�E=_����hlu��S���g�^K[F����>-�x���0XV�c���~K�1I��cp�t�������1����n���Q��R�����2���r�M"+�����CV��bDd��uUT��JQ�	T�6��*{�{����=I�rN���SJ��T�}m�uc�q���Cb�r���ge����^�l�����%O�dK�\I��
����3�Dc@��,�+#2
����fKW�S������f�l��r�dG�0=�������������������[����A=y�f��lJ����7N>�������^y���t�F
��c��H(���x>��(m���y�����rR��9lo���Z��J�������A&�}cG����l�m��T6f ����%��Ja�c����~�:����~�t�0q��f�P�1k
��}G}(�Z��U/�H}���k7fK0�D�}d�T�v��=e����7�4�&_�����,]�����Rm�d=�����7�+h?djHhN�HK�a4��k*}QV����v�O��������$a2Z��T��9��9��{D�=ZO<(��u�����!�r�$���~����[����F����L�DR~�8�
63�j����0����iYz����3��5�E��c{��H���r����3���8�9���������\�����K�����*�FD@�,�2M����K�"��z���������qt�\��e���=��	yy����C���z8PCAz����XS6"�9��S������������B�q�=@.�������62��%���$�\����=Z
$8�l�B����c�6Fr���G.]��c��}����m���s�����6n�)U#!�BN��\J��B���*�h�h�Q2����s�dF���m�,3����=�%��6>�x����-t���a��k�?�(p��� ��fTmV�K.$�|.#�@a�|��*t;B�^umL{��(���=���9r����]3�������8��oFe2.�x�j�mf������'6�\=&�O8�H�.����6F�	������ ��������Z����y������^Y���
vK��|P���Gb�$%���XF���n��5�'�yM)�m�D+���`-�w�^71M�)�-S�R����+����������s�KZ"�����b���0�1������E1S��f
�*���s�3_��"������Hb�Z�,x������e19�.!�=��L"+�'�<��c������+�Fo�o���E[��i�@�'j�k8��$��C��M<���R���!�x2z�T��-��'6_l�K�d<����z��	���+��r�S������gQg�Tr��5����;Ox��;���8�-@�XN��i.c��[
M�Z�x��f0$yb��6
���~m����N�s���l�%9I'T�2E5�rzH��x�K����)j���p�@��}9�J��4��M�r���P5�����a���8*e��,�Qm�����Q�i.�y��NBM�5��Z����\Y_/gFH@s�K���pV����@x"���/��B:$��Z�"�yk�KL��@�l�GLJ�c����av�so��#���B�R�����^M��@C��-�;U����-J����J:��(��j�JN�15�>���\sEZ�v���f�s�@?k�-�|���C�������t}�4���	�w/��'5����*V��B�i�`��k��X\�7�sI���[g�'VX��i�h�X��7&�Q���j��:��-�	����l�%�R��m�S���*t��v�u���EM9S%��<�������X^��]^������;	@�z��_���/T/cQfzf����r��'��-��of�O��r{�t��Y��aF���L\��u�O^'`@�L��2�"��+�����5`�+���h�eMT���s:0�:*7EwtG'�����?�R���~a�_�;�:}SJV��"{����B�}�'"5�5nd�fK�yC1��X�|'�~��>����e����ObG��|�)��qUee����RRQ�lv�pt�N1���@���T�cu����,�@�'6���Y~b�;��;!�$��3p�rb����r@H8f^O��	kR���.�L�|�iI��t�=*�9��K�N���{a�O��8���0�eB��'�b�������U�����=�|��R���h�$q�(�'�Y�^&����R����sGtV�j����,Y|�a�7�
{x�SG��'����B�z�k��]���7O�<��`���eh�
�����'p�^=��'6��.��$}2���������������������o=���rV[������y�a�	����a��tys[rE�})�jj����������x��pL�OgI�&�Uy��y���F3�bP�Nd��7 ]6C��x�D���5j����[��qjJ�
h��kAI$�aA�5i��q�U�i�!�
���#(;}���56�OX`#$C�t���.m*��_��J���T~s�	������P|�A�F:@�X��k����J�X�C����TrOa,{2t���� ���zF�{�ZO�n��tn���IF�x��P��C��r�Tt������ZO�������}.��?��O��|Z�>��?:��K��(�Y����d���Ym4�-��T~�$,��}=mc-�����w7^UN�}yx�T���~�p1����U��f�hd���~���+
���H�9>�����/�yV��W�$��m�]?����_?}:��U�,'�^v�������]�7�����pq����S����x����/������W�4��~DY����zUL(�o������>_�W�����)
����i�x��+������r>9����=��%�6��?�����+$q�o~8�����v�?�7���y��kI����o���o���~}��~���7'�y{�}�������/�N�/�<}�Cx������������.�t���t�����/�"Sy�!m_fs��3�5�.���my����?<�������O.��r�������I�����J�������������!5���V����z���4�������-*�r��#�$%.��������~�e�m��B�D|MX����Q��&��R����������
��������im�]u�m��s5,���&��n<a����w�_?��l$�s�����J_��:�zsn��!���^��=)������HO����nsQ^>J���u��fT��=�����/1����NT�Y��}y-���*���kox��7����J�+���57�-�L���0�'X,�/��>�
�%��%����w_��������,�tc��p��=�u�����u)F�L��+J�2��-��=�)���Km6������?$�x�?�������C����+�����YGO|������A�u<��<���V;�{8��ob��y���>H��/�_f7Bn��]���c(UP7��;�
������TH�r��f��V�:�����D���1�jn�l�o7F���Xous�z�o���H��h����u����
�z�f�?�{�V��&y���|���~�Bt_�2B��&'�He�-Y��
��=7�/�������<R��H������qX���8�T��'��5�y����������x��'��M�G�}�=U_V������:(�1"������u?�b��������\4E���r����>a���b������#Z*zA�����h2P��w����_>�����{�,S�D��sx�X4��{�F�B���r6�E�����J4���8��D�l��oc0�5
iHQn_�~�H�q������^n:�5"��@���v��*R:$\���CE�g'��:/������e��K)�?��6j�x���x��=���o�*��2�:�>V��[`���WP�#B^�x�������lJ3&���'�z�<����Qo�l����

�?�u��x�����AG��QCA�� ����(�(��Vi�i��m9{Y�hb�e��
uN:y�pw��W����vJ�^���"5��mZm�o���
h'�����f���sk��pAP�+����*��Y�-h}�>������x��-cG�_���?�
��,��d�����_���m
U��Q�x�u���+���O��O*�B�+�6�m�e�A�k�p����i��������g����Q������:y��25���7�z7��B���U� \���%��f;r��h��*��j��3������:�����|��g�b���_+�����H���\��,�-{�CQ~����PU��e����FGd}��{��Q��� p�k��G�h#�����P?J�_A������������x����\���`������T���eA`���F���`������wE�C�e�$<
�_C���P��;��?\��>����b4~����0�W�<����I0�
���M3j�4����ar�k�*�lcm'����%~L����l2�P}�Iqw���o�H�C�!�
����F*�����b��_K��j(�o��.��/����6�e�~�����"��|��Y4@�������fLG�oD���_���Z
������D��G�_o��*t��]���p�Gk�?�0�����_����������?>������(^�����������>�������������0�����u�H�������M"���e������������������������E���@~�%�38�����XH��m	�
�����������v��
���%���F����[��;���
������3��d~�x��oj��o�@�������r����
�od�oi��of��oh��o
���M��lJ�MlP��8Y������6���lI~�����U��E��?o��Mo�h��p=�l��Qf~�Uu��������� ��?������_��z��Qv�y������������q���ofj��#�������f���A67��8������G$Jtx�y#��y����us��&��6��-���0w{��������7������2k�D5[����2A%����(���Q7�9j�L���y��3U�����t���um�/j=RU\������ �T�����F���O����O�����z7jL�f���4� �n���w����fqq�0kyc=9Y\����bP �����b�PR]��&����?zBM�Aw��bV\�3�95���|�e��x��X�[w���L��������=?}}y�����������ZOAw
�|�[������%�
�����j�����K�lTm������>��>S��#�OfKQ���eQ�N4��������3]KTY���Z���&��B9%�D����k�^��j��S�_�>���L�����'!�wi��
Tk��������.n��4���Tt����}l������S���.�|V=�Z�3��0�9r0r�{��W���
��S��T@
�N�Y>�>��;^,��*$��x���=�����u�����y1o�z�?���I #O5m@U�&�y�U^}�y�*+��T��F�����@�������H��*���3�W��~�>T�m�!�o������D���������G�7�������Oh������^�����*/�����[�R���������^~����x",��
$!d���e���x�*�W?x��7�7W���\]DR�����J�l�Ak�C�!�M
����bU��t�2�+e0K�l�w;�;�Ze8���n
b=��P��Q�������=]eV/����[�x&�]�S�	��������- !#�t����|�?)��y�����.������lgh�&��u!+���Z5�@%��J�U����k���5U0�Hv���x>��[����������qc��R-���V�3oQ-D{__���>�������5�F����-���������G�@���e��m�C����C�V#P#����9�����������,��L63Y��0��yU��IO�H��1����=�7f�[���_{����X��J�Q����\Y��D?��DZ/�%=�u~�,��;���p@��������������f�m�Mt|*����'A�z�T������h�>���C��7��y�@bvO�T����4D�R56N����J�fN�PoW sC���(�nU��wB��(H�
�kq3n��XV�U4?~��(fH��%@B����zbp������g����g��^\����	���7�}�hQ�o]�O����v&@�K'#�U��%@E�-���]^�����}��-���dz�	��$�>���6��3��3�iKv��6��Th�P������n��F����\��L�J#���*�*<�����L��7�/.�����@&�OJ���l,�����������t|�����S.h����S��K��M�]+��IMF�GU�������`��h����i�A� ������ ��`7)���aE�q�q�yh1�T_�� �����6�'Vtp��_{����f�@�`|������j�_������`K��g��Z:`��Z*@�L�P�
�
�	�����������9�F�?vL���@8��p���!����!����e��`@���E��D��_��@[��N[����"�o�.5��Z�s@}	Z�����haTf���t���X{������v !C�)�!���e�H�^��C�K������x{|���4�%Z&��y�uo4��z�@!�����f��#�=�O�'dc����
���j��0�V�	�����h��!]j)���3z3bf��T�
�,*�7#P�G�
����B[^I����EZ�^q-���?��)0���EUeS�z��
��#[�`�~|/:��m,��u�!k�*����hB�&��������Z�3��F@F�i���s���9�i^f�c����yAUP~(9��i���t�
2�(��z��;E�2�M\����J����lsPzK=�1�a�C�]��������f�u7��5��]+���iE5oR��C������-E%Z���x��_U�����-}��.6�s���-�z��?���|�lY��gm��/���]�����/��]{��N(��].������+���Z�O�����GU�g'6�"�"hA���z]w�M�����������z�c����S*�"����b[Rp�B�T�%���@9��������C��Y�����]e�����-�KBI�7
�}Q�?�q��eZ�^
sh�e��=v��[�!����Y���l�t�V���$^�x �����M�t�(!'��j�t��>����ME��v�6@��$?"�����H�J5��C= G,)+���T������KK����MI������0�M�3��������9��.�����0MW�Y���=C�U�x��{���d�����t��r%��h����7���9��.s�p���0��{w����k�"��"��s��P���h^F�s���=i�"ep��_�\���j;�e�����>�q}��m�Y�U��QB���'7�KX�>S<��Y���+������m��&�C:�e�������v}��U7��������L����R|�����y ��Ur��_���R��		�N���L���X�'����������/������Vl��@.	1�2>@�}���@�~����Q�K%��Wv�d���d�$]dc�Q���_�����;�6e��p��`�}�-��k���1�C���HY�ch�b|*f�d�87_�t0�Ae��gdO��g?��f�M(��pG-�?�v�w����c��{�2Qw���_m��t�_����� ����U��v��k��M�����9[OHU��XQ�4.��WuU_����Z^i���D ���'��a�u�[�6���YP�r
�����f�G+�r������*������Q���&��@�"��y�HM�LfYZz��U�a�M���U�v��_o���9���& y�a!W��%�}��oH�����f����2��%Q2��.�h�����F
`�1 �}����c����������(�O^���M�kSE��;��l����q�w�N*5(�����7W2�6k�������|5������^}������i�	(]��\�Qn��"�|�f+��x��84�2�����;�L>��}��TQH������v!E�=Uhb��dI��pI�~��;����O�^���p����Sf�w����G�r��,�k���a�u�d���A��y^�8~y�������i����[���p��u�#��d�Z���w"��m�L�IMgE����\e������`�����w?�����:o}W��m���&G��n�2�o���������oz�e4@�����s��1c�Q��<6�V��������r��\�m���t������[��a2���`��,SZ�h����j���������>�����z2@%m��K����Y���O��~���m9��w���.���r��>G�7���O�X��;\37�t
v�.�Y��{l>���t-�������t��e�{e�ewE����\����R��Q6��t��r���q=P�9��[��"�j<��x�U�'?�l�"R	�j�+��J�����=+3Z�m����6K�����fk�^!�����R�f���C���>���m9q���� ��zs�q�M�����u��(R���8���P^�T������|#"�y*��@�<p��[ ��"���=�0pT�>�D�o#�_�U;��om�*��H�
2��d�.��&F�@CF;���������`���e��+}����
7�)���X�q���l����z2@fv����C����8��m"	�9�T��.z*�Z������vA�a���k��
�i�$���KK�1�M���gj�J[�D+.w��~/�����'����A+)��l���Ut{M����]�b����G`��E�^�Py�����x�I�-�v�	�~�������<9����0�Bl��C��d-�;����Nn�f��ui������
�_=- ;�`�ti5@�a�������l���ro
�����^��W��W�M>�@������r����Wn/�W@������i�^1���e��V*����'��N���u���G��v7]��rMC��N��v�f,&�K����{]��v��Z���

o��EQ�K������]�������y�_��N*a=. 0��s'�����Cm�@<��u���V<v�D�}�uF�UM��md��������������
6=vT.�h�k>�F���C���c%����x`%�U��6#�%c:Y��S�@�������EUd= c�?�:�b6�)�-:��V��g�_����T��"6�S�HX�!cM��!��5����s����m_��zQ���<`m�W���O��X���>���������u%V�}u0����9]�:'o�����y�T����������n�T�3���&���lU�Z���r.�������gWm���b�o�p��x���z2@��a!�6E�������.�����az�Z ��!Z8W�U�3��P��p�^���������emS��6�,[K��kw���6�����^]�'��V=hCz�]�<S���f6�d����c��7� T��?Lv`����t��������Zz�Mm�'h7���	=�l�{��p��l�b���:p�,��e�5}����D�$���rVt�u��v����p��.�>�yTy7��"(��������;@����z��� �q�]�L�iB\�x~���Y�����w`e�{���^����=�+�_�
>`�++��Jh��s�����M����������Z�n�_��)!��'�x'gc��7��~��������w[JF/��Xr��G
��������h$ge�^����N�f�=b���������������`�;�_1"����:0�$���>5�*g{�'D/��1���z��N*<@T��_�w��u9]�b2��<H>a~�JtT�s������@yn�w(�`W���L��>y�����a�� �v`����!q�D��7d��p����'��R�W5#�5g��-�3}���``uw��X����&�����D�N��SV��3e��*+�8I�V9���� ���`G��/���t�:���\�������9��
�j|!Je[i��6p4suG������������q{��S�m�s0����3�8��8~Z�m@HC����w64������}G������������~2�����P��C��"X��c�M�7�s0���`�I��Q�����;��8$�.��{Qua�r�����e�����t����t�E������x����V��\\���~���h��I��/�U��z��j<-����1�B�'4c�k�ck�5�t0���a]Op��z2@cl�����9��c7-����-��s��YzY��rW)��������������qE�����96'm�,���!/7���u�8l�8\���t���YVU��c���,�f�-Dc��d�����/��q��"9$��|�������hx}���]=�#�-��y�����S�$��!-����M���K)�C�+Qh3��gf��b���j���mNK��P[�������p���`q��dW���Z�
Rz��UksC�Rx���w�~����������B!`l�
|���)��|��&����������>���C�,�5!`��]X��x�{�~��C����6�V�k�]�Xw��
w����w���zl�@���t�v4�'��7Q����#.L ��0��d�g"P�]\��+��C�`���L=�a����Q��:uS}
�h��l.������ ���y��}>����;�G�����o�-[���q��XorZ�s���x��39�b�C`/����g~sN��w$�����8=���������<��M�y�0�����{�eeI:<�=�d?N'��{|�������{mL�NhM@N��MA�����
V��1��<tc�C���6���:��;)o��yV-�������Ko2�@=v9��'�T��T�X������'K(�p'����4�Hsz$G���YV�
���=���P�],����y�'}����i]W|P@�v1AA��z_���0r^�Z29C�����#+?��Hr���
��3xh�������\������h���a�q�\������������v9_�#S���6/q�.i����'��?�����G�������IM+w8��s{|C%C={�F	H����	I�P�NC���T;�!*���v�H*�{Xf����Ze��
@x�����-�����^�k�?���\�<-}�����������qoc�!�*l��z�����s��u1�*f�����7*uW������w�po#g�H�U>�kN����L������]|����at�F;�H�D�O��'����������Zb����"�4LA�L�������.c���@������[q���$mN9�l�����c�& ������X�mO����.%���.��t�5|�{f�woH@gcws���!G���u���b��P���#(�?��{K�������J�(�����L�A!��C7��P�!0�>�fYZ����X����d��]��BZ{�6l�)�<������#��d@�|��A`�C�9{�^�z��d��`������������:��IkF:���f�����Z(a�oB@P��M�>Vt��r�\��}� ���Q��������.����Z����a����:��-�8��\<eJ	���]������h�2�r�����^u����!�����d���c8��
��.P�e�$�Q�X��X��mw�s�6Oe��W���i=U_��J�J�K,���J����?���i}K�Z}��wU���|v�M��L$>���n��~v�$,��}=m�,�������xU9y����S����������V�w��7���"����E���I�J:�� �/��"�g�D���!�w[f��������O��������1�Mv8���9�}�=�����/�������������xy���^s�?�#����������b�@�~[���������u�����)5���~Jf��I_O���prx��O�5���H\I��� >��K_%Q�o~8��=�������(��;�%����s���mQT���ws��>��^�����:z�����K���o_��x_>y���������������]���������OO_I5�J�7K��}���O��WV����"��I��2�����b^�xr������Z��g_���ZV�od_�j"�_�R�R���
)*O���5\��K�����l&~�\v��R����	
�	�-D�����~�eU?���6�j!l"�&�IU}���Z�I�)[�����J�~��@F�e������zj��.�?W{�J�m�-�O��|���G�����7I��@�`�<��Yy&F�_��dYR��}�����MuO
a���gZxO���6�O��R���C],D7D�4j�D��~��KLl��(��K"��/�E=xB�����W���������b��T6��O�{�����
�/��7�"zU�����}���������������ISG�D���R�z���xL�+V�V��-��=�)���Km6H��/�p����~������#���+������GO|������A�u<������R�N�'��s�ev#�9����1�
��_v_��R7�#�[�)VN���h����n/�/K���c������vc����V7��'�_����$Rp7Z�nGd�>o�!�@#�����v�^�p[���b���nW�u!�0n!jw�~����V�_W
�� �7�3�������<R��H�i�����q�F��yR��z2i���g_.������o����|�y��|����#�B~7�>�����{�x:�y'� �������z��D,�������Z.$�O���FLk�gtv������d_��o��Un_7ww�l��sx����<��:���_�������@��K�������q}���<��y����������>��}���<�g�������~��?����3S��4a����8U�I3��N�}���<��y�����n�~���a�����_5+jj�n�����0������A|���5�����:e��l����_k��[��<�jV����t���:���>O�}����A
������*�>���}j�4��o���S�>8�?O�}���N������0g��m�z���
�_���/�|Q�$��"�7��	*�J\�y�.L���L����>�&M�c�o?��.����n��*�d��,6z��$��=�=�B~q�\P7r�f(M�;=y�j�������9bi�a���y������(�Nx{��^p�O�8�
^���t�=�F�X�QW�EY,�1��z{���=�}���q�/�������g���_��y���}���q�.��:
��w.���yViy��=������.p�|��j�
�������|2[���2�i'o���M@�v�������������|�O�>�qQ������������4�8Og�
A��<}�.k��@��\}��Q�K9sP�}��k����I:����w��^�Tr�.]A�f��3�Z�y�Zn'����)����z-�#�dn��3��n&����e>��9S�I������*�`��F�&�>������!���M/���$�~�����Rn��		`=u�����,�dH��2�/�L��V�=�#c�4�s��}r@�y#���K�T-�S,���{�I��������S5�^d�;��g����A9���N��A��rq�`���\� ����������w������6����������S�����r��nY�t���y�X������h�1�b����rd���m��=�L�������b@KXc�}��l&�~
����Ld����g�������x�H,- B�-��w�����%���C�u!��l��]�8-2i�\6�v��$
9���z�i�k~j��c�q��<p=���v���'��
!�����s>U>�t�U�\�:�I�?�����u�N(@T"~�@�|E������t6�H���eZ�����=
�B� z�4���+��6WM�[/b�l�
�h=��������M��i��Mc*n��=t���(f^~w�Ms�G���z�Y��j��v���a�|Z�_7���m]v�c-�F�w��f�^���Q�O���������:�/D�3�d�u����#�=�[���{���e^����REyN���)�.�����z��M�O`
���=�U�j�y"���*�����B|e=�W�������/������9?>���#��������K���;�����i�T��X�f����k���L���������<`5���f�R{�B����� �wi��91���	��b@��`��u��:��&s�+`L��]���r�U�&�~Hg{�[���m�LTC-E K��h���7��W��H7�����n�<5�*�,WK
���.���R	����!�4��%�z	�i)�Zv�J�����~`P"���{���
hg�K�f��s=�wn�����3�����q�����9�n�
�Q-S9����Z|"O��+�~0�������h]~�@�
\P�k�7`������������U+h4k����������>�e�������Fh*���hm�T���X�u��{n���!]<���K�,����N��y�:=p��m,r��\==����}�������`���m�A5
����,Gr�R��DNT
����@m���E������/y�!C���r��e��jB9�\L�Wz�@G��c.������J}��iiZz
��Ep9���@l�N#�!���s;�uA���1���U�c����'d� m��4z�J�[;a��e��7b�uqvy�/fC�B���f�M3{�Zq���\��^;��zeq/d=w�i���3J2�ZI�%��f�3e�.������xG�i�y^�!F@9F�rh��9�v��j��]���=9���6�2�z��s�On���h���V�z�P�)P����Ct5���8_���9-���4����4��lN%�EV��C��`J���T =���G��2H9������Iw�#����wi��'��#�����g��#����w�Bj
�O�����U�q�yaJ�d%��.��������%I|�<e���F|��eN��Gv���������\��b��[��+��|�*%��"�[�������'{M���)��9�"gW���q�N�l��_� 	�U}V����������*���\��l����z2@|W-X=����^�;=~~|����zQ�c9��U]
��s�����F����cL�)B��>���
��{��2�9��>|@����O|��6�9�� �F�s)�~+2�:�O��_t���\Y�yA�?���o^�^��z��n��h{S���x��=&P�9�T��`qKQ�b �_���~�e�l^-�L���L?���L�PQ�����$@|@���j��7�9��e@|���c��'�ij��Q_/f��|�?����s(*����=�J���V��:0:�w���������z���s��E%�FkoR��9�LxX?t�����"��'������s����	�j4���C�t��w�`9����6/x:GB���V��X���dS��j������]V��n!�Z���}�-+��z8@^8>�	g�O�h��M�S1V���(+o��g���K��^\e�T���D����cZ�E6cw=(�2,������&�����MH���.}/��N����>=�.[I�c��b�*&�3�V�����Hlf}B:�~���`.�i�y��(�������
J�/��}6����>����V �~��D��4n3�)b}�e/;�v����W���/��JP�>G������X��<���I�S/'M�(����{���TG�(h��>�1��d�^�>e�/����?���w��/���|y���<n�\���*"�`T
�O��������FT_{�4��BCH!$��:�r*�u:����J���b}�m"�wmh��X��e����\����yXZ�����PX����v�S��TA��X���5�y���KF��oO���f�H@��iIT�c*j�,o���?@�8.���#H��������_�����<{}�w�Ho21.K�PL�b�F#5���>���F~��uXe74[2.���{Z�GF^�D(��aMfE�iq����|�����E�����������V�����\�u@��X���5{p�Yk��������I��e������
o^��j�[��o��.I����FM��T4D�9��\����\����fz�xhN>�i�n}W���y����!�����F�'Zn�
(\��p�d��$	-������Fl{}����i�M�H���o��2����@$��)@?z����h_= �k�@s}���1����z}�5�`z������}����^�&�N���>{�S���O���P�z>�v�}����n0�@����}��W%�&*���o�v!��j��4j����)���]�����i��>�����!�}@��C�%k�����)��4��������>{���n!@��pkMq��}���#�N@t��#����IBX[0s��}���#7�p���{	Fm��.��x]���]��T[���S�^�#���s(��ep����F�>aB�&��
��#7�l�?r������ �������6�������/�8a=�t��������*��d��;:����V?x�>�]��#�E������V��gAMG[O�����'���e�L��}�x����������xwi������.��4��M��v��1^��to[��}���+AL�������x8��V����`�g��'���Z���~@��?�t��Y�������ss�����������q��6�5������
gP�����u.�p������j`�L�|�,\��#��d@m\{��#�������*���I���7p���n8O�2��sQ��!�jX��`@8lWKP��3��
����=g_`t���[,����E@��m����������,�����D�B��y�mq/��}�qv�[ ��,�k�@�
8���tl�n����6�`����-@Ye��ivM��j\D�y��9�b|����������3�n � r����in�"��No_����g����cO�[�q�����LE�k`^��H��5�m���B"Uu���{�o��u�������|h7/��w��2+����q����q�b6m0�@l\kg3�7Y�����L���t%�����#u ��'j�
4}�����?+�h���C�:�GDj�D��U)3UP���Uk��n���6�x�����Yu���`��3��U�����vl�Xa��eX�w*������o.����}s~y�\�@�����^��UV?��m��xR���(�k����I:���t"�B�C�9��`�A���x�}�)�r�K6}%r�R����w�����f���p�E�����"�t�����(�3��i���s��N�-�(hz�]-k�wl6�6/���$����# ���'���
i�v����fz����BH��y)�e;u,��
5pj�l;�Q~�����A�����+� l�����j�b���_;����l�i+��^��&�ry�	�����W����)��Kl���j����j��m�u����	��C����=�@(�O�	�������zH@�VMJ�����me)I�����"-E,�YZ�:���=�Y��U�������rEe��������v4k4���c=^�����s���+O��r=>!n.��zr{Y�m<'�
� �@�����e���H�L�V����:���=�����s_�����	�k�jF���S,�0S�J�w�����
���u���v��3"��h���/��?���{�*�����r6�]=K�Z�=��q��`�gK��5�~�F^d5�H{w5%���8��x�d����z���b���#��7>�WT3���������E/�yF��CGV������fADv}���x�9�������|g�+�����p`����NCW�/�n5����/9�g����z����U��7�U��Q�}1����B��;�o��C������N�Lt`�`F�&w�����q�������cu��2J�
6���Z������{�������\}�7��:����D�F"�q��'��-�����93�@��d�8@��O�y���j���[���fY��|&E������TnMko�������8p����l��t�'����|��5pq0r�7�hC��(Kq�"�3r�K�fIum�%@�8�XO���A��s5�iKc��S�<hr��&�F��8<r	�
�\G����.�]�Bg���;6�u����S�����!@�C7�8�q��k�{�z������u���vw�� d�d���1���4�^K�"x����!����8<r�"����x5�Q���[x��Y"Pq�lK��0@����Y�6���EM6m�u\�~�J��z{��u�=@r���b���NF���fxr�q��'4����q���_\�Pr@��p��&�{s��lR�MW:gyl	"���z2@���Dn�q��n�����L��u
����w��}��B��|J�:<�xp=�y�88R������N�o��1��l^��6��r�f�����"W��w�<������p�a�*s.���L�b��\�=W�����P�rf���&Q�lQ�����H���d���-������v:D{�j;�T�'��E �a�*j��w.p�g�
��C�6W�����x���!�����s��]
^�'v]�:��1�
t.t����\w��=z����:[��4����������?�`~= q���n����Ei��LJ���B����\�����6���=�mV�k��C�� ��U�%�G����y��+g���UH2�V�z@�����@Q�1��z^��&Y)�����\+�M+�*���x=�����>����`�h����w��I�K�����)����6�]��^6��6�^�/�����j�53
�����)`'@���h�pw�43{B�' �uO]- 4���?�����z����\��C�zD��a�����x������)������a���h%�9o�}p�H��\��K��B@��6{���������������c�����V���i�[�\�Y�����y�}Q\�Y���3�-����b��{�������P���pu�6� t���4G�N��*��g-����Z����! �����~����xy��y�k"b��a�����f�)z���;nry��9�l= 9�c6�j����,�C0���������>�����z�����!��C�_��MO�x�@i�p������\
c�����K�3y���u���b~�e�bys��7�=\5��^�3��
p��������]!J�w� �������CV��V�G�F���������}���D����r��������������B������:����m���=�U��Cg:t�l�Yd����fp��EYL��<W0�}h����Q��H��a]��U��$����YQ���m��^�
*����V!���:RK"?dMP��I{2�*O��.Dd�2�}J��!�����	mCX}����C�����M&�U�D�P�/r�G{A��:���h��U63N�	k"�~�>��p�	��tI�}�9f8x�����K�?��������E>�F�P��[�����9�8�v�C���p���x���Yc\L@�,4�sL`=���4:���{����v����9<����M�6��6���p~D�~���CZ�lw���z��\6DS������"�#!������i��9:r�\��u5:��F������H�,;>u��=���nW����<�"�=9]��3l����k_�}��W�����[,k5�j���#,�I"9*�Bp��dW���T���B?
�2:sW��,K�]�1�[���KZW9�"�M>�S2���q�z��A��>���r3��P��y�e���wWj�8�F`W@��+`W^.����e.�E��8����_== ��y��q�Ev��_�n�MV����o���t-��"�<����(�>�����k��w���M�����L�]���/���p���mZOK������m8N[#@�G�������uvY�Lc��u<�8l�D�k�MN��#��G�-z���QWG���B���K=D��Bq��L�EM�����d�=�������B������i)_Q��j��Hg��gf^��l�7B��-�P'w�[�����J�f��>
�L�3��.�rQ�9l:���Vn��p�I����z��]�	�F_��n�����*���-����X|�3�8����;�\�&�����p�h�����nu�x������Gn�~�����o���Z�/�<���ugb��/gq�^�p������s��7���U���^N�f�c
�[@����0��nL}��N�T��k�������Rq���e��k��=���^y}�S����I|��������������\~��oV�]w���q��z2��G�[�)'�=�_�9c����T��I@�Gny��#���@�G,9��}���X�=�*�~H��.��Ck2`
�.��3!��Xf\,�{G,��/]m�	���nLwi������w��ft��n1{���d����@��L��X"|_��X7��5�*�bF��y���6���ZNh�<���k-����S^"@~G1���L�fG1���U>���tA]��\�Z����n!�U��:��,Mu���^;�����"@RG��pw�'d9W���3X�baP���j;�)��Y�l����Q�Z2��v�Po������c
����Uzp���c�%���eH������#���^5�%;N�������@�;� �������*�Mp���@�N�����K��Kv>�VBb� g!�'�(��F?��������yY?���U�*����/���DwdY{�V�9����	�����; ��8t��jk�F��iti��������V�� ����k�^���r��sT���:E��p���*yW�k�*��,OZ(E^��o������F�h���Y�;���*���|/K
g��;��n���@�#����1�A?�A��.+�rp���
@:��#�dG�m(��#�va6^n�LPG,@m<��,����p�dtY	��*�I���L�==q�M�rZ(�`="P�Y�y��M	:�H?�u$R�D2���#�A�f9g#�A�f�fy�Nsn���t�i��T%�D=��6gb@<���/�YX�}���������S����i~#����W��F��Gg��A��ab�x ��������/:5�����M������;������A3~��^�U=nj�P!�z��'�x��A������l������?����:bih�m�s���h�����~�Z�H�94���S|�8's}\`�hd�]����)e���������#gCq�5B8�*��L�7���Q
���
��3�dY�����;L��hd�f�9�������G�
��]�W�����}��S_%dnl!s�Gz�b|y���7�a����e��a�z�+�m��s��}�D�8j�������\/����m��e���m�����Q�h�f�����v�Q0������m��e$F���o:��`���xmSh������z�����;�x��q�F�U%�X��z}��@��]8b��Z.u��d6
��6�;�xo�
�������w�g�O��M6��o����Po�#��g��hp�,����z)M�uA���B��;[��H�i3.�3�����1�m�:�u�_}�<q�!�z2@�9l�1��!!��]H�n���/m�9��5=5��#�&�!L,����L
�u= qAbSA��9�wo(���t�����=����d��v���p=��n�q���3�^����+����]V���l���\��������}�7��l��CUgw'�d���;�9�XO(��9(���4)��y4�W%��[��|&��jZ��n��z�����v��LVGai�W~y:�6C%�pR8�,��K���7v���t��mZ6_��1�t��h�b���nzuun��������5���p��a���T6�P��k�#�l���h��y��q�����/W��y�������Zo��zY5a���)
vY9���b����U���@�q�J�e_�C��Q��h���0UN]�;�F�,�;37:~���L���N����46$_U����!@�B�Y�p����m/�uA�������5�m�%��E]�b�d�Q��[���yQ�
��@�8rq�����
��}����d�Z�N-�����c= p���s�����*��;)�Z��)��BD�K��c���.!���M����z�L
�p����K��n��Z��M-��+�c4��r�q���)
����N��r��CZ��;s�f[����#�r��]���L9S/��g�(rE�Z��\x��l���s�������u
���?����O������^_�v����vL���]�SOhG&���aQ��2,@$�����[������s"���*���o���.i���dM��L��*ZM��S���������d)�{T��<����$�I�`��ss'��{?[�4M���������o-EY�\x�l���Oo=sV�� t-�r��*�������*$W����!�1@�c�ly,�g������`�1�1��>�<s�n/����2i�+NE��yO�o�]��
�Lr���JAi�Bz�SGDpR�B���k����?��H��c2�����qZ�W��T�W*��J��C���0K����
�cj�Lk�H���T��h��}\d�Z�B����z����D(y�	q����F?�|������Z�4��/���������B�/����u���2K�Wz8@����S�<��:Y��r
��9���[1#~P�kw����cHE����M�^d3J�����'X9�{��@C Z��]�G�������;}�h����Z��b�+��WVOa������m�Tq����`��E� ���B��������NL������]������cG
�m)���12���'���� ��p���j��e�Okz�l�v���z2@c�i5qe����h]D-k������(������gu��;�N���.W��sX�������#][.] ;�M�up��p�mo_�O�/O���J{� �1�u��d�$��sY��s<���f�B����:8t���(���������?x�+�z�-�<{z����?t<r�$�rlqBfbw���9�8f#�Ae9m���s�>DN����b��������uo3������P�^;�L��&\r�q��HEO�^���V��j���AT����>��J��/.�g��#'���z:���97���k�p�3z2����F�'�>~u�w�(C�����J�u7&4�-�p��h�g��W�����o6��"�'6�x���y_�v]Hl��%m�z�U�k����<��Lek~���3n���^����2��I�����>K2���I�H�
���N�U�Ym�j�U��6�����j'�����`��M8WO�S�y��>���/���hV���$��M8�VOTWw��}�h�f�iY �;����%C@��A�GLm{I���i\���]���B���:���6q3�M$���y���[jDoIj�G�[���PS���I��W?,Ls������W{�I�(���\�X�{L��������� ����xLd�y{J�����Wo�\�]�r�H������L����������WuK��?����7�X\=�V���*zb@~�m���Q�,��D��X����V $#����S���dW�������2��M8M-]�PO�?C	M��X�\��7	��
���/�����g���z�'���M,��Lt�z���v��������ic���+���Ml�y�@�����Z+9�����2�<��'B>.����H��9��@�B�)�`�	�|K�h+��U������~�����	gg��Uy��P�����E�F���������wq����,�L��pX��P�K���������y-������T�@���t�KhF�7q���P�����(Fw��D;������n�D�z��G;��f'o����|����\�_�����=?=?��wg��_\�N��	:����6�������{�EO9j�x=>�������e�U3K��qo_�L�:�gM�H�Nf��v-��C^,���������'��+Fw�anz���7���'�M8b��.�t����}������L*�rE6���xH�>��y�:�S:J:yV��1�d��UL�|���N������(�04O+�+��D��[-$�'���
�����B������	��k�Kur���*��LiS��F+�oX7/-� �BzE$������NXWcy[�qo��L���|zV�5>����Z@#'��KOt�����H�������/���x�V.}��'n�'�n{�X�'�<.W����;�����r�8h����'�7�<v�;�tV�&oZ�F���~|�����@H��I�P�I��e������E�~��'~x���!����4�hT���Q����:�0h= b6�������I�lZ�f���7��\ni�bU
z�$��XL��f����p���%?��.NwV�>tb��{�!�1'���yb|������p`�������}��r�E����Q�oY
�(�{Z�����s�EO�5�����q����*�l��T�X�b�Ch�����D�=V���(�IY���Y;,����h�>#sV���0��x�.
��sq��B�w�t�l�0��p2��3���D��y&��?�����-�@|[;������s�yD���n�2�3�G
����#��d������*�7+kr���N����ep��p��_#�>��x�+eV��w@n�����H{��p0��P������y�Y� ������V�%UN�[kO���lM��^���H:k�|9���<({~z���������W{����r�q����b1r� '�l��@NX�e�_���&��A��A�0�	� �R0���Mn���S�'#�Z
D�b����k��v������9$=��Y�W���`�!T@0'#�>A�����v���g?����U��q�@E�\{�����/��j�6<�������:/E&l����~;��l.���*1&�O���24�<o�+'���l��m��3��6w�V�M^5����]�6���M��nQ�i�`��.O��	\��7��Gv]��h�x���
�v@������������������E=��h=PY}��C��eI�q`��������n<���g;���{��c�=p��7��U
�=p��.(!����"!�$>;�=�Q������gc?��N���<!7�����u�y�z������_||~��}F 7��8����>����{��g�*��N/D�r�I�Nu��W��:?�*SG�������Xz�|1���a�>���b�4y����$7�`����Ac�B�W�������������lr-���N�6���&TLud�#��"���NZ7���@����������C�K��M�mD��7�nu:������/X�e��x�H��{���(?{t��d>V�ZQ��Nz0@���������u~5{P����(�9�q*���fTj��dY�=-������4�j��/��b�=y���s��F0k�?��]e��
t�j�~B�������������kk����'1�!���&����N��u����%��%aU�s}FjX���7f���8��a��$%���+r��w���G9�Gcr����wL@�����i�]�i��Xo�O,��z�\��@������i�>�\��d�:c�.O5�O���t��|h������b��W���A�*n��?��������q\ ����I���v��E
�z[�$�R-����F������y�+l����+x������u����gO�����[�
�Y�q��K��tvw6"R��lNt�m6�k�l�{`e�=�:f���
�6�r���z2@Sb���6&k?&��������gVzc:B�R5���4`����V�`zm���j��~~������
Z��C����1��4-25��>�U�����������m��@�����
��To��f*J)uO;�U'Y���7������5k�n����d�����l��l���3s�@��b|y���������9	���Dz�Z�qvg9�?��/����"��#�n���Ul��O��Xrn]�(��h�;��A�u����N�����j� ���l%pX�`=p���zt�����w��rg������_�>7���'i�6s���m�r:�C�,57]���P�Y���Dn�k#��w;� �'/���7�Z�'�K'�hk�qy�d��L�
u�������h{�m�=�C�r�L#a:�r5�.s�:1�Q����l^��A��e���os���3�@����q�]��|��w���mc����t����hRyj��-2MA�ifn���7��`����7����3o���]���
��X�GZ�,�"�Y�->p����S��Bv��b���h����X��p��n/u
�}���������|�lom�������5]��#���yw>�X��m�����5��6��U�~�.�i�b��3�0}0���>;`��{�'��fyMM�ZzjW��*b���8�1>p&�7
�CW���+�p�x{�6�jyI@�M���Bs�B���,C�B�}&W&;��\v�j��="�,����6��.�_{�9����P���= ��q\�}s�F��4d��%���\e�����j�x�>i+��$�����Q�8s�~w�����m�k+��~����>�v�K	������s�v�V��v�!�������������ei�z{;^����w������SO��]C�n�pC������
���n�v�C�m���U�K��!���6x���l�����F�;]�X�(�v9r����]_�6�]o���e�Xv���D������M���:<�A�����M��E~��wN�M�nK}uA3�%�'`��\v"����Nv�G��������C�i���ii�,���r}���Tm ���3����v�/h���zC�^�h��o��C�-7��
�C�m�O�������%�����I�}����q/�^k�~�p����#��m�x}�,��{���(]�&@�,���J�,��j�%��
!5ZSp�>t'��K���R9�8�@�C��%���S*'\@RgP�/_6!%�|����z2@Sg���X=m�O=�~h� ��(p&,Qv�G�	u_���r���,gr��m�*W�mO$������M(�F��$4t�Q6��$D�(!"2(�����R�b�f���@�C7�! ��V��SGh����L���
`�������t��L8��r@�e'P�pw�U_�l�@�97u-�9�}����8Fh�H�R�����O�������4�\f7i)'��J
H#41�o����X�H�Y
W���v$}��C��{����},Y.~RR���6��74TnM��=���!`����|#����X;_��Q��,�lj-?L����������HG�=+bY��ns�l&_��c=���O�:=A }6����.��IF�5HWO��H��TVY-(K�gs�=_C����*"�������!��(��w���C����	���������S��T:����4����N�D��Pk[���
�s�FDj���
��f���h����J������?��8�a�h3��+�"n���C��9w��"��@]���=�gL��#W�E�uWL�kR����V�I�d���T���HC�Q`���>������K��/��O.�����<A��E?g�.��Ry���;�ol(�����>^�o �@{C�`h���Tm@��}��]��!���[�����R��yN�����j(���6�t9�
���6��/�n�y���yF��m����g��J@d�e�
69�v= �	����l��m����e���M���r���uBl	&;���1�F��z��lqj��0�9�R\_�`'����}6����~5�>�����:������B\^���N��n���a��P�
a8�i:����U�:���}C�}pehY���M�����N��=���|�</���k�#���mal�`?������eq�����N�Z@b$�G�z��y�&9����������t||qqz~9>����Og�_�7�h3��k�������'�r�����n0�m7�K|�����a�N�����u��*����;�`[����~��=����_~�Pg��x
cN�@=������K��3���ak6�e0t��7�<V;9(�_d�������EH[U��}�d��w�.e��#���t������#0�n��C��`h�f�tu�r�[�q-���w'Z.�-��s_�,MR�<�����7�&����+���2{�#��J	�X�U9��`����g�'ded=��;cYy���e�<3T*�t�mF���1�Z#��Vt6��@�m��P�N��m#w{�n��b�����G`����������4:�i�$wn����(���\G����o3�F6C��i�Ukd��`���-���vKv���������#Y�������}z#{�$�sz2{�'�;����I��;�.Z�mn(Q#Rq<=��_@J����to�}�L�e��
��*�k���Z��@���C���q�����������0��8.8������P�%���c�W����4h��E��0�t(~��@�@�5���=������	��L'H���DRHf�aL���0"��`��Td
G��[;�a��~(�Q|�R���7��Cr�g�]�T0��
�58j��_bun�U'E*�=�sSx8�}*��A=��pp�U�	x,�7�p>���O�7`��7�P|���3�~F@����`�����������Q�?�7�_8�Fu�`����|��Amj���'@�p������.h�^
�\�C^9�����+f�%[u��4�g4��N@�k3�Ki3kL'�����\�O�>��|h��^5�{{��H{�|������p��	P����oO�����|�����<�����x������9�$����������8��8{x��=<���s�������3
�nvPv���Y0����E������E��������/��������g��gyS��KG���!�����_����F����4�R.�gzN�}H95��
��<�{��!�f88����b���}�?
��Q"*���X��?{�=�O�_>�r�������8y;�����K��2���������!������k�e���: ��p:>(�����A6���������q(J����^���/��������wz������{^�w��/���eAb���Y�f�b��}=�-����1e�s���;s.�vz�<{qp��?>8xu�J=h���I?R1�N���������@�g�T}u�8	3��B%_�=������k����Y��H��Y~���|']�S���o^�k���(
'U��s�������=?�������*� �B�e�T�����9��Y�=�a���3�c��5J�g9�����H��o�oT0���TJ��x{b�:��:��k�/������E4Q~������_�]����U]�����}����i��������X-u5�����i�x�@�h:?K�{��,�91�1ZD�/�]�3��?)MA���i������/���N�������Z�^�\9�����&�=XM���dy���/��� ����������F�V�dAe��p���^�J7,S�bz��i�L���d"S��2�\���0��:�p2��O���u���������Ptz����o�T��=H�A�{����}�:�����N�>/��7�w�7��n�^��w�\�����j�����������Y���K5"7M�j�M~�H����7�U]"��%�'y�~ �um���n�����#����5�+�wm5.!��y�~��n*Ls����=[�q�I���>��o]��^G
�{�m�������B(w��r#{�}�|��E��J^�+�����BX���Z����x�b���o�����i��~(d1^���������[��R_O�7[(�N��F����6�P�
���]"yU��l&�W��|aY]�q�\7���z�������a��b�
�-o���f��r�M)�J�/�Y��O�����r�~{�M7?5.G6��z����C�S��\����E�Q$��
s����R`�+��J9)���K��!/��
y���L����+���@�"������9iw�*ZpoS����K\g!���j�����nw���������\�����������z�v��q�����_�|���z�������������_�����}���z��k��qf_����a����$���s���_���_���_�(�_��_���_�n�_���_���W;e�����_�d���>��S���%�W�l��%����_/���
�����)E����J�6h�~���v;=������_��~��������z���������f����b�Z�P�t(]f�z��P�������<��
�v�����@������Q�w���_o��o���W7:�k�������/��k����<,�D�Z��������P ^��Z�md���i�:������9��[-��=p��0�v_|��.�=^��p���<��pE���F�H�j�r.|�.�[���y��.sQ�X��$_�d2	�c-=�=\Q��+��b��:����D{�8�=�I���h�q��Z\���7qr+^������%8o��b��p�
m����+B�,��������<?����� �D�O�@�%d��"h�b;\�`���.l|�9}ur~s�����B�i�e�B��[w���"Pg��:�������`
�v���F_@���{4���U��S�����P|.P�V�X��XM�E����"��h�	�����
���hr��YuS 0����&b�:\,_��)���2>k�8b���l�5B��N�V��F������/�4������P~e��Y ,�������� ����4�2���#��E�d������8�s�&��3I�w���0K�E���SM^�h� ���N3z#��L������+:���f������S�@uTE�D��b�v"P~�(��;^��[d�Ft4
�$���y
X9v�h� ��������0��_
@P`�
,���0��%K��r�����<a���.�0u�^���� ���3��O���u�����O���7G7'�� R2
=A�`����t���|�Q�m��{E��8�����@����Z-@�}f�.��#�	[������u���y
T���&�&��2^~�u�7P�=�E�MH�)��F�@5���f��^k1M��i8V/�D��[��z�A�`�,���#�'?��
]�
�;�Z!�Ap_����z �������p����d&w���(�j��f���q*6��(�uX��Z���e���j�E�K
�������h:[��j3:��taV�j�a�dU��e0�jj��N��Z��T�hK������\X��_*3�L�yz��t�Zuj�i�TF����|���Y3/;@�:���P�����s&6�u��4E�E���J��>Nt"Mfq8�'7��6/��s���5T���i8�j�a�����*���')��]�^]F��}�^yu@����}���{G�e?����{�j�u�ZkT(S�Q&1�^�^��p]�[]�x�(�����	kZ��N",��� EbNv�X5(9�XAGt�4��"���&
��n���.P�.��c�E�O��
��@I���~kS}\9�Fv������jj���t�iR^{u+=��=N_��.����t�W+�u�e�Q�qte�O~<|10$��c��!��V=�}���
^��hhn�����c4N�G[V@�{��{/�-��J�����:�����g��+M�Q
�{��/��������c4N�C�5��13�������~����p��W���^��~�i�B?kw ������j*�����[��X\^x��X\6����O����������/�g$�J���}t}S���=2���g�/�P������7'�����f�@a���^���P�j���PV.�����'g�](
�d���������MeC���Pf.��������kff����2s����0��CR�����Py.�����`���M40hps���������������b�d� ���_N�H���������km������p��oN�������?-�o��l
�6^��N.��K�����f�R�_���X�zx}vq��0�5S�:.����e�kbT~��I��C��a�=�����+3�������r	�/���Z�Rw�*\����C���\��d��G�qE�v�r�:�x������W�g�-�K�^�BU�1S�U.�
^,��WGj���-W�h�(�����D�7����Ka�=�h�U.�,���ly
�q�v�r�z/���]o����Q;2��2�)F����8_.��zwt~����a��
���aE��
k>�h@P��.`��+,���qz���Bp�+L5�h����p1�T���j��$SM��w�Z�R�Vxe.�+����+�s����Hd?�wf�0�\�1��peGm%��dz5@a9�O�8���/��^/�`�VU���OA�������y���<Kq�)'��?;�������
�rT��������t���2sl1��
Hc����^
�C��U������>(���:PT.-|U���=�y���V��rL.[�}H����^U%�sZ)����s6���p��{�?���;����8�q�y�x;�\@�9�W��$=4_P��P1�A���$��+6��u#��Fg�e�Iu���Y6��j�T��w�T�q�#����G��U�����U�\��N�/o����p�4��*xhn=��h.GD�)���9�DWq�?s9�YU�����y�����p�c�U�%��l�G;C#�en=r��e.�.#���������iy��s�H���Q������-��c�@s9ZUQ���������l	�#�Qc^� (���"K�m]�mU%��J��\���c�m�nz5@_9F[i���;�T�����J�(����?7�p�\������\���W�����u��#�lv4`���������TI�u��Xp���j������r���rV��F��2�*J��<n��\�WQ�eP�\�"�F��\��W��#�I�+��k�>���r��B�x�!z�z�����
�J�����*�}�T8���U�N�U��\�5WU���Ck%��&�o�LWU��b:��+"J��=�i���R�HwU%�O$��h@G����p�r�;��n
�5��\�oWU������r,<���q�\@�s9��O��]i������J��e]�.7��q=R�Xy.���*k�G��GA�ABM# ���X|U����r>��3P�\���WT�c�)�zOaDq2ur�Z�z����N��m4��2������hz_���o	��������t��(�L����:��������n����r�?�C�c������{��xz�������J��7�W��0
@A9��_@�s9:���u����vz5@�8>��
zqi���9�#������ol@��\������W
g��(��wd��]��V3�J4���?yK�����.��y+�Y�w���]<@3�V4�:�0m�����ekN�z�v��8������j�Co�k>}<��8^�-�[�Sl7c���;��l0k���q�/���z{����YG����,����m���@���]��Y8/���S5���a�G��"z�����Xu���D����hZua�
����WG�&�V�����&�c�a��d��uL��>��bgx����F��f@�;�^3�����q�*&�dy%���Ek
�4��Y�Z����&�
��<{k=f�^1�f�_�� QE��P�Z>�xu2������C)����j�`X�|0wU�.��w���`�#x�����[z5@s9�V�N�����r��HyC����������5��v�b��gcK]G�X:(�k'��`Cy~�Z��s���q��xk� �q)���������(�cE��u��P��f�.s(���e2�=c:�����x����hQDA��_�����N�o��m��@q}~�e�����8��,�hur0S0��6o�o`����*g`UyVV��a���	dD�"��c�=,��G�*zU8�����8���T���2�`Wy6vUD�T�����8NF-	`�w<���[��N�I8��Z�,{��9������]s�!o�Gj�o����E/���P���W���E����GW��\V��
%T��4Bs��BoW�W�J�P�k�)��� �y�����
@�82X�%�_����d��)�>T�~������0�&��hb�j�6�$�0�S�@����=��c@"�:���<V)���F���Y�Y7��Gy��l�?��}=��������5��E���KY����������L
�e�U��\{d��)�5	��$�_�~M�5	���I�\�]	�k�q�q+r�%����K���p�P#�\�K��v��ka�����yO�m�����h����r#{�}�|�����J^�+��M_aqX_�v��_��]��
q��.=)�q`�u��Vh����S�����K~��,w������?��B�	���bF�����jU�D���~���n�o�p��~���Ig��������+�S���.��2k=�E~�SW��b�����A}���_�Dcg��7�&��e�N��������N�������(_����~=��z������z������������������������6=#�81����[{oz��+�~=V����_�������%���u������_��j��>�<�����h�A��a=��x���t��������Q��>�]��_�|���ZG�������9X5���&k�t�������s����*;�I����p����Wv���Ouyw��z�����|�k.���a�����1/,H_��?��a�&j�"�
�������/�je#�%w5�xV�ph�[u���`���*�'-����EWy���xW�#�*��.��&Hk�paJv���P��.tf�d���*I��D�3
� u���zO�����s���];\��]j$�DS����M��ON�I��!��,�(��]�0��DJP1��E9�.�������Uh\�������I0_
2^�U��"YY����a���:Ee������:�eg���l���P�S��Q�\�MB'K��k��S���8\jYh��T��e�C[�I�R$�e�e�q>�~q��9EPJ��P��&6��.5�f������QS����w�t�X�V�����CK3��cg�@���0�����>O��4Ifq�l	��QF
�V'E����#A@���R�|(��O�o�Q�w�R)�&�:.��f���������x!v�c���������K��P�${��'��F�^P�����3���`�U[�[*�HH2	'��i�$��w�|RV;M��\J\!O����G�.���S��$��=�������H t���n� 	�ct�������:��8�d���d��y����s�����Y��f�bO':>J��.�����<��AA,t���D==�HL1���tb2�Z����<$����?�M��?U(.�d�E��d�X��0�)-R����|�S�	?�Y�=UZ�4���V�S�`}@)U�jY���$��HR	�c������G���vzW����%-����B(F���� i��{�eR$���^�C���<K�c �</�Dl�rD���N�g���{!��d����rqT&�����j�-��mYn&�w���z	��Y)�X���#�
jV�����d0=��?�������|/�1��4�����w�^����fT�y��A4������f��(���x��.��(�~?G�4"J�t5C�=O�|z���%tKZ,�#�A8N�M��+l6��$�PCq�a��{CdlP�]�\�Zl=����O���\�Z��J3um�G�D���x�F,k��;'�Pl����j������wNQ���S���kV(�������v����r���X~�A~K�)RD��\Ba��&�TS5��a�z�����j6p^�[F{��-1H�Id�C9pF� }�8�C�
��QA�������b��u����S��������p���2�9j��e���<�"*�������=�ng�Ns@�P�M����X�E��E�|�%*�/��-��7���
�I
��:\�eZA�D�c1wJI����|�UX�z��ib�:l�Pm`A�P�
��\
��u�9NfOs��5����/�H�|��&Nn�<}E�)�lB�~o���1C6f��Vn�h�z=&����*
���������s�Z��l6�]����Z��4�F��,;���.�������� ������h:��0��n����]k(m�� ���e`^�S8a�������@���Z-@���Z#Z��m;�b�������[�n�N?��-��\Nf�-���)��\����>��*K� �p.e3�`�b����n����h�,��h\����p����h��t�K����L&���
�.��
�P�Y������!������@4���jpy���(I7�A"h�K�55���,vGL+*�����rO���&Dp��+V.�;�1_��DpI��9���M�\�g{����E����R?k-�Y"I��l��p��
��3��&=H�p������<���'�5\�hKk����e���[d��8F\+7������.�8;b�)\��������e��lE,�q�`!�A�p9��J����^py�+��|�8����v����M��)�k�,� ��S/�4�6�p���Z�vsI���% +��e��a�,�_�}�i]d�.�(�@��Gk|�����S}��K;\vi!�m8
�<�~z���I�H�-��E�
�L��)�6���>MG�d�C�	j9@�i��9�o�A�i�K2My1�"�eX�K��j��)�����KF�6G��,t%��.7�m���oI6�IK��0�y���e.3�V�e.���y�Y:��Gd�v�,�yL�N���F���95�J~��.�u��%���S���'L�@������.�����;b]�]���;^R�����8���C���I�����3���������]K|o����4������}����� ���f��TE�+�' �7�i4��`����Y8=K�_Gqxp��'m
��?�O���� {��Mr���'n�������9;-8�7b��T{�J���?�	�d.��Y2t�B���H)�N���].���$�v�$�9.Hd.����No��?���]�����t��@�{.�v��b���&�Q?U�bG��A��>�\�l]]�?���g���:�S'
e����������<�3q@��t���{7] ;6�\�m��Q�'�L������,U}�z���K���E�L��a�%�h4��{a+\��y��R��;:��g
`����N��/`���qG�j�s�Oc���kao
�T4��0]��=�A����,�{�(��|�����)th�']���&-Y8=�#3�`�������Lc�������Q\�N����'�
t�eE�BX+���l�8h��n�
�k�4S#�{�-z�v��M`t`F�,3R�5��mB�e(P|��s�7m��:�.G���:�����J��<����i���6$(��������qlk@�9��^
�j�m(���<��hRT��<�$�k.�b�������"�t��a6z(����>���J��t��+�Z4��L�.������Ce���5�\���*)��<d�%�*Y����
����������E��HRe�<[L�V��l�=,�H��Z�TGJ����bJ'�ad�[�<{y�q�?U&��s���#1V$����;%L���)����=�,F�c1VK#Q�%Oi�OB����q$��������pL5)Tw$A��3��SS
o5�������J�M&��lGB,���@���t<����bGp2+����7�p��~7%����jr�E����i��1�0]��hB�h��b�V�������������-�v�r���6�N�j@G8����
�j2pDAcK��.�,&�d?/��&5� �O�'��(��t��
]�Prl��d���x��;�)��\y��5k�%��dB\�����P�+:�3�<<L���$�xI�����N����}��7&��aP�uwGRg+����I�K�5\�+���&��������hL)��;]"�"<��IiG��9*���1������!�F=$�<$�tM�d�Qt]���W�)	`�cD�Pk�C����wqp�r���1�c�W</�L�i��K��t9%�o����1zG�Q)	��.�`�|	p<]��iR�H�@�82��[���sQ�a~1RZM&��@���<{N�U��.�p�lztr��� 
�y8I�����:p&�,�=u�CR�L�$��/�@�#�:�%�|�ix�_�a����	@����K;`y����W�*�+���rw�wx(`fi�^��Z@���Om]����T���1]��)��sgb��X[l�sh�������r$(P������6s����"����/69-�����?__\
O���q������C�3	J��9�ta�"��(��H�:Qg90K���������$1����] �.4���Drp�$$��0@]��*�u�<�����Q������?]���W���|P���V���A
@�t9�'9��]8�o%ko�8�[P��~����^�{*��Q9���B':-1���N�G��������yu��|G�e�N��vE�;I�[��t�d�j��E���4�DB���D���ag���b��C�����aE��3]f��G�����9�j��sL��]��Z������W�q����
�f�+jf=�,n��l�	`j�SS@�t9&yB��%Q����p|q~~z���o�����0��v��lX4h�T�ci�/DQ8�������"�<97_�,G��R��E�0�KQ�F@eY���~��:�7Z��r7E)]8k�F#@�Y^��H�����,?Sy/3T�F@��j<����n�v	�F��������.��,��8B��&�u~.	��V���:�����Q5@�������!�������r$O*e�v�Q5P��X�����r�>$i�b\VM��tW�������� d,���J)������rP*�>�7]��� W���0Q�������f�1�L������'��!�GA'����	�rL�\$����?�-��uN~89~t����W�$�G���a~����o��e��R���)�[�B����4
��$G�o�0�E�@=��WcGo��������	��P~������P�c����a����U�����q<�i�.�\w#I�����{#��f?=�����+�#���?��x�H�&�CI�d�4��@nt$��JX��8�m���c�����z�(�#�X����&����������>E{�h�DqXJ���#�Q 
���dG���*���Tr?�HWqI�=g1�D�6�����������uh�So�;������8KF�F_D���d4)��A.'�,��i7`�zn�Y�J)�,;/�b�J0�m`������)���.�i�Vob=�p�����qR�����\�2
R����/C�N��o�����e�����R��+����p��;�t&�2N������	���8���T�G��&�N:y��1�llt�����{��a.o,��P��j�j����j+���+}����5@���t���d3DTY�k�HE��f��
���qDK�V(��%�b��k�j���@+����M������,��~�.�X5j(��z�XQA�$�X�z5@����#����j ��c�dV������F�q3+�:�� �zlT�����`gF�@]����B�@�����!��B�o��R�op�Q����L=�����qr��<:?=�[F^���L���7N�����q���PjW�_Yqo �����V��zl�L5�����HL�d1w.O_-wRU�����aa��x����
gu��y�DD��h�u-�)���-6�r�ad�W������b�g���Nyh�L�h^X�k3��X���R)u���)#�8J)�����K�����y<�=����='w�)�zZ�j��t�v-�,n���\=-]��bTm7@������h�m[�@ WD�M���['Gx����0���v[���%Rm�B���z50Y�SqT	�����"�����K2����Q��I-�!u'���P��G������dY�J�+u  ]���H�v0���������`��^���U����?��,����������>|��!��z'V��wu�	�4�� &��8CNJ�/*�IEg0�4����-����:�r^2=0����Z��$��2��0��31�?Hg�������G����b��
%�W�e�����a>�|Q���<����
n��{T����{e�*[U�pF����]�#�����6���������l�j�t�M�e�H��p�m�W+%����)A�|���U������h>��RFb���n|=\��)aj�!-���d:L��W�6.��m���#���t����(�u����(�et��O������_���QD������k�6E<��
��W�x���|�HH��,�1�
���U�j$���58/�|z~�Gh��7���x�^+����Tz�R�Q���?N�?��@6�ry�����#%Y���r����4p.����2�T�6���&(Y���3�54t
{
��e!.O/O�K{
,������=g�w(�Y�sw�PL"F.���X�����P����P�1��y���~�+l��*�U9JK`K1�d��p^a�����k����R}��5�}�j�^d[��=i���D�U2��dA|�0��~�
&%�Z�*��/$��=u���(�~B������3�Q���5�{������=����q$�g�|���P�L�����@��8��^
.�[L��S��[hz�b;����A�U���]I�+�@=����0�/��On�]�Ym�O�'��i�v/A���UX����-���H!��t��AL&����7h`�-���m�0��t�.!`OT����O��������+0D�W8gW^�=�
,J��o��1�h�^Z�D��]P��6N.��^Zkn{���V�|�����6-��F�p��E��GQ�E�������(��e!:���H�P������;*��(Nno)u]2��#���4�;��\	�-�Q�����tJ�Jo:F�GjyB�<�F]"�vv��:�[ss_&��w�z�G�d�������6;���
��������0[~�����|@r����KR��,��e��e��	��v0���6�]������$��}I��~�������a��S����Z�r�v��`�K��a�_���Sv��9��^�'����9
�TF�����x���7�����?l�<��P�������
��h����/���_������`���o�a�zW��{+��Mh�v�L6_0�P�}�Vns3��������6���h��~S���n��p}��#����r1�L�kT���3=�|.T�^
@�&�����]�i����io�����J]��Oz]@�!g�����2��S*��
,@���py�#��������}��"a%�|��{1S�����h
�y=���>b���e��}�����yC����;���N@�h\d�_�yxE]�����Js����_�oN�9��E�!}w
(�>���;T������EkJ���Fc�}x'?����q���N��sa��j:4%��S�
�������4��&��;{�����n�"��;��>Q�K�������0��}~�� 
�-���D~�oh_��9��&�����[���?dn@���`�E���������R����|�c�W���q���#�����C����;:,����Q%�U
G~����P���7�Y�����Y��<-����+v���~~��K��
�u�j<|.�vQ~��8�����8������:�����.�X��.4���F*l��@������S�����\�C{%P�@,&�h���17}��������P��vCT"^-����w��u���4�.Q����@�[	�ZYG������V�����bn�����y8J�=�T�v5��o%�3e5p{���Xh�>�Z�@x:/6��o����K6��;�Ri�?��:�[��S��q�%~���A�r�5�4�u�\qq�K"��dO��J���F�6�'�6�8�:��p��9\�@SD�cy�Z��rN���P�8
~I�2�Pf�c���K���)#r�v�a�Bt�P��L#m|�\��dA�x�!%@�.?�n���ir�=#L�u��!}y^�h ���D���[r�3)T�p*�H+b���F_�FHl���!I�������!\�!@���������1
��
d����q'6�b�JXYk��q��3�G��{��
x
��p�J�Q0��`@���1�e����~�?�l
���^~=/x�6/*�z���@v,���Z�a!��OG��Xt�
����/fc�z�@��;H���Pe��8O���R��3���Uh3���|��@.R�;�����}������Z�����O�,��? c�~�o��ME���o��

��A���8�?�a�]�������E���a�3�6����+���Mh�Fi�,�?��v���<��p�
�|�l�sQ��jP5����J �{y���t�������> �����> ���-��������E_��>P�`��>�������]> ��6�>�p	��}.��^
@2[�r*�U���-D�,���z����u��\��W��Fa�E� ��R��e���`l,P~LbY�l��T�`I[BrO-��6�<�vq�����L�!+���;�Si��"\�E2�EZ�(�;��$yI��:I\����2
�& ��6r|.�]4-��+����~��|4NbO.���'2�t�S��"`���8��<�|L��aZ�$���\C��V���.��c����\������)��P����;:��oO�G] �Z�
\;N�����t�';�4d���c������L!~��M����#����}��n���?h�����qr/s.Z�+���g�k�x�2cTJ�s����&MV��S���������*��,�|@��
��B�eT��=<�4��z!=�G��m@�o7	O�NKYBj
c�v�>���;)�r	X�<���S��m�~�p�=o���M���7.���W�cb�	��\�;���7�[�6 �����+����mi�k���3��.��3����A�;�Nj~}���RZ�������k���`o���7�������p�����5h~^�X7E��>�nB�_kNN@�o[��W�����)bz��\M��f$`��9�<*_fFB~�mF��30��M��E�h���
8�m.��^
�HI�J)F�?B�,�W/�'"&��v��HD���"�]p���y�n��/�&���y$�����������R��#�J������u��M�~��������w'M�0y�}�
p�mh�rB�t�\��E�o�E�7��
�f�3��,S��S����D����m��j���V!��\l��A�s!���gs(��Y����4�t�"�{�Z)g�|.�k�2�I�g@���
<�uR��?p h#����#��k!phs�z)'�����	oKk��`P�����Y��{B���*���>��%�[��pxG���7��9:;}5��8���
'�$��P1,��>*�G���(��t��f��~��>m���(��	7\.P��-ly|�S��D�S���	m���t7���'�;�y��/F���<c���e��K�������������(2"�U�����yr/�9i�;Sn�&�R#�z��}t�QQ��)��2J��6]��{�����C��G��l��6pYh�K"�n��n*z��. Zh~!��z]D}�USx�I?����$(�\D��:%��|1�R�6�m�Y@��\����Bzw�dC�Q6�a�FE�
�h�n�_/���>p2h7���L"�qOt�������9������8w+����u��"�E�V0}�p���1��W��E����T���"��a�
�n2e<\r�*���<��-I��+�� ��%�M�����Hg�v��B C��n�m���F�
E�!�+�����L�+�=�e�
*ZH���D������.nSZ��.g���n��6�>Xi9��L���
�h�n����
��5t	@
t(f���m��/}���)R�z&��9M��6����`����<@�ow7!�E}2	�:,"o������/�<4��m@�os4~ �
���������j<�M�se��P���~C�(uOG�@����V����|!y'������o+��GW�B���R����A:hE7��B���^�jx��/y��$�����%�j�2���e`�E��=`v#?�\+��B���*o
o��;B�sG�S���T�'�4(��������^C�F�O�N����
J�:�`�;����m��a���w��p��r`��"?T�C3<��/W���6.z)]���������;�^l�	�'��{B/���)�M�T�/����&�*���i����^
�c��Q*t7fk.���c@le�K7r�[���0pi���=�~G���V�����
\B��K�^
J�����Z���
�1�e�\��H�mM�]�����.I�#>!n�8J?�1�.0���.RVg������v5��[so�2��)c]$����Y����Z������ �JM$��e�
�[���y�W�a��R��X�&(��"*�:�^�����u�8If��#�M�#�B{���`v�������\,�D�5t��z5vX���|�+��C�,���q���t^���yH�P��0,N>���Z�Yr�?eF�X�z?����sXJvUv.r�[�)��m
����KfB��_+�����=�����	����z/UDu�Cam�.-����KN+��z�BZ��?���
��4��,'b��>rG���r�Q+w��%�W��;�\^���Z���br��;���	b���0��8��A�*����}1��S2�������
���C��w�L���Q&����%�Ca��kB.Q�;M�<��I�� X
P��Y�8�H95nv:���sX�2�/���X�)t�g�����:����������4ZhBG�NmG�]���g&���#����\:�]���:M����J�^p��1I�	��	�m��Gb�i l9��LP]}r��
�_�f�p6<���fxq>�|7|urt���'}��lxur��������\y�8@���9}wr��F�svts�����r�u�8Yt�R�U���6���s~tIo?�<������h`(�d�f�p$���_N�o����$8>9��������~t���&�
��L[�C�`fm*T��������@�'�^���8,E&`�8a�_�S����BQ6d�u��B���Bx.t�/��]�!����(J� nz����Pc�0���P.��7t��7,��s�(5�)[�H�l�����,f$��+xBt�zB���v�4yV����rO�������aF;�m��r2T���A-:�0��d{�;y�O2!4���D�u0�Z�;�p�bsJ14��dPJ�F^��^+It[xHt�/tk\��v�A�]t�&����|i�o��\f#�$0���Ru~�����;�w�S�w�1�	;�+�S�+�������(�4���,9�#��;E_n��Egc���M�7�0_E��������L������"���0��'!����Qn��U�Z\/	��D�aMu�m�mA�Y
�q+s:�s�x{�EUJe��3���/E%^��!�����A}�.��������������70�h'Uy�G���)�2��}�O
{�ht��h�2�gyz�r/#d�$�t�9�]��`�F�L]&|0���Lb��/��i�3C���/�����%\/����]sUc����B��Zt8_��b�+���l���)�m��J�N�k%��I\������QJ�����#��� �9�&Q�OS�`���9T7�OS�;��|��	�6�yx_t6�����|��|�1*K���?h��)��1:�ch��C�uW����W������DzY���/������R�^���]D�B����U�Og���G�l��T��q�	�>:���^
���W�����,�2���4�CU����A:�A��N#.���aeu3<^�{���8?�p"�pN$z5�`��vp����zA�=�=z�;'��"���3>��,�m��o���6I�<�Y�����i^w���r��! ��5��_�����>�/��Ae��}��xkt6���_�MC��7A�|t����&�p��4q�0&@�mUTm?���;�G�����sT��s6��6q��OQ�SE����[�Ht���F�
9��z_����\�<?��f+��N
�/�����^ ���k�
1�����j��ps�p�.�j��$N�-�:���}p��(O�P���b������}1I���I	x5tjy5�RbZ'����"It+�)tj�)��4�����}�D3A�ab��R�p^
z53`��b��N-���R��y�$|�2u�����ky=�}������|4�V^)E�U�6���B��Q�C'���hy�
]&$�ve&����3��/4� �6_:wd�+t9w�����g��.�[0��w��B�p��>�]���������/�;���i*&��X%��dZ���B�w��B���B����]����|�j���E:�v�G@z�:���Z��UV�c�MX^�������v�4����\�������t���T:���,�����]HBx�j��u���8�?�����@���'�wT����~xM��N���k`�w�ZK������M'��?�u�8�!Ahb���Z���M���+ �r��'{��#W�!�aeeP�Mv����i9�w,A������&Y�b��<:?=f@@��nJ��Ep$vBC�^C����h$#�����w���lY�[av��8b�|����vR���}�.�^/�\��R~�c�M�`�W������~�,3}��a�H�����m,3^����X�B��v>Z���8��:��P��L�3=����P�?��|�S�XQG��m	B�G������������9��xu2<:�p��k����m	K%�^c���+��p���uH�����n�^����I�=�7,�.b���6�u��
/����K������zy����m�W�X�^�|�����'R/_��8��Z�m�J�\l���B:���-:�
������F~�B�d���$�H ����A��>���{�o`�U�%w��k�������� ��]��#���c���������H�]���+=x��R@#���l�A[9�>3F_��������.G*'�s4r�9\�`F{C����JF����������k^�6�x�#���[@���(�z�WUzW�K�p���@�������34��n3fw�cv���nB���E��d��h��E4[y4����d���9���n��=";p7N���.0���P����>d�*a�A������/������w��o�,���%���xw���/F�*�� �p����4����_�:�������a��J3�XS�C�'l!:x�yX��mX��xT�r�q1
?�����H9
v���6�n��w������YN,
� ��v��p�K�Y-/���� ���E��
P��M���T*o�!'��&����$���c�w���
.���,���2���}�8�H�].!�^
@(w�����k
�/�>���IR�bS��t
��f�lf�d�:Z�v�J����/�;��b�Q��;@��ZI������M��>��Y�s�����u�F���5���no���M��o�����^��/@c��)����r�W�j�����V��D�6e"�����9qur}stE���2"�6��[6R�u��$C�6����n�p�Z4!4B�^�������2w��p7�`�l����n�'�h�be�C�L)�}4-XZC�6F���I����������8��f����,��(����*��.
��ZL��U�
��Q �3t�zF7��[+��a�rw������������+�>w�[�S�4�����t��[\��}��4��
����-�}��2vw���% up��(�~xH����Q��:�]N"�����v�I�XR��[�h�!����m�}��������<4���{�Y*@]	�`M�����X����!s�d(V�|<����BP�����]@��nC��z�v�;���*����=D��Y��/��������Q=���Wc���a��d�.#wV���(�k`Ul`=���sX���Jv�v���p�r������=���6��V3z������z5v�60�hMS���9�(�������=1[[�z���C�����`�E���i����9�I�=@��q�k��$Mc��S���fw��k�9�N�>�).i��\6�)h��nI�6@��mZ�����X��Hl��d%�8P\�^S.�v�_QNN�h���(��M)��v�ca����(���,����u����Hyep�P�)�b�HA��E�4{.�$������0�8z��MG�"����r_o�������L�8�4Ke��so������[����d�!bV8@��qdi}u���-�xIR'�#}5��<\�(������ABd
�H���/����.������o���?` ��`We�y.��j/_r	��<���kJV&[������U��Pq����4K55���0
(�=��\��\�5��K��������{~�#�r#��N���=���	�]�����h`C�h�e��`�@����8>�^
���_���,��&������?���+���W���<�z����(w^�z�Z�f�qlf����f��`\tf�������r��5)��6�M�gA�t���W@�Nl�4��^�@�E�`���)5D��;�����:����"���,��z�����L��,�i
�}�F,e����'
�b���Ze�<u<�K0�V��]'���T��n(�t���,�x��&��������s�^k
���p�h��&4��(��|4��rA�����-���2�b�H�=@��m1�Y�S����xyuqsq|q6�����������E����F�������B�O�X���z�R���*�������%w0�G��{�m�3V�z�u��X�z5;5��{@���
���|�GY��y�q|?K�m�9��Bc�7�C`?����M����=|KZ����t�[+�qp�{u�Z�x�=���
��`�q��|M7�Ik���EI_]#WEp�{�zL!����P�����r���" ,�8��z{�T��^w���������2b��������:��M���>��[FS5}��o	��Y�U�������n�����6a�U�9���B����@��qt����l��f�1%�W����k�X?�����7_3������Wf�C�%$6�����c��\�q���7	�3���3�{Ga����x�s^�b"��{�3��1���S�����l����G!��j�P0�<;���A�Q�F�|���cb�:w�KQ��Q���s����70��b���Q��qL`���\���?��/c���{����f��o���������&t���~K:/�����^����z�����F����� �w���9������O�����Wz���4����r�I�wO��u�9L����a��4+8�l�����{�0�4��XsC��a��{.� ����KJ�����'��&D�m�}@0�sq��j�������"y�t���NC������Qx����}c��Jq�	�Xz��/:r����=��:�?lH��������!�@;���X�z5v���������Y�*}	V�����,�H��G�0#�)��9�v��7��s���p������F��"=�����Q���]��y���2\�]����/S�<���f�L���'�~
���I�Q�AT�>G%f7�}������V�����[�_9o�n8?:]�a���p��B<��� �	��>��j �	����������j������m��x����z5@�]f������J�Ag��g�*��h�U��nh9����������2Bo�#�����}��)�Z���S8�3M�b��t�>G�����5Wn��g�n@��sa��j��z
6/dP��V������������"�\w��G��jo�	����������M�3��}���W��g��A}pj�,�v�9���"qM��y���.���*�Fa��}����\c4�>G�����o�wu�������bJ� ��"�fO�pO��{���s����B�L&#W_[����k��������
����E��R�����,(b;�7T���-7 ��'��^;`��m,�rR�4���w�~��l$_Qn�$.>��OWE���\�_������U���:�G����]��c��(�xi�Z��0]��U�({���*a��"�f�X�6������GT)���s��@��;-y+�_��j���vM���}���������&��*�r���rJE�#Y#�IV`g������;3���-�Q����r��Y�����}��+�����N#fl�I�������HE\�ZL���s��#�i�P�����l��;�����}4B ��'���'7�frtW������y�8�a\����o�\���������a&����S���o�����-���~����R�i��	��Ih�rY�]O��wJ������K����g����&q��E������}E����n����H���{�����P��0)��[�>;�.���ZB����_����^'�Rk�b�0s?V����XM?U\��n�c���n��n��X�|#�u���u�lj��q7y���4�rQ
���v0���m#�8�,\�&��X�M,\@7�stck�[��������~�h�E��v.�P�9
�^
�����k���5����G@��7
�\�/d?�v��%L����}��
e0�E���(��e�3�O�����jt�
��H}<�[��_%O���wr�0M&�8
�06��}@�����F�m�<��eP���D�.�����sEbIV��cK�����6�����h�#��k��
��}���w�"���WPk�`�u$@����u/K�0��5����D�Q�(������4�o��^�}@��� ��"/9L�����@���0�)1����>�Y�x����������^���9A�z����y�q���dh������gy�)gY?I�/��d��{�{a���e��%kh�."�Oz=v�p�d*Z��y���}��oW����\�_v�>���� !k���x�)����1s[jte�W.������1�v��#J�<�{���X�������+H�/�W.������������N������=��\�~���Vo(�xN�!11�k<���E�Gp�n����X��P:4c�
�x���Q�x�q��E_u�Q<���]�]�.}��Z;�6�e��|N�xI��
��y�<�Y�|�;/�,���P
 �O��z��]NB�2#!�C�)Td�dR�(]���/@A�(J�\�@�Z^�G_Dsy�)s�\�����r8��6q���H�6��V�a$?��=!9��A=����MX����k����b�c1�y������2��r����7g�v��^@�X��
G^�����<�j��
�R�Z�'yo�L�6\z �>I��jI_y�T�����X� ��I2��%'����?E��L��V�*-�Q0z�!��(�qY�����J�Q$C:�%�a{,�>���xLk�X��I��n���V�"�����\g�Uv���H����_�V��������f���:!��R���1��#��-��fx��:��+^��w$��i��&����2����vRy`3��A���\Y��yh3t�A�p����x��b+�{jGR�u�vf��?�C�"Z�(�%��D`\�I�o�#��#��X�����w�u�m(���Xbce}�@��s|�M�[|�M e*�F���	]/�Q�Vz�X%��>�z���nH��&$u�K�������g�NA�-n�������{'�]�]� U�A�R�vL�G�~H�2�������MR����Y@[lC[/�X@��>���^�_��y:��V����zY�=��k����hW��	�)��*�'�8�o��P��� 1|�� ����W�9����|�u��$	%40�*t�Vq����k�hR��6�VAn\�e�nM��*8{���w�g��V� �������*���4����K�`�P��
�0��-����������>q!��j����B�/�Lv��2�x�����)]�p6g�{
{��wbh��nC�)��!�~8�=�<M�y����sZb��������M�.T�I�rvq��=@,n�80�u��e��Wc�W��h	`XY���-�rQdw���hL���*K����o1	)���*�UX��==�4�Z���#'R~��������������cFz��x�� ��b�V7����-Xy��� I�!I��v2$��6$�:G���<�"QW��G���<�x���9��m�D;��S �d�BX��j(Xn\�]HW�����6^�����<����JIm
A
��'q2��Y�'��,=@di}�	���z!���<����"��2{�$���Ej�k��������+5FW�����Z5�s<4�|��������)C9wU�#@!���W�c��PH,{���*Lq��H�/����XWR�p�a����qPIAe��y�yR�E��p�M�6��W�%�h4��0\D��l�D{?��R9p����MC;W����<�������v�2=������?=g����S+0djQ�U��9��Q�U��9+��].8RY5'�8�g���hzOY,���b*�o�"	���TY�5<����s�<RYU���R��VS2/�"��h&}��]�rZR�Q��\+6���Tu�U+>��
���O�0@����LK�Vo0EV��������V����#��u�R:%y��������L�k����&4�>o��F4���3��h;Y�a�.�!��C��]�
.����#�Q��,�P�mx�R=9_V�L�%����,����7�aD�;��[�mr��V5eC��+!��LdIU�������|��e���D{.�z�0����E���>�F�N���_D�����(����6����*�^��5�h�!��.��F��2����"Z�X��-��)J}t?
������Hl��8�+��6X�!��sy�o����M��u�o��@n��(�Gq�<�(�W+�������������
�[������y��f�j�s8���9x96�s�����p�S���)21��N����v!-�a>��Y����m�g��
7;���ovm*3F��$���s�����q<j�@�6W^��<�fw����!�J������	�����<��B�Q���+di-������SG��RH��+(3�B��nrb��)��|�M6��EZ��g�@�����`'}��zjb�f�s��Gq�8q�����8M�LX�D�HN�L���j�b�������N���b��dM�����C�H�C{L�(�mX��B��~���`���)�H����Z�F�6*������8u�l��4��4iz^���Na����1�~6\
q����0���,+a��/���p�e��{N��N��
�d�u>��Wnd��Q��f�Y����������L�m���)|�D�$*�%go-y��<��NN�M���dq��r��-�!P~6�6yj�)yRE�Q��?��C)e�*��`� 95��z>T�Q��>���@�9Zs~]bi��F,��mL���,�)x��U������{r�����D���df����L���,;1��[gVzN.r���j����p����uR���B�-f��	`����I��2K�O�}y,C�-h���!��md!�)E��'b���5����)�Dyn�>%=��{�E,���{��fz������L��^m�ly�%�iO�P)O+������y�/;G���}�q��Q���~Plp�)�6���<���S��9�T97�P���R�?�r�����TeI�~	�J���0������z��B4�f��1L�-��2B����u@���`�\o�tN% Q�Z�p���t1
�p�\(�J�9���?�D�����bsq������M!��az��z�	_��O��nWX��'�R��a��3�4���
���4�����V���/��Ry�����
�wPt���D8V,��?x�sy�y�nb���bg��s��~��-�o��,�+�*ET%�W.<�,I#��*�C�����@;��[%����"l('>��A{��������r�M����8�,2LM;���[ERM�C����GuV���*L�$TG>|�Ji�L��
;�=4D���cP/���8U���vk�Q��$h�H�UO�w�2��!�"k�]?�����5�y*����;�S>�I�|p��11�(�5
��)�"�������������Z��b.�N�.�s�+�8��S���tO��sc�g'�s�Hb�qP���?��5L�7Z��az���@��%L���g��viw3�N4����W�%�Vv:Ca�D��i��
��Y�o����P�x����(�i����hQ��m�hz���,�jD�J�/]Ik2���\��dg<���[-m�����ep�]�#m�c1�eH;*��0��b���U���rS��4���)�}�&�=p���xX��oBYr��=l���]Y�jgOW�j�:v�Y�. ]�����d�a �L,A�����&q�g���R�����<�?�����{��ryJ�g4�
}��������*4��
��]��]<���!^*��Mx�P��M*/"�\���I_>�u��q��� �p����T�K�(�X���mB>��������o�%�'w�Q�m���7��g�o�?�Z�?.`��0�����\^��<�/������+��*#C���,���a��22�M��("��Lj���J�P;��������O�]@\w9��Q@�&|uE0��"��(�N��5d���&�g|F=!?.���p���j�`���8�s���X�����D^
�;w9���n1���y0�9�|��[�MU�V��S~�&O�1!���~0�Z�����>����.ZS ]�c)�]D��R�fa;�tn���
�-�X���}(�`�I~$*	+I�� j���D���o�ui��eu�R�w�<���pD7?���Y2�n���?������s�_Ar��aCtKa���Yi�d�Nk�1K��E�<R����}�y��%�0UX������T��3�e����3�S���fL�@v��x�P�w}��`�9���r�o�D����oh:�A���j��p�������].X6p'���k���z�k�E&V�O��fKkl!��V�Z�i� [YU���a8KF���n�x�rp]5e�
�y�+m�5b�	PcD��z����"Uo��p����F=@mm��)��4�� ����E�_��c���X�9�����Z�e$�m[�?�C�U�y5|�a�6�A[E1���P/>���{��4B����o�~����.����j�Vqo��x��Mc����$�3]����S��&��yY��N��Ye���4�����n�l���z�z�������z��Tkg#&��n��	��L�]�8����n�0����jY*�k��lX����$���
h�.Kc.J~`�\)ZP"W��c����]�zvQ���Y��n���r���(w���B@�-Mn�]��v���bGH@�v���wcx����m�	���1��z~������k�L+��m�9���`�:�DK�Q8�'��y08���hb�q	 Z��h���d�9�	v�s���
�,����?�����?�M���V|3�3$d�l�F�����?(dgn������(Vr�#GH���S����O�:lM��}1���N�>�5�d�vJ�#���^�85�T`�uR�<*-;W�-'=;kr��`Ng�8�'���]�b0�����n<92��c�1�������I9���1�`j�S[�m,�����\��ES@H� G����C\��v�p���W!@�vkR�]@�v7�n���������"l$�.�]"�}�Y��b����L�L4�Q8���0�]��)�����
��bK�csz������
��I(e���{cn�4����� @�M�)��Z)���d�C����!�+�����^0�]#�0���q���9��@�S�
��M���1`����9�����
E�a��l���� ��b.�x
��.�XuJ#/��A0��E*����� �z�#�=(9H��i�N��q�s���!e�M��0�t��[����M��'�k�	.]|�Y{%��������Hy��5!����}-�JS����hv���p�W�^�K���B>;y�
����r�]�U�b� �	����b$e�P�!��^���x�x�����&�b���,N;JzM�����������F=���}-T�J��\��]�0-~X[��������Q��=.����+_��I�W!�AM���I��D��;K9V-���W\�x�#�5�7�t��s �tx`=<@����_�'����$�JD������s�k��D+��P�=o+WF�:{O���&���$C=`�I��/N����k@�,�5[�4�m%[r<�qP�tm�9p�=��>St��KK�v*�,<��������������X����M�����6�8����
�(������y�Y��@��J!���Q!��q0�������g�&;��s/'��s���d}��A�/��MB�F,�"{��7`�'-;%��&lc�2���M0�=.�5O��#��7T�K�[EM��S�*�(�c�WI�����=�����z���5�H�y���=������^����q_���4ra4�rf�&�*��=�m����|�4%�x�^����U:���&!��T4
$?H�T�*JS9��j�����)�|R���I�j��^�=��P�I��5Y�r^��;�_�������!+�v���V)d<@,�8b�Q�NC�mJh���#i��V�x����g#��)��{l|����i������������hd��au;����y��bE�����8x2�U���z�g�gS������[��l��1������.�Y��_��	8�^7�bR$�	���Y��s�������4�i��OvM��r�	�|}R*3�D�8Z>�m9@1�8��Q@ �.��P�������S��]1;���n�����/�.���r�uQano���6]����k��rG���v����H�������o������w���m�Bj�5�v�6���f#Q���FY����G�T����e��(�j�Gv�f�u�Hj�e&��Y�&`�{����&��{6>|�	h���&_.+2��g%������=��n����p�9���a����H�^����l�4=x	.�(���7 �{�� g�y���qt��6�8��y���|Wf�t3�ZlsW'gDS'N�$)��{}����	(0Gb�G�����r�R
bI+Oa�����&�J�c=���K8��G�VQ8�n�	���2��t��U�I��������@KZ�%�1�+V'�mY�������-
~$��������w�72,�!
����P~��y2M�p�L&�G���$�������e���2`�{\����:�� �L�5EzV�[�Q��,u��mq*�g���Q�(��nL?k$��{����D�����U&l�����f�x|to`�=���Hy��{�����NX+4�VO�a�%�h$!��.(�>��������5a�W����~�H�^�����^����p�2�48��<l��+y�������
,J�j6[vE�I��[g��<y���$f�[����F=v���&f���?���+y()�$��#01G2�f[q��!�9���r���?�Ma����b�;"��J�{����,���+@k�PKnui�kx�����s��M����?dl���W��:47����\r�o�a*zE�L[yl�L�������?d�}��;�o�O�Y( �e}@	�9J8�-/�w����������/�������W0��$�sQ�m�2�x2��c������sd��7>��9=c�2�w��[����3
����������������t�P�5�����N��I�u�;��H���P���_������&<������:���}oS�|kcM8���0������~��&������N���|~����)��l�@".��Q@�1l���2�[6_��"����J�������������EtD&� �%���2��!@K�����*KI��C�|�?�L���
��>OW6W@6�9��� [L��������^�������u�f�������������5g������
;�}�P,��b���t$~����I?�����X�T�P�v?�"L���Bs1�����y��EW����L�����}��m��K��DODG���"�~�o��l����k���D.'g�5`(��a��~}���m���"s&�~�Jo)�#�%��6~p��(�HN��(�-<r�
�)$W�t���������\4R��K��a��\\�5�h&�'eZJo��\E��U�.���B�f�_��-<r�������$�~<?�;������������4�XZ��\r���tW.��&��B&��v�dT���G�N^����W�4@��9;F���G��������/*Z��e�z>�����
�c2u�1-C�9r���~Q�(8�m��~�P�[]���G�w�����(-1yw�W)���@O��E�n���TX���jB�U��5qW�+	��&�[���;I>����X�f�7���
/���)+@�n��V+�����Cu�:��4\������,KXT^����d�w^��b�0���[6�*GslI��MX��f�~�������I�j�L��;����w����,;��|����`�I�j����v@�&���l�fO����sZ�:pmQ\=M�������c�(�~J�R�����[�B!��3^j@6�{
��u����������`u(z��f��p�4���K����`���h�B����z���w�������w;���9��J��*%�E����������=��I��.:�����0�����J��cw{@e�7�2S!���9hts�;�|g���Jt�fxur���|H��zg��'�s<i��.,KZNkgf�|"��"�-����f.Y>�G�MBn����<�����!5��)�������;�7/���^������������	7�����:�I��������u��S�{R\"=:��Q~�J�[-&�&��m�@���LT�@[1J����/%��`������H>
�,�_`T���(oL���~C��q�l�r#;���D�-:�B
jm��^�������3�{frt	X���m�#l��4M'��IH�<���p��aK���4�S�=������g7������G��Ggg�*����������"����������b�`����e���R2��,�S~=E�T�Hq��B�?`��R*�?�0�u$��'>�s��j*d���J���+����B���H)�B��(#i@6�!![��\���������L�"���2���&�k*���u�)�\b��������r�P`j�U����[
@��m�uzIP���P�^�m@�n���*�|�v���Tr�v	���V`�M�9��JZ�"�r����6��6���N�Ihq*������4O#�2�9�n�v�IP��(��f��ijw��v�r��A��4>xi���:�>_f��ol��v��%�|�H1��y��@�nT����#|v�Fy�����&iMR�Co#z�$���6��|*��IZ�n�]��[D����~
�����@N���^9�mCI��@G���f��g���%��,B�������lv�S�~��9d'#��&���'cMF@qo{ XH�d,�v� �����*���������vS��r���Y�������^�}8&z`+[9�y[����[�@y�i`�i�nn�Z�ie)@�o{[X�jN�����ms�����sb�/RZ�s���m��e�;Y�_e[��������4���m6�e����o�r��������'�A���nA����������z#��pX�y!�3`���N�e�l�����5�}����>�Hk����(��ms��F8;�������*G�Dda_�^{�p��b�C�nZ]|>�@��;���FC/~�YU���aH�$$z1��)=��h7����;<��m��;���-����)��}��1p�hs��7%�=���g��9^�\h��N��s�4�;4�������i�7+F������Zo�`&�*�j� >�%��P�����|H�����/S��G�s���`6��o��?dc���iN?�)N���V|B��O��h�=���}����
�/�Q(��#1���f�x,�+��1��q8�G�{'��^>��?P���/�������|�!��^�� �^>E��U
�)���������p��;b���<�{��!�f88���~�,��P�}�?
��Q"*���X��?{�=s�@<-g�4y������8y;�����K��2���������!�����}-���l1(��X+J��Ja'�G��i��o�6E�v����������������������;������9��^�\t��8��	��~��������o^]��y{����|����������������+������M� >889FjM���t����$$��qfo�����>�|v�L3�/n�f�3����^>��b-Q�;�4�
�y����D$�F���8/�K5��u&~~
c�a��+��A&�IVJ��E��X#�����)��0��I�`.�(M�����h�Fz �L����Q���^(�����%crBTJ�gV��_j�Y�],�����/_��j�����f���D��C��O�P��y�-M��5�������2���Fb��Y���?e���i���&�����������$T���h/������h>p���~'�����u���-oL�������^�(,} ����a���p�n��Oo�l���I� �2�j8�uC�d���V1�n�4c�V�t2���o�^�h�L��i�n8�f��gj7��~w�?:��~���Ku���t�w��p��������������J5�s���4��������,��U������m��]����L�V��W| ��hU�o�&������(Wu�����u�
�����m�/��[6N?�B���'���J�][���ja^3�D{�
��\��%D�V0\��sNWkc�����]�u�P��'���
�[
�_���P@�md,�O��Z�Z�^��~%��	�k!�
��T��$��T|��(_�+�{1'��z/�PHc�t�I����+��B��2������N������W�O�4+��Y�.O`Co�g�V���X&�L�[_^������/���&$:M���\j�`Ko�4K�O��8�`�jb��z�����2v��7(l�]v���"#h{L�b��-
K4�\����b���Kw�J�A��	+#���GTl;c��'M���oR�C�.�k�P@�����}"����W�w�o\w]X�����u�����(�!E�Y����/k%��y�����
���o��nW����������F��ml�6����v��i{����F�z�������������_�����}��������������/���MY�+���/xM�����������{�/*��{�_V���h���_��~9����}�G�z������G���:t��_/���f���iH�/fe���
��_�������w�w������������z��������[�w��O������5�5�J�������}��[�P�����t��/������{��}����������	�~������j9_���C����lo��,�]�@yN��p��0�D�VY/y���g�x>�`>�0��v_|��_��I.�� �ZI.�����V�5��(\h�]�W��w�����b��@6���d��4�����E%y	(��t|��.�5��(\�r�����D�BL����\"�s�����0Nk��x���Z ����7qr��+Z����L�7�d1[��M.u��N�j�p1�v��W'�7��O��nN/��k�FQ���"���i-��5.�PH..��"	h6�����l{�!��Fb P���MG�B���PW-�?���8g��zi���r�V���w�Sz@g�B�{CL��.��k
,�Q��+���2��V=��������V�O��Om6#:.?�'�����A\�
�G������d6��[�����bc�W�G�D d9m�	I�$�G�m1:���L&���8����i����oEcbYK�q���1�g��/Z���qt+F�(��i
�:�lj���_��R;��SG	
UMX� ,�J'Q:
��&aZ��At��N�i��D�1-��������hUG�z82�Q��=���I�!D2
8���-�� x �B�!�F�P�Z��aT�$��.�7I�
*Ew!7���E��@B��Q1��\�������l}
2fY���-�C"�����n&�e��������C8��.�%�f^4VF����dMf1�������������z�R4�FL;\����hgF���5{0x�T=XYdQ����Y4e�{�w�T=8QM�R�Ypm�b�W�5�Q f;���v=�(�J��)S����z`�����!L[,���V�c��Z ;|��A�b�iueo3���U�Ew[P�Q2M3y�����mk�g��.���>�t��'���W��:�\���<$���������.72M����'�\�<��b��Z��d\D�\A}2	>KI�(b��5�,>Y�m��F�`F{B���K�g��&3�R����cI� ��- 7��6�?�[j�a�����q�R�~t�jwk���%�2JL�R4S3����.��[
2fA����G�����������&P�.�QUU�SDf��-~��{B��?^E�VAUw��rd��8��N��E)���L����l9Ji��R;�N��"�����b���G�#&�1:��$���|a��Y�y�	���uA��{�@�V%�/�|�������&'�I����~�|�s.��3�%u�d���^��8��2����'Q�r�>D�CN8[�����hZ��V�@
,�l�-=���:X;R2���;�]Bz�8�z���5]�D.`�����:����*��2�[)�&���&����'������n�CuN�RqDHEw��Eg�N����e{v��R����R;�yj�?F�N{��[��s����9�Oi����Gf�r�1��<�^�f������U�v�"��@Q��"s�U�.������e�=��K=}��Y��n�x��\�'�%'���!�	���<�����8�cg7(`k���!���Be�d�$��@Z8�h{5�I`4����c���l,vV�9Qlw[J�a�F��vD�����$�t��nA\#i�W,!�D��'A���R"���q�Q�������)#35�������
�x���<-��UU����Y$��+$����g�S��a���]�C������#E�����(��Cg��tL��2���p��L�q���t c������a���!-O�e�/�;�E��`_��>�j�k�@�L��?����}NNf�<�K����8�A��F�
R
�H�[���y(�T�����3���Di��0ps�Q�$<��Z�<�r���8}$C��9K�t&��h������C���kRP��6}�*���qxG�������������{��^���Z>�u�<�9`��=�r���6yt����
i���Ev�V�a����"��Lg�(�������T��t���{h�^��.������9��W77g��o���N������_�}��:���H�v}�Y���+�z��t1�*"z<_H3���S=>DB�&��3
��8g�=gWi��
ce����N1�`23���C�����:����9�qG�Gt�d>�+t0~Z���|x�(I2�)���{�_5��z-������LF
�F��_<��iv����m"l��b��U�CQ_��}5!+f���W5�8��\H;���,���n��W3.`���U�(<�A��N�
Y��J/�q�i��U=�����zH����c5��$��A@�k�Ua���yI��(�?�]�qfyL���J����`�QY��*m��b�l��s���W����R��]	������x!L}g����U8�k�Uq"Z����\F2�n�
I���V�����E�(���^%��.$�{Zv].���QU�[�i�Z�L�l}\@Uu9��^
�`�����.�����)�g�\[�M�\b�� s�X����
O]�x*�~�:�0����LY����R9��	����Rs$��Y��#�����Y*��0��d���Lu���t�r*�na@[�u4�c�Vu�����R�#��������:���1?��~E������I��Ky*E��Z���x!c��y���X��C*��K�c�
��k�� ����n^^q��l�`��c*��X�i7�������h���U&���s�H�w-=�F��\����t�m_�ON���T��3���hQ^���#���`*�&2����M+T�G%�o�r%�~H���:0�(�?����(�T�'�T���P���R)���s�h@G���@����z��8
M��h:���/��\�N����XI��D�Kc�a�@`;U��l��8ie��:rLN!����T3>���>�Iw��|�b$}YV�'�4�?)�u���r�P!��p�I'��"��,��kb��Np��]{J����:f���H
�O�$uY&��s�d�DW	L%�p��d�����):8}���)H`���Llg��tSA��y�M��uT������fPl�N,��@Y@.u9riq�NkI���*�������9/�VVp�S|����K
�\R��%�������F]8�o��7G7��w��W�W�d!����gB{�O�T	 ��v*Dz|�R�tH�
W7Ab��	�m�NV/�\_oN~��>=;��A�<S����*�)0R��<4�4������5 �����6N{z�@�96�^
Pa������r��|`N3�{2����"���"&$�wL��(F��E] ��VZ��T'Rw$�$����D���N�hBP1��Z#����}�:\ ���X
���(���gBm��,J-��0�#���'4������	lD��IsHnqi�d�cb����z5@�9^���N�j�����~�=�Y����p��Tl���'b�l�T�B�k#T
x������N�Ua`I�KR�`���X����f�~���S�#��$��qm^	5����g`'�;Q��.�:T]�WT�#���\�p��d��I0.�/a�R��qx�������li�/���wu���s��\�)y��N2�`�����Q���7��c�����#�M�T�[=����(~z5@u9v��HX�^�q��Gq|�������4���D&3iT� ����o���fR����sl=����N���������/�����::���-�6>�������,���q. ��A.��Pl���%��f/��oT���W�����@���rl6���3��z@>s9��@TI� �o�O�n�L��-zk@�8��9����r4��}9wZ|"!�b{�i.�N������r��\��:X��x2-�iD*�P�\��F�'a����8�8-7��G�Nr���&�(�e��:`z�����j����R�\Za���~���"t�]c=��F�vv�f�������������u��������Xnz5v��8��)�2��z�M������+�3�`y����<�J��<�����4�%�k��1�r���\G)�<���j7��]�=��&���������������R�����9w���q�5���z%�z��U���i�H��������)��eGF�,��b�+�r�f�x�L��aH���iHQ����xJ�G'0b"���w��[��J9�w��fW�����t����Q�P2�()KA
r.���:�zJ*��A���"g�2a3*�$�;�����t8�#���u,7�c�Q�����?���}r&��f@��8�l�������������#�����nz58FC���xm�<���������T"�b=d�.�������
���[��*5�J(�&`�y�-D��J�\"�R����'=4���<�H�Ke�������T���R�������,L���f��s������r�Vb������1��K��l.�-�<��^pd9��S�
��{� �q9�%��/�]�0�<�1�o�=���8�^
P7��f�6 �y�
l�+hb:bf��3�����Ym��i�)J�����y�Z�}1:	�O<�5��<�OV�d����<
�1<�C�8����t_0Ob;RN��q�������8��0hh����jo9�h�E��8�g�l|����7���\]�^��W��8��pw'?��v>9������^�����a�9Q��`���
 �P�9��^�v����@n�8r���F��1��j��r�2c��2�B*;�N���'Z�e}�}LKQGt42�g����ic�i��R�b��I�3	����
d	
����N�������`��1�5�V����F7[�8��
�����Y�S����<��
�ERI[��l:��<6��Z>����J�8����Hi��TV�U;���M��kE�8i��
�d��LV
���Y���e%�:2�M�:�<��,=E�3e3���QY�Rry:V���3��>+7����������6^Ze�2&h��f��o&��������<�e��{E ���T�vD�r�)
�qJ�4�TP�<�Z�W4y�)��:�`�o��5��i����u
+�Z��U��e����<O�����xsur}=<�x������-�O����s:�����J`@e�8*�^
���F���y��F�+NN�����:��`�
9�R����@��a�y>��y6���R�=J��dJ�h����:�0[���9A�FfY�%
�����\�}�����9T�T����Lk���������i}#�����P���7��x��	@f�z
���g/\NC��@
V�Xmr��Pe�&���������jY����t�J����c �y6�����9�c������qTv����<t��	�\���������6M��iH�Q�U�2Q^P^S��[Q�����4
�Y+Od�.�K��[r���7�u��X��c�$�R'
���r��>`e�V������o`������"~hQ��W��}#k�O��S�:Y��]�?�N/�P��+��W�1����b����I�����6�V|��bh��
����#��TFe��]�B���wd��&��Vw��
������V������;z�Q�����9�����m	�����Fo�������|+~�&�4W�ZXA��No��oO��|yqz~��#0)���~�a��������*&��������(�fqNV��> f�1�%�����;�B���?l����a����0���=�s�Q�;�6���7����3<���(��n�C�:�aH$�Xn��i� WL�c� $9>�x��.�}QT�P$��d��1���=N����8>��K�U�p�y���t����s��\6a���r�+�e��[Le Q�JfO���G��
�����1�d��B�1��U�PE������S ���8T����Wr�J��];��v�����K��X�z5��&�����������������������eKu��}yo�9�'I�&��	�������n����?�u���Kg�^�L<2����M���]%���
0M}����J�?�u|�R�_���R�o#�R1 S}+1���s��E,L,���
��GS]"�V^��(���9��y�����~�+���.W��<�"s�@�U�JQ�2��(���T���sZQ*$M���z��Vn��,�tG�I�?�s�V}o3I��t4�U�����{���'V�����{
��+q���g�Ll���V��u���YS�>������;.���m�{
���19
����x�����K���������/@b/��G�J��^��<����{Q��X����4L��b��`���KF�<l])H���S*����w�du�0����L������
��<e�o��Y�6�t�������
�y�����Z��,b��/���������M� K�.�n��7�tU��L�r&�`��rx���ay�+yV#��
��%D/�6��������z��>`F�6f4}c��~�!�P��g�>�M�6�t���A�\���;�9)�y)m�q	x�>��������n�
�`+�����i�s��8�E1f���c#<�	�n��8�]4O3�����Q��j��0��r��T�!�����y+`���6�8��]�<�
��Ju��8����H�C3/��
[Z���k�6�����Ra,���ih;�)%�.����webn9��X�>�j��T���Ie�~���k���`��,��T~��Xi��ZY@4.��^
@4kxT�Q)�f�B�\��<N�c���F:�{"uR��'�a�^�o
����������2*�=N�F�q������X����(��b����dc��5��z�Pl�x����g����]�Y�%� �y���|@���
��,g���S[�k�U���0���:9z���1�o�
��R��=��;���}����-`����������Dy�	Q�(��l���(���{
���<xU�n�o�*�N��N���>���(q=��
���e�13�X���~��YR������-]���`T/�X���_���n:2|EG�hB�/:r-���t���Ppty�
���E�eb�w�2�~���i�=&���<�{�!���iWSy�gF��w�o��/eM��_����;@��1^7����o��2V�d^
QKN�����=��|�zm��2A-3�)�����z��v5��J�������t�{N&(UW1��He<~}�h�>G{�$*�����#��E�]���= ���-N�����4}}r|=�<�����G�P�2�����sT�r�{2���)b�B��oB�/�������x�^�wP�}6��z���=[�X���+�%�[�VnP`��~�5�?���?�bC���Vt��0�}f�Iq9��X�>�R>/����-x/��F�M�mz���+ ��\�]mR�R��@���
U*w��4��-`o�@���]/$op������@\+���,�kw��E��I��L����
�:�P�}�m���S�
���-n�L�f���6`t���������Z��.���[��F�^��6`S��ljm�'���92��i*�0M����iG���6(����4o\9`���s��}�
�^��m���>�2TK��|�z��8&��jX)!��&�nMB�/q���B�i�	c����FJ&��G�Dy\�`���)#/���w�m@	o��EGP@�ns�n�����R������ny�^�r]�$�?�HT0��&	�;�@�W���"��T���a@{�Y:������g�4������������w�C�=��jZ1��>mr��^��r�
��mk�bP�<n���6 �����z5kl1�����t���W��n��z�Z5�_�|Zh:�1�p�9�\t�T����}�����q��#�d8�EZ�[����}B�������`��n#�vQ� ^����z5����z)���"�X�N��#U�x���<��0U��piL/mP��
h�m6�4S�x	U��6�\�9��^
�)��^
4
']��B;�X��#{�[J:��tj���F������_��\]v�O���o�
��mK[�J8����t1��@3.������|��C�7��s����v���r!j7���Iz�*b �#2�����l�����i���o'����*��E�F��=;;�bx��%W�X���2���|���-�ve��t�p��Ry��].�z���:�p���
��m�:�5*�_��5�i���f����m�os���E�9m�w����Vd�HuMmi��pt���RPa���L�VK�l����E����O�upM��H�#J�H��^�J������y�b�rJ?.M"��n�bx�J	>IS���Qu�R�H��M�W���;�����Km@
o��M��FE4���e,�f�.�~���2�E�6�@�noC�.�(���� .yw;�O���]����2������T��x~������?��6G/}�|�6��S�X'����r�9��p��h
�SqB�|�\������W��XeW��"�k��w���������gf�����\l�"�s����cx6y�.��-Z�����\�"@)os���j�4	��h-��m@os��kiN5������1<<��4�n���JvP
6�
Uq��,#=O�� *"����jz���WNE�lM��2��_"
TP������C�]����v��!\����4�����l�m@-o[C�;�@2o#��0w?��?�N����E�y���5�(N'U$��h��hjo`F�oJz����,�o��Ybf�l�y��]5����|�F���������|�������o.�n��MGI@,o������\����C���i�7�mD_k@R+[�n����"�Tu�h�E�qKH��~��V9������������,*m�o7	���#���,m�����<�����������8�m�/7H��P���N��w�����
��m.NzQ��KGt1�3���1����
����6"��3J�������3���$�9�/j��x�Q�-��~sut~�����v�Z��6�\��bx6�x���c���y@9��]�c2��a2���P��<���MJ�����,0N>=��������W�M��gL��e<�����sy,�\m}��7M�c��s2����X�.s�]��&���D�F���_f����H�`��]1J>��G�"-��JW�'3Y
�
�)���7����J� V�<^��.&�Z�����y�mq�O�2�d�9'*)%��]�KdG��h���KV��������=�V���4B{������<8�<N*:�����_��<��d_o��~.����F;vuZI?�}���7�0Lubi��9$����X�����UwU���*Of3o���VaT�����z�Pd;�vR!�N�����c��-��
$�cq����Z������
���t�c���������/Ka��*������BkV���w�R�h�i���L�H���b�V�&93��T��;.����R���\�Z�3'N���?Rv.Z��)�.�#������&�}�����4]S�q~�:��aD���w���S�#�L�&��{����m�o;������0�ix,o�9��������x������E����{�Lt�x�uwf�n��1��;���S���Q��e����`�Pg�G����`��4�9{�:��Plf���N�}���6A���Go����l:�B���>�3A��F}��F[=�|Yf��6]�����@��pQ��e��>$'\��OT~����Y{��n���S�>������\���w W��|�M�<�$��H@��p��U:R�~�iH�c�B��^�I���Y7���h����Q0�;�YoL=@���P�b���4S��U��;�1�i&������L*���<��|��<��a*�����_+~����H#�������<T����Q�uyh��u~�R��Z,���E����k�����wQ��4����M�z%�Y0?��hI�vu������j������z�z�/��>b�'�T����t�MP�$�;��������b�������[�2��g�X�� �����HF7f#��w�
7bdN����6�1q��
q���J��;	���������	�`yx]�T��/@f�pdv�4��n�!��=�T0����C����<��|��*��������L��;�}����t�!���frq�/^t����:���N���<�k�r/����)%RY��*��(`�w:�^P���qT����r��l�TR:1\��TzE�w��6��C/s��Q�6zC0�;�\�@�6L�����kY/4�1������*��w����Y���S79��V,f�+��$����P�;�j��j�x��Bk���|T��(:D�y�)��J	:
����Z� �Vp�;�+/SG��\��%��R�mM���<�&�&%c�d@Y/�y��;��77���|��`��z}@$����CZ��i�����<e�O�l�Q�p�r��;(�zQ��;�'^�S��1��j�����[[��of����e�;�}�@�{U��< �w��Z/J��0�;�������V���I�ALz*�z��������N�!L�m1
�D�<�A�y�Q����T�$������z��;���i�x���=�K}	f��w���a`�wjEW��vpP�jz}��
������j� ����s�Q�Kx>��C�����_%IY����:�_�4��62����h�E����;M��T��FP�;u��*7Y��c}�b}�#�3oJ���9���djU���8�T���k������v��>�����t�l�� �w82�^
��'Z�4O<�{�9KW@%���x�r��f3;qlm���v��q��j�8�=�"���X���Y8��l#�H��Z���������_��#����GWe_����-��LL!+9~`����l���t�nS����Jx��5u(,y�"��:�z�C��j����.��~��d�.!�(���$�$�0~r���&�<OW*�VN;�*$����O�Dv���	6�K��(�A�.�fID��4�iPJ�@rC�6��p��dG��a�c�]�,��<�v��|��[76�n�u���Mhx*&%J�/��x�]���W�dv9�B�W*����m��5v\]@���_.��^v��[7f|����{���,���.`�w�:9C�:d���������$e��n\?�kp������3����i��s���\x�|�0�]@���
.�^s���Myy,���~�,u�������zub���$��w��#�`�<����L-eP��Z�@:��_z��2.���Y8�J�}�a{�X�����]�ew�U���\�y�j��\vW�ewEY�w��	8�]m���(�\���0�[j��U0`N���[/^|����G����hJ.B�A��_h%�G.�l �.��w�
#��h�"�J�Z��pB��4��E_^�t�g ��w����(�����'�����.�K��N�y}�\�n{KS�����]���K9!��a1�������J���t]�����t�#�G��]F�?WL��C������[^*u�����	���{>���)��4�*��(�rZ���]���n�>�T�"&�5;� �9�,�n{K���;~�2�p�|���-�8�}�}	0�s	��,�ngK��V���������-����
�#���'�['z�n�&�c���jJ�/T�m( �w!���*R~���oCa����������b�P���n���%������6��B�#���T�������.��.���/�,�dg�,���t�������� /�������B�\"���c�����'���s�oX��I���I��[������Fr(���n-�E����-,��8I�����1�>W���@o�rq���P+�H��a��q0�� ��_��C$�='���QZ�@��.6���z8�. �wk�w�|�T�$Y�LI�
 �y���DS�S�_�.�[$�� �t>
#��)}�;����#����(��%��\�@�@���KD��q�����;�����t7p0���0��� [����A��:+\����c����r]�w���0���.�]7��6��{1�u����]����(�T���E�u�8(�
�����S4����?����"�����R@�<�Y-�!�I��?�$����-��>�8R�.�,�_�)Kd�����GE�
�J_�)������S37C��}xf����3��eN}�r�e�G	�N�P�����L���1v�Phy
�+c��	�db;Z���F�_���RG��d��0|4\���R����]��E�S����K���1uix������{�x����-C	] ��>/������~BQ���e��f(`����}���xQ��k���#����h�����w[��y������b{���z�/����8o��V�%�[p����������+C�.K��������9��.�����'B����a���m���w5����_������-�����s�r2�UdP��=����@��R#Gs�{\�z�;�����v��TX\����2���jn��=���m���6a��c��W�{MC�����~�{����?���������j���CL��Z}�E��GW�z=m�9����k1�K�Z=8�s��q���,��]�����vp4�����.�I��{�w��mi�����.U�l��PP�{n�[� ����i��^�}%3�^� ���B�W�!����e6�
�E��j��W�mpxNvh�dA<���t1�Q���i������-$��_��{�)���V���2/��=o�
\aV8�:?=��P������{�3��A�[���1�$����|
��]4
��|�S�<�������E�L����
�n)��2$V���������z��{��kJ������{�.���YRc��N���n���WW�/oN^��� ��j��������sE�[��r�����lTw�~�^=�{P�{6��0jGq(�t1SIm�4Y�G��<N��L3<��j����&D�A|=_�B�d�YK�����{���=@d�5!���*��2%a��[�������3��u�W0����7Y=������U��<�z$=��i�}�bw��x���A_~�]���$��K}w�`�����<O�������p���
�D�"��k4"��B���R�+b��F�T����ow
����-c6fv�cf��0�M�����{���8�/�k7�F��s[!�-+�z��{�=�k�����b>-�yQ7��c�)}�Bu8���u���W���on�)�0�X
��s���fvum���Rg&j��XeB�l�?���`#�H(����9�}J��^#����:.;��	�wVKf_���s�o���"t��N�#`/��p�T�2�QuS��{Mb��&I\|���6�D��F�C��ip��.Ea�A���$��� w���f`H������1����
�t���D1@BMb���0	>���� ���|w]�,��C���P�4C��k�\�!r���<)����	�Z)u��h�$�y� ���l���=���5�_�.�\�YK�{�����:�E���_��$�hO&�,�������'�7���������'��^���YfO3���'��~�nx���������s��x�z5
{
�p������4�58�=G[�h��kx�*�;��	��;����*>,1������^�S_@��5�Z��3y�c��*�j�FE�W~��R�W��(���|�m*�%����^���E�s���=H���q�~������R����{u��k�\:}/f1���9N��}���8�t�^2d����C��_���.����~@��m3}���pN�����k�u������z����x��#���!^��G�u��o�(���x9�� 6%i/�$zv����$2wA�x�_
����M|��J�L��6L����Q�u����^����^xox�:�wo����� z1�en�]v~��Q���������Hg�/�?}���o����+%�3�-��M���1��};
���g�>;���2�k#�:t��<�>��(?���1������ �j���!���_��x�v���l�Bv�����z�_
����n��.Phy�`�~@K�o��t�a����(:��y�i������8�&{��,�j��������{|M(��Ryo�_'x{�l�<��������D����/�'��V������61��kO[��P�xz��54�jC�$�<�[����}D������u��q2�/~������}���Wk��
-s���h�+�k��z�i��z�	����b�Q�4�>Gc��h�oaj��b�Y����}�T_�T2��;1��nN�gN���s)��rZ>~yyr��
�������V@��s$z�js��������/Gg�O��o��E���91�0�����;kJ���b�$�,�T�G��;s
L���A(B��_NK������x������}��@Ew
�)�V�.��������O�
x]��|���+*E�����)��� ���AX�}Y1�@�g��b�;��n��
�u�Y
sx!�9/����pB�o^8������/����������9�/���qg�/g����6��t���m��Fb���v`��M��
.�M����CQt�~����78&���5`
�]�9����0�/��--�*�-�j1;��m	<8��W6�*�����^{�_
���G�����l2���6A���M��(���S������Q�LKQ�k��*����::�~���}�B|
6�����H�N��2�~I��u}�!}.z�^
��&�E)����=L��k��Q.u@xp��Z���<p��s������<p��o�_��R��b���j1��#&�V8��������
���7���_z����
xz��xz���'�_O����Z!���U�k�����9/�>p��	�����M���M��b�m�CZ��4��~�p�4H�^�S���<��o�	������:>���r#��Wi�&(@��/O6��>��+�-a���rT��Z�%�?����FQ����[�t	�n��R�5K%pI���"������&}�:@G%�#�o�#���c��X�G�%�MK�MK�l��$}��D�@��$�Pr���Wp%�W��/�J��$���+I����q%�y���Y��j�@n�������A�����-��ozgG�	@��W���)��/z
�M������9�a>1���y�=L��
����i����fp\�o����o�����\X4b�R�)�4��yd���H'����2��j�-�RE��w��K:`��������Hb��M�H�%����Rej`��X��it�i#����ZI��d�����3+[��P��*;p
��T~���c���!&mf�/��i*5��p�r�e��r\B\�����L.��jA����k2|�v�c�Y�/2h�����h�/�����@p���@�D�&<��q;�#�z��d��
�2���C�8n�-c7�
��*3Dk�y!s�����~Z��}�a�l��
�����C��P��<�����3S����)��	Q����	xp��c��s�`�
o�7�%��:M�:�D�	T�SR��(�|:E��dq�@A��a0v�$FY�$�2�D`���	)�P���s�Da;� �����p������?������9a���������)�������M���	�+\H�n�bWT�5�r���s�����5��`����J��Z	0�t��0���0�&*��u�Q�����������
q��?:���K����s���j�7�( Ks1���  �p�EY��
�C���P�+p ��������S�=	������t��G���*�7dO-Su�^cw���N����]%��K{bS9�E���FJ�����
�)K?
��,��tOk=�h �ld}��D��_��A���`����Dpa!������{�Z���_��G�g���}�i~�����>Q�r���?���U���L�(��m�?q�@mp��u||�A���el��}��T�1ps�@-4���9�F��8=�������f��wF#a|�i`3��>T`��^K�1;�p*�����|��h��q������z��,%6e�3(���L�WP�c��(���M����o�V1���V��`n���:*������ML#{��\���%�5����K���Zv�cJ\3_�2'���!m���?�?	���Xe!3v��=h=_(�mxM��r��
�����w$�U��*LG�h��j����I\�����h���R����x��rl���@�t-�
!�k��D����H`E��h�>����Xub���X�L�z�$�������4�����*`��5�L��`A�M^*bMb�3�uq�J�O��<����\�P���j���i&�^*��M�u&��l�X�
@m��
e�Y
����E8����2����\-S���z�-NF��8��&�k�j�8vh��Az5?��6����5F��M��L?���p#A���g���Lo�p9!�z�$�����������\��_�@��.}@�{��5�?Bc�z��A
0F�a�����^�Y�	�z�Y���������\�z��}���]rh��&��Mg�]_��e�6h�oB}/�J�mP�z�$��2��h���;��<�E��xwy�4���-�����"��2y��"_�Y� b(���LCgd�<yt�Yf.E�i<h�4���z�v @�A��4����-�?�
��J{�����=�q��ez���E*GF�=���+�s������.��9���{2h�����=���@���9H�n+���_DVN+���������F�7e����"=�t�X1�����k�������6��D������U�y��B�E/���\��h��;{����+$��8�M���[��8�N�D�����hv�}�o��Xl��v��l��>��K@6Ye�(�����|��[����
Y4P������e�.%<�xm��^���T-@v�4=����1agM��z0c�:����k�M�h��-z��1v4=��7f���T��L�'�I0,nK#���$����"���Y1����k�'g'�7��5�*��MC�W�KD=`�1���p��ug��Ns���@>������ ]�Y���4�)�B�g�f��M��y������pt���O�|L!����������\1���gz�!�G�	�fC��������+94�R�����sM�3Q��+'�#nM����y-���"�|[�rH�T��6N��2�����~x}~�N ���7������NX�����N7�����<��&���s$�t)���h�Q��@D�~O�Z�P=Hb^Z�@v.3=��wTef�U������;��2��M9��	W��#���$x8�,/:
�|D�������\����_r��vr8=��o�D#4�0�F=��`u3�!��vv7=��B;!���� ��`N�kN��Qfn&��pzn9��%���3	��4'��Z���<���ta�����d�9p��2�B�z���P����w��d!�b�s:���'�J/��AZ����YL��~��
Y��h�v���N��<�P�~�����X?�Twz�+�vF;=��;}����4	�@-��n�P�
\�]�����
������u�h	������H��������h����|�7�o
���;r?f��N�K�h
`����H�p��E��e����c�;
y�.>)=���b'���/r�k!;���7C���j���J7����r�&g1G���9�E~A����pn�=�|�t�j������_1��qz^�U�pz�e�y�|oz��q�c����/s�c����������m����mz����N�'$?S�I����>�hST�t?��A��a��T�
���a01�Av7=�}c'b���/c��jz��,!{-�����c�U-����{�]�Wj{�)�w*g����������JZ�S��RV �i�%\�k�F�N����;���4��?UWBv>5=��k���Y0�3�*ncJ�K�-��9M����������9e���2�#F���41J/M�L�����[��������2`�U������	%z���c�".`<�\�f��F�6��������Z�'��6(�[�������C	X�I��y(4��):'���Hv$pma����b9��1��]+��X�\�.vm�ba�}����1�#��3�!��2��i4
gm��X.��ADv,p9*��Gw����pz�{F�v\pml�����D��tu�r����Z��\�vw�3�� ���`�6!]t/s�S�
�+v�����$)��R��(��N��;�6��M���w�d�D&]�����Z������?�KZ	��#a]g���`JqI �Y}�[1�L��X|�H��P���y��y��%N�����v���N�#��
���v�9X�������gO����j�����N)�n��?�+������J�6��QAK1]g��~.dkYE0W�<����JV�z�aM�HNq������'��8}�G������85 tm�(r��t�v������	��k�]�$���+�e�FQ���.��\������~�9�~vq������{b�]�,v�h�F=^���	���u��H|�{4?���I�f9'M����d�e�o��`�8lX*�cR�]@Uv�Ei���$LsyG���jO./����N�U���M��T�����&QA����T���Z����a�8�.�]U(�+���l�05s����zlc�(@2���0��j�#V��x��`I�KZf���8�W�"�@)n�,S7�k�����|�����D�o����w�#r�9�s��[�i���XD��k��9H/�Q�1%�pY��%�l����l17N=���EK(��W���� ���"�w0}���y�O�}Z��!�	��%P�:���E��P�Y|��(�[���'9]�o^��/�]a�R�F�8�F�@�����YvMK]@(7P��
�(��$��b�rh	X�.7Z+��J����rQ�
0�].�4U[�&����X�m��'
����~�|syt���<_����Rv|�����i������J���8��[�\Rm�(�*���fR�1�f��H�.�L8�n.��)���RS�+@Z%�
�dz/�]����k��V������0K*L>��t���%&=�6-X1��_�J�-����g'��G�'���\]S6�W�W'�7W5�1��u9n�^ ��\��j},\�g�|<�\#�c5��5�f_q�!�`����>T�f�|�4aSC�]���<���)��n�z�cLq��u�0e��#&(]�������
���Y�2�X�y�m�x�p>�&�*'�'�7��i��-J�<�����g�U���]2W�������$��Zc���e�-���-u2���VO����IeBe�|\��a*u*��)Eu�z5j	����������G�5��X�$���)y�����y*:��!�W�6���
KE���]e���W���8��W�W����������=��8�U��b��GR���;�
%�T������� ��������~��-jN~�[u��=�����r��*���E������V�S��T�:EY�1�
����k���,�c8�n���{X6e�R�A%��Vw���j`
`�����`
�i5�d�Guy>��@��F)%[�*�X�k;�jz�Xw�p�V��nw�X>�3����^�"f��2[�aDU�#��!�$��:O�����8����%�<��W[�I���L�������l���76�0vE�R�NeN4�e��f�`�������nc��M?�kSt�C���I�&�%C�,=vw-+���%-$ZJ"�<'������>��1KZ�����+}R_{=�s1�q�N���� ��\������j��u7e�K~� �����M�����iG������<���8.mQT���{b�[/q=;��?�b4a�"�c�����z6N�0��v��������X��]p���;6�=���l<[�!��p#A�����-�o��x0l���)�EFN���;�x6
ny��?���W|��#�X�%=��T���KX���m��^���R����n���C$����*���A,g+�!�'�X.9�����{n=*�����	��,]�c��":%��$x���$7.�����?�1=�K�&�4h���
�k'�/GC_T=@��\���Q@�~�s-��8w�2������r�8�#a��saNd����b���wWn�q4���k�c(4 �zM���PC�e��P��an�%�\��*y�m�qx�zS����<%���'��"`�R���V�/L[�����(Kw����W��������Q�������l�5��+��t6]��{�-n]�ow��� �����R��;�U�H��-��aR����-������|��
E�24,��j|[�ky|�����B�I����_H�����#����H��~6j ��]�]�]0���-`�|-�;��a$>�i �z\���@��:J8�R$s:22�+W���������I	���02��P��h2k����}��+�Ai�J�8t��Of����o�����8*�}+gKX�
B�����4�5!
/K��+m�/:�o�a����-����h��F���CH�o���6��)�\V[:�cgU[f���da�#�����+�bi���Mk1n��$���n��7�U��1!�"E��g�f��2��
��~=��[L��]~�G��c#P�(��-�2��&>o�u�����������N��U5�l����72j�����G�4�>���V,}�A�q����\D���`�'�:{.\!��(�im��p�R��v��kZc�/�!�0��Y}:GQ�UfqwW%/��&�e�*��q(�P����$7%J;2������/��&Q��B��h��<�0{���`���\�,���9�����9����5�x\��9�\��Q��r�fxs��r����dxyu����Bdh�J��J�'�hvg9W��nu��`')�&��l�������)f��������r7���D��b�yG
<���YH>x������.�M��
�G����Y�+�$N]a���4�wn�|��!B,D�[�i��,f�t��. @{6t.�8�������
��#�,�}a������d�;R}�yt��8���C�J�4W�J�"������(;��=)A��/$�o���I��A�=@����r��Y2[u��U�dI���&��m��#�*�W�H��l%Ri�`�k��R�W�O��"IR�W�`o{]z���^�!��=9����������������52��mo��U>O������D�����F�a1e9�p��_���n����M-_�:^|��������2Z��N���������������|{���5�|K��
�J��g��4��DWojs��������B���������I��� +k�m(o������?�d>_���6D����E5EP�>�#����>'^�0�r#�S?*b\"����+v&�<s��`�Q�Mo�
��tkK��������+�6�tWm%��keYN��>yc����{M(�T�?Z�s
��X
F��������<?���t���s�x����9�������Z>�Q�R�Ck���"`�{�]�0��&���[����&A6z���]��d�O��[�Wc���s���^�x�����F��d/��~T��^����P����[�����}������,/��Q�����TI�*/�{�Z_����^�]�|�u�� B�������<>��e}J���4~�������z=�d�
��t-�8�����
��3t6��G�7��b/MEL'�K�#��c�)�B���/��I�(|+��Y��F �{���DcZ\��c@������i8,�)�����KMf����8T'QK���p�=���a)��/��{\@��Xb^��
��=��_.��=���\�W�������&a��Bsq�}d(����������c���~q��:�\}q��Rf�S�����c����o
���������W�o�����V���|��o�c�*�9�T��u��}��Hw%rS���tu�����������������.��@�I����(���~Ki��t�����v[? ����o���.���>p���:R���������f���/�9s��~��r���$��������s���$*x�,����ea���z��qa"�Zt�6g���x����.x#��7�Q�&���e�e����X6�>�D����b,Z�)�o\/����N�V7������p^�i6�5���:3TcW���#n�p�ITp����||��������n�����{>�����8�Y��Y��U���8�:�OO/Z�mN$Q-Rq�_�g?�)Q7.�N���+\3]�%�� ���p�����=p^�������b��.��h�Qp=N4��
+�>�1�����Iyx����+W�\Nv���w��<��m�Hp:�s���}#8J����U����N.�������#��sF9�_=<@r��4���y7��V��y=�
��!��t=B������{���r�����hDE�`YL9$�h�}rjN/�N���7���h���`}��vZ\���{{��@�=�s:�y������r��
a��������A�t�=���t��=�{�#�W}( @��������x}����(=z���8�]���`����r�W�����F��N#�I��{�����L��|�&+N�d8���|���8���+ZCEM��R���l���y�����~i��D��l���n�i����(#���U�������9�����W�=g<t����b�A��{Mqc6��x�T�<r
�]+�n��n��mg�k��vZ� x�#���C18|���$�lP��x����1���:,`�*G��7��	�x����t�`�^�3.����{sC��7����O������~�A�2�*q�gg���yR�����Iv����LF���f�������8]��U�2��(t���q����!y3^��=�����Mu G_�h}�>F��`��2�g@�{�#H�_fln���o�
�����Fe@�{C	}�����({c����S����gS|��t����b����$t�c�O��Oy|&���L6�F�W��n�j|�4g�E�$���<���u���3�z�4����N7�b���,)�l��������y�U�Ar��-���9��9�)n<��{c7�;�����mZ5�>���K�8���2<9��=N.�}�#�J����>p���5��h>]�<J��m�_�����v���X)�����
�v���o_����J�<��Vmm�=+���{y7���U��o
s��b��������2�s�>���2�u;����K��o4�:����x���H�w.���Y��b��x���8�z����O��~��H�@4���$�[�����=@�{�\7"$@�{�n�\M#$)�uO�����cg�R��N����i�����#6[c���,�B�6|������f.�����7�kj�G��#�#/�N/���a��:��dL4�n�����d�R�F}��}����}>��������|,��4�f���~�����P�=��`t��n�h���
z����O�o�9�y�^i�E|y����O����:;}����?���D�f��~�?���\��Wn9��8��d������T}�]@��8n���G����r����	��&�m�q
5�x�E�> �������4��L�o�	�o��3�=0	���U�����I���wf�L�����"�������/����o�V����F���	�}���{5V���}y��2�}��;�����������W'?�<9yu��l��?�����������R�*Z2��YE�P�o����G@;wY�.@�}�~u=\��NU���9h�(8��,<]����`������`=�q�
�*��m�R��\&���k��hx���� ��z�u
���������tb��)��.��i�N���5���w���:-�����g�v�3>e+����e�������$#4L�A��I��f���t9�e0)M(���ya*����9R~������~��f�{������
�
wmyc�������w'R�����o��]v�r���=p[����HgZ���T#��)�uf��;��g>`�}��|s�6������$������z��l�<�ZpV��\�R��w(�.�<������������}�we�}����a������yE��j�J�=%p@H��.J`�2������J��s��Qp8}�us|���0�\h��o��e����W)���I�J������GR��=p��x��O�p�1����F{o��P��n��HP�~O����(v���xt��������
���~�����?RB���i�kHY*+e��t�������t��b�~Mu������Q*m�D*���������xse���u�?�:���������+sC ���!8�pr��8���T&�*c�������&_���
v	��>'9��Pm�����?y�C����!����������>���N���1�B���*��Ce��I���oD��)��bY��N�����t��N�Fo%�7FQTi8y�P�}]�Q2p5a�(u��J\W�������x��YBsDM��3�%��g�*��e8B�{���i��&[���{�8�0���a> ���������~�}�`9�����y�.��0�~_�uc:
�r����Y(p@=u�}@|�H�.s0����gQKd�����|^��l�4���_W�{��hj?�573������V�����M}E.�r����c�������S��H��hq����jP��cE��,�hxm��[A:7o��o�����3K��MVV�����P���#�p�������[���N���(�oW�����A��EARk���9}q����G�_��S�>)~��^U��CH6q��-Q}����<�.��������U!��VRyQ���c:�'$g��
�f/������~O�t��crP������
��C�>@�>y�Q���|�u0��cp��c��W��8o`o����r�7��N^���T�&K��jZ���Tkn�����_���=F��]��a%���>TY�����J���@����Q���}��I�����;�a W�Q���<o������o��_������=a�����o_zhB<@��Y���H���	=���|Cw���H�[��Q�C��M�������m��n�-�y�������h�!���-��N��:��C4��Y�Z��(�<��r��X�T�����n����X��V�����V0�d���T������a��6�v�b��hvS��N�S���*)��j*������?p,�������5��$�j�����K��6�I����n��\��$�h����~��������>���f�<w�0��`���P�����S/���c7����������������9����t�V��~['�a����GO��w��8?o��:5�����i{/�|s�r�P��$x�j��[��*�W7�����wr�j����|m_��k(U����ld�[
>��i�y�q���>���3@H=	��"��G����+��iNNa^��}�L���0�y�(rx���F�)����4���D]�0���5_KVT-l�	p���0y��`�0�|pZ��=a�d�����IT@2�����x�L�E���vn~���m����k7��1B�t��*��`Wr;�v��\��u��{�+���4;�-a�Y������E�O�@g����:��k�R�C�0k��]9<��>]Fi�?����]-���������&�k������*c����:�a:��W�h�7Jf����j�8p�W�3+&Je�*�{��*S�M���_�S�T�a
����]�9�0f��9p�(�=66Wc-�����}�1S�f��:[6J^����w��vM��&�zm��t��]1J�e��Y�'�p������D��r'�>�N%T�-����}C6�_��
�����s��g�$�Q9�W=����'*�����M�N�'Z���Ts��s��v����R�A&Y���s���&�W�K�N�����2Qg���^1�JX��V�,flT+����w���J5��q���Jj8���An��q���nk�	�0�������\�u�~�F��a�`�C8�P`��`x�]E��\D{��4.6�z�1d�?`�����������G��> .�M�8p�O��k��Y ��p���JC���i���8�ic�����
����L����� �[�e�����~�rq�i*�
�HB�A*�M%b������6���
�h������7��F�!�"8��+�JG�S�V�C�
��"����S��j=��1pX6.����jn;�W��WcSQn?Pn�\�jwq�y�D�W�6�Vk�_���I����M���g�W�HM�������V�����&K�p�.����j)i���l`�f����m+�
<����?f���a��|#iHK�w!���6�c\���
8]e����!��rq��i\�q6��<����zX�+���
z� �
���7�P�|��/����E����v�x~������?9�p�~�����wb����p�����V������|5�2��#�Y���/����S'��}J��C��Y�Tl�=�C�-�l�����;�6H��d����i�������]�l����>(����r8�`'��M�'����zJ)�1��B@��=�������YV�����e�S�J���e+O��L���2!�h�#����C�����m�I���[t���cb�r�.#D(��z6�������"��@M����<t��R���rW�0F���"B������or�(��g;�Us���sW,%�:���a��s�6�uc�2��Y����]!b��X
B}ec=/�k8�{�k��_t��-�N}8������r��p��2y�.p;��c�c��Q��e�+����N6�Z�����!���G�tl�
{��;/W�=��d=��|�R.�lN^WE�����\.0��������s�X�*���,����d�
2�y����m�����6<@p�]���Z�P�����������}�!luocd
�q~��a�b�787|�q�%�5'�����C�l��5����|5���!�o��h7W��_p�p�Z��1�\mk~��J/�	2L����k���c�1%�r�o����!�~CD��7�I"��g�A�F9������a���UWx�9�,�+O)��	h�Jgb<�d�O1T�W:��am:qV��NgTD�\��!�fC�,���j�:������"FS����lN)G��$@�Gij-�����ap5V��Z����[����R|&2�nc���sZ)����$�P�@Z���������)�ZqW3�@k8hUooo���^[�����c�[�5����0
����Z����s)BO�%T~:�5�&��Y��
��c�VAM���5M��P��)B�����,f���|�N�A�]�^���6;��V��G����<�����^�H�����q���C qC(x,	7�����^�Yq��j�@��q1��K
cD���)��lU�y�����B���P�X����z���58(���d�h��8]�*��������7b6&=�i��
$�;�����E���T�o8��O�K���0x�\���R8�I
�������3B�bj����t^M����V�R�4��MjA�����1-�w����>�V�>
7�p1)��(���={����W��qg�~�r�e&l
#��a��V��1��$��S�����r�P�$��&t��[1��j�7��&���$9d��u�f.�e����,31���K���1do7�3��vl��E8�_����E[;W����zD��D9D��F	�0|�rO?���aoebG_��l*��������A����K����t��j&_+wP;2H~5������/����-tl6��fQ��!$�;���U�$zQ=���A�K��w��8��@:�`��2rrP�Q�pCh�:<��"�b���9������\�w���������_�'�C�=�}����.h�����y�(��zD���p�3����F9������e�l��
y��)G�l���JZ����J%��7��|>���]�:�6�Y����e�������s:�u���p�=83J��L%3u�"�i�&:�W�����������6�������O	*�)@�CZ���Cq/&$��zrM�j+#��l����q�$�Cu���(��z��$mc��ITJ'����|���4��^5��Ms2h�H��e���?�[I�yW�qh,�v�N'a� }��tq�%#���|���Q�>m�����Yuv����^�]V����r�~p���[�ZqM�����S�����XGdbl�l��;�N�.N��S5v�0��ct���QP>���
e5k����bnZ�%���M)�y(T�x�����hW�w�E6�^�����1{�s��kA�����\�Us<��qN�'��)���d���v�W"���� �jQ�ap�}��5�T��7-i_g�O�L�n���j!��u.�*4���/Q��Jlj5��p�E!W���tO
gU�&@/�3y��0
8�����(uH��A�Y}��^������L��W���l5�2u{��n�s:I����������+��I�T����$���e�Tx�1����&}c��IM/�n|=��u�Vzd�=B�w]4fl����q"0{d���.}����y6�uq��]��#������{��a�]�s�6>_4L�+��h<v=F.Z�6��������EG� �l_����f�8(}t�	Hx5{J*sF���/�L�.����n	@���#2�����f���O�?�"���6q���S��]�� ~�2����+���^l�a
����6��Jq��g�I�W��Q���U2����0E��e#	Dg���>�K��R�D��t[�N�����V��L��d)�.-����ZkIb�n�C�F��S5w�u���Z�s�Zo��
��[�u���&1C!!'p�vJ�sh�)pL!r�zS3n��8�����#p� r?1~gn"pj!rxW}.'"�g(
�D��P�"�v������9����\"����f��v42��O�n�YN�E��@�0�.���Q���RT������BlV��|�����r?��cs���q��DC���.�B����'"$��_F�G"��Q�4�#�G�+L��b��#pr!�N.���X]�����!�E�,/�������1����\��q�hWu����L��6�����z�A��N*(�"oh������,�M�t:^M�Z*c+p�H����_�$���I a�W?u�M8b�#}�N��y��E�I�SY������*�/�Ou���Y��x
p�"�?������."��E�k�S���R���ct�=p�#B�8�\�u�������2c*�������oJ����P��D��H�2�p�#�>��������$��Y�h�Y��)���q8.�R���GA"�(_ po=��#p$B'A��s�u��h�+%���brfYY���9��=�``�)Q��^-VU}��'�'�!�jpP}�~D�'H�i;��c<���������

/e������^�h�Y}���pn��H�~Q[�W����� b��cN�c&w�D/�X��M/8&r�����r4�g���D����3�8o
=o�f�T�v	w���rLd���p�
��D�������6��q/j���'�.�B�UEr����vn�����G���0�N�vx��~��l���1V��E*��L���lY�]}��!�8h`pb?�8"�1c�������lv�U{�8���l{�@���@��#D�TH�����
�s*Dm�� �����Gh�"!���s���J��I="l��c,Q���q��K4D���z�s�dK�N�c8���������'1���v��5����zzc�y��DIi���h
�����S�7���Yf]�JG�<t�.�P(g�T������`�����q�a7��:�&�WL<��d��iK"*�3�+a��L�a�K6A~��k�7_kp��UJ����S2'����uD�sfxNsD�T�S��/[4������������p(#r(C���E4
��8z���V��D4Da�.-�O��$8M%C�r�(n�8��R��A�j�p����&"���Q�(��H���(���7�j��;�`�<J2��0;���G\��"���G�0D)��~�518K�=KP/�]-D��=�f�s'WO��J�4 }5a�p����g�
����Ohx�YnwZ��g��h�����)r���wv
3:�Xcp�!�y�!G������-�p�T�2�*�_M��T���'faF������c
1��o�c�f�������&gu�/���;��;P����C���'2��e���^����6y$U��g+II��	�������]{���*�eW�S�����1�)��������FV�p����#F���S�?|~���F9�pD}��6�W�o�	\�hG��g�i'[�NJ+�� ��6����D_@���V.}����9l|���z�=v-�����N&(������3���e�mJ��\D:�m������Y8���n�:�|���z���	����W7_*��CI��X������+�m��V��e�z<H�^h��9�h��~wkf����;GBs� {�����=J���y5�s��c�m�<F����g�����@I5��?|m���c�s�1����U#_�7��O�.������|o>'p:�����91��c��}����4��[��A!SH������=�k��D�����1ty��.o�V:�,}x�uK������7�@�c���a����xW�{�����b*bZ5����������h�;�'	j{{v@��C5��b���p��/���5����'�M��VfG��)�lZ����~���2�s�Md��3_;_m��1�t�����P�1GY��5����[3?w,��:JV[��z���6�������1���`�
��F�
~�C���.�S�7����70� �1�8�;rmD��;9����S����H�l�@gK�����R��!z�9f����)��y�d�M�����MZ9fE�
� �r���-�!��r���W��r%��+H$�DN�~!��t.41���1���}
��q�D�tp��M�^Q�"p�|�\WS^������E���?}��F��%�"Z���:�8{�������F��C��MK]�$�qZ�3�y�����{��W����i�s�n�O��G��hj=MH��x(Z��.��xS��:}����N��<�Kn�cnd�q�c�b�	������e���9�m;*�Pz �����6^�}z�M"���:��cNR�(x����� ���'1�&�y-oY�K1P?C)0���Q��Z����E��U���F%�1f~��Y�}I"��,[Po�gc��U�<)����-�)�Rf%u��6�>���w(c�G�6f[sphm���u���uJ3�m�F��c�`W���n0�x���#s�Q@��
���g���
I���n�(���,��x��hk����t�6��U�����)�Cwv��|+j��'�P�x
�Q��}�i��#�^>D���F[�/S�=��a���C9[�hm�	���`�#.sH83��K[a�F���{9e�=��f���p'Z�Ef�]��b����(p=��cG]-�yG�b�4pe���Q�D6�p3��ml���w.���4���u�h!���b��/�C
"���i���?>3,��*��O���y�p��;sZ%�e���L����.$�u"<Ix�L�N7��{}$�mb�ne��o���Z	@b�	�TD�vz� 5�H�/X=�&h�\m���4'$�\M�����	��m�c�������]SCnu����p�YJm^NE�d��&��}���<�8���4*�����>�^*7@���{��{�<[���+2�q�	 X��~[	 N$a-*=��*_O����������(�I�`V7V�!����l����qu8�d��5Ei� �����	�M��#��lH����6�.h40���g4���~�������IOV5�j���Ey�=�17�J,��?�h�'��(=��V��h��)�Z�U��B��d4�7�������;wu�������^^��<�m�<�Y��.�j���-CqG�K�S������&�b�]k������i�7�Ior�\��J%LC^USt8�1����&6V�b�H�&�@l�~'�^�k��3[H	`iN�]�����/_� Z�������������tzyu������W'��?�@m�!��t���~�.������S����������>������o�/N������o�.xw~yyvb4	`n����	@cw�"������_>7pb6y���n�0���/�p��b�{U�3�����X:IS�x��5���{���M��Z�/��&��������+�z�G��Ml�9r�5�����q)��h�Fn�Pu�-��~����R��i��.�!�Q�>�S&�M�(�>Id�nS6�q����J��������d����,pc��������(������0��F�c���f�s��	G���b�Z�&
���J��+2��H�b���b��s9=�QM9[��U����	x�����Y�I8��hc�EM�,jG�b~��Z�-��M��0���@�&=�������j��&4	�I��u�t����-�J�t���,5�}��&,�ogdQ���VC����d�a
p�X�>�0�t|s�k���&��oG��h�!��6~��dWy^#,iDg�M������p��t�$�&��F���'����p0���7������wrq!'Q����������'��) O�<���<�4����&6=��@�&�en�����H�g�8���4	�+�{#�E�L)��7��4�8S��������q��.���>���^��������fk
p=5r�&v�\�4��m�)�]�Y;��ly3-�)\P@3::���G���TA�4�s�Y� �������a�6�[

I��x�w��6#:YL���p������'��^�E�']V�L
�MgM"�����:��������R.u��F:�x�0y��d��2e|B_��;Qw�GQ��U�qkm=������j6'��������dyM���f�X�d�.��W�z����Ja�FjAE����D~��G\���c^�2��&Cn������l2T�����:���Xr�1c%�{M8��(x�h�ji�b�����T.�\�J�Suw�B��Nv~0�$�qU�[�lm�S�6�l�!��h��ed)qb:q�b��	����9��r	���]PZ�u��Y���fm����:%j�\�]�/��zkBN	 lV}��3��Y1Yy��j�t���@1-)3������pY9
o�M�
�@���Y�S8]��f]	�Wg��n�����"fy�$�*XAYj�������i5�`��]91�c�&��9�g�z�����y�lx�Mi���H�?:��>sU9�z������5 b��]X��&6eX���������c ���M8QY#�k��t�'���T����\lh�d�\��0��Q�k�{��o�p����;��5Il�Jm����)�:�[�b[MU4R�"�Y�h�v���$q���������}�=�����E���m�CD�V�D��%���(��g��~~�U(g��M��~�*�����R�"�.�>�������Sx���S��k��Y#�������>������$������~����3�������_��X]&}��������\���}%Wl����SbG`u���n��f�^����j�0}���	^�|�0�����!��]�����\�_��w���=�&�g���~�M����������8����#���W_��h9��/��q6��{������B��xqT,��]5il]���R���n�r9~����s����������V�s���w��!�V/�E�6�E��e^�O�7����>�gK�H��A����[f7/��������O��aY���Lt���p�U���&��f���������Z/���4��/����N}gt8�[V������������M�������u��v���"vy.<���z����rHU��=DHs�>�����K?������������������o��/��[���[���EQV���}}�������_^��������3�������t�|���{�����._��CO���,����'o���E������|���o��#��,�D���e��������p�g���K����^|Ye?W�g~#'}"����w���)=_^��,{�3��z
�y���I�1��_�i��k
^U
� ����$)���/�,��iV�eY%��0��k\�_��k�&�;��������)��}������)�p�>[/��������QK���������~������f������G	G�V�jk����g�_o����@/���t%�u=(��E�K'����b�L���#�������W�I�
��A��_�����VZ�v���������-�+[}+9\���o��&������F��k��%�l�Y���?�����R�jm%��{�^��wX[G����iX]��tN���r���DPY�1����M���G��I��v��3�9_}��~%b���=r���F��Da4��
��G��wM�����(�����V����.�����F����lO��X�n"�Zw�#����7�����������D~�SL{�4����[��_v���9�<�h���F�J��5������:��&MQ�9��-�^m�p���7b���n����K��������������f��r�Q��~�~�����Hn�#��M��������g����q=�w<�j1)������1�x��'��M�g�|�}_�������c��H|�BtY�1H)��g����u��On&�^Tf��2��g����I��?M��w��m�|'�5�
m���m��_��t�[S��)w��6i}�Ft`����e�>�R�0r�����Q@���������Z�t�`A m��X���6s]r��lI���W��V�:���R8�_�
�&���U���Qm�{�Ww�B�����_W��4Q��+���1�*�W1��5���2S��6��~����W���Of���D���\���>�����v��k�d�����<m�/pG������y�������������>����oF}����W��������������)�C��c�����%|������7�>o��"�~������y����������}������y����������#��+H v���z���y�h���y������������[���~�7��U~�Fv��0�<��d�k�#������w�>��mo%���������Q�y����������J�������]����9��������7�|��]o����w�e�<��������%*�5�Q�R+��e|�f�i���N|���S]����M�]�U�L\��j+pmh"���]�U-L\��s!��[�b%So4��T��I/��i����HS�����UM�.�*�%.^���f�az�2��jx\�fyU*������$#�/����7N��$���^�bEO@���i6��K1R��l�����'��V�V�.q�����U|Kt9>c���/&+r>"�M�ZpU�U(sL������f���^�������)|����T�5��Z�Y���xE����������J�5.�/Rz��6<��������������J�T�e����g�l��b�{.~���lY[m��\~^�U�1���/����r����M�2��]���)�s^��j�r�M��|����J��n���KiT{z���,��
����&�P�������uG����M��O�|��w���-D�W4�"GA~c1��l� �IzO�R�4K�3��l�Z���4��S���a�6��w����j\�����K�+�Q��3Sg��]����mS������/�)�-����Z�-��<�df�kQ����l��J�>C�)u��3�`�._�:v�����x�g���Db�O��i�����4X����Z9���	I<{kD�r���1I�v
����K�_Oi`��@381��d�����We%�/R�&�g�F�qsWG}YX<�i!�^�<!�*X���'�������;��t�]{Z�UVw��e_�J��}�f
��lv���z����C���R��V'�cbZ��u+4^��l��}1�.�1m#����_6C�>��3�{��M+�&�wM���)g
-P�����g��VV;*��c\A9d��l���]�JHJ33Q{"]�qA��L/_�=���;������W�0x�Tf�r|��"���������l)����C),�����Ie����b{�`aGO�e�,�����\�6��u��Y`����^9�s�K��\�=�='���>q�=��2)��+W�����B���g�k=���m��f�Y���3�dD�x*�[�\�,w��0y)�Qh����&sT%4�q7�:�.[~;�����ME�/������
�E('3��8"�Z~�+�����������8{*ca
����LR����?�p�y)�������������Bw
�O�)�����N�]\������)y��]Q�L�uFf������Y3�Q���5�`OY�8�m������]�Z,�84[���rSL#j�����V(����b�(��{�ca���'��Q�5�/E{�Ay�Mkw�d���m��I<p�:����S�t���/b���R����sT������/�o*�]~KqU�N�l2zo�![�_38M6���s�b��������h�V�W��!��x����9�����\OhO�,��}g����������]��n���L3���?h����p�}�AhZ�5IW��Z{:&�Np��9-�h����u�CrJ��L���z�n.���
xE.������%�s�+
W���b�^���Y�����������Q�4��vr��-������P.qV[+K����^�������������$���=�h�q�Y��'����k��6Z�c��6��1��?O@��������J��7���L�c ��U%��>]N�t]��z
�|��2lf1�r�J�e4S��bSP�w=[���4���u���u�Y���j�.�8lv�}�[U5[��KdD�F���5+��b�R���"�=��m��5���T���/[t�<����
�o��u�Qs&J\Q�`�������9-��-�2��W�������h��pD[��V���,����_;x%G�f��Y�X�N���	�S��9��S�rx���r`����b���Mgo�T&�yF��A{'��iq�N�W��b���Xi���$����Km��������:ysy��������7Z��w�����=(��-b�`����V3pl��_�N�g���BKYi�-
b:�o����J(����w���M@��r>k���1�uc���f5��D�yk	zq�+#�=��(��}Z}�Lb��0S��wr)�����*���9s�Y#����C�G���=����L�n�����2�y�r�{��K���A_�r4�����?����3��\Ng��eU��!����S�f��s����-�1s#�������q�xN�#���M������Vb�Z>�^��*�3.���Y����P�Z���Zq�������|������c�-�Qfp	/�������C�=M��pi�eau���.[\�xUZ�*���Y��Wfv{�fQ
���"!�������d���_�M@���[���	�������v6������/��'�u�DK�lO�,�5���.�R��]���)����j�	���]�T��b��S-W���A/������Y��t���_�������[�r�;�m]8a��&�������*�&H��s��7/�����zs~E������0=^���x�%?�m������Y��\��J�0��������I����9���{��65���_���w����;_��L>��j�����?������z1��;���{^m��Y�~"�����%g_4�����n?:�$[�^��Kn���v0:�����\�B]7���Q����%_�~:��x��h�>d�%%���m�O'Y��;�{���i�-���yA�Xzw����?����p5�P�h�EK���X��E^l���oo4L=�8���a����h�b��	���L��z5O�Ybz�l���{_�7R�4Yb����S��?YL���"�g^��]8N��4h�VO�����m=�8��-���f��J���b��G����������ip�)�6f<[\M���7��C��b�������7b���xqru|qq�����'/�p��{��{q�sG(����\���>�]���m�4�8��7�|��������z1�Up��s�
��G����K7t}v�>�����rfN����\����$�,�����x���^��r�:5����#���aZfGL"[���!�g����<��z��=����M6,>��T���iZ��r�B�������`�s�w�\�H���U��ws���P���b�,�����8��X����4���v^�$d�<uR��.��Be�a	p#<����wT�����o�N.O��h�x����3'�wd{E�uQ'�{�J��1��VK1���'z���7�0�OD<�do*��o2t�(4���
	�x�I�I"��/t��k�p�-k������%��������v�JK���!Ds�MN�B������,":�P���D�����55�+e}�B�3�;����lVT���8��G����Y,��9���'�n�lB��h��a�\�>��v����	��bG*����PEG�(9��x�H�]��h�5�@E���6*G���������h��y2���M��������w=
@gG,:K���y��b:��������T�����E1�Q6�-���������w:b�S����R�����0D8���|*\��eE�V�r�Z�8Q�p���u�@�fVeP�K�7���<�|G:�iq�l���D���7Z>��	��,���r��=�;���'��K�2��~U1s�u��dBwmV��������:�0�}{�q)��N�g�A��������@��s����dq �^�,���P:b�R���#����*��wi)O~of:4|�����`���B7	x
�2U&���y8�un�)�H���^�+,2���"��X�h�bU�N�-�����+����]:��D��3��S�P��
��*����*1IH�>�w~�\�02�������������b>yYM�Qg�Z��������[	O��<�P�?�9=�np8,���i��o���!��VxRJ�@��3uhDV���m�������`�U��PS�������jEIkluz��=������G�5iU�rj�R�Er�g�0������(M*e����c����m�1��/�kup��3�
��@
A������n
��,b�L����v�9����[m���9��4)�\J��j|���'��&q���#��U��#3�tGN�22�$8�������~��O�����?��c���6(�56��Qc]_��}$�#�k_�v��n~���^�\ ��
2�7��k�U����W=<��a���8<������t�z����JA:��EQ/fP���W����9����x��Zqrc�Q��Yg��)m���7�kVcY���Nf<�0c�2������������u>����@[K���������#�6���*�}���<���������J��<��H������b�v�������S�Y�;�yw��\����dmN����b�����$�Q����!�x,u��Fu&����W'?�����'��'�IT�����+�/^t�Hs�y���z1�Kqd3]���#p�����0���_%�o�m��T�[�����zH���\d��
��#�|V7�N����8&��f.�V[�dbO��N)��C-|#&u?�����,��Ims�������u��\�H���d���FzJ�8&�P��,�x9+�t��7�`��$���P�^�'�O3s)0�#��V7�O��nF��}:(��v�DKtM��<m�B������/���Pm�C����^��C�{�gS��;u��=�0��p�{40��E�.@�]����zYv�rd�^����6�Z\�L�Vfz_�������C����l6��Irf�]���{��9��@�����n��S�I]������I�Xi��J�����;��[�NS�/�0M`��h�����`�3
x����qX�a�:M�b4�_4s����&o�a�w
�w���v9Z/�!����+�Z��>��*����|E���f�
p���_���u"D:6����� ���\vm��>��e������<������c����b��q.�l6����������T�P�}�H\s$���O���=<u�tZ���=�oIB����5�w��O6/�6������Up?��CW�X��w�2I�����>M�������W?��0f�XLj_{t�s�:{2�c����9�|&>b�x>w�:�������M���>p���Y���8���7����jN��S���o�����
y�N���d�o�����r��V@�]�v:�i�R�,JMW:^�������>]@]������^pZ�^�z�����SR���s�85M�������p'm(y��npW,��l������b�D��[W���zL(m����E�}��o��E�.����ch�e���j���b��@h��0�,�\_�G�^I�V���j5��+����$��Vz��F8���3�d����/����r��^��,���G�@�Y���D�/����f?��R������x��L:S���eA��-��r�g��I{!\��JJk�K������jN�E�{���Y��e�y�]�5���0���U�������������SZ
5Sf��7�����H�u��bu�[6�����p�gvY-�u��?`�k������}����03���Yt]�����T�v9Z/x*V���Oh�F@���T�����T}:o9�=h����E�_Y1jeQ+�]�J�$��Ls�8N��;G��w���g��T����Y����m>�JR+��Y�X��4��>]g����f���D�^K��Z��H�`j�
S+�h0!��k���l���"��R�x�L�7#�w������?��v���2H�����D��6��y�����*��nY4��ZJ��,W��m�2��(n�DUo}�����olR
b&�[�^g�-JJ���������<���
�����|�
�m������R	��:Ih�*��Xf�S���N���qn��Q��Z�*}x(�kE��Qr��ys~Y�4������j�)-c�����I.��10Z����%W6���&:.!��ja<>��V�\�J�F��W�ht��Yf���k��s��o7��C�����W�f�
�������l����{blzy��������7W?�x�������#���t�������S+Rc)8��<��.?g��,���:=}4YI �^��������Q�2�����x�IV���~�Km�����$;�q�RM�:��}�,6M������.`���a����,h#�\�R���J��o����Fg�F��2e���;@��h��?]�X���=w��U���H�^���ts�=��]Nz}�������)���Z�T���E�/zs������HI��E�arx�"5�x%d����i�Q3T��oE��8��q�=���`���tZ��|�>s�M7�i3��ID�I���n��J�r��-��U�4n��%���#�����zd(p7�M(��Ojc���������ts�����x��]�C	�-��0 ��d����:�w9Y��k�	7`�Q��dh=)d����:�y�z����CP���?nz�B�7�m�L��1�����Nw��n�Kw�X��1n����t����\����x���������1C� H�}�� �$�
���2i��9/��U�������������~� �r7",��Y�[V��������?\]�\:���=��|}|qy����������mk�ZYs���'<ysy�,����.G������u7��<��{�rw�������/f����;����>�X�"^z�7nD,���G9�t�����1��Q�VZ���O���}�z-*�W����\��������x=�8��I���n���SX'z��������r�F���w�<����Sm��mRt>��1yG]�/
xr�����;����="����=$��@��!��?�Mh��)z��?b�6�7��=NU[^J�C�D
�%�D��WK�z�����i�N�W���uG�T����Jl��\�Z�1#9:Z"9��:,^GO��|K�&����Q�M���������:��^D��&$�E.����>�vohWz�X�����5B��b�y)�'#c��(�IC�P��d���R�[�YfF���=�����i�x���=�z�&i��f�����d������L%aT���Ln=��M��b�|x���|����1���s��������������8�����-�
��{j���nu-�y���38QN�[/8�!�:E�-��x7N��.=�`��)}7��}<����}�w`<�+F���s��,n�Y���A�l����t��'�Iy{'��Jy{��,R���`�h��^�\3���|b�_�/)a�+����.�tnLP>�����9�Y��,���S$t[��#9��Qq����N1Q��,s�����}Von�`�Vi�%�t6R��)���4�6����9���:|P^FMg�~�np������h%d��w:lI��=��=Ns[���=�X�mq���������}Z��F/�H������UGYRR��$�s7��O�o�7?���Fs_���q<@�{��P@��y�������ib����cn�E��S���D�����l�\�48)���Wqs#�����g;�6Z7�O����? QJ�{�r� w/� �0�g�����e��$����\T����H5C���L�?�!��8��$�F����t��/C�:c(�������D���a�]�����%�DK�Ks��@�y�Gx��( �=�8���?8��E�=@�=���PoO�2z��l�X
s��kGW�r�����,@�{��Q8`����L�0�d��K]2�=&��gm2X	cz>@�=��)���:0�:p�Z3�'���3{;4�8�\/x1�7�@i{,���k���r�[c�j�"���j�Q������.�RO����c���������t���J73���@�=q�#T�6{]�_H;���g����8`,N�"�����y�!����$~}�M
�b{��kW�/����R��?��1�z1�iq���
���Ez����Q��|��t\Q�8����p�q������[�c���g������e��J>�^��i�I+l��F���0��=@6{V�y�)IvBB3MC���7N"�E����g��;�]��_���q�z�L��<���A0��(��/�����t8�Z"�WH���e����Lu�%�KN�>q{(yY�Ib�B{���^pE��8g�Q���M5[_e�g���pV����f����b����0d}��<����'�����?����K2��X�6���B���"������
�������b������vf��>+Z��\H���+:��/���\��&e�uJj�K������������$��C�����V-o�o�Z(�n�?�������?��V�|��#vE���9������-Z�[�24���V��|	�	�`a����������m��-�y����o��A���	����kB���
�X���Yk��:��"������|��E����3�g{���i�[�: d�_S:�)�E*eH�w7�$������6��������6�����:����3��.�[pY�������$Wkc�x�6�����]����s��U��X����IF�[O?[U��l��2T
�G��h{�N��3���TY��Ze��q��^pkCE��3��N���{���[�&������7^���?�*���A�Y����/���W�������q���.�A���`��M^0��L��\����1�/����]�~_`���ovU�09�]�)���9�������.&~K��(5pY5z��3W7x���|��*I�T�K�=�zC��>�}�����e?[*�R>�M�l������y�W/x�{.�]��PB�Q�7��R3�>_�drdN]E������tI�.]�2��4���������w	Gr�M�F}?��f�@^�����s"���H����Qk�����fU����9�W�^�����P����K�K���q5uZ����0��M������]6~[-�^8�L��;��6y�v�2�nW�����������U>p6,��(��on��j��w������C:�"�i���e�]{<_���������-���P��-�a�t�g�l;6��'��R����FL�X�����y�Z��(����j�L%\���'}�yI�Yts����:�3KR�}��LF���$Q'�����R8��n�+VI�}��G7[-T��r5��(�����ILK�W3�i�E������Q�YE��k;����T��c�"��H.O�
�n�Zf�L&=�E�b�������������4���r�tZ����:�sT��C]>`m}�l���w�N��N^��&	��X?�M�&���x!{��W���'W��N�_����c$pk�P��>F��}K����r�K�&f8-OUO��T����_���M�Y�'u�[���KW:���8��g?WWU�!+H��&Q�`���-K�Q$B��r�%���Z?�xB}M��~?z���oS\6�g@��V�c�6bY�?t�m�X�Y�������t�U7x$�Um����������:�S/n����>���;��`��Q���7>�q}�������n�K[���������S�����}"Z����������Y���e�*�f}�Z��:��oVP�:�bE�����k��UG�e�[V�B��Yi�v>�e}�^�>-|�U�Mo���n���"��Y�Q)��5�gR��67d�,�i���tj$��yPD�4�������N�K	!R��t��������A�lP��^n1�s���XfsZ�~��+Aj��[|�U�x�Y���|���S�l��8�V��{+������V�����F�6j3�Y����4�/�?�����j���p���9vW/x�]�]��>@w7��Y:Oo�io����"HLB�o$K.��/���
 w}@���\�5�����Y��
P^'��(�?D���SLT�hZ`w�����������E���������'���l`	��'��lq5����9���;=wz�'�=G2�����=;gD�}��Pu�u����kgM��}e���h^��ym����W<:v�!�#c���X_?y�2h���7~L�/�r���(��r5�j6��$�qu�{��QlJ�����������b��0`9c��2VU��4�|%a:�n�����RV2o����K��=v������uT�!k(�S��vp����s�4���H�����p���`��Uv��"�qp4P>b��`��!z�t����]�N{�{
��lk�`�������p��^p#6f��IP���E���+ng�(���m�9#���r�A��5s��B��t�����r��R��`��@?v�Yw?�\F"r?kjC�P��?o���q��_�P39g|�����&�:Cd�����/�dI���9"LW�l��{�����=9#;p:�+�����AC0�>
���/~�`��4@�D�pg�j(���q_����g��(W�"[�5@i0����
x�!j�}����g�`��5@�m����x����E�M�zV�D&5��l����\Pu�A�]
��GV��/�
XM�9�Lu�1�z1�y�P6-n���Q�u7q���C��
�� �|����~:�?S^��3���g�u��Z����[�Y�x����y��O���#)TK���_����������[��Ji��2�1�
���Xd�e�|mc������z����{8`�N�Y/8��6�L�8�e����v�������~,;��K��C����n\�hSj�H�-��"��Z�[�Kq��W8�K�Tj�V�/��{:�f��f2O^��X�{�>WQ'��_��������;9����/��7^,�
>��|^W���^�Q��L�,�/��u� �B��]r��9�Xm�}
_��N��_%�^��X
ec�(����k	@B���S	�Ino��9�Zb�����k��x���W���@c����n�x'
��x@t6��b-[G
P��ior��8H��Fjk��{7a����������k��V6����kD�v�-s&�M��V��@i�M�KCx�uo������^&�Ht0��;�<��y�_w���!�����JF�8Qb��5B����t�B&p�%���>f�ZqfU��\g[��f��dWa��u0�(�7?��1�)�������V���H��8�Z+����/R��cEi����[�h$%
,dr���:�x��`����l��z+��5�q��8�Z/x��������zN$L�Y��)�t�3�@�D|�^.�9�Z�������e�����4���(�E�a�i�9��a^^���I@�+���`��� n�����h	Dn.K�p����v��R@:6��(8�l��� ��`q����)8o��39u'��%pE�5���s����n��6HY�$cy���2����)f�Y9���5{��w`/ r��[����&�*���>�j�6��$7�T��^#:���s�X���+���#�,���^p!�.h
 �+!LS^:�W����:O�X����e��z��##v������;�tB�@�<�^�V+�T�{�p���Mu�[\���o$��"��7���+��
� �~+�5p&mn-�=U�<c*}ks@���)! sC+��:�v�|����Y-��=��������4��?��������
����r�6u��J����
�(9KC����q��
��_a�O?��
�����h8(���������$d^7���B@
��I~�U���e0�`��$�;r�KN�x�v�������dKp�p�88q8'�4�������z}���k�����
��B@}�:�q��#��O�,��l��y�%�p���]m��hh�9�H7�Y�c���2t���p��3�n��kr�S��nmt�����T��b�
p@�'8!@��!�r���
��8�T8�Pa��<����!`yC�����z6�	��0���7B���S"f����
w%����{�L�=��Zq�b���O����� �m��$��"�
�x {w�E��[��~K�:8�V_r|mc��^=��m�)������;��M�����KL�i�U�-����*��n��u������]>��q���o��e-�!���T�X�H�����
��
��������ZV��5�����B����J��1��%�%_������	~Z��{���1�p�������]��8����[����Glwi�����\������n��?0W]mT��J���1���>&'+k�W���,�-����Z<���Z7��6����YY[GTc* ��E��.���l���!��O��wg��r\�^�c�k�a��
Y>W�g����}����5�]��>;i�n���F���t�I+`uC�2��v-�0odo�>�r�M]���-�^w�f���
C�F���y���&V�_���+�'�k���Z���;}�����{��@��M���W�r���C����N�~���[�b�06Zv�-��8\�� jCN�X/t{����}��N���	j�k
��RO��e�����Q�YJ�d:x@��CT��������*�:j�'���Y�c���)�6��X6�(;�����F���������oYq���)�w}�8�!�m�7��
�t���/`0�q���^�cC��G,�7�a��@��08�iu�?��u���a<pa��;�lSX��C���.�Xe�+h9�A��\��!G���f��i�+���+��B>w&Y9���C��c���!��_K��l5em��1��3�-����sIL�����V��v�7D{Y��j���N@@���^���ebX��b
�U���9L����1kC��3g��:��"�l,�48�!���\OS�_�@l����YZo��(�A&��y���7^o����y�-�7�,���u��F���8�Z/���"���W@#9G,����,f��I�S��j��
��7?(��}I�������q4s�B)&�z���8W��=:�[�Q�j��?������'�]/����!Xr{[R<�Hx����!�gj)��"@"GGz�f���u��e�	�Q�c����!
���%+[����1-�������g#�����+Y�������s�28������9��&�8v��h4P��~
��w�)#�S/������C���@�+R�V��FzM�k
��8�z�������E�\V�J/tr�����J�U��zM����IW�Z���<v����6\T#�=������$�Vt�[�N6����~O�aP��)��Zu61���K����c������)O�K�f��a��!�n*~;��c�F�g����\��9�hd3�4rd��r0�Q�k�H�n�~�+��D�l�8
a=�,rdc�m[@��#m����x�����b�k�!��E�����)���0�����M���Wz�t����8��Q����z�a��|%��U�����(���\H	h���9����DT�s.5����M����R���a��e�yGu�����d,oDwJ�8(Y/�{L}���X���^�#7�b��w)iB�21Fu�Q@(GC������*� G��yG�v��'-;b�#N��.}8�q�j�������O3Ge0s��0�7��Q�s�_���R`F�F������3G�2o�k_t�����SZ,dq����W'(T]�$G�����/U�2��������s�������D��|�i����	q��z1��e�U��E��_����3�8�:0pRCy`z�����0Y�"�G�x�^pECa��E>og�V9le��*����g.�n�����w_���[��_p�o�+���%��:ZF�N��\X0����J�c&-{,�A0�(�����9]����D�V����.��4�JZ�X����Q0�x������Z{��
g/p��@7���&�n~��~����#��1]b|M�������cfv����%������������*�sg�#@HG!��"�[/��DT6S;k�`������X8����F�������7�..�/O�..���>}�}��d�x4��qs1&��}+,��2
�5��~���Pj��/l���~,
�`���G�����?��!�=����~����#�(q��I�;��U ��C/8�!�����:W��<�^&pLCX��/8��q<z1�� b�2��p�H�mv6���Py�Mh,���|>�~V���>CE��vH�gZ�m��/�i��
��Q<�S��[���� q�9�'q�8�E�8�p$�O�M*�Y1�H'�^W%��--uR�__�~�RY�	8�q,�eU$����MK	�k������|T��R���q���;C&3sdWB^?��]	(�(�%wpX�hK<X�F��(���t�(��s���#����>��a����a�a�WP�h*L�B��7(�����}�6�p�i(o_z�x5�4bZF��0��Mn��w���S,��7�`�)T6�ma�������|�q��E3=���M�Y�.q��z1�E
k��--�7��������yc���}y������������5�����8-W�b��qo�7\o�+��7�������oX��'��7��*�;w�R�����:iwW���-v^#;�9;�.#�G���m�b$����1���!WZ���:&s��9��	!��jNjY����S^��3]���x��zP�Z��j0�6ms��z1����ac(�plU5�'�>�=*]��������o�ty�W�t����L2�T��P/F�F�����Q�\F]��;��T1�T����UUeK2��]����q���������[7xI��1����^&���������j��G4��eV��(Gg�jRx�	���tB4�s�������5Dy�m���N��L��s���z���C��b������8���R8�) z�J*U��iNIL����&�8�X�����np$����a����x����O��Y%eJ��-�:E�!Qu]��@8�d���I��
�T�����������w��c�����������-���&-vy�R���6�����6�k�S�&��������T-������|y���;��F����MNn�����q��8������f_�G�l@<�������s�3;e�s��p���V�l�����a�fLL���H���@�9�.6�����������c���6�����N�/~�^]2x}�+I�Y���?�9�d��l*������.i�\�h��)<N�q:���p��)�9��}�P�����;�}�J�&(��4��:��Gb��QCw%�>�m��;�pl�@f����������O��~�C����{������b�?��K~A~�9]�*cM6��J�;$�R[�����+p������p���;��g�������O4���}�qJF�c)\�Zn����ks�1]���x(pL�~q{UVK�8��l��~YF��T�4��4r<AV�hi���=�p>1 ��lo���h�j�*\�����$����tYr;���Y}���'������-����z��|��<u~�W���*�]�\W?7<n���1]r�����5�����`0Pn�����>o� 6ba���V�����q�es|����h#$nl���)<���5�a/�ic+Ok�������uS��������5�is���
js��V icN}�h�~�9�aZ�O�Ts]d��&M���l����wY:9t�5���j1�����t�ALL���8����n��3�d����q��2f��53e�"�<hD!�b��A}Eo���Y�v�y+�/����*y���bZ[��S���}^
c$�!��Zy�b����N�0�x=|R:go��$b�b�����Xa���6R��&M�M�`�f?�������g<p,,6k�(����v/���xW��������'��=��$j�@.tQ�0�9�ZcN�W}����/���j�\���+�K��/
��x�|�����;)�_,���f���g��m[@��=�#05�Y����?��6���6��O<V
��O$��b[C^�������j0�?>��z}���a�c���C��kS����!�N�}x���_�Z���9{u����/N�x�	�����`����1�������	c�p#��7	���_<��805���^����:?���&��R'2����k����a��p�q��CG��H��kM���WqsSfUqS����Q�����:z�7�`_��#�������G-���s��t��0�ncuK�1�m����k����\����S(o@���=�OJPB_#~n���ns����[c�:8��I��:�>���J^�'�n	�}���g��j��z��v�	 ��������������7ay�M{|-����\�B�#m0R��I6�$R�T�S�C���H8f�e��N�W�S�j���4k���Io�0���a��w$G(���U�'G�ZQ*��kvvw�����%���������^{�������-�y�6j'QN8DY/��<�����5$�Z�2��H�Ho�!����=���h�&T�c^IY���	�&���	�����~�J�v�u�z���z��U����X�D�))�M��#(E����#e_Z�"�D���)[Y�F�b�t�Q3ZB,W��P��2SR�pL�A5d�ZaYvS,;�@�&������������'��������}���;pn���R�.�	���{\��o��M�����e�z',�y������EPEg��=|C���C`1���J����n�,������t�\&-3�&��&,���0�>���K�������^��~{vry��@�J84X/x�qg8n�����M���T�������h�n�����6~(8ob�0�n��M,<�}(�����u
��������p��^p��8���r�	�~��O�,��^
-	�t�L0<Zl���7��^�"@���^���&�k���&,Z[����F���Ku� -����^-&Mfu%k���/+����6�j��Q�f��������'��	��������y6���e�[�*�}YQ��F�������:�6>��RH)�Lo��.�H=��rI���X�n��0��:�����8]Q�C:���x��}�U�b��I}V�����'���	V,x_m8W5 mRA�*����+l�M8`��{�h7�@����p�G���h�HI0��g\����v6	�U��'��}���*����g��$��j
���B�2*Pw>��-t�<��k���lb�g���PvVm�6���)�J�	�-WD6A��1>F6	�r��m4�#�u&�h�t=�lA<���7+�2t�������I49��`�Z_����|~K�X}�:s��t�����-/�W�U��\���Lj�T�B3��'�RBg�7�����0�	��5G	�&����b�����Cr�K���Mll��!,it��&����P�[�:�OtC'�$	 t�p�.Ns�!��7p^0�I��>[�m�O�6�m�Un���p���w�)�U��Y��������VK�r�c#��rQ�=�S�,���5y^�����v���)�
��ERct����P�	K�nL��D�}|Mk<J6�!d�����y$��^�s�g���&H"WTUj:'FL��=dRy�He���$��x�2�V*��,[8�jI����z7m��-z���������I	�:������
��	������+'�@g�#nn������p��z1�;
��U�l���
e9�0�$��7��P8���vLJ��V'�"N8�[��k8����7��8?^�\_\����z����?���^���n
���S/��]�\}i'���&��s��M�p�;��	�o���V^��$XK&;������&C�h��9Lm������^>�t����'WM8\U/t�d����I�jM�P��z�����Xp�H(�6�pZ�����C�+NF��0Z���6�tl�K�(����+}c����	�+����|��V9�����bXd�J�9�Hn;�h��haAi�������k�'V�D�_
�y�Z�v��
O���q��G���$Q�����}����������c3V���>�����=�����__+��z��W����L+}�/$l���O����4
���b�r��
pX�[��QD��+q��f=���������(
8d5�.��W������������a��,j�8�f��]KT�f5������W5J@�5J��`Q��EM��U��'^ �
�p?U|ta�C5>�wA�t�h��|:����U��}�=�*�?��?���Z#_R����~5�(VK�'���;�J��Z"_����
W?;]��OG{���Pk��|���G�������)�G������9�pm}m�y��2�Z����������l��()���7�3��y�nf'R�s4�������l��7~����z��-�l.���2+�h�rif^��c(�����/-+�L���'n��74���)�rY,�f�-��/N��(��Z���d��Y����]�[E��Y�n{�j9N��ly���j���J�gOKW��}!���A��c����k��&E�*�F�Lyg�x�~���������X/h����to�R�0x=���c,^�?��"�������J��a�I��������NhY����+���w�K�����p������U���������dU�O��N�W������*���|[W6-3�\�-�����{�:Y:7����8e��P�I�����d�|'�G�p�G��c��������m���WPEZqs��8����g����O��D���x�k�J�-�t�}5y�����]f�R7�&���
p��e�i\�e�}�3���l��S����}��/��=�pJ[Q�ht�������4#hn�dr 3,�uD�vN�>\�\�V�;��lw>-x�����%)0��R��
�/���U}��A;�L���=��8����?�f����=�Q��������Km�\Y8���57�.j
�����t?�h���QpdV�b�k'��6��p��}��j�qe?��uN����L��Zt6yLOi����(��8@�e���u���M��fb�*��MVr��,�=��KS��7Kz�L6�n����`��8��A�!���5�&�}4��+���"�����u���6}�J���XV��9-olZ��7q�5�{�O��e�5'@�������lS3��c&������D!��a�U�85}��V���������rv��>�u�]VB��l��v��>����l2}��#��3����y�=S;�L���;�L�w�c��=�"Y�Q����vn�yje5��~�����,3}�s���,���6 -�Qw��bu����g�|���rQ
�������@_������@�h��<p������O^I�{��z<5��
�x���|t��y5�s������2}�{_."�1%P�2=?�;s�����>��zw����������������<b����D���������v��>n�%��(x������/~0_pI��!�8r�� �a����2�a��w����+8~���O�+�(y��&z	�"�6�s*(�^�`�#�MO��TL���yt�/r\7�_�o�eU��j�����2�M�*�{��MbX�{G�YP��Qp;�y�Xf��w���=�&�g���~�%�C�x-~C�����Iu'����t�%2���g��(|,&����d"�U,��]5iL\���R<\>�u��������������/�[y�]F�_���z���+2���(ZL\I�t~S|����K�""K��ev�����Z|����Z*VK��o��yV=���6{>?.�_:U*>w��l^}�����t��8���OU�e�Q�{~]L���������o���u��������%���=��C��f���H\a��GQp��/������(p� �����4�7�h�3�t�1r�[eu��J�{�������xu���OoO�._�9o�����������{/�?u�J}�
�{�L��<��N�??y�%�Cj�4=��l�~�����n��9u�g�_W��_��%�y���a�}�����2�l���	i)�����=��ty%��"����5\���3�'��l*~	������=IHEOIlv�M_|YV�����JtOa]m��,������P?�d������S��S{5i�����������K�����Sy�u���h?a�����m�y����*����������U&F����[���p:J���?�V�x�T��|��,�������3�F�XT����/1M���uT��ld/������|��G_}#���u�w+�Y}9��2�o�f�iyv���/��7� b�Kz�?�}����QYT��h+�e�'�B6����:��eVVL��jF�O��[��;�&����M5������������s�]��L�A������SU�+Q�P���{�l4z6:1��n�u?���k:��Z
����Uz�������[��zW����[s�J���=k�#����<��������3S�=j�e���U%�{�bzP��	��n�w~������8#����G(���0��>6��V��{�\t�����z�q�MGr��)���]�B�-�*Bt��&F��
�[:&���rj+B6
��O���1b���~$1���k%B
���#8Y���
�q=�w<�j1)�������x��'��M��|�_v�������L~���p����E�����ON��{~Z#����0����e1>�y[��{�����@���>�D�Nr)s[�;Z���D�m~S��Q��J��4k~e
"�s����_��1��X[7��uF��Z�(�
�>��A0�2��Z��hS����R�������M�R�Ys�R��U��8�5�~e��tv�����{��w��A���U��#��������Q��M������Q��;W��t��_YsU��T�V��u��L|"�v�cb����	���_��m�g��i9��5��Lk�oQ�,�n�2��O��c�nf�y��d����k�v��m/g����
|m���������������������?��~����G��?��:��?��>n��{��������1�z��m?j�{������7&?��}�B�E����o�y]�S���r�7���G�_G'�/��>������_�����]�` �7~?o�~��_x�wZT�N=��������_7:�>���+����e�����y��_���Z��U���|s����'�Z�B���;��w|�_o����D�L�{�|���5������n�-C����V��-���|�W4����8������z�����[,��I��E�W���&�E8��W�����Z�,��� �M�J\�����g������dwN�X���dwN�k�	�J��KwNlp_��N8I�;�t.������cOs�8\����9��4����(wN�o�yY,������S�=��h�8hO�����yE#w���p�_���
�����&�.����q�������6���s������w�/�/O����p�X6�'������������F|�zs~yr�������
�.
��o���I�����U�|��e��2'�.V��l�Zf�L�� ���T�E�������s�,fN%%��������s)F���X��t��,V�EQJu{a�p��'����3s���=��YZQFH�CAYD�����0�~*���V�i��l�N>wRrrs���r���T�;=��R8����lI����.����.������T
���y^*S���Rs�����"Kza�bd���3g�-��������+��N�I�|z���iy��Lo�:]�eIeL�����X�
����WF���]���X�N���V>UMJ���U����zi�\�_��oCMm�D�
{"�P��#^����Y)����������g.&��~^���X�����t|��hx��D�������3u:�(?E>/��J���y���������+�k�+�&�]>�X���9m�(���Y����,2]-�M�k�h����X��{VY	�
:�"PV�*Z���3��3���vc��-�i��D�����}GE�'���g��\��s��cS9�D��t$c�������d#7
i��;���&��|��?$��l
d����f9���+G%�uV�?�mU��>�v��i��?f����,��y{:d�[��Rf��D+�������v�W����frr�e�jlZ�}zF���&C��u*�k����]:�Um0�_�Q������I��y�{�Q������D�&���0��tV99���AnvJ�5[U��Z[��X��^R+x6.9��KM8�Y��A�k����%��z�u��9�0~�7�\������l���k7��c�V
p+\��������</��0�V7���XL��U�!��?����\Va��O}��j/,S.��m�����fO
,n����[�,b�+K���!��>�Y��CR�h����\�^�N����Y>g����tQ�s��M��������1�����\B]�f�p3an::�f��
�=���=/����F�r�>0w��K���mU�����*��);�p�)wI��������0�_�Ph�E+��g�Zf16��*�=|5��}]�n6��>�������Lt�$��
�c���+���@?���j	���eE1������,hYR�Z1oA�\���z��A�h�|���Sx��@z�[�'u���lD<�l$��$7�T�)y�r53,��y��dZZ�qM������X�����lK�C1i/��;?7������/��=��(������^
z'�HU����u��-�j��i6�����A/�2�j��NjK���V{^SQ�-��������6k$�5�z]����R.�L��1��Ni��P�L����q@��f5�w��������iIE��Lhz���� �GE)�D=�����$6Nf-�@/Dk�����L�g���:f�+I�=��(�����2EO�#���A���w"0u��KJ������������#����)����������k�r��K�-���O���A�����H�g]�=���pIO�z����oE�:Q����^���e��z�gF7�tvT�u������T���]���Z����	vw��:�#�����\�>u����<
�U�l���c�c��V�����������2��vV8�N�%c!�p��4S�����j�Ws�����_^�>���������Q0zy<r���X��5�{�orj�/��Z�q���[�������������M����yv/7�J����t�Z)��$������ �5�7��i>p8	�p�M�j�r+����"��/p=\�T��5�k�~pjf�Ii�X���5j�M��Q��}$���j������������j����0�_���?����$�#�T��5K�j��+����9�F�Gz!�+�L�����:������Q)�k�yI��_
z]���N+s���d�N+��v1�6��e����c��}Z������Ot��aw������4���N��o���c.��u���ot��������dr�&^-�E}��}���������m��~+�v�#>�����D|�:�f��`#5Uod���������T��5�^�[3�q���cV��9����_.3R8XGV=�6G����E�Q���x�VwL���D���2y��4x(�����9�5��9G���<W����Z|u��%j5������_$�
�$��'A��X���������Q���#?�C3�PuAwU�n}�l��&8|��z9������B��g���
R��<�`�#��Oo������8{L�#��y%��|~tk��eV��sY�L=�q���z1��o����X�n����8F=?����W������~���9`n<�<v\��T����0�O�#�G�����V�[�=�-�=��O��F8�Q?��� �'`Gv��)����<�R���v-*�:�G<.�����q�
_zQ��z���C*D��IQ�m�x�������}#�����(��on��|��j��8�O/�0����T7]�FF��6\`_3�R��-�q��^��>?�3�w.�1���T(@G�p�
]w�I��)nI�������)3�`q�������������#U���b�!����P63�<%��{)@�8�������Q���p��^pM�\����$U�3b����VL���=��@GA�Y���^��U���9�v.�L��c�b��zy����������jI8�xu�������eZ;`G��\z�,�(�b�Q"����\�/�����`nT����pox������#"���Q��l�P���l����-[�=s�r��N�E�����|��-\,ry�H�y���^ee�c*lG�*�uYt���G�������W�eA�/�+�`FG�d�8$�z���|w���'W�N�������>���1j�s���]������~L�9���x����?fW���K������O�9�a�FY��q��Qp`���	q�Y�k�X^s����s�	�������D�7��%"z�D�x����-���t�f�\���s�A��<�<G,��Nx�bJ1�=���\� �P�8N����Yr������>���g'o���)j#��vwt^��^�F����U��K��TTd������+���_;��kP�#��kp�#������[MH{�u",�P�����x�����w���n����8���
�^��Q����bsm��TEi���26�����/ OGy�Gg�=q�)]m��}����#M�b�K��-��o�.����r����n�m���#%�K�j���b~(��Gd�K�����2�����'������������7<>;}uE^���FDt�!��%�:+o���w������_��Jp� �����<�,8���ds2�F:�yG�g#�&�&>����F��� ���8T/����L���)t}���y�w�s�1�t�x��_S}|�����<����T�W���Wg�o�7�7`;G�i\���$J���3:��Q�bz7 DG	�K���
 �����n���8TT/8��������n��5^8/�-����3^��W��q�']z��(�|DGn�z�r�yoO<H����O����G~0���	����$��p\B��&��8m��8��8���Q��\'
C�8�=�z"�#c��5�b�\���;���}U{�>����c�S�x�����k�����H�N�+i����l���]��xj���7ndwu.�r5s'&}���j��F1�>����������2s�YE	nj��7���j>�$2�T����9���wu����:����40�.��6�?���~���������z�5/���j�����.������z1�q�������9��4vRy��p"��g����W��&������-�� Q]^�T/x��(�����:�����E#tr}U.�q~���n�V����db�%Q7�k���7��O��n)�X����jQ��Y�O�F��h���V�s�]#ga�1��y%Pm������w�,�xS���n�����R��c�P�K���IuY-���;�?~h�ux�������b:u�A�MEw<�z'�>^�Y��i��S������� u���J:.)#�^�1<UQ���R��J�b@�w��x����$�:�������.����������m���e!S�����,h��;�p�,�y-q�R-��*��D���J�����3�?d��$��Y]Je���e�Q�F%$���6��4�\��C�yO�H]X��TF���
���%�i�Az��%���\Q/�iN�������N�I1"�KI��c����\7
�V�R��!����|.�GJ��D�=�RL(�+��Z�j�������l�����a�C6�V4�����&]�����n/[��5n����������(�[�'w+|G3�Q�I}R���Ex���N~zy����OXo����8[��l�2�	�m]N�n�6��[���U��[���,����k��.��������o�$�!�������t5����nprV�v_]$��;\�}Z8Y1����wd���sg�r�mw�������sZoBv��������s]��b��5�}����kC|�P����q���e�����t<�r��V����4���r�PA��w���j9�-���b��jYY����,�4�
zN)]Q���G� �]N���2��_��P����o������b(��]D
k�k[&6V����&N���x��a���:��n?��w�5�/����$1/�����
��S�dK��:�����#`��~�1=D�5V�)�-c7@����]�b�O�q����J8��E"�M���kDq�84`���v�J'��$@{���7k�x��dvmp�2����7��~��;�\���>��������1t��a����k��QC0���E�/c��MiV7�������@b5�9�K��;��7���r\�^pD6Zc��d����w.D]d��s�>W�W���@�_oE������m���aK9��zUFLB�[�<V����V����)��A-�o�N-��.P�=��R��}�9����u�e?��E%��f<�I�qv��5���$U)M"/(�����`���5�z�8��I4�`���6�z�
���8�l�xh7��!�a�~Fd�OTl�<'���
�=���ko5��v���@����1[�ji'm&Rw����;�'���v��Xc���h��a���,�H��x������m�H^[���M0�?�����N'�/O<r�O8C�3��n/0��5�^�w�{-�r�z1����o�_j�D%�S��b���]ZeJ��:��v�����J�{UVW�|Y���/�D	���xg�y���t{��,�7�}vs�)���L��M#��)����)����;O8;���B1x�xg��6O�����������#���e������w��m����6�m7v�iR����q��Z������?�r��^p�6���|����kJj����w^~P;+w���6&U"�����&�o7��q����!�F=d��H�
6J�j_r��9>�L�
?����+#J��k#���
��n��34�0����������
|7[�$��q��7�w��2�nT� 3_�y�/�9�U�N��5I:u�����k��BK����L����vHfKj���*��%���x�r��������=4y�S+;����l�pi���-���
��)d�������L��w�9�z���)p �'���'=@�{G;�m;m<��zp������]�w4<�[�^����'��h��D��z.�m~��{��nZ��w-�g�~�J��Z����r���;�9j$N���_����w�s�3l�����^7Z�1�~�,�o�y��vo���5�N�M���T�7�y�����l�^����b�3�<�|����z�c��b�b�z��i<��{,>O�:����e��E|:�j�O)�����z���7 �ObNL��npV,C�l"cJi����\��l��&��g&������8��=�W&]dF��Z�_���������IQ��n���R68i6L��0��&���������]v<�{��.��o�)N�3�����{����OD�����b��Wj3W��
{kl�X%~*����s���~��=�o��{�_��7"����:��K��g�f����L�b�&�io^�g����T�lDX��5���/L���<�{��t�jz3�����~��?��{��~p����w����P�
}���>{��P�	�������z�<og�c��}�,��8-w����y��ib�O�&��p��^�!�o��{,LO�i��m�Q��@H�n
�y���
�A��3~����wYG�����>�G,���ApT�����X.^e���4[:�UY�H��O�z�e��N@x��
�W/tP�g
c�T���`����K�����q@���r@g����������
�k�����=�:����'p�z�z�B
�?d�������k�,�p�S����ks�@��<@�{A�p��^_8<����U�4���q��z���m���Y��^�/J�����~�_����o�R/����w��?v�����OF�����,����!����������a�<
���\C�s�	�D�r(`��������~��@�������=�d�w9�����X�$�#��&X�d�,�Y�'"���Uk���Y��/��@6���t%U�%]&�f5%���3��.���:�����U���5�KZd��2���}�3�\����!��z�sG8�F[
)s�B�-5�������Q�����=}�����w�n7��C����*M7x��3��	��l�����(U���I��h���Bo{���x�gz:2�[��/��4��j^��-����I��]c�E��p���|�G8��~>����g@��9�k61wL*����q��z��e�/���������4�R��V���g�.������iJk�@��acp���"�����n����R���+m<������sN��<u�Q�X�$\�������0;��:z����|J��Q���?
�;������;�1��\�����,��A�k:�X�;����<A�F'�?0���;o������`��e�z�L��������.0Bl��
��C�`=kV����V���7�q�&8TU/t����f�Ts�����������+�_p"�V�~���U��V��Y>|�:}O�f��v�=���q��v1>�L��Z���t,>O}<���wq�F�����Q�*�ll�m���0Gv��$�5��^>`E}����<{,c�C
�S���G�,�Xu=��I�x9� �}��V�d}��G���Y��58�����u'I�wrm��^|��1c�\�m����8[��3%P��M���zYv����J�Q[�����S�k������~8zg����Z+J�l�oaC�
a?qU	|���V�v����I�����eho�����w�oN^1
B> F}V�Y�vT��r�r��SS��o�P�.9z������
��3�6|@��6���zY�)q��^�G6b����������5xgR�1i"9B��r�
��8~K�T78#;�^�p��9ie�N���R�!-�PE�J�U2R�C�[�����~_"4���i�#U����fi)&��[t����t�� @<}����@?�b����U��#��cr�������J���er�������Aw�r������LAg�V���a�=8����m�ly<M7�+�j�w+������N�E������{}�R��OWWd!�$��{���XaF����evY\���A�^��R������/��]�����qP����������X��ky=Wr�@%v>��0���N���d��*�����IWS�>���/�r��e��g!Vq��PV��Gt�v0V����6o���O����<�M���'�����i������&��[E���iD��t���a��!����1��%������:��
R��\��������IoD�I�T(���]r�P6�6�o�R���3�Af����V4���_�:#3�*��Q�zL
pW�"�l�.]a��J�g��o�e��K��@����SVK W�o�~�0R����I-����V��K5�^�������CYZ�S��p}�k��K��v_9l��1U��CM�z�]�s��E��/;b��jTK�h�t�0����lr������Lh[?�[��0��$�Kn��q����1)���������N��g��W����5z`~}�X�n5�2�����'.��o!����U�����q�$��� �~��\��&��� �a����jx��m����M�elM��{v�
�9���e�n~��Y�~�s0���[�p����Z1�H�m��J�g�N����A��Z.VK�L�7ey| ���3�N��x�s��^�0,-�*��;*x�hgoS
G\}���(�n�>���N1���qH�� ]y^Y.=tX��6Id%��SfeZ0^}�	0_��|u6�������f�h�����)E[�e�iJ������k���e�����\F��X:��L�3%(��hN���x����s|�������W�7; �f������9�W����������.�������_�#�b�����y������H�GK��/��+��#��~c9@|���n��9��0�h�������N^��;f������m�z��oU�u:�c��[�k��w^����=��l�������oN_�
�{����HU��#���`�*d�L�I�����k���.],��Z����b��f7b��2��i�&� g_������>Go���:�������Z���.��'��P��M-W��7�����
6uZ����
��ls���JR&�l�(�bJ4}��>�4��d�2�@*c���Huv���b��^%�]�}�y�(�
��1+>fu�W��N^��:J�8@�q}c�G��QQ�3-Q��R�J�X�-Y[J���D��c�)�����c��B�1b8�I�"7n
__l����2��6�Q�z��;�t���/6�^�������Q�I�@�$���#��B���"���n�)RfL����u��zm���z��A��W�����]-�����=���5�CN����eW���z1�u�z���+�=ywqzqy�����+lX���X�m��1���U����xp+�VM@�y��DA���!�e�1@[���G��)�$��E�b�S�_c�����H�6Rn���L�dm����(VDY�����3���%�>��q�W�v��n�pf�2�r�T�������s�g��O&�:�����(t���z1�-�;�Q]���.^p�����b�������Y~{WIOG�w��/%��}o�?%�:�)M�:��0}�Men�%��x�Lz�����������
�c:^�f�M��R��/����w�/	�9;������������N�|���X_%��}d����:��6��e���g�]�����aJ�������E����������V�������n�������<��kD<�@x�\�e��&W��K*�P��EG?T����Rx/=��q�����+967<����Fn$��\��7��GH����d�L�M�5�p��z1�����\S�tS.$p��^���[c`_�������$Fnrt�9�=F!p��y	ejQ�}u�p��\p�������
��
"���� ����8E@� �A.+�l@X�������C����
�).�,xo�O7hn�Sw���;�X���B�Z���l?�n�1��etQ���J��|_k+��m�������w����V>�F��7��W�� ���lX��0�J�M��3��	��a�7���>7���k>vw����}��q��^�U}Y]��>
��7�K���L�����������2����eiw��nGQ������b��;*P�G�������mOy�2Pm�r$�E�*@n���>�S�5V	9p��������t�����3��#/�D|�0�Z�x!�������o�5W�pn��`p��ou �4�th_����4��h�����Q�q��o�i|���t���zq�7X?�Z��$_�S�����AP����0�#g���z1���xG�����}��"+�|N�W/��5�'�sN���}J|K-����Ar����N0�'�[Kw�N��I_�$:����ViYr�m������m����P���ve�����x��}�x7��_��@��4x���/]t�I�����������5;
�}%z7����0 l�~�m���2/����gJ���,! s����E��6�`[��;9NVCB��,���!F.���(�	E��
86d��}uF�f5��a}nOO�\6<���l
�����Y!R���Zm~0����w�_!�Z��~�~`��S��KNN�k�Vw~}�����_���o���><�����T�����~����4:3@<C���:[�V�\p\��ZO��Pz��&L�b�+)�Tc9�0Qx���E�/�9��% �:c1���L����v�>"���r�\I7	8����C���%!�IC+Nj������<]����<��@� �!��j�44����I=t�-�EQG�"K�$}`��/,���r�C�����Qz?�[��L���We�d�*fj��a_��4�vx��Oc5�qn��� ��M��p����z�t[:��i���:��XR�#��?H�A�gO��Mb;����y��L�x%�`n���:�F}��yST��+�a���4�J�M���J�0��O�&�e���f�E�wCs�kq{���G�Q�|D�SI*�2���.���;���u����j��kD������X�$u"k��������+��������u�������o��(W��TWJ�I�,5�����c�LVK�%%x'�G���8n
=����#���R7N�v����n��>�4�!�OCNF�.=`h�����;�g������Z��pn���x��l�"�*��pT�Md#���!]b�5��\���l����(kR��%7&��2k|���aQZ�=+��Lt����L���Q��q�i���v���{��y]�z��V��k��nps,���i.��w�O��{i���u������|�A�U}@����[���v�W����"9�a��-t�Vx:�N��R:T�9���vks�-t�)�5�w=��5�+��qLt]\���z�������%��q�!�`C�p�.���n�{\�e�r��5�=��y3�����f������:y���������O�,��l��&��6����C�eC.kU���MS���,nh����������YM�����Z1�
���?�N6����a�`����U1��{r:�z1�� ����������������"f�6�����+��
�^�W!�h��?`���e����?��������^|�"7D����HG_�y��M>��R)!���F��
m��=.V��<��"x�l0e�E����k��Y�rt��pW�����!l6��Y��A��a�o���M�����5�+��������Tf2��o�����2��,�;������TJg	u���	�U���d�bu{G-�Mr��B���_����.�K5�U.5�hq[��|����wn���u��H�b�6NW%�H�Y������7��$M3:p\V�v_e*W7���VU:��%�6"�rafE��R}4��:N+�/S�O�O�U��Z]3Xmh��O�b=���*H����7��BX�
���?�{[D?��DzF;��v�	��U����:t%�-�l��7��|��3���F=U;7}�h�!G��'�*�:C#�~CV�W��������KJJ�v�e��F�e-p�4��iC=�)��Y���U�UM�Y^t �CV�W��%�}��M#{��T��0����.�E@�q/�0�o�)�v6t����[7������{��_/X�/��e)E���h������np1\[Tn��R�NE8��>]lx�yv�1
�A�����2He/����I�a�Qn��uc���f�.d�.�1E�!suM�>\K!w�Y@"�}������X��e���2{]]-�r�ix=�f&��Y�Xps���_)
���|w����Cr��z1���$�
�H��*�+H��Rg&b��VvL��s�Lr��eM��"�n���*,��c�8�E�oW�2�WY�V=^R�g����$�xC��-�_��M��GzU����$cQ�e~���mj����I�`��tk��i
��.�!�
��}��1��6@�pmeV� ���[<�S�I�_��k[����<�V���C�����e
MV����W�O����I9�o�TU)��Q8�,UJ�{��Iv��iQ�V��<yY��R�Z'�*;t�]U��i�C9P���e�_@�e��������	#�b�s�4�C��v�&��G-��<+n��L��=�E����zN���������R��z������<[����b��0��q���'�xdS2V��r���@2�(��:��l�"���Z�m��u���,�����t}�@80��I�n�`)W�Vj|�d�2�V�uTr����������]��Q_
��+���k��J���+����D�Y�8f]/��V7�>@�#1��k����ow�Z��R:f1�/��f�"j�)���&�#X�|e5��z�@!g�kPPaWFK��E��V�
']l�m��z��P?5��\�m�~"�GL�w�����L��K"zGn�����Q_�_U�t;\x;��m���������v���g�6�;r{.8ol�FO\o���a�~��]g������2�2����jq��1dB:b	i�S���������Vev��*��J�_��n����LW%M�h��������N��	�M��"��+Q�����f�u�M��@4J�v���E~���
K]�7���]����������L](��Zq��q��Q��-���1�>:�����`����c���:�[�G�*_V�A���6��.��#�_�P?	]�?��q0�^��=	���
��q��z1�A�D�[4������[�t�t*�j":iM����E�O�=%<�Ed�08��~���v�C#DG��c�����/�r���F��(�c��������
I�3F��2yp��]���������7���8�_}��Wk��wP�;�;r�����9+����o����w��8����Vql��o:oAx�R�F�z����A�cGP@G�b�^����z�vl��������2�]7�w��3
��������Y?�#�sG��v��X��9=N�D�=����\F�D�q��^p9�v9W��:�O�U�Xo�fq���g
{.���3���#�������`�KS0"q�f�%++�h6/�R�:u���.+�*�I@��:��<8b�`��������v���2K'W��I��������?�	/�o����E��/8^aK��x�*�����*��9=^3r�n�	�vDN��z���������z�'��U�6��Z=�|md�kmyp"��FQ���:U�VVV'���=kkx���\���:������'�b7�I������,,(��n�O�7�m�S���D�.�O���#w�b���{y���6�� iP������� �Q�o���n�4�/�4(Tnd�r
���nZw����8�`t��;2�	N����6J�9���L��6��+%��.��)�m�O>8�m�S>x�8�/���6Jz=��������E�%=WV9���Z�L�?���S��y�;&+���4�7[�YG
�������JTS*=`���,�*����G<-�Ha��v�{�F
�?��26>�DA���&6��S��`������wy������2gbB����:�~�p�1�p�����U�
��8��1���c��1+���!j���	������}���z�
����J=S���=c�D����`|�s]�4
��������/��G=�l�F�1)�ex�&������i?�������&�>!|;(p���z1�W�z��[/�]d��Xr����fS=����G�q��:Y�jsx�����b���``}�9(p�)Dw�6�G���7����\��g��4f���j�
��Z���>8f�`E�}]�v��yF�>�^}���UI�z.5i�eV�I���<��1�W�q!��p8�����9��rQ���&��p-$����j}��N���-J�VVps�_��N
Z��a^���I%�H���������E��6�Y�h#��q��D�+�1���������+�t���
b�������r�G��1�V�c@��no��	A~:+n�e����q��Z��U��7_
������b���)��z��K�1 �c���������[��Xb�%{��Rq�}���1@�c�:��4{�<K�x�x�\��#��vg����i�/��@�9Y/8����V�=�hc��~��Mt4�m�6tr��Z��Z�UIN�e��-���P���%o[�j��UP�m7�N��{�+��]=;���P�9�d���z���C������c���.�5�{���f�������.���&���u��F
J-{h�d�����}yB_3D��p$t�����F0��"�l
E�/n�9��f�����c,j�����|��V3�@l������IH�f1���9Z/x%N$�X��r�������������\?8�S����L�q!�[dHu�Ou����3��S6$�h���	�]?��9H0/I}l����W4��������O�sn.
�9f���4r���J��_���:C�J����_���;pJ)���+,I�Y9�}'���Y;���R�2]{FK�r�r�Fe���+�ea�jZ�4/�P�9�I?�c�R�d��5K��L���d�.���
9fe����9�?d��r�����d��1������@�c=�tk�@�9��k5��qH�i��z�;L��-@3���O��G=�	�dtiW�jwX(s���Xm\r�q�z1��nUE��K+2�e���d�-r��=~����0�B�^JQ1��������^?����c���N��<
��S���9�I����UY��Z���1`��w�\U\r��K�����G�;js�p���q?69lrl�v���I8Ue������GQDTy�M'��`�8��D�����	����<�GG;.b
9�(d��0��CR���Oh0���)��f��6���8��o�`[U'_�&���5Q�-�2$���|r�O�7�ql��_�����Zz��@u(O���a��ZU�Y�j4�5�'p+=�e�}]�r�Y��y�z8 ����2S=���:9�$��b�J��7��VMwL%��t��h`�<��]:�r�	�����Rba�q��y��	���oG����2��~+�/Kz���P����^�,8��`����ewAI?=�p�I_�W�%
^�O��p��z1v����q[VOc��f�Hm����o�	����������}���S�������j^��V��<{�(.N*��f�J_uR�n��2�����}�`������bt�&��"����������jQ�m���|-�H������%YA�|�N�R?���?W�0����&���L/���)��e1�����R�5��:Q���b���E��s�,i��d#�����g�(4�k��t�^/
�pw�l%�n�w���~-��\��F�UkL�`���a���<�:�e�`�I?��0��E$�����b���ds	 ]��� ���L�>������;0�v=[R�%�M8�[��&�����-#'�M,P,k�'XN!�������L6a�s��'�uM8E\��A���'�nq6�k�	����@�O�8�P5������	 \������&��t�&��z�riv{1���ud].�q~����C�0_"]��_b������o}������{��pj�ERw,V�V���0�C,�*|a�5�u���
]�w|��L}�����6n�����m�;5�nf{���u�|*E�:='`G�`�B��������?���WG�O�[�����p?��i�>T�jbU���z��WpC �W��q!�j���t�ki`M,2� ���	X����o8��"�k��5WzO|�~�"��gv���'rM�����P�	G��
@�	�����A���.RH5V�^-�Ekm��)��O��5��z�r��v.��5���]K��w�����)�d6��)���PG������A,P����m�:C������d�*�����Y'_-k�D
�F���������*9�D�7��~�c8��X�k���
��EpzG�.+�\VHZQtf$*�+m=OSU���*����U�zM,�{���$�<}X�����L�l�p��j.�[�;2�T��U<�J^����6�e|	�����U���P*f�LN!�Z�A���Mx�_� ��,z����/V���nQ�iI�1�8�"������G=�O�����	�k^�w�C)Af���Lz0� ��LF����0�
\'lF��M80��UH�X��^��q`������M8V_(0lY�Q���&��zu<`�3te@��E��G���O�M�!�	@`����5���^����e�O��@��"���.�26�A�7�k���x��	��me!�9[�X��vM���A�E�.����M���S?�f�X�~k(����)6+��X�O�Y�����>v���I�o�P�	G��^�c7+��X��O�YA��lF�tn���e[�}��U��uP8t��B�e�#���r�B������_��Y�}���Y�	}�_=cWi�����Qi
q�����.���t�d������;��VoE��v��QgJ�#����k����:C��T��n����;'K�3�������
���Nm4p��D� v:&n�!�6��������B'c��u�V.�v�Gx;c��2)���[����z���F��Z�h��3qC�
����������3����q�k�9���zn�����?�*~k����q���Av��>���������[9d�U��U��';�=T'����1�
�*�c����ef�f���I��t�!�O�{sQ!��$M[/�����0	�%V	wm�Q"�K6�~�K���&���)W�z>�;�%�36��%OD�S�i���T������pN����c��s����L�q�����m&������[���p�]�^�\��IFw|y���������������/N�����k\����v���uUl�Z���_��_��}��7�������&�#�L�94�v5��A+z���7���<3}�ol������c����c���:���~]��Yf��PU��apU<���aD;l/>
�{#b������S,���bi4;�L�C��*��DU"fY���_�I��7��
�3�Ie�|���-�����3i�,�KQyyj}�I�K����Z`��@@��`��)pY��n����6����e��������j�4YI1�Y��
y03�>;�L�}I[�w���Y�k�tC���Bx�����>���4c�cg���9X'n
�GB�O<�������;~����6����^���������R�������Ay�����������e%�,�b(!��
�tk�,2}�Z��[�?�=`�����`��/q!��^o)Z���A}�@��9_iQ����0����dgs�s�����N����aH��Od��Jhn����j�k1r�b�����C�������0	�XV�V�4��I�����1)�������PA��th�r�d-}�K#��'Tl��v��>6F�����\�������9�s���N��������'W���SZ;�K���+Z���\������#�X�kc�"���r�f�v��>����	{�%�n��j�}���zy|vuv�����w'o�N_��x�p��J���K����-�������s�W���b_�]���U�GGv��>�����/}�7�����������TWoPu]^C�"c@�S��y?|X|x�����M���PR�z����Z�!k�X��2O�l����"Sz�(��x�WE�h���wU����Z%� �i8��#(�C��y����������������nu���5����y������<]&���E��Rn�j����~^�����>��J�&��M��@�3OP��;����X��k��a��
8�[�LmG|�s�|Ci�k���!�rF;rm4��BT�\6���M�w��^a�?�r)	V	���o����Yu��}��q�i��(�7�*��
x!+(�/��[9OnVj�wKU������h������M��l�JP`�
v��>�Z�'sE�Ec��z�s�y�J���#��V�����'��RE],�Q�N%��;Mg�t2}��@�=:>j���H1}�Z_���*����B�,Z��0X;3L��;`irQp@I��N;�K���+hg~�s�?�����_�h�R��O�E6O�]��-=��a��	�X���~�4����EQ��%�@����r:
��L[-�Y����F�	a�Ex������G�q���h�D{�!�w��L�m�i�����'��A.nV��2�_���qoY�OtEDqW�oA1����GkS%j)��
�6ko��x�>�;/+��VC�a�������o���>X��I-e�:t�K�)�a9�o�*I�����Iv�4�i��L��d\��8�T	C����
�Wb��$��\m���_a�d�_#D#����v�6:���� 7���h�)Us����W�p:�,n��/E����hG
����#�z��
+��^g�����������tU�-d��F����_���t=o�jW��4�Y��#�ky1tm���UT��� U������#�����i����w4��<���2/�z����qY�����M��.��z�#�@\��Y�ADA�e�(������gi#���PGGl4����x��)�W�EO��P�#�:6�>��~)����zo�1:�5�I^~��T@����Lz�����*v��^�"�D0��|�<B<p65���G.��y�0�#���D�9�
��N�����wd�<��xd�V=���w���F|u
q��q\ps=	�#6�~a���5������|�t~���2"�*�k12� �G����t�Y�c�pP������SB4��R3���$�D�D�D�����>D#W��[��:p��������HZ7,���+�^j������~k-�ChEV������h�<R�w�`�o�8������3�/����@�0�RwH�j�
����:�3��4�D�V��<#���m(���j��%w�����U2Z^��}>6�YG	!��w����������}����"�����
���$�m��2�{wH��E��>P'b�3�_�����V ����r�
)Y�b�Ut]��������^d���E?�������:�x��:�xd��_����7�z9m��������;���mNE�������9&:���
I��cU)���*�.Vr�~��hx�(�xF�(	8Q�������]�����f�b@�,TK�NU-�\���rY,K��M�d>p.N`�#[��XF�\=��"��j� ��sJ�����=(�m���y�����z������#��^<U6��e��zy���F9�'X$�;^Hc������2�.��k���#s�<�<-;d8z�\�p������. �wlj������_�
4���H�(	;qk�[X�c�cY����r��a��>���0	x�l1���P�5x0D<���!2�����B�C,s��w_�2}���k�	ku���vp������lRK�0��F#@(�z�#@(��~9��G�'`'�n.�����N����P��y���F9��D=g7N������y����i�BijP��
E}����M[���<r�������#N!�(�����s�5�W.����|;��S�����,OplB.���=�jo���{4D�����o�{4T���'/�p������7�'�������_������������\��i�J4�t"%��i�-�'�Z��"+&��s�	���hY��|"	�?��<s��"@K���G��}@��2���:p�(n����v���ot�v���q6bj@V��l�j"z���!�fy�;�K��f�8%WQ�!���~P�\,�qV��G��Y���F��;������)��90������P~{�]{u:��3Z��G6��eK�i@���� ���n$!~����u3�>q��2�v��7�D�%���h��L�����~���AdwmS���O�:��xk/��>�V���s�d�{����mN�Z��-g�<���������x D���^���d`FP��4OD��G��������C ����r\���v:���o���q����(tA�qn�N�jL��f5����l�Q�4l�;62�ZC�~��F�����M=�����c�dd�n.��k��s(�+d?��`���vE��*�)7����~�����H��Q��6QT)l�-�e���R������ ��k=&���aqeRK�R��k�E�o�+���w! .�����pw��4]R�@�k&q��s-94�.����p��������6��]��~���\g����~�5��!W���y�L������pw��]6�4��e:�ZA�:����c�q
k����?����^iW��������LI�^g7��(��.R{^��������C�����f�8�@n����inV�D�~W����vz@z�J�p���Ec�e����e������O?��p�D��[��YVw���UB������1�IA�ie��N����v�����1���u.C�E#������r�
�](	���|�����z���������.�����-��u����_M��T/}��(��t)��.�3��.��]D�s�l�<�"�����cD�e�J��C>���v]�~y��57�r��Z �^&u��5����G�Z��]������^��]�����|�M���gt��Y��H�]��O/�N�x�����<=suqv~y����[��\����3��u�B���l�R.�����2@��=�t����t��t����Xf�	�&����s9^��������K�e��=��t��;�IB�����]L��@�]��������x&�*.MW�d�.��1T��G��o������c��� �)���e�bTnm����~��-R2�.��+��>O#�<&��������yG��8ei��[���s�:{�������&7�iRC�{�����4�hr��V5�q���CS�mp�u��aZ�����Uw��n��� =0�n0pn:�q
�r(�Qp�,�^_���Bs����V�����E�,���h��B����:g�9��
�Q�v�����|Y=�A��]D��%���bp���)�!L�� op�jo��U\59��6�i9�A��]�iG�>��?6��z�C��2����w-��V~8��4��
��f��w9�p��d8��o�
y�\��P�e�k=
c�@v7dEs��@�f��}�PM�Z)kk:�V���u�>U�n���^U�*m�|���yR����2��k�/��@��x��d������\@��=������o���CY�a0�.'�m�hd�e���
�/6bq���<�Q��}�����������o�~t��i��sH�GK�f���PK�sN��.'�m��6��M�����rP���������X��U�N������{P�����e�IT������v9�����gzE9��M
g^T��}:WX�T��+��N�X�xUOa�;��vc����T~E�d��n_EA�T#����Lz��,���94 �]Ni���Y�`R�O��$�.�e.u>���6[���ap<�m[m�Q���v�.�+�j~s^\��O�N$�&$t������W&1o���=�@8o7�y�����������o�S�6��*�y�.��CJ����N:���vm��1�����i+�O:�$wH.����y�Vw~}�����_��c�x����F9�Q%��q;��n�o��z�������F9�� ��t9%]���i�q �"�54!���,��^oE�{����&���-�Y��Q�\G^��E����*�E��*������_���:���Ty�(�����-.���Kw-��������D�����������������F(���c��}��H'�b���|���3�f�R�r���raj�8�w�$��X�{��O����5�(�@�7^����WI4���0n���
o���������j���l6��w��{���8	n=~�1�Yfn�N�5��o ������qp4]b��W��J���[R��=�6�]z�s
F�d�:�d/��s�0{�})���q�F9�{�����rU��49�l���1}V��I$��2���.
�i��`vS�Z�At=�d�����
.��t�t[���?{���F��{���y�9+'M���Nw�.$�4�N��3������q�C��~���TeWI��U��N�Xk�$�J�jk����~�N�K��U��(=@�AS^����
�<�Y:�ZQ
�v)P������s��V�1�L6�/S������U\�v14Z����]���~�/�UX�
'3���u���c���ZY���D�T���>�����1�h��X�%0k�0�6_��Z}���bA��|esL���QZE��Z�T|�\��A�*}J�b(���B��n�_~!*�
�$����B0���������L��q�\��N:
�����~���\�rnC(]1,+W�|w18�`�T����,L�Q�]D������M��� d��v	����l�� �m�hu�U�d1���tn�bJ8�f���H�C�+������0d�8����.a�|Z�������@��.
\��	~�]��.���$��q1"��t��z�����>�a���N�J�������	��~�������p B�[]J��B�6��.Ae�$��$>/�<�K�����3���(�KF��T��K��Cm~!�����0������]�'���&�o�#dKmjB�0�[���
���$�(��3Q".O����v�^�������]M�m��s.����y:9`kp|����Z:�Z}e��x�6�j���M,
�e�q��By+����fu��4-�'��$��+w�N����B����F&���T"��K����|F�I\��=nv
�f���]�������6~�Xx�a�v���).��y�>O����\�mv5l3������zt��q��}�rvMC��Iw��K��.\Q!�Wv
ye���]�Wf=�KJv�������l1^F�)]����7�K.f�p�AvQY��"d��c��dz8��p������=�"'��c�n2�!:�����(�������Zy���J�^�r�F{��Y�����(Z��bt���WH�E�v��CP�.F)C�J���d�����\�_��A@���)7bTH�nET�`�]�eVV����"eW]�l�O�3.'�(�l��t\�\v1rR���fM�@hvP�%Xdc�!�{�����drya!~�	��� �f�-��1��
�nC��a�9J������E�����yq����;�0.�b��nTB`������
o�Z���yn�����#���y6Z��T^[��(�F�3E<	����gQE���h%BB����X�S�2,-��`�k�Z�o�i�U^�4��6�N�;��CH��yV��`�]�U>���.������py���>������t��������w��~�y:S��s	b����=a���yA��8���q�+>��=����6������q<Q���]mx�epH �yl�\D.���L��M���[�Q2���0	�75)u"�
#��:�����������aX��X�r|����;���a5�K��.��|k"��"���j#\��I��!�\������e�N�j���##0g���Q��{-������k�?����"d�u���y��=.B��AL�x_�&�@������AS�iq9"�r"�h���FU��sx~~z�m� �kA��6C�L���8D|�v�H.J�F��6��	�S��z:Av�&�5
�]Ieh�("�g�o�5^���N��x&��o�7&�fv
���`��A,���F
��#(g�k�.^Y|K�
$X�x����)("�R���p�p�+^��v�����z��5�1�1���/�y��{��aL���^�<]�l���1���_y9V9�=G��7�W+�k�N���<�!����A_{�5��M$�#pm�(���F�=���0>[I��>�
P	5���
���#��R�W�:�(�g'�#@oO'[Y�y��i�c�����b���u���T����R�d�^!�>�o�>��eJ��������v� �,�������c��e�6���u���y�`�=m���eLQ�b	WZ�p�������:�T�3A�{����BP�guWg������������p�d��YAe{���C��� ��6�<�������c.t{�@�G��)��z����B0��1��k��ya�a���|y�xiU�	b�CCPoY��2���t{C�8�!�L~1B���X������}QEt��CCI��C ��@�m�0{��m��K�r�60
�,�L[�d
Dj����x����aq��=8�@�=A��n�#(eOG)��������o����Vn��P������;��p�B�|Wc\�����]�
��R-�P����{���lv���"^O�|c��gw�����.��U
 `�5�/,�����*]+�#�_�m%��!>k���6h(`��ra{��&�!=g��j����L��zO�CP��g�=�����5]&K��1J���������k~�
�i"=#
��,;6�t�������dw��)k�D� �=�$V�!�����A����b#�p�ap�������_�nJ��+�:�EV.�i���)[�~��7��������H�&�2���x���h��l��.��5����G����t�gHW����r�suk_C]{+���
k��Q�Y2$�=�$�|3��<�5�U��xz9����|����
�������O�Q����������go��N�����)C��Rc���~��#Xb��X_�:�@�=.���$�`c���{����Ni�%�l����
x*"����M��/L��������!�p]���!�qW���~{eY�&��#eE�W5�D'���q�	���\��Q�!���Q������,W�@�=]���R��z����p�G���0�v���w�`�=�u^�LS��i�%�/��!���?��/&��U�^����^g�]��"A{���T��W-����4
3m�Q&<4�NAI{h�g)��)#xjO�S��nh�)������c���5x~��k��/R2���lH�Q���=�V�!t&lmF
g;���|e������=�G������Mttr[A����aU�n�l1A���=�@������l�_�/�NO�*
��jU`+�fb(7��%���[0�"D����N�3��+��Aa{Xxj9��z�{l=;�����xq���x�g�+�3���-���/�b{���C�O������u�v}�a���!G���+O����A�0s���=-]I�M��*;�s�{�b�^/��Hh3�������[)s2�G{Z>�����b�	�����|���������Q��D{$y����bU'h��Z�f7��{vp%��C%�
���.q&AW{���%_H���;��B9�7X�%"b_�{�n���5�w��� ��u���V�J�a(��!S�@���j�i�t���i�C�/��Y\w^��Zj������@�=C��#jO�PC*�	��U��3yY\xx�-�����J���N�OE�s�e9l���	���@�-}���~��:<9�8�5>�R�����@�}4����=
>AY�:������2.��)���������|�������>��}h-�|�������z�d����J>z��u���8i8�����A��_�����dmU��
�jK���f��	fF�I� �}Q-�S>�C�����|���12Z��
����O����5�)�@�}
Z�5�)� �}
y�8{>�Y���'m�,���j��H-,��j����M�P���q���%�yAp��������	�Q&�j0/��wZ{�������C�)�]�P�ER;�|��������`�}gMQYoj%0p���|�����*�	��������dB����Q��&2�MOCj1�<����k	����z2!u�m�j�dB���?�dB���+e2!�q�!-�C�>|C�W�@F�����I4�p�����������5$����������q� �}]�he6!0p_������}C@�'q_�����}om��t6!�r_i�f�5��(�rj1���H]K_<��O`��o�p+x�'�s_��?`�!sC�!)
���X�j%B�|���P�>u�:��������9���Gtc:�N�mr��^Y�y�Ye+��9_���	:������H����z��"P�O@�>�+����F�@�h�>�i�Aku��(�.*�O��~���yp����},,��!Ak~�8R�O��>�Z��,5$U����1/DL������p��b
���x�U����w��_w�<k���("dD�XoY����H�)�'0j_�Q�bG��~�Z&�J�!�ki�������r������l@R���N'�f��_s�	X���h+�����~��Ac�X�k�l#�j�\��C��B��H��~��Ey��f[������+E �>�<7�A=�X0h<��O��~���M�.UC�@���4����V�����@�F��	J���u��O5gb����j�N�'s�`���}_��m�i*�������/>����W��U#���Q�4|6�{OD���y}�����[�c���;��C��w'�k�B���J>���1��[@�aS+w%��2A)o�!��&�c�z7���	��������b ��w��P������7�5\��|���{��\�
(�A��j�����a~��Q����5#�4\��f`4C4���N5Lw5>���~�
����Kl�#`��6�"��g{'G��W��}
.��W6 ���k�����Jm�a���`������t
������@p��J>z�L�)��.�7�����������\�rPBae#���W���^A������:��	?<�h�������gl��M�uy'���3�tp~��\�����@"���Vx="�6b��
zlX��@�]�d���w�p�)Mq�]�7����g����y80%P�R)Bw���<�9�U��"�1�����CY����J��V�C++��@�
��n�x:����5Lf�����|��LJ0�0YI�L��i��<J���mvT= ����xs�>!$Xd���@GV��\tW\&�`���	nabB1Innsq#.3����`v-��W���!�8����a�������#>+�#��������D�*Ho�EzU5 X��im�u�������P{ # `�@�j���@F@P���%���4���Z(��������"�+��)$�Nk�TK�: ��-�C ��.24����Yj-Ov�"���py�ti��A@������������%48���H�w��O���a����	��d>��{�o[����~[e��d:��a3�>���E2�F���j�oB�cTW���+�U|]�����8���0!�5%�0������ ����j���0N��A�F���p���U4�����=�~F`��&�����Lm<���V+3�`��anL�[�����1��<������<	�\�h6���bdconO�?=8�]�
�O����]4�sB�Ly�u�9������_ �,A@�t%	�R�2	�9h�2���e�@i�>A:~���0�1��]0v3}�-�\�["��j�Ds`�Z����#����������r�Uw�`��~�V0: cf�����G���[p=���M�1���U<g����Y�j���Udlm)	�({�v��4
SB��8�r>R�B_�FV�XP��L���:��"���&���I��6_[`1���z|`u���JZw|-��b
����CG�o:|�)-GTQr�8" ����|%���l]��������X+aoF�;A�Eb1���%�����/B�4����ya�8��� �]$�ZZw|
��|1��U�=tpx�����rp����Ay��w@P�A�g�N���U���\��Qb���_j1��y���
�;0��E�\'��&�\X���apycZ��-<j���JF���JF��4��V}}'�(���im�y������V����I,R7$�oG��&\����6>����|Gp�����S���1���~;��0�M~;��4!��mR�����mc�@������oG���������+�jZ��������8��?��%B�4��5s@� _���/�k'�������k��gB������!qF��?�rW����?�o��	�>����j���(��jt#��������}�!�J�U
��7i���F�ny�FH��a���Uo� ;�^�B,����^�BMpm�w������� �"��vxe�N�C��-n���c���$�&Q>���`�\*/!X���v����SB�1TK�?���5� ��/�C�����Wp����C�?�M�Zw	�,:���m<����\4��O�vq�xq��_**u"K����HH��s�+s[Hp�����;t���vf��Z�{�����n��%~�.$�P�����@��u�qs��O�y,���c��a�eR���uchC*e�������[LH	H?�B�R���	�(S�0�\7�MClWSyh�A	?��k������C
��}���}�����
H�}�&�vH�!�+������)���C-��e�Z�d��<�8a��l�.����:��,����o�8��.9����P]ll������3sf-2~f��$����.���H����?���B�1��R4�����r�mO6���9�n�N�y����C8o)����CWo���~	G��a%�R���Rk��&���]�d
+�?W��QX�� �CWo���7e���x�E/���:���%0��~�����.K��"�jMn�E�\93L|�
���HBI�q�������7���������f��dq4�fpY��D���E0K��:���y������[yS89W�5��	%s�J�������e3�s�n�l����h��N�Q5F�]4���b���������C����G�{���G`��k7Z�����;��3�[u�z5)������p�rprz98:����������ppq�wy�x$C�z��x�����"[AD��m�Z��e�	bF�|�����NX��������
mFH�1$���6��E�q��+� ���m�f����i����&c���Co���<������F����a��!������^c��6�f����GZ���H#d��o�i�J�	@N����F�����C����[%�-�(C�{�	����7���b���5u'x��7=��,��(&k�N��q���&��A�����>��{L5�?����C���C]|rHkdmA��A�\���,�5��k�+a9I����y0S�
�%�+��'0����|m����`�Z�h���h��hB����2Z��ib����!a��n-����R����S�[��
B�������I<���x�P���������dB������9[���B�A���a��`�C������I������)7-�AX^'�?����_a|}�Yj�y�*�i��?BG!9A�C�H��;����T��Gh=	I���+��d5{^f1��P���l�D�2����f�R��u��J�~;No���Y>G+G��a�u�r�O�u��4�n^?^�V5$��	\?4��=��������,���'�ku��`���V��V���!A���ZZX	��mIB�����q�-��B4�����D�!�]����v1�!����-`+������FtuA�������)[#N�/FV�h�0q��"�w�b]��>4�������n������3��_����|���E������C����|$�f_��+�ot5�����	��b�%������>}b]��Q<����x���8��w�h4���|�Ge]��q<g�O&7V6�z���+�w��3��<)�r�����/�q��	�+�nYS%��0���:}���x
�\�
]��1Y�������<����������?��;�8��,��&���vf��'��nbV�@��z������������A��j�����p���m>�����qi�uX�������h�2�w���p�z1�@7X��.KA���������>t�d����~h;����������?mCZ�ehY��Yw�`�G}����;8�����������u�����������ww���@|��tmKm�n�wwO��H@�����'�T�$��4�8�s�^L�^����WO��)�^\���'�;��{�$�?�\c~�dL��_�y��,��I�T����L���_�-�q��K�w�/@���dL�xn�/�{>��v��I�����6��'�>)j4��'����>����?~2���
�����a��*G�%,���X��2�r��G�',�|������o��Z�
�,�Z;��13��{:>%����)�����9���]���Y ��g.rO~����wF`�d�����/!��������{��'����o�K��}�3{:�������O�<��U^Y���f����
����|	/�g����~N���e�Gc������ g�����.���LMH����EZ���cfg&(��p���'?y�����~�?:f�i����������m��3^���k��K'x����o�Q�F�;��\a����<�����~�0�2nv��Oj�1�M�Y]��IJ��� \����X�vY�K����$��c�VP���ok��/�l\_��������nm����f3�+�sB��{Y�WZ\$�6�����.��b�l�-a{�)�o�zx9
r����FB?]>j��Yy$���� (�2��#��P�������x1�����P���I�I#�i�i������T}��Q����2��<K�i.�	fK�Y
[�x
�k����!mWQ�����XSvK�K��U���pk��{'���WY
������{8R�[��U4�}��BT�����O���V�}��e�fG�����rB���z��:pL�*l-�X�|�&��1����m{�_������Z�u����/Z[~�ZCm��y��o�f�g6������2���U��6*�&����5�}�����-�2oW�^��m���g��S�����_xR�g���^D�Z6ih�Nj�����0^`���M�/�S{����~:�-��8�t.8���������p��,R�����u��eg[����m"�z��6���h��
��6[�ni��K����~�NjL���u�]U/���I<9�&-�5�������:/�,j���e������r���f�N�/q�/\K��DU�L�����������p^�"���[��FRh�B����B�������	\G3�~���	� �P@
h����������A�?�+�?^77���H�g��l��
�c���g��l��
����S���i��43�C���53�G~� ��-�������������~[[��F����3���7?FU����o�N�V�oJ������o������o��fx�o���6��o�t��>����h�2��M�^��a��H���6��&��H�g���v�8�-�?v����k�������g�����iC�l��{�lkr�5������?��o������k�Qo-���_S7�|~��j��
�y�
�/'��3�*�wV�[O��������-4��b=�������~�
���z�����8�eOGV��X�I<J ���U�&���:ua+G�*�uQ8���O��|���mY��
��D���H���a	'5%���u?���D*��/�|2�����������\>MZ��dwq�J?����6�����&t���j�j�������O��kD�R�n��j.���
&�4�f�Z�OW>�]{r�����������������?�*b�����,�t�Ch|	�4�����;��w��Yr�f���Ox�&! �v��5��f?<����b���e���%�
���X�d���~�y�k���"])|�K���fM�������/���������z������S�n�;��>��@�/���h���]~�a;��S��{������S\���u�c�S������Y���=����8=�����d�h:��;c��,?���Kh��<�w<������}@�,����R�b!���o�����=��4��G)����*o�1�w����������0��3�����s��N ��u{�2��?�W�+����]������sw\��b+����5;��~�<��)�u'���u\��CS=���V�`=����V�2f����"����������+=�o ,�Ea%0��wv��mn)���4,��l�k�N]�B]+��djo�}�[��[�w�����,)_n��D�p���~�_d�{{�w�x��}R�����i�����a�W����>���Z��>{&VA��)b?��X�
K��o���J�q�;���K�3��>���mc�y����A�~+y��/���&���u)�-��@�G��iO���}@O�,�_h�������%g�Z�r���`�T��v,���v��i���|9Z��
r�/������n�n����6����������k}��o��5����������FG��'d��N����:���
�����&F�����X�����b��_q;|s�������}��d�6����;�=���_6��I?�p�n���l�
#�9A�9AZ���'H7��7��M��
w���]�������t���9��9�N��������?n��
��M��;2����"�l������7��������_������?��7����������������A�z���
^�W���?�@���������q�����s��K��
�����a�vo�����s#�s#�r_>|�!�7dw�9hCbnH��fwo�n�^��W��Wl�S6����+����7��7���������oG�n�2��C��y��������q��o������\s��n�>����'���J��|�������3"�S��lN�`'E��"���i�,�h��.�����k6P^d����V����'?Yak1�sY|�.��K����^%7�A���G��6�Z�>����s-��m�G����#�@f����q@%\�U�>�"���N��������~Z�K�T����9I�_M�������nj�{�0uV�x�C=��N��y�:�!��.�3������<p��,>����=�+E�zd�BLu`������)0�"~���={�j�/�	+*2���'T��Kkw��c%�\W�q��8�������D6.�M�uQ��#r��*���h/o���N��1K'J�������/�Y<L���5��9?�U}�(������J�T�3`�5�g��R'��
9+O��S���8�G�[y'��V�YE��d�����o��	������S��G��3{W}"�>Z�3fN��/���~�bu��-�kw_���]]���W��:�&�M�4;O�ze���6l�6z(��62��|�k�#�::8<�<zs��wytz"eK�`�,=h�[�1�m��n��F���Y��6:�w���!k#C�?��p�`K����"�(=F;��{�b$��H�r!���UU���i��#V���12dd������<���C�A���1`d�J/,�,��<No���C�`�R9l�F������KHY�0u�a*��N&L%��:��9R1�d$KE�o!�%H��t��)�!L�k������
����;pE"�c�A��\N�
��$�"�c�!�����,��7���%��(�o������:F�%���s����W����$D�5�E�����E2��]�E��k���I<�_ E�����(���q��>��LDA�b6��<�#�%��E�A����r�+���e�n�cy���;=x|8x���h�����V$���,�#�EF:_�i���L��?|.��G�s�R.�(��Q����0�>D��b����H���2Z�!eC�D��Y������b�H�-�`�7�'�����k+��+6�G�K��Y:{��/�X[���G~��x�0s[<���-l���c�/Wl�������	D��a�:���5�������c���l���t\��(K~Lb{�rY<�k>	Y����]���3G#��a:e��#��V�,��P�`�����.?1�=dd�!���7�>1�}�5f���<k5����t
�e$
�e}bH���s%�ramq�o>�4(�l�����%l����u�-���_�F����+o[�E��A���F'��(|+���I�'�G�x���|�����l��W��O��GQ��|Y��8_���D)�������������'���(��[5��Y�
c�*d���6��65��d��6����~O�Z$ T-�F�2����pt`m�� 'r3�(�U��Zy�-�3xP�f�2 t2��d�_�\���6���f�����b)��2���{]���D�e�P��TT���������(B@����b�Sr����^2�1{?����l.�P���^��Y1����59
��)S��mA�i���Vec�5���@/�u�(��c��S�W�V	
���M�?�l�EI>�VV�!���^1�o&(��+VE���b;5h�W3���c����:��:6O�C���x���{Ha��V����!!|�f��O\����X]K�z��w���h�����X�f����/����P�*��������r��2��������n��y�h-��#3�$$�.�k��R?���8[������:����������"V�a�������x.<�qht&W���^k�k��)S��yQ�������.��$+��F=�Gh_�>)B�zz��R�%�Me������JvM�5�i�Lv}������F���\�C=����d��'D��)r��5�TB����FC�E�\�}�N +H�B�!���rQ���k��"G.��}�I��p��'|_c�����51��&c���/1��FyeZ����o:�+F�\1��&)_������x5�����.���Oow�#�}n��t���}n�����O�;h��vW?�������\���~��]��U���V�-y����q�>o_�i���P'����>7X����$������8���|Y��=f�S��bx�]�$���:�~����%g���s�b|'PZ��N�q�wr,;��r��B��M`�6����*gE��jB������Fm���1�W��I�!%�V������u���{ bt��Q���ZU,�[Y�����sx~~z�]e�*��_F���������	��^�������+��~h^C`U#�n����B��!�J��6��*�C��6F�2�9��hK����C�(A~�4;>}k-�J�*��`����j�a�����|1g���%��Y��Rf����	�3~��E�q:|��_��Y��;ru���VUq	v���Ul�_������p���RY[7cF�M�_m|URec_.�{��k�1�R�#G�e����VH����,����`]e!0�R.�@[mm�M]y����/�_��	��V9�O��1{B\CCCtqH|��H	�w���B��6�����q���'XTcQy��>��"��&~f�5�� g8Q�������`6O��'	a�[�t.W���ZT�����m��3��V�N�v�<���VY�	��vu{�D���WM;(�_�����hc�5��=Xq������W�Tn�h���WDl�����JTpum3\���j�7CIt����D������__����U|���$�~����V h^[G�����:@�"J�k��\1&��[q�^���f��y�5.E�U�L�Jw�S�1�I���sm�[TJ�����0��wf7\��i
���� �J���J�jE@�����v���Y�"�^��UK[���m'lcl0�?��)�[K����l9�3����N����Z�V�]X�X��
��6��W>�u���|�k���K����x���0�x��I��o�^��i�F���A!��E}NX���C��6�7�!B���L���a��=!L��)M��kc��
!=Ak������wq�1��J�;����n@��6������V�_����t;��H�n��5	���������mV��_��������*�
��{�!!�Z�j[��-i���Y�������Akm�?�E�):
M��z���?=99�����{��@@�6�B�����5�*pQA_���{n�t|'|��O�v[B��5
�9w��$��jmW+O-kc���
!M:�UY�,������Z����H�<[w�u}���?���7{G���sN^!4���Y��!�CK���a�+X����R�����d��`�{��U��l�v7O���OR��-���"!F|h���R#�J@���^�N���N�>�]��A��B�rf����RH��l�M��5�h:�G7J�bS�g���P����)gC������������/r�9?�;`�"3m���6���X�p<�8����%�����C���������\��k���	��\���:�3H>$�Qz�Y�8��c	q���y�s�@|���4/%��$��X ��A!��,nm�\\��_��R2j��Q��.J><P�*C����������:\�lI9KBF���k2"@�5C�@Im%��!d�������,��G���t~�����{�=�� xR[��jk����.�����]�l�~z�w�������}[�Z$T�o�z�������4f�6#TL�jRWiB�t`�����s����������/�����~=</uBY���A�r6���@WHL�� ��|1����R��!�W�k�C�N�ltq������q�c�-�?�����tM��vW��q�(uYW:\�tM�3J���8��|��vt�����2�!���Cg�l���`l��o_WK|��mE�O�t�b����Q����^5����G3
��z�^+���+�����Z�UW������9]Co��q�(L����N�������h:�����6��:�+gC����[L�������?#����4|���%��:��TA`����9U��R�����9���4���j&A�:X�X9B�L�\���tH�.�C@��6�4BN0R�L�c<���9��`���
1��a_w���WB�� ������w�a��f���??d���3K��-?��f�C���c�������_���e�}�|%������?�(����e��)��:�W�'�M��8��'D���y�"���N�!�|Be0��)u��O��N(x1YpUo+c�;�����f������#���bVz�\���cH��@`��k�	V_�AO#cG+A�
W������,�j�)����llBN0�X��P��)���Y
�����t`���2		�8`9b�b�k��D��x,x��e�����QH�������,���=���b@+�%?�C���V]n����]��s$�)5q^^����		����Zy&hZ�35�����`��<���eZ��!\�3��ZT#��E��xbDc���
1�=����)��[5�����0_�!B��
o�T	IQ-��P�`k
[+YW^e�5�����Q��V\U���0�����������E���WI5[.�~em8���O��s���?l0����b���rc��Bx3Y&����w)���k~N�!�Z�+gCH	W��	��AA��L��a���Q�������h���������$����V�O�u���Q&W�4����c+��^dC�x~G��,��:���!���r�������K�6����*A�:�*O��������<|?�q��C�_]�W��b����_?�D�y���+%��C���.fk=�|�I	��8����r6�]���u(\�@t��>,���
�p:!�rG
�uT^z��Q��c�B��N�Rx�(�K��k�sD������2��;���L�<��V$ R'��w}$k�`N����N,v�
�����6V���R�����I<I��E��{��T,@��
!D:�tw��[�F����������k���R���P��A�������d1�-��c�c|��G�2�>��{������^����[�o�������+��1�\su�@�eb���c2�F������'���<t�v�cu4�VeY;��;��v7L�����g���y����R��'���n�W���vE�����W�����>���E=��bR�<�.�����:�*gCH�.j��O�q��LC;;A�:�5l&HKU`Ht��g��2���G�`aV�lA���Y��F�Y<5Y�e`�����Ue"o|HB�4\-����UlgOaPu�a"������D�o�&G��!�ZG��9�zi�[EZ6��F�!D����wzh���Y9B��k��$7�����-�`�l=�`b]
�>�XY��Tm5�z�~��X�����������h1���@��y]�u�k,��x>i��t	����������X�^�\,v�r>zYq18V�F�.����U��b]���1y?%#���3��.lY%��[�m�Y���e�J�}�|��6��U�+���Uf|� \],.,���\���Ri�'y�Z�Mh���Q]4�,������I����C������O��i�n{���b!i�l���aG����v�!UX'm����\8u��S�N]
p��N[�@Q]�x�.A���N���~�N��M������@���>5�~��o�)_�t�W�.A���U��>dJ���!@�*gC��X%_5���y��9kh�^5�M�2��>Ey�h�3�SmY��t�@Z]i��!CQ3� H]� �����b�\��A��e��*�Z�`M�p3��<!@�c�yT�Z-���P]4���U��b��Y	x-�q����bT�������t��O]S�n�{����Y��]��]��T]�R�M=Suuqo��1\�Fuu�k��b�����R1��PU�@U]U��oWuQ\�?.�a9�Q
��:�"�������.v���K�"VeXE����qM`�.������M ��9�,'m��7�W������������/.�i/rp	���Z�����q	x�����%�@`]
���
.������>��.���d��"a���#f���|h�|�����<%S�lV��@���j]��P������j��X��k�V�S�2����G��?����4_���j��� 'Z#h!����h]
F�����*��/����~�u5p�y��;U���J����.������UT�u�5w�M��t]���R��Dg=K�Ya�e���e�"�^�{��*4��(u�_�|i�����a?|�jQ�@���
�y��
o��9����K�y�Y	����bM*��v9J�-w��������	]� v���_m�P�?�+�X�Y�0`��.���%�d�4���Z�E��f����L�\��n6^7B(Mc���f $��bP��
!���p��>�?�W���@�!�6+}����3MP�.F%+P�Kv���p_D2���@�0�s�.����p>2�b7-��(������b�h�u �D\�x��E���'�_k�xUO�>�6�%\0v�(���F��nhxLQ<Ye��"�����Zx�	^��xayt��k����b��:��a�����x����|C��f�a]fZ����&��r]�����r]�]]��a9s���w	��GV�B����x�OU�����4��W��Xo"%p�K��.����L�h�U��)�������(��~������V��[n\vQx�%����h�e�gg���*G��7���yw�K#�A�=����Z�o�8��G�I�����C'����Z F�
fE;._h���!��u���Q|<��29��W�5X�P����:�X�S������|W�K��.)w�Y��_�(��4��2�������l��#!D=���v��J��KB�+�%��{]*���D���������q���u1rW�P	t��b�B��GA���)?�7�(�z��u���.)�b���
1�M����U)�{��dy��i��
~u���m��q�Cy������h��0�i]��u54-��p�p�+���%��������z]��nY��Q��$����G@���E�J����_-��� ����g�G�{�Q�E���g���K�.�>���#�Y�40�]��/7��\������������h�K<������Jjj-���(�g��#@^��Z����
����g�V����a��e���z��0,X�F�U�i�Zw?���O�}������g�t���)������������e����E=��#�a�4��I5�k��A���r6��a 1$y��8�gk���F���<��8>���cC��+�i���������.W��+]�ZV%�����g����#�B���R�9���=�0��j#�{-B�zL�Q!mk���_��>�<-�����-�q��r9J�8���g'y<W��x�����z�i���:�	� ~=,.��
1�5��C,��<����m={�x�����s���Q��.;�����L1���3c}=����V�^��V�����6!;�i;G��M����x����,�{�Am�����#�Y:}����a`��
!kn�e�RL�W�Q����t�aOJH�)��\E���|��c6��P]m����e����!��5������w��c���`M,����$��#������$�;�q��~A]{u�K�<����>�Z�.�����0����'��}
{���k!q�T��up��.���Z��GV�z��"�q�[��$��E�����j�G]O�a�u�a����=�X�-mP>���Z%���������=������r=��I<���Pq
���(�F}�c����r6���nv��`���U2��2�	�3�����Hf1��{��Y����,.����UL��:Z��X-�X~���h^y�g����E�����?��e�5>A�W��=MK��i_�;<>:9��.�}�]��0�hRG}���Y��/���I�@�=,��
�g:*}WA��(	���[a�Zgc�XE�#�q/hm��	�,����!�����;���w����O���;9@��y*��bW��e��4h�O�x��kJQ�^�Mh�.��,����w*7�kNh����>x}z�3�K�����^B(L�FH�^B���:!�y#,���CIs)=�����=V���l���-L�Q�4A��d���1���&�0{�2	�1�=X���u��=S`�#�uO���^}g����4��/��g�TN���7m[�t_�G ���[B��uu_���a+����`'�y�$���	����w��#�u����v�{=S�|��
��)����|y��E���3A�{(InQ �G0����������`�=����F��i���W�@�Gy=�-b>������	 ����r�y���������#?B!��$��e<��l�J����{ZD\���M/�����?[�������e',��}i�-�������r�*�/��f[�!*�P����#pt�4p3j��H`��^�X�Se,��Q��
7�t��y~��F���Y��P����g���3v�#�u�o�t������d��	��A��]�������N�Iv�����[��8.�ULGY�l���o[������	h��j�%5�[�����R��3�9_�VNt�1��V>��p���-]��Qz'�>���X�f����^C|-	�eq��6�q������t4������lWW���/���+��
��JE��iS�e	�~�f�!���J������>�$�B��v���������s���oL�;�h����&���0iq��'Pq�k���tk����G��	���j4J2�}��0���X����^&8������u����aK��/�U"4��+UR��.`m���L�u+_��������A����^uG5 �c�	�4��N��2{>���ZN\Ju�o1]������e�oy��r�y['�5�&�������	.����D�m�	���0v)�R�)J}w�8�|���u�f�t<N���G�r�qJ������g:#�w��!�!�Eh�1�����^I��D��c�;$yz$x��wZ�
�i�1�]����H����i� T{;!��z��D`�
��p6���k��G����G#Y7��,�W���m2��:�[#�����a`�����}4�&=���������!t��/M�U��~!�j�'�p�	w_�	`��E�����}���!T�m��3���>m�h����>�_���u�B����q������"�.��ku7���g�8�6/Ad��-gCh���fz�k<����0Mpw��`����\�+�����D!0h���	�Mq�'�g����	��_/���Qi�qB��F�~�'������
��}, ��
!G����&�GcgoYo�7�,���&Y����k��F�g�8yV�D,;��G!dEp��d�A�	
�oK!/�&����6�&kc�'d�5�\�
������?� -b?�1��LE�pt~N~���s���`����&�d_[�`�]���RW�	�������=�f�O@��"�����0��O@�>
��Zh�.�F���x�	�x ��}P�!�w��O�N��APK��U� �}�*'��6}���=�4����Y��8�_?/OZ�-?]\����:p���b�k,��#@kF��:��C���	,�G#`[������pMg��B����p��Mlg!�uq5Q�� Fv��O��>2K����	6��&1����Q��d_A.��������q��M���1�2�	�
��O���O��~o�S�_u����1�r6����T0�$Hj_CR?X<	>�GvK�����	������$_usX�Bl��s�.�o��.��)��)��py�M�U6>An�h4o<�W�4�(z�:�[�Q���{kXk����W����c���
p��5�
��\����/o����7�����A��z�nh�h<N����j@h��Fk0�'��}g��o�	h����>
����@o	������O"�k��-gC�Z+�d������y��b�d&�p+�]��|��`2^YA[Qv?���i����M�!=���s�����pCD:��YY��b6���y.�F������R;������!�&@?�����b�,��\O�T�-F�g'c�X�>�����%����:�5���E�)���������e/3Kn��8� �Y��|6�r��'lpg���rZ. ����Y�$�Qz'������V^�~�(���!��m�	����N�oey�I���t|o%��2��(WE�
dt��^3�k/�v�9V6������F��X���h�(
�W������I����]t_r���mk��VnYn���4��64sq���,z	����-  �������sYn$��k�����;{;�{5���Dx��\7�S��0��r>����N�O���g�\l [b��yF�:�j9B$tq���6 ���x������	�(���|��=���r9����
%��!<v@`��><6\���
����<F9*%2b��?�V���`��:��\)BHl�������M�\�d�l��a������d��j�}}�F��y�L �A�!a���r�R���Bo�J���p`8�s�_�0��"1���e:.n,T^�F�X����s���r6�`��n+c��X���j��.�G��e]pk�|�|B���K�
&��Ac�B��8�c`�S���cu:��=#6(��van-+�7�>����z�1�:j�C��)���m������8��n(�;�_�b)���,�`ef��:!��t��W�:������9p�<PQ�b���@����i@�����P��?�K�>t��l�p@���lP�(��,S6S������,P��`dp�$6�}��\!4,gC�hC���A�R#xG�v�����Z����l8p
u�v��C�bO�������	�)2�o�Tr���������`���l=p	�N�J	<8 ��-�p�`+�����n�V3�.D����I^v���x*�S�R�h�_�N�����h���Wn9�j����*�EcjD��Oy�T���
���H��r���-���q*���zk)���R���o`��:��A5*�������	�Dp`�	��$M���#�h�I��NJ8�n��A�@�~��`nPvf��U���������-���{��C`��gp�BL-R�:�L�<o��B@���Y���@��p�
)��UA��`��)�C��P!�(��!T�G>�dL ���nYg�|�|�����>0�V�����|�
���i���	�7@�_e�#���w)�(�����]����U{�.RD8�`y%D��F�BR�����M�/Q<����������=3����VO������w�
t����kyZ���:X��h�W;U��;E�-3(���"�z0w�:�W�P�0z�JV����E���W���1�*=��b[t���m�:s�8�CM��x����1��p.�����������t>��p[*�Qn�r�+��}��[�U!D����U�4,X���j�U%����ym#^q|�*�F�0��/8�yw��o���
�2�E�p<8���r�,��w���=������>����^Z���Ch������x�r��������	I������TR)���I]�B��H���a����_O.�G��~��w|~�x������B���
��r6�������e���c�`
���
8��:%2u,l�#�G�>	�5����b�=��`�j
z�wu6O�q�������+�:h]���Uy
F���������8lq9��b����(����/3s�OQ2������4��B8zfk��Qp��P��h:[�����)  �@��R��}��.�j�.X{�P�S��N����h����7�-
�������F]|`%/B���� C�-��Iz��`l�m��,��7�^/1�
c���{��*�n�E���*o�� Z�&$���C�5PH�����<H����@� ���H��"��!p��9��_����Y�� 8��a��_c�F��j>����!q���9���w-���t����"�JH���i<�U-`�[V`?�^\��������_9���R��z��aFV�Q�4�y(��E���!�|��)gC�!W8�����pd��e^F��l���O�Z���C��Q�S�?�(��U�J�R��P�T�Fh��b�]��������YDOG�`s��l]-��������y*W��(R	U���a�����t<*��e���M#�����y|~"Q�1�Q��$<�K;i9�lq�U�u��!g�WK��!
b�v����j�$�I<J��i|�z�]cm�i���h�j�H��	(�	�*�b�8��Y."Wo�ez��F���U|
���=���&Jd�9$@�9�f&T�/��^����tT���ms�������RiB?�@����aaN�s�F�$!)htX�6$��c"��)��!1�5�\W�����|!R	�e���$d����eh�uvSz,���\.����
R�u0em�)�c��r���4)C��@���Uoo��dh�M:�n����^%����5����7!AF�fddH��������.E���%7����JQ�!��V�������������RK���C��=�%�������`'C;)�AM�X��2�Y8R���R�S~�"!�v�
��+�i��mQB�L�Z��5���e����B&qC���V�����B��5�B�($��*gCHF�p��stw�3~�����	t�m���Q�0�3
�
c�W;�����)C#�R����|e���e�e�-C4�k%����D#GU�T�z���O�r%U�]
�H���o����0��!���W+���@9C�H�*Vl* w��b�g����#TL�U�����/�LL��5��/���.�\P���dr.&$���C!)��P)]�WH� �!�x��)gC�C`i�}��B�
5b5u3�dG�BH���m�)3���]��?($�7+gC�6Dl�L��7�Xjb�M�H]�M Y�~F���RLkS�%��0\3|b��K��!F���BCR�ER6��44��*�S�k���d�����4[^Y��O��x>��:�7��,��D��q$�-0�Gp���>��G{^��b1��E�R�
��[K��zff��h,�[&��&����ZUd��S�u"�V�����c����YH��a���@�T��/ll4"�q�R6����L&~	p
)�UN�{JH��a�P��d���MY�
)L��� P�+gC���u%�
�` ���%�dk�[S�Z\\�&�Y	��pY�2!��P����2m��k�N�V�nC �!������^�>��=�,��N����&`��m��o�������9$��P���q]�R�1Z5�a.p��
�0�ERF!4hV5�����&>b����
�1~��$���s{�5��0��=��������������j���G��=,����^�zh�V��f���i��G���Dp���u���{�l�7�s���{F8p�
���BK�+M����^azX0W)������n{v���]���\k��r{�+Sd=���i�\�"��m�o�qB��=��EO���g��n��
F�
Q�����8E5������<P��������!�m�a
l�F���2[{���O���WlD	�L�g���)��/��rNiy��9k.SLf}�����V6����|�g2c2��oO��!=l�'���c��]���=
�aP��
!]��L-�������5s@����<N���g�Tv��������9A����#�g����L�Yv���$��'y%�+�O�^�����0~��gv{�-�P���4��27�o�~��I�r���wM��{5�y�����d����
!"E��=*k-]�����	�*������E�]���M�{�^V(�6�a�qc��8�a�s��Vg�%�=I���4��2A�\��C�?�P��j�<����:�L�p2Rr'T��!U���*���?1�1>X���F��U#H�4�W���z
i��x{XlT9B�Vr)���	��$������{���������=�;vq|zi����iV|Q�[j��X`4��
1�
C�V�(3c�������3�xm�j���@|��	�����������-zXb���r6�p�A�ZX�qyBW0�,���3�HZy�6/�-���9�f7�"g�$��Be��*u����fN��	����>�#���B�+=��������d�������rY��h���i��u����N��=
�kT�2nmc���a�T_��)�zi�
��-_���(�$^!���6WZ�rD��6�����{!$�7�H�!��ZC��2iD����8q���$��	�����E��.r����#��������i�]���v���������������{��.F��=��m|1��`���
!��nozI�v�]��[��;�(D���crwY�"�
U.!�+eC@�=
���v�X��J�����1���{�������A���/v�y�`{{�����G0�=S�s�?��KU��?�z{m#�.]u���oO����G��=,���
�0�V3D�W�����ed��i����8?�8�$p7���*��h<�!�{gv3`�<H��My$��d��t��ag�H�b�e���t����������r6�������w�)�l�`
_>��m	$n����-7F^Y36@G�YS�:�1�%h�F�Bj�	��gL������L����q�r6��hp^��<���a���
!@�aik�'r*����\n?+�>���0�V���Z�l�@mj�L�����m����W�?��`���"��������"���������,
r9Kbx�q�F��-�L�E�X�m�k�������K�����t�Qj��c]����A��0VS}�~�c�f�l�C������T��R�x�X���:En���8��j��O�<Un���ieN�q�'x�>���4Y�@G�]��uj����\�6���3�n�M ����{q���]4�O�b��M��v!W)�6&���3���!�WaSZ���\
�h�w�F{�@H�B
����89tY��J�V*�=L�J�k�7
�*��\��:�����(�����(T� V�(����q4��S4�8V���	Sei/�����<���V���t��U� t%XYz�f��=S��z�����R)�P�dU�H�#�BC0�U���hXY�S���������w�u����Q��br�KbE�R��	X�oZ�����he����Ji�-�>A��5��C�N}\���e�Tw:%���'gLa���-y����G����Ua��Z2.jqXv�����o����v����'8��)�ZI�������O��}W��!���3��g1As�qe�0�^}3��o����Ik'�/�m�:bK�m�.�L�	&���7�U���G��Z�^��]%��0���+9��tD�AM���n����c�/�|Ofp56���'���`MN��c�w�����������������p��.����( ���3o������mr4M��i>��4����a������E���lD0	�o�W�jC:����f3T�9|��/��_�������4����F���aoz��o��f�4�2ZGBl���r6�v��^����L�Q
���&���D�ln[��EB��Ev<j����%Xl�ME��m�	�'`��Q�a^��}����7Ih0����c�+R^��Va��>�������&s������9���,-t6j����E�"���_��f���nD�����Bl��IT.�����)��&[���]hX���������������������sx��P��7c������f����n)�����=�����B>�s@L��+>��!>CgdY�{�<M�����]e��Y>���.��g����E����Jo $��.�tq��.��5����}3������q� ��A��HUt�MEr�:<!��p�%W�	��kz�a�����`�y	�����r6�rP���&(�>�w�z�@X^9"Z��`�m�L#~����g�%af�b2����I�'�������>A���b���4�kob5B��}*�����7drm����*\��Z��X1`�Hh�Q�^Hn�*D��P���}���B��IE�q�7$nm���OWA��G��r�-��>���1.W��P�iw�M6�`�p�9�%M�hd%b��6���TU `�~`�|������z�H�#n>�����:����x���#$�4�oC��+m���w�2��-~��]f����"�BW������K��)\o��ye�-Fh�)��
����!���.���	#���������8W��Z�����i&��E�U���,x��
!��XwR�G�Z��\\��ec��:�����������}
��$��(�P�������Q�Vybhc�,{���$�����b~j$�{LA"$�K��La.����d�"��>��`.T�A��a�������{v�?a*�;w7�������[�l�e+�������

���|z��\�=CkaU�������,�{�^w���H�>���{fFA��{�n���������h�]��m���vP\��t��SAo����tCa��������p�rprz98:����������ppq�wy���<n�g���m���.\��ig�m�@���\�]_l��������]9B�����������ol\0p���l������t�Kx��aE�}��T���1F�L���8�vqt���P�O{)CE�R����R�6���P�*�?q������e�(*�u<B�4L1Q)��,\�Y����@�����GX���d���������d���������_*EhU>_o>��XQ#�
����_1���J|n���2��{y���`,���f�[=�o��X��6��|�n���N�L������|����>��Yywz�>��#�������������������y�X����AB)l����}�Jh#�c��P��w�a����Xx�j���'���\Np�\�5�GE`f��r&��&�W�^s������/��W�
�.�p��|e��<=&
�?J�X��

��^�zE�$�xB��/%Hzj>7��zl>7�Z'�'o�.���Uz��y���K�C������y�w0`����\��,��*���
B�4�s���"xZ_B�4�G�thDaK�r��z�>7�z&>o��-�����|	)� b%BJL��U�������'t���GB9�H^�{�2`(��K�~��h��-�m�	�����,������6Mr+����l�2+�g��(��0E����Xp�9l�1Kg�'�	�c�����(�(��j������~g���	�9��9�1�?������q�-I0����Z���i�����O�7�d�<8��*��Q
#�3����������Ma����3�|��������h����M�y�������G�V1�����pB<D��dDH�.�.�t?�-i4�A���:$V
;�5gNb�8��G�q��t��W������l+|n&�z�>o�$M��uS�����
����0�u��.W��)������p�"���z����R�/��4L�S�����h��{�_�qg��{J������9����=��a4��p�,B���J>��7
([oA���m��B����eJ��~{Kz>��%��eC�C`��#/b������,�b�z����z�>7�&z*>_��T:8����+S�#�Z�k\V�������O������U���i=~
��;��nz>_a�C������J TFC��^���C,{B��v���? d'0��z��6�#��y�g�3���:U;�^�$�t��@H�.����zj>7�=�
���*��.�����&�Z���l����?��#�5�j_���1����O�_��.��|���f��G��U�2��5d��n�F��"c�������|�y��������WOvvv�������Mo����m������h��z��j\VV���I:����������TW���E��<�~��6�g/ww��l'K�a����xg����e���c����v���#�!+y0��0�����O��r,{���Uk�R�����ttM�O�?���6�PZv��w��~�������^BA�*?��\/����etY
��C�[��l�������������������~�Xd9�
�4K�\���Q�����N�/?;�~�|wl��}|�o=y�������=�<x;]���G�,�n�wwO��<�{����'�T�$��.���z�b�����'���4��q�w��	�ipu����2&Y�/��`v
�$gz����u&^��/�����O����}���uy:c����B�����e�WO��~g�q�3=e�.�;��'�&�&�^��&o���Q�cOF��O����*G�%�b�t����2�rff�T�����Og����ojY7��x$h�x~����(`X������Th����~����	��7�LG�g�vO~���5�N�({	�?���i�,�P��0�x?|���
�����z��?0���,�A�y,���^������]�b
����
�51;�:����'?������4��du9��G�d5�4���^�Y�t����*[��"��P�13�3�
���Z=����������t������XO����������}a�/��e�/����{`������v��������������}�V?�o���;��'��������.��&
�K�[P
�YFS�L����_����&���H���r�U|[[�~�e��|�z�!��S%��[�����j
�Y?�
��Q��^E���$���:�&q�
/��:�m��z��'Z��-��"wT0K�+����6L���r���Mu�R[0�D�D�Y��"c��bXL���������Y���R��I��#���r����f���������&�M2��/�d?'lj�z
���T��{p�v�������N�����MV��.��U�{���i�v�����<fu��m<�8���dz���v�d�|��������e��:��W����������e_�5��_��!��ZVy�dk6��|��_��}������|=}�j�Z9����_���c^mXM}�N��}����������R�!�I<����[z8w?��3��j������o�i>��v��z����J3���$�����d�_'��_���NX�<���m�����k��]=�m�����z������/�]��]��M,n��+`�+_w���6���5��#V]����`�
9�!��G������2���������o���F���o��7��������f������_��7����������n��m���	4l��Z�a��C�����_g��5�������6�u�I��[������^�W�����|cl�:|�j~�\�����-�O������-������l�o������-�2���nh����w��f��k������f��a�-z�f7}����M����l�uv��?��<�(�������[}��I�����������S�-�_�CtQ��Glq�����k�����f<+
�-9���iQ�td�������"��YUn�^�S�VD��^��ZL�Jy��l�����A����0h���-��Y���lib�����T������z�1��8P���Sx_�W.�,���2�6��% �����������>5�5��vV�U��~|:}���X#���vK�������	W��'g�:�?a��L����+�e����se���:�����j��e�wE������l�������|)��f�ms��������U6�����O�mf�Z��df6��4p����n�,��-��'��[7�Z��yY���U��,���u���N�m��|����m���������o�D�o���I��������=��������Hf^�5�d�<��m����8�6����o������+����_��go����6����o�����6�?�����h���o��3)x�����6����oC~�+���n���������W��
�r���2���>���������ul'p�d�y0��o��3~���0�dUU7(C���=��C����z���H�����}w��x��f�w�������HDv���Vv]���`+�D?m��o���_�1��G���<��N���M�_�U��}�����FWm��_���N��1�q_M��[s�����rfT���P��
8��y��������q��uS`hE����=e�C�����m��	�.����z�X	��U}�>Ns}��z6.���e#�\="W��r>���V�������3�u:��e��Q�:��o�DS6YO��K��5	������U,4B+g���,6?��K��B�����������g�Og�sP
�3|n9]�y�vuA�v�^��9�$���;O�z ���l�5z��62����\�9���<:=��%F���F��0�G�M\�_H�lb������2"F���z)b������d:/�r����H�J����^*��62P�\�qj#�tU�q��b�d��U�F�$1~md�J�
1dJ�
!����h�77�CP�H9l
p�O�1R 1td��.�d�t��2j���>D������FHY��t�)�u=��#�A�\��h� e��1��+h�;��!�#�AF�\N�
��$�"�#�!G:�t�l�lw}����\1�d�KeU�#n,!��%��5����)��	�L'��M�f�R!��L�-1��q=B�!�5.A��|�#������(�j���d�tD��	�L&DQ���#�p��B�3���H�b����L.DY�x��2\B3\3�E1g�8��`�f�!
��7p�BsI���V�%e��4Ft�#d�CdB���2�w���<��������<d�J��E���GH��HX������b���X��.��"��Y�R9�x��C����HgI���$���g+_����[$���-��2�l��f���s(�Nf[n�t����'W�4��R��r�(���8K�~
��'�8_���h0O�2�b{�����Hy�����!+�B�Fu�I������gl���"�{����xq�L���%_�OG]hK��'����7������+#�_�<������k�a�c�kKt���>@{t�O�N���?���0��m��>=��g��b3��������hd}NF����Z���kFts�#����[|.�G�q�����0�{m�����8���k���g���MGs���4��=0�1����=��x�F��i����0%S�4�b���> �}�v������#��/7�b��^?�b�f#= Fz����|1)~�	v(����&����U��F�t17d8���P}���fr������!6�Yb�5�����!L����u�1��'�(�q<��owvv��#<t��#H��Up�/+����E����tx��=�^X����|p~��7����C�VlO���[F�8������o��<M��bx[J$F�����^G�X�2���F����0h�&f�+�����[�7�~,Dd>(j�c�9�CBSBc��|�I<a�C���������J���c�������(�Q�\�����/N��9�6A u�����Y�P����asVB�zFb�#���3)/B�z��i����x�W{��dlHs��U��V<-6��n�@������tM�l��`�F�"���=Q#1�$�G�GO�;P��G�H�����7�-X[�7Kv���^��|H�Mvf7p�'��=���yGtj�*�7�y�ly���i,M��9�Y��O���-U�P�������h��\�j�!3}�uL���>
X�JI���+6���g�����.�o`�����N��������7����dI��}Bh������x��|�_�&Vk��A{6�	��#"'�BS7���X�	l����(����{�(n-�-�q�y	���Z�A��9��)���~+ylv�		��I}B���`��>!9}��?=����x�q���$���\������ ���j���%�}�� ��-[��X������h2������6�XgWvGa�y�~�h�y�{X�.���*	�be�"����d�����&�N<��7�zp88:�x�������������������������>7��X����S��,����-Yl��2���V�Z�3�-,�m��hw�2�>7�K��vW/7�s#4��W������E�}������W��a��oc&��$]d�1��cN:+`��>��k�W%��)0G��6G���y9KBo�Vz���]�57q>�F��8�,��Sr�g�np�z�����k�N�����N��� ������x����,�@�"Av��#HI%%Y�x_����dl���O�o���L?�)~����I)����9aU'�sn�\�9�[�z�q����0}��'�i��c���!�����t~_�������*���Js��1�j1�x4_*e����r����Az=���r	�ic�'$���R!.Nkc�<6N��&�X�@�a��� ���Am����1f
���C��������H�Q2Q�&�1Y��8a��H[�L��xe�c
Gh��B2�o�	��HP9B/M�������	����O9B���� ��#H�N���i������O9B\SAX>�n�����]��?>�t�Ev�.�h���,[���hO9)�x���{����������������@m�URm�&/��8'�1�A�$#j���$�x��,��^�n08�{w�z��	u��Q9B]\Su�;��vdB]\���Z��J#������heJ`mI����4�$L�&hQ�������d�^r|q�Zy0&��`R��Im��M��$�I���L���;I�Rh�S��T��*��*B����������������P�[��!T�{���h�!T��U���#���*�u�Gz�1�UI���4����)�`gm����*P��{��{�CT���H�&�[Co�lB�Z�o�q�<�YCg�[�e��wf-����G�+:
�M	
�M�N��Uz/����5�����;���o�}��#�^[��>x�|�m��*C������s1�.�����_6�<B�0\W?��8���F����sF�F^+��r��`P���������DA�J���,���*����wp�Ol�Xo��
����<as��c�"��������0�G�Elc���MVmnM 0�\��Gb�?��
�<��A��o�����#s�1�Y���V\�����8LZ$]P��{E9	��Z+'P{3h�a�S�l
�!'�	0�#����W��2D�I����
�a�r6�6�0f�^/����@- �	3��|,]�n$������BHO��Q��7�X'���	�N�_���P�����������'����wg�G���f�����T�[a���+�{#k���i�[�@�m3�&l;l������^���B�o>���/O�>�������Z�6ZXk{0wI>��L;�A_!�!����!�#���
:9-�~�W��$hk����5%�I�Fd������+m-B����������!Ma�Bys��m
@�`������5�R��Hi%�L��R�oL��}�����E�%���u��r^���Z���x��7���i6�� !HO%@k[Z�=����\B<����KH��{k-!%�M�V��wTB{t��Q	������U��+TB	���`���I �����L����xO@�����F U�m����V�R�����H��	%��Yvla�D'�J��8�s����;����A�<�����������	����������T��#m[/�mq�`�{�GoO`A�e�1�O|m����%�	�����7���N������\yx������n��8�hh�d�`Bs����RM�Mj�9;��s����m�)��O1Lk�+���-xA@�`s�C�F@�����1e�{��oE'a+��V"�,&
����,�qz�r0��L����+��5T����/D�+��*��K������8�&:+��K�WD��v���l}-Y%*i}o�xE���`��
�O;������"���lOG��'�D�;��Y/�@=���	8�������`�����������-�r~����l��"�
c��le����kG�������o���6�ff�������
}�p��	���Lx	a��UV���*Dh�1��T�,�<�����jC(����S�
���~(�
!o�-gC�����O�8x���kg{2[���ml
��i���H�_�E�"�y"����Iv�L�cV����\BO���EE�c~O����l�B������K�c���W��n�Sk6O')��RB�t,���t��qZ;x�&m�#�i;��v
�Z�����I��x��
�
�v�vt���]9��8����L��	�l6�����k�z�B��,���
�N��~qx�:_�������o)w�(3:��c�������|�b�u�`�,���
�u;��I�z��������.I���leD����-����R�s+��Y�gYy-
�a��C����Y)6���1O�hMG�,�u+Ay;��wC*�Y�HBa���&Idr�����*�v<8?}ytr���E7����lv���:�����bQ���L;Y|�w����v��c��)mI�qx�j
�iK��?S�Y���r*��/�J<��E/n]	Q:��)�Ew�Pt�@�,�q�C���,x�2mY{�li�M�,�&���=y��Fki+[��,W�?b^T���Y�$����X����8.x�V��������/0�r�L�b'�N<�*F��g#�E�	��u4����Ur���������e5�
2��\T�*����(��e��`�]�eQ�<����!�%�������(+&���{�T[�����l
@���&9���M��1�n5��O_����K��>:������0������5�!�v��j�l��:�8e}|^��5[n*�2���4+��U�K�%vZ����D���8��&�9�E+��I<�m�Y��-d�r<�Q-�j@h���xd}�;���#�[W1�!1��tG��,���
��FQ�-yx7�^+$[�\��q�r6�
R|�2���S6��8��$*�����Y����$�s������2��XU���.f�u3){�,wH�\�FX�	����@�F��o%�Rs�"�o'01�Z
 -c����N��S���b���������o�n�F������)2���.v��� ��bx�`�����j�`�yc�M]J��(�g*���C60�^&~]�EX�rQ�*�<���j4��<�4��"p�2��������;!��K >�p_M<b��N%���x��t����������!$��w��	�v[&���E��!d�����t���B����8`�m��t�U�{Y*���lO�d���|7�C��NH/z�B��8z�N&I����J��		f����%xt��+�����d1�N��0��%�LT������!�uG���"���`�SF�>`A�6x��h��_:�W�o[�K;l��d"�!8uG��C�x�K�����/�d���S<�.<O�T�67!=F!���>ZV��>���>\���q����v9Bq(D�8L1���c�@�~S���}454!A�;=c����w��8�����<����^�gZm�!&A�;������B`��6x�
kPkK4�ra&�,�}���c{n�c����Qm�oz���r6�H��o~�|�!D�F#������</:
	�����k!^2R0�A����AR�=-�����A��1�!���C��c���
r�i��d&�?���n_��0D�'0��9?.W�sO����r������N����<v
n���S��&�������/�;�;&=>,����� d���c���
���og������"�)A�!qf1��wL��e�~P��M���I`p��.L}����m�jh{#9\���W���h��.�����!]��|�ZG/|�I�q��->���u�U���1�����xg
���W��D���uRF�J���	�����>������9�a3����lu�,�K�����u��"����u�E����.��)�
����}U.���f��K0�.r�L�%� �K���b8�=��t1��,�qzu��Qc4`k��V��Dh���/���.��&�4���h���4�WN�9.�w�2q���eu���L�S����u����3��h5�D���t���L����Z��yrs����s������J��9aJ������U-<4"<��F�=���]]�qH�(!�Bf�x�'<�U'N����6���bk_���z�����	���7����:�$�{�B���(��BW@�0SsG�g�j���~F���i��z?�^����!����L��jAxK���
�������O�����a�'���qZ\!��1��HP��cD������2)���],$y-mYG���R~z����:`��-7�^O�q�L�!9�.�i�?�Df��y~o]���E�����-fp|~zD���R�.�Za����,o"���v�s	M�Z�xs,[���!���b�{��,�|��D�Ku��2���Zs����M���XU������Ru����K��X�$���&�>K���4���6�;9�%6��r������*5����p��8�-��8�9cn����!����������y ��$_6��\ ��FQ�e�  |�����h\�kF�R+�*�+�0�s��.�S�D#B��?�Yg�#^�?K��@��(���B�`��%���KP�.�]��U���S�Q4���3���g�U��w�"�����N��1��B ��i�vu
��0�1�}]L�K1������]���!T����e]������=�Lp����c���@�]���u�
��|~�)���a�r6��x�G��"u�����,F;���������`����NH�i8vk/���`�r�g���^x���\�Aw����Ugo|�m-��%�s�7<Lh��+��k�b���
�����ga���pp�����I���r&�l�4��jH�'8<9��t	�������������"k��!ynu���c���'�2�j�����U,�R�(J��o�\2����|!+v*r�����WL�l)������*�6�8Ecv+�C��.N��:L���]S�6���a����������@|B�b�<�O�^}4�W~��+QV0� dC���G���YLm��}]�UG1����#�{�fq������#�H�K�.��#�`~]��my���F��l�a�A��z��b��<�N�

'a�d��"��v�h������#�^74�Q!([�l��Gp�.�����dZ��0���
Mo�v	���0���G��n�p����y����$`[�,�A�uM�>��[g�����@�u1�V�J����o��u1V��������G��.�Y��X�!c�i���o��5���
B�hu1�U�������u��P7	����Vy|�k��%�T�o6�	���b1+���>]����3���v�S�,�$���&���cy���R��&�P���0�=��L'[�d���gm2�����p�z��u���Gp��	��z4HWi:.X��x��]��OO���|=��40&R��i��<�N ,��y:f"Z��zp�zpv�A�z��i�L����U��^�_�A���J=����
�
�b���r�z��0�R�F�0�i��U���Y�p��J����_���.�{'�CT��9�����9�
�����iT�y���j�B�;=�����'�V�������O����	H�`B=�	�C���&��!�	�7���.=���`
�w
�`0=���� /=[���K�EZ���,����z#�Gp��iP�U�T
��IQXyh�%t�4�~������?�q7���Ws���a���W�'�/��u����WO�n��6����k
�����*|S|7�F#�]�1���|T�u��sV�drce���';;���������7O��\�6��}��WO�B���*���p	�R�����4��3�W�l�E�-��WOn�|�rww��vg�,��xg��<6.{
�c����v���#�!+y��������O��r,{���Uk�R�����ttM�O�?���6��J�~�z�.`�t�[	��[���_Xd;��w��A]�������V��/'��d����~h;���N��:��_��YZkdY��YD�{���h�����}�1������u�����������ww���@|���u�%�;�0wvwO��@'��l!��I<?�#.�?Nbf��(��{y?�z�y�i���~?@��)LX\@~��}�T�����b���Lr�8����k�u���b�^X��.��r��1A��C9l���&���'Y~?���8����?���I��5iE����g���,>��+��l�������Dq�Zf_�����C�?_�@���U4�M-��X#�t�	��<fFBq�����w/W@C�/��+��aO5��d:�?s�{�S���1�;��KX�A���Y�i�[
��w�WO���x�.��z��?0���,HRb���b���s�]�����i��_���P�
3/�����{���I����Lsh:������"g�/�����.�,G�YS�R���o�Vn(�N��T�l��i>�������#����Y����������-	�M�t���~�����	^�����rX����u^U���9auw������
2�Vg����>�����E��D!|iyB�%��o��8:Y�K����$��iW�Q.��ok�/��\�_�?�&��]5NVv���	V�
&�j�*����M<a�
/���5�m�z��'Z�M-�5����:0��O�O�0wV��i~*6���gf�h��������b^o~���y���-k�<x���CG��7��%�#��������������(P<���^�����Z�[]��Tv����,�O�,��v�m����|�P�?lY�����M��$�K��sN4���_gr4Y�����:���H�W1������Uo�U�k�>��+�hZm���m���������l6l6��C��U#UW��|��:�p���i�����7���������7��
l��T�o�����<��_�-47��{
��G�'x��G[���w����o�o��\a�]�otW��t�C{\��.�7���w�7������L���J�{flW��YZ�����kK��e�l��_!�[��+}w���8���9����������]W`��hE?m<����������?r�����q7�b��k�
��t7.�������������Ws����;������D#f/�`V@�g@��A���1��US�W�xR	�FaYH0�������
�Q-,�
c�f��Yb������lS�k��"/�d�lw�C�!�[�:���i�w�;R8�2�=R<*}X���ZlY���^�}�>�����}��>��qz���&�t�ZXo��bV��>��ea1��G����,<�����������='R�U������,�Z�TK�G��,4�e="���^?�e��^^H�c�{������i��e�L3�d�@ft��_�jC�p��A^��1(I#b�X���; F*z`�j��u�%�c�ng�>e�����h
oEsnK�!F<��W'�Z��(+��Xqw����F&�#����q&��g�h��2�B�1��KE����D�W��'�-�	�sO+mU�fy�M"n�*�_�%��T'BL�	y� �ImUi �{��^Q(w/B*����5��e1,�`@D\��h���B�^��`���_J�.�7�M��2l~��u�]f-2��
:�x���u'z~$�0��X�tw��!����A�p���,	���d�����vWZ�U�e;T�����4w��[�
�X�x�9��|!Mb�U���5]`��;�D��Gc����
o�$��=B~��F��������*\V�j�2���o�%��"t����5w�-��T	B���J(:�����&���~�jw�
�i�*�b������������F�/����pS����k������^��5\�x��b�,G&�=��k%$�*���
�l1
,g�!@�3x���8Ofc��x�I�G�Qh�w�/r�:����-�,g�m� YN����p��v��=���
��PX�La���r�3G'G�r$YB��K�B'?���2@u�|��Z@m ����'� �citWZ��������v���P.�rB�i�����2-Yco5������|Q9�c����i.��A�DQ����f�R(��u�Y<O����g�)�Rg���:�X�f��7U��
���h�
3�hs�����/{&���l�Ig/�����	Is��E��>��3{�P�jm���jYv����C��`*]J��LTH��,��Y��zqu��]]��
�I�:`������W������K���E��h�� ��	W-WW����^�PB��=�|�i�f�?�M�!��^-xR�0�+�������`��#y���S��������������R�o�\?L��*����e�AAE���.��/q���QF`u\��fe��+}kYC~�Aq��T#BI=JI5�=�ZE�V�1��a\�����*�-��dI��X�-�(q9��Q��'�Y����h��_*��M���]����C('zscY~{���/��������x��l������%:o��KU"��r�N�\�!���=��-T)B��E�/�J+��.\L�z�����7���/��r���J�"���Ja9�W���%���WnNB��+ ��$4�'=wL��qza���U6�'|�:��chU�T���!��|"�U��z#q-��]Y8�E�WP^�N�d���.�I,��'���`����
^;�~I��JZ��c<�a�d�|�>�'��~�\���d�>��WO\:ia�N�{�^
����D������T:�.��������	�C������a~9iYJ�[ii���T��	�/���r���WQ�_Cox��r%4��r��UH������CRi�d�q$���{��%$��z���J�P��0NG���X�H�#�fRz�JSC�um�R|�C*�����
��9�������\X��M���~�dS���K\Kia�R��,m�y,�A����E�R��8�.�$;�3����n�WGZ���K���������v�$1^�+&-����������CY�0�,C\'i��I��Q��'-��I�����bv���f+�<�fJ�g69�/��X��.�\S���'����V���7��������G����\G���&-��������{1�5wRJ��Sq����3���2��J��R�������#: �B�l�Z��L���b[���E��P����K�����s)b�bWBJU����*�)���X���l.�s4��r���]*)�S����R.1��[(�r�@
R1���*�����-��]��BJ"F:v�T�$���h�L�R����T)�85������Xc�6q���E�\�
�?�2�2`c7TJe4���h1��s�`�4�]D���V����5S�j�wP�0N>�O�W�k�,�B�N� ���4BF�M�7ic�MJ�y��$MfC� ������RA�<g�Wf8$SdV��+*��������=���b�����.�l,�����,���]3�e39��z$����+gC���$%Cq
-���I;�����J\�T�
���!$cs/��p�^ W�kV�����JE���Q�r6���h[��h��dm
${��y���T��Mx�����#��rN��P���}J���#1�0��<Q����"��T\��y.����r���_|Y >h��h�<Q�}�6�����9b���e����Gm�'E&�$��
����N�%��zw���D�/�E4f�]�i~1UR� Hm� �;�w�hoyU�]A�|�!�I�d�e�C��>m���W�� %�V�(�\���q�9�������,)}��+���a �^Sy�}��� ���j;�W:��!���8�������Z��R�,�!�E�p��Z�B���xo:��������w�'�Q�4�\�����������@���q���@��vc�M@�6i���}V�>���-�3]�nk��E/���m���&�J�*+������#����V%��h���
��[d�4+��.�^�y�5/~U~<B�P>�!������{3��HB�P�!A$2�8@��w�E���x��7������7���i{��H���4m���!�%+-�go���G�'�������
G0���Z��.�Ya����b��^62���u4���S����&d,�lD�H����p3����ho�E	��Z���6O���
�r�(���hc���
!h���Gh���!b� 9ah�.8������d�Up���Y<�cy�F���HL���X����NQ	�Y����t��m2�3"9�Z�����}�I^A�+4ru���(����0�\bk��Q�	�-fb��G	`���~�w�4%1�q�P���H��5���MK\��~5��=#_�?��+C�z
#X�*/�A(N��H��6�V�rR�h�����<4`��rD���H�63Y���Q��� ~Y�$�������`m�d���H���E��6�
���%-q��������l����5=��(�W$�x���XUZ��(�W$X�a����T)���'gC����� �O��>�����������������j@���r6�8[~m���I�;���E�b��(�6�"��z���y<.1�e�T��(Vh�
���h2�������
m�7��!T"4tl���E���.��>?<\\�]�������	"��h��������jG��G��.���:>5Mi4o���Tk[=���iLk�1�Q��%�T,[m��Y������t��T�|1��!,��"8�v
�
K}�p�{ki����t�qo��E2�;��g��w?��`�I��v��Q�Z�T�\��tD8$��!�%�%,
��_/����Z:'^ �K�(k~��[#��g�t�X|W�BF�%�n�j>�������r6���������q'��o��H�%xM�5���xp+������O�������������.�'��w������e�2A~���&���o[@�=<���=��cS8[�����{'o�aF��!j���o�Y�t���+�F'A��+r���?��Y��Q9BM����b�/����R�!����>m���!�#:e��Cp��i�~��[j�fy�:��������o#:YmV���>Yp����z���M%�Ac:��j��y��N
S	�<�~�:��>�~d:i��}k}�������`��b�:Z�`h%�7t~����r<��7,l�����U���������!(J��jA0�|����9gbc`��
1P�0��'tz�A#�Z���!8I�nZ$C?�C��R����65�y_.S�Cp��U����Z$e$�J
$Z�-��&�)��Ck�|�F�2�D��,[LD�+�G��I"EAv�A����O���-�2�&����&I6��-6�@IV�� @_*�e�����;��!M�+�������8���Z���!����/9��8k��H��C���z�����7���B��s�5_�*]���oc�)�T
����r0S��tt4gQ�"�����4H�2g��:c�iV��c8���t\cD�Z�9���"�F�4L+�h����L���V��@����{!�?E�F�)�?u���e���,��E{�^��[A��k8!�ju��V��i����:�,+��a��t���hV9B)jT�;&�!�P�w����*��8g���\�|�O�B�W'���x���\���Me|1�a���6��8`�bsY�-+�y�# 
�	�u�P��z���:��TU�a>d�n�Avn�Aco6����
u������c��K/��!��=E
	d��f1	U�9+u<��e2�c;��x���4��Zu0hU��@V���ZM��!�T�R���!�CN�lu��Se^!�QG��G��j�(OT3/��~��D����DUO�:����c�g��������&�������(_L�h~�������f^�����p���
>D0Z��hQ�	G���o��
I{���Q��1�KE�!�P�C�l�a����x#PPG5��h:��I�a���rw~�Ui��7�����l<��0����3"�&\�Q#����:�
0��==�:����b����K_��������Q���	���������M��w\����&��c�����p6����\FB������H�R�h�<u0�=���c
��i����;u���F��S����d��f�!�T�N)�2��:��K�+ty���y�|�Qfu��Q��FSlt�,�dZu�q;wYOy�}HBd��r6�>���,8V��
��5b�T��yhB;L���3�2<,��+���n �������F3u���E�L&�1���f��y%J�������a�����d�w�rf@�
��:��h�:��������z���r6�^�����,5���>D� ���v���e�w�A�A��`��H�'�uzkE�$��x���V����UG>`�j0��o}��a�GU�Jh*�����_KS!���� ���������|�N��A6I�F(�z�jC�(��Hb�N!�^,�Hu0"�M���fJ���_Fx	���*��!�� W��qu��1e\]�qu5�k{_��*,��[]o�3����#���h��J?A����m����f
�"'�Q�)-�e���(���^u.�	�jHX�^�������;�R��/�-(��v�H��@a]����<��s�W�D��F}�E����r�Z�����^E\�-�F�����	w7W��<�l�U?���;����
CkY��c��@U�Y��'V'���z�6�����O��������.�}�s�[)������y^����������C��^�\�eU�Mnn����������������bo�ly��e^TW\G�q=~�}C;|����[e���u*6��"�����@G�#��]��-������`����<����#�)�C��] ��$�
�(6�e�Y���Vu������|'��"W�;�<�*�W�z |h�SDN���������+Dh�(�\~��r���L��F	m�c�����x����~��
m�BK�|��x�/[�V��Ce
#�f�����E���;�*�>&�`y�70i�	\iQB�P�Y�����b�2�'w	��u������������r����bQf�!�CH�.�,[�-�*��p���0��gR�@�]]�X������b��S&�����&8c������Zd2��;�9+��E���"�C���:p��	����[���H���D��h��EQ?_��������c��&��B��3����g�1��������:$��H���� ����w	�����������<M���]��/�@p]�x�.A���>"���������[������J���b�U���%�Z����.�xU��~�uu�be%[�"l��[��Sf��t<����E��r5ik.�G.�`h]����A�����/E)�P,���
��
1]��G����Zc�B
c� Z]��w��?
�R|0��Y�:20�!xV�3�u�W�q
l,����y9bh�����h��%PV�e���6	���\�ou��V��[]4$�j��G$�8M�/��J154�mu}C���h�e�������9T��p|d����)��%Z���u	��EC��Zyo<N�<��m,�����3D��;YoOYq��P�6M��P#-5�{����.a��2���b�����F�fe�V��J�z�3n��1��E��I6��5E~{�����=k�Y���%���N�D�������*��������#*�m���%k����Z%�"V+*M_c�yT�O�>�U���]��"�!�G�d��L6�L1p�'#���lw�q�c��k&M�BvuAm�WD� ��j�d�l�����V�e�����#~2����
�r���62�T����958	��EbQ�d2�G	��<�%gb�c�N�{fIyD��������\'BLP�X���N�nZBv1Y�*O������&v�C����%dv$�c�4&���P�?����b���w����1�z���$��f��K`��)v�jHl���\@$���Vw9g�o�)���]�`����w>��P>S����.)����Zf���zx��z�d�'�����1������C��K������.�j|B�t�Ar�� �xW�q58<>|g�S|������|r�%phW�C�S-�>�f��K��n����.)�=��A�6Z��r�2��ps<�
Lp<�!1�u��"���rm5��������xe��l�fu��|.Oq����\3���ks�g�
�S����p�#�e���}L���)^l-�m@��2b6�<��(O��A��q�*��TSD�������@x]3��%^�I���u�zi8+0����@)�%�[WG���`o���y��N�]�~����+�B�5�#y\�k����[� ��-�8htrX[�+�l��aF������F���<�Fq������i|@�@~=�-�?���Z���n%���#(^�x�i����MF����O�"��I��X�J�[�H�Gp����kTa�����-���R�<�.)D��^u<����	|��t�����1��3M� �w�r����<o��]9W��x��/��a�`�$�e���C��JRF�^b<�����BR���n?��u���i'�f���<��l�sM���j�����d���n@ ����t��Y��rw�#h^����u2�k�������9_B���{o�
n<r��'-=����o� Re��Y������H*�q��5\����GF>���Gp��X���&*��zP�#�]�w�$��z:��Lj8p0�C�{���E����^H�UKP��k��#� ���p�~���#H^������6��3k��:��cD����0`)����7<��+?f0`�����.������=����a���=Cz��=�����	l����j��`�=C\M���)��e��t9N���=8����X(����=�H�p�H�YcL�U[�����sM���6,����Llc�tM�D�zb��TN��"@g�5\�5W��1����Ep�k�H��"i��\I��N ��D7�-����8**&{p�QC@�k��d��W{��
o�q<��|x��Y��N�*��k7X&��L�i|��%�mR�DF�X{����VSPB*EH�q�a���]�����v�� �|�zz��=y}D���� �k�>�rZ#Hp#��l��b�i����W�xT�8��fSlsI�z�4@�An{�!���f
�((�L�,C��e�!��p!"�Kq�a����$������a���{��B��[��my{��e��f����
IB6}��!<���0�H�8I��,�.6iV��X$7�zB�L�m����o�#�m���l��?7���9�d�����C��a]-�{�=��A��R9^�Rw�:��F�n�GV��
9�P�����'[{���Xn�
r�%�n/x���|<����r�'
��SP3�/����?.�����V�[?(�iAq)�6!��{..gC��q�j��.�D-���[Q�[sS���E�^#5S(�#� �=�0/��^	e��r95����g������=�������6%!Q��FAS3XD���gJa+�?AR{T`�e���A��l��
��Tv�,����
��/��g�"��zTCM���mtt��]�U"�E�uKU;�_�*AH����!�����v�tW�z�Zpd�����=������,�*Ad��iw�0�!�q�,D�G�������5G�����
������3��$PoO�z�[V���!��V��SD����y'{^F�d��9zA��r�BR�F.�� ���e�k��=�������zjB�_��k}�*B6�1Y�N�����=]h�j��r�:�LcD�.ffU��M�f����?��]#$��7,H'���[~�N6�	B����T
v
��Ja�"a�����t�wM9N����*��eq4���	��D����'�v��l�&q����������'`x���l���wM��Z�
���D�T�s~�P�,����r,$��l������8e�O�Vz�wl�$`]A/y��n�4A|�?���WL�����!��W;�k|�6d!���q���M��J<�V�c�������GX]������VW�RY���!�u+��Y<G��`�}Sf�L�T��7A����E����fD&k��O��>��!�!�M(����O��>P[�'��h�l,?B;��My����r6���f�#�5�vR|�x�u��<#��o�����w;#I��� �}�pD��1�%a�,����w� mc-������.F0��cf�}���� �}����!�*�������}��_���Y��3�|������l���O��>&���c��F�'�w��{w}��h�3�*��O��>��jB��k6��'��M�5}y]�$�6d�+��B����q<���}��������f����C<�����A	11���wp�#�u�-gC�
[>�����[�>��C ��
��b2�G��g�[WQ���rY�����i��G0�>�|+�GP�>Fy�n��$$�3��a	<�>�8�1��J�f�l���w������X�!�A��>Fw��Gp��itl{d�������>�w�����n��7�������	��7z��N3,��5J�+�x����$�c�����_�>��{�j�#���@�)%����z(��V~�,�	"��
�
��C(��|�|����j����X�?���}w-[�;��E��	���g9BL�U��a�����/�F%4�G4AiPB���r6�`<�2�D���[�A�GV�.�C�k<����[�|�}s~���[G&[]^&��\b��������4�Xg���J'�FPM^�<�G!z��WP��	�Gc7+3?��\�������F��@��������Z�k,k+]�W��m>�M�)|\|}�o�'`�e�'(^?0��B���W�y��S�Iyb�cq����OQ"G��	|���]9b���:�J�C�N��_��~Y>N����6�t!b?m[���p��'�\�����1�r���O���8�Z/j?a��,�c/�����p�u#XBC��@���.��<�����X7�g�q��M���&�3p�|�E
��H��~�f���"�R�D(�J��1����u[�����q��VW�������4���0
�o8���D���i�Nk�_�7���<�(a�T#�`C��lA
��wzK�D��������]5�`��d��v*�uP�,_���vC��iv����1�`�a?{{t��������t����Hg h`����s:���� �����T~��m4/��bc_?�'�a�,gC��6��U�~�A&A&����2)���&p.��.����Y�L���`\������h<~b&�0N>���9�_�	i55]c��O@����P��7�k4=������9=Kv������$^�,t�O��~�����|�/�-%0\���(�W��~����3P����A����d�>?�!f���|�
�f��f9�a
A(�.Ft���l	�1�le1#�Y����]m�.�����o��	�~�#[�L�����( �����5��PZ1q7�D�:A��X�kR����! h����
�60�i�'z��< H�����*�f��l�]s���b��n@��Y�i�Q���7 H�����*-j1��k����p��}U|�=Q?��f�m���_�\����K6Z�4�A�X%L��& �r����f
J���&�4�B N��b�w����T���k���W�A�ifd��k*g3�l�q�r6�`��WH��Y@0�
z��P[����tc��!���\�d�g�&
6pp1��F.��������/��NO���[��T�h4}8.�D��������?���<opqxi�S�
�%T�i]U.�����U8���rpz28{788���y[��F&q���h�������a����\�O����0����=��@@@��Y ���ngM9��Xt���a�A������C@����z��Y�6���ln���rK`n���	�6�hZ9bc���'��������7��'9����Z�+1�ZY2I������[~$[��2�::=��CB���b��,�nb�T?%�8�F���O��h$he�!���E�\�D� x@P��i�e�]�L0��rw��XxL���-��B�o������)���X���r���a��s* �@��������x��x1���2y"!��@�X�qG��pi8��#� _.f�����8e���J�� x��|��w�V�e�]�<���_���G�����{��/�@��@h��m�nQ���U�<���W��������x}zz��#�):�� ����L�e`�-��%�=��7�8��\O���g|,�|I��-t
�X���w@H�)`�P�BY���*Tm`g8 x�@����p����,�C����<����� c�����K�I������	�Lm��!�C7Ta��N_�_Z�8D��a����R���d:������6!��ZY�3�f7f�D��f����aY�z��X^m�NN���}zt"�����h[���r�����tzbu�[Y�������U!�
��uU�����!+t�KO���	Ml�x�{'��	*��Km=P��)A	�`���<�Y>���\.�@�96,�x���R� ���4Vh���VdC���%��$d�[��w�������y���w0�#���z,gC�e�Z/y��8[�}_d�w��{����N�<�|v|x��r	�9���ZH,Nkw��O���	::���:�|>�n:bH
��xdM�r2�#a	%N�o�[���l�iV|]@���1�B;��Q��D@��\���9�|����7B1�Z��:�����k6����!6w��Km���9��-���>�����,{��I4���AB��jy�|�+���<����N�0�|��f..���O�a;�tl��,��BC/v}d�C�������j>����,��Cp�WY���J
_M�U���Uz�|�6�O��o<��F�d�/a������b���'v/ 0��48���{Y���]Q0V.����n�������m4oj%���=
[��Z�!�����bd�F�a�kF��o#�
0*\���C3 �Z}p,��Y�D�������^���pG��0s� ��-gC�!�|C*�������� x��4�smm�qn�x�3���/A�����'�H�<��'Qy�{�qL��Ao�u0o(8��%#�dBk0����I�M���ith���b�$y���N����c|��!@��8�t�
�)���E[Qs=�2!]������&k����4W��,����{��g���V6G
���t��
v�.�#/���.E��Z�q����J�%��i�!�����M~���B<�l�f�Y<�|i�p��u���/����tB:���.K?=?8<�^��|'�k�����V	Ji�`�a��yRI���`6��x>`���tZx�����tY��W�X��tO�y�h���=+Z�e�[����������nAVR^(�����k{#�m����'b������!���a��X~����1		?4�b�8�H���~�&�5����������1$����H�z�[�t�,�>�U)1)���ym��elQ1a����V4�H+���������`J]m����V��e�$0.xu��w�g�h�+M��[V��=fZP�3��`�C���M�e6r���2-EkJh��ZS�5e6t�=�H~h�����!����?��}h�U�|H0�������
� _�CH�m�9/BmC'"���RkEA�����eY!q �
=�j|���s�t%a���K��)��0��TO#�bH�75�
4e��Y�'#o���:$�&��1��j5BDM{��k�l��i:njOjd,gB�T���)����j�
��h0�a������mJB��/�����^&�H�V��1��������8�:�>P���=����1��M����-yl6�7�Z�X��ijM���2�o2-�y��_����'��g.'e�S���3!v�B��8R�:�:l������m��8�b�1 �!�E��ax�G�
�3!v�C���w��inx=�!7>!E��9Bh#���3�8�v�8 ����N��A�%��m��|~5$�d���)�HF���O3����=z�e���{+��d9�U�����:��#K�}�L�Rb�b����lY�f�"[f�t*.�����T��[��<1�=���<�� B�z*$N>�^������W�Y�:BC�����;!�����.�8�~�����S����l8��-A�r�S�L�R����Pw��<K�B,8$��~
�e
�}4J��`����o��AX����h���	�f�@b�,��
1ru�v�1S����V��^�
����S.v|�t���WK9a�>)?R�9�E�
����*U+��*��b~��(�c!v�@��8�N@@�E^Y_vq	9����@����)J,f�\�;d�h���bA��ds6�XC
!�6mfXh������� �Y�e)���Av�{��I���=$��0XMe�������a)�.��	�3��l~H�E��W���c��W�l�K&��=����_H���z8~��
����������C�X~ojx�zej�'��������uyj]\����:~0?�V�TRI����8�N�Q�0�/��rg�6*�Z���k&��PG��a,C��1�l�H~���,%��0�8�-�
[�M��q���u[�8�d�Rq#I�`0wvv�t~���;��v���"��:�R{�[PpQ��u#�L�R/ ���P����B�.�����b�b������U��-�F�"a��;S�~=���j�[�k��&7]��b�C�G������e��-�:�b�:ep�TH��)P�V&#�}X/,{���Eb�{r��Mtc��u�>$��N@��)����c
��i��p�%H����o~7���/��:�_yO�����o�6v�9��> ����������C���m�G�h��"�A�a�wL�G��M)}����E������lLd��X��$����X���M��$���,��6��?�89B��7����<�ty�$�r�I��x}9B����N�\������I�����"�i4��e�|��o-�G�$���r`��B@���h�/�y�������oJ�LK ��#����F�{������_��|Oq�����A�������S��o�)lq�������5|������m�-�h��
���J=�����-� ��0>i�K����B�7��e���F�H�\=3(�G@�=�`���iS�M;�Y\�����G0�=s&��G��=���w���{X8w��#���[��6���]����>����G�K�xy�Fq9#b��xX��k��n�����3t�OZ��%�����_+���-�P�]����\�1�m4v����b��Z����w����#1�}����\�8	���C��x�~���p1�C��I<I�����+D��]�E��a����)3����r�BcW���U�������}r�rV�t���m���4k^:�,��El���1�u��������,e� �{�(����:D���R,e
FC]�#X���5���e��q&_c�#0��.�|��dM�l,2:$��G��=���!�+��Bb&��m���<�M���2�x���i�w���jA �=S�weD�}9������t2��#l�#������qi&�����U6�'3��2���K-��g1��Y\%HL������	�'?���P4`�Y�.���VVu��k�%CJZ�)B�����&�����d
����o�d^ )�K�,���>$!p��e�"�x���U"$��lP%���jSLc��u[{f�w�D%����K��a���
����������{�@�b]!M?`��4����2g���
3��1��"X���4��+D(
_�
����?�[cV;8�F�i���l��uE�y��tK`�=�5������!�@�e�������63A[�0��LC6Z�������{����������rI���x�������bm������?������ ���kF�t�L��vcn �lz���8J�(��F��M}�`�{:��9/B<0r[���:D[4��i����~j]-����*�}��Uu�p	Sk�0�E�6'{��J�����\)B����,�/��U��Bd�V���q��rU,�D�?��:b��
��/1[����B����c[�[v��'(�x���g�Y�;G��\=�h<����5"$G~���bY���Ia��~���o�4�
oc�`w�x���,��V�!N���[p|�N�2���y=X����:��s�8B��A������{:��������E>�;���z]�D6�����l�^q	�pu�!�E�$o��tx��w��(5�$���i~]�]?���*�U��R����X����\�����7�w����X���!���s�z����`�{���*�s��'h��m�#h���6�3#D,����L�`���O�+7�4��f���U^���(k�y�-�Z��Y��Jn 8�,-������(Tl%S�w�?{����������
�[���,�@f�-EV�:��R&���S*��$S$� -kf�{�������WS����	jw"�T�F����O_}�,g�U�Yj[(�]���f��{���",��u%�o��>�xf5v74��Y���$�6X�����F���hgVk���5�D�
S{@����kCWfR��	�������V������qZ����ao�I��!�����9{�;��z������{���N���c�x�?@c3��H�N��2��g��S�O8�	��l{�v<�9��\|@E�wtw��<�~,Dc������K�<���G�8]O�cf��~�����3��E����rh�8#a��>|y���?|���Y��������Z5z�=�n��ZX�l{�A�)~o8D��g�c��]��(��f�u?�g���<����P5�R�;k�u0 ���'��f+o��,�YM����]1]6;7p������4��j��%�����cm�Y��
��x�o�o�V{s�q~�z2@�r+�O,Xwn��E�(��5��q���]���D<���������o�D^�4k*�EU�v�=�C��B�Q�MH�n�S���_��,n�������S�\��B{I��1��"����������[M*�,EF}	���)$QI^	�-��]:����������dUW%�/�����"F������������+P����5x~��C}&�7��^����7r���9��s�i]O�.N9���i���6��5�\t��E��21��%�������|����uR�V�~���o*�N�5+?�����zE	�����^7�l���D"m�_��1��K�1�e�2fs���O�D$��������j��&�f�,F7j�����z>\C��[w]����%��9��^j����NNUYz{����������{s���7z�����=g�����7��P�&��l�������s�9 �s�Z>v�1P|�B��5t�����o�J�y��~nu`�����K�7.�E5iB$�_�n������Z�����]���4�~8�c9��s���h�Y���zd%�"���[������vo�7.G������i��F���/���X	H�
��v���h�F7��7�!����^�l?h�It���S)��d�Y���f��i]iZ�����������$�!f�vLo���[=��v���Q�Y�e��>���"�z�T�^6��:zv�������������y�]����mc@6�`c������]�M��L��}�������C!�������Pv���_sN9��o�5�c+�������Q�vK���\���@����s���~���kS��~:i�X�H�<t�mO���n!���>]oK1��v������@�sw�?xXw�����Z?����}5���
6z����9�_OH����~�M�E�f��\l���G[KZ3�{4[ME�d*�����j3��s�4]O����7�����0����Y�C�w�m�^�yb�j��/��[�}����m]���R���[�~&�-)"~9<9��o-1n�����P�uG�	!�l>���z2@�l�����`���4@����O�M��U�a
R�_k1t�����z�N��+`�@�4�Rh5�\vL)�7����1�1�{r��]g�����o:���j�hv�\��1�!��^%>:<?���wl9o-�M�wL������f/������_�/��r9V?�����/uV����7��f��/o^=��|�]�D�q���Qy�d���k^���_M�"�?.�m����\����k�^�^=�����k���|z���[�����)~(&�W��R�m�F��Q��l��������������z��f�����d9��g���e���������H��._N�O����3�u�;�����<�����?y�}<��(W{QY������=e���v�������:7�{~yY�>�����b��f�Pe�g�G�W+��&�{��I��4>���~� �?L� �S?�� �C?����>��V��z����vy-�-��������x������{?����{��woN��g/^��%<z����k�A��{r��b!
����gTo�x!����S�������[1`��*�(Ew���gG�������|�[�zF�>���Am�u����_�VE*[�UV��;������G��c9�����oS���k�2i�����.��Y��_�)��g����5��SC!�L*�|���R��Ve��6#F����^OM��u�	U}�|�D��t���:�O�o�`�ozI�@db�H��b��,E�z�xE}���o6m�@
a����b�Y�n~�9SM��'�v��]����
�'�%����%P�?��I�O�����+Q)^�!��x�������]�,^sC�B�t_nJ�f���"'�_��7��!:N���G��g��X-����%e����QSK�D65�R�J�yY/�6T�L��[&���Ft%k��l�����
�l�?
�`�?��z�m����O���Eq������?�|�� �&~p�,���U���3�!C�.tt���r�My-4�9�����F3�����;$�����k��RH�r��f����;�����D~���<�=�!F���0�������G%M�~k�I������/����.��4f�m��������$���;����7������8����Mf���
�s1�\7�rd.�t��b?]?�@��y�`��DSGF�b}�����xQ��_��V}��W���n���7m8�Sw>�<qa>����Y-��2���������,��tx)vA2P��8���q�>��E$�?��<q��t���?`N&�l�5��O���6���i@�����(����d��8��;�%6���Aw������T�g��Y��=�'7$�r8\9�.+�<<<uYN�����-��A,hd�d���-Zl�����a[h�6�-%[�n��p]��/������vV�=����j��ID����=�������}�����������}���������������f��l�����O�H��e�m'�?���?��?�,��q2������5�����_^��']��]��'[*����O���N�7����4
)�e�����(:��X����$�}���q�?1�����'�����6��.�`�_�������kB�������g����&��x�?��������3�j���,����3����[f��vX���S���}3��O�����s2~U��.�`q�n��r�Q�u�8j��}�f�:{��6��+���^<����o7��<��Q���;dx�����J���[exg���~�.��������
j�=��('t�V1�V�T�}=�������j���I�<�C�nxg���4_-����s����{o.��i�pq�����y?
����qM�R�������
^�jJ�'u�������hy ����q�M,�\�Rm�XnV���t9�"�6.�;�%�:�Es��p~C�Z���br��eE��`�������	�=g��U%``�������s��d`N������tY�T`�D�EUJ+�/c�
�/=2���
Y��h@�U!-�4>�4%w$�������\����������O��
0��#�����\��w^���j��
�y�X�����9K�D�Z��n�,K:�Vj��=���	�F��+}zT}VEV�P��UZ!>2yP����rZ�a�P:(�������3���!�A�����������*�Q�����h���@����������2������{����RL3�9��Y��L�x����������Y��4�g5�{'��4D��f�{UhvF��_����u��~��.E��j�lN^y���j��-��qn��h�=�=u���������{��Q1�DO+��Ok������Y{����w��������#i}�%�������}��wO 8����%@0X#����������i�����j:��D'B9����7���M@�w��6�g��������yq;/���B�^��tA}�v���L�������T�������$:��}@m����������t1-&�
AU�����3
�
B���n�r3w�=d*;����0u�4{�I�v�f\�r~M�$/3��i�,���u���-����MC����e�3P���V�D��g��$r������)s; '����v�a��]��p6���nZ�w���K����{:v�����7�e`��q����:.���^s��x%����hg��W-�6�����<�b.�	���|d9]����z��[d/����FF:���6@���1S�3���0����':��8�������~x��(�Y��1f�g&L�\5�>����QK@��v��@���9II7|Zz����=#��]!@�?�$�8Kb3�E��iB��3$)x���*pv�oN�/�&$�01�0fHgJ��&*�L1�LfTM,?�OS�����!k��J���j:�>Vc��W��sM�/����y��H��%����y2����)��c����:��)B��8���rR>MH@VgY=_���'�' �������,��+R@\gq=��#�U��}����&NS<	P��Y���x0�L�Haj;mu��4��i���r,�������u�
;{��D���l���h�X�c=�S �)#�f�^�TB�2��NC��UP���.����"�Swx���nu��@�5y�f��%@~S��f
�2u:N:��2�&�T�f��P:��Y�^�'��_H�b��}Bz��t���]Z�����������r���PE@~Q������w�/e�O��fe�f�g�d@[2~i����*?����0W���������5�y�����Kb�R�W46��-�4��:�n1��j9_-��duM�������R�����I-�&��5R:�-�f@2�����_�:���t&��*��+\B`�iY�%!qYz��)���0h�u��5�q��]�7j^L���]�&�#�p!:�����*Q�V�R�Y2 5��}N9s[��Ge��
����|5V��U�
���"1���1��3�CO��a����So��J�81fj_A.�x�NC��T���2����_e�p������7@��Tr��U!z���.G���r��������W��������TD6�A�XA��U5"JdrO�k.�f:�^��j���x�,��,������8����*�@�r^����I=����v�(U��&��tU������n����y��j>��T�n��D-���N9!
����S��B�Ks���"��:~�6��������-g�z
ZJ@����g^E�O���F���4�]Uc6�9FhV�h�6��������k-TP�M�v�����GK�
��u@|��Y���^|�M�K��~������m��+l�c��r�����">HLje8��)��."���D���+o��2xo�j���^�]3���>�J|S�����Z���r�cW�����5@|�W.}����@|n�z'
�[��/PWd���I#�=����Vj��B�V]�zuI��A�����B�*�N$��P��F ���T+E��uz#)��R������U<����9���Z�nt>��>K/��_������:��'�'�����Px>�J�8�Y�X\��"�����?������e�<��&���D�/���v�g�atQ���7��;�T�{���9�����5x_t~Es\2��l������������d�E����������`�`O�W>���
I�.�^!��#�R%���[�^/����d
���`����/og������y��{�i�E��R����.�Y��s
!@�}5�^��{�b�h��hwP���g�{��q�UVS=�0,h���R'�%�M*����@�������#��KT��{����o��pq�Z�{�d?p;�8J�D{��"h���f�m�hh��Nh��R��?pw���z2@'8.�xO�G��	��~���h,�d��I�*n�	�F�0�������p����E����������{�U�za���q{[@@B������-�+��������������]0��m�ohc[�DA��_��9�)�"��CWc8��,����z2@8�x�M
�(P�#��Ky�W��gt{O@B"�i��6�	�
�*��9�����$�'G�o~:�Ub���zrf}�m�hA�w0g?v�[Dd{g}�9�YO:w/���g���{W�O��5K������U�z��.�f�tflk"�!�ZL�r�(k�}���x�����1���i-��!��}~�����s�r{��8?��~�r��Pd=���	`�}�)�������@�
�94XO���7g���n*=G
�7���7�'�p��>�
��x�������>��7������[5��x��<<�q(yPi9"WO������A���Z3�����/G������P���r����;.|P�9�(:�3��>�Y������N�/�����0����x�����'��p�>�L���������V��X���rW����S��w��F1�D��g��l}
��.�y������ }��3��YKdf�<�s���e�l���:"����!x����z�;=�u��"�O����losS}lT�n�����o��5�QW����Q/o���]T�o�/W�k{^3�^S�������>���/����!j-!w�m8`�����'��#U�q@L}b��V�7�5��kL��j0����r��������l��q�N�9:�C^�)9�\�`=���m�L�������E��mF�3�6-��I_:�_���B:�3�E�S�3�g[/����2k~M��=����I���K������������^�tx��L�*�9�TO�E�4�}�6�ZV0�5@L}1�\������>��������M:9��$S�b�,�E}����b���2��DwR�C��$�+m���w|�������|��O��������YE&b�����]z1�A9����?��=9=�8�����o9��4��9R���������������uZ��Y���I�i��rw�����Z0P�@���'c����Mv���
�����������n�58p�P����������]����|zzr��\�>k��j2�n���wY�r������b�@��&�Q��-���A�#%������[�b_��]?���L��8ip�����w[i�jV������o���@��"��.�go�c���J��np������F���[��^������H��{������xw��������B_O0D���
����������R���}`�e��|���Q@�G�g�'��}�������Z���HTx�u�U(��Fi��lU��A��M<z��?��{���7�{��D ����
��[w���g�o��'��C7���5�3lozF�S=bz&�f��H��#1���P�A�V�����PL�����j��`��#�����bz&�f��:a��#m��������r��E' s��?/��Te{<�]-��*)�R���'�������,���P?@�Vt�kk}��r��/��cEg1P�^�)�}J���bz/�At�@=�#����6]��l!tqh����du�tg�MUt���Z�Ds�����(l������w��|�����?�ysA�<���9#�
"W)��N�fQ�qL)Di�����qqcAE�q�<�'��
J���4p�A7��V��N+�B�i�z��w���@����3�i���M��sT�~��I����h/t{�~���o�wS�2�	��f^���,�F��lF��K�o5Wk�e�Tr7�3\g�U�����#�" ���lt��_�b����h�u+��Q�Q6w���}ulO=R�n��#����z2@`��Oo������������%M���`�OT��R�!)B�h'��������U�9�*�dL%�i��w5)����beL��4@��&��� v��0�z�p������+��i���#��;D���{����q��bR�LYDk�fvn5H�XL�P������"��}�o��$��y`L����G!2Nd$ H�����>6H����P��d>��@
9�VO_��0��g�*gP����`�)~�k�9��@�M|�a��w� >�.}f ����������7����
��\�lW�t������
\}_��Vc�:�dn�B�v������.7pui�q@��D�KGP���ik�� u����u��7%�������:�'���<������
��`�.�,��O�5�m�9-�:�]�����v
tM�dn�B���4]���'�����.c����5����`��Bn7�(o�z��=�X�� k�����8B��r��^�
=�P@����7p�~�\���?�X� w�1Q�����
X�������&�������N~8���d^�p���-7�WO�"w��\~�
b��\7��de=Y[|��Y7��AxoQ��TWk��}�;��)7DZ��'�����}����X�6[�NP6/]M�C���6���-_����&t�|if��
��c�
k9�������H��o�{��"�8���W=M�T�.�/]h�8t���0��~�����k���(!�����5$<p��y8������7tD~�P�����u9�9<XKp�!���KoB�����]_����r�Z*�X�Q��P�q�&����P?o����
Dw�Y������v5)Tl�dH)O]��(�����b�������rpY������!��7w��@����\E��u��z����xV�n@������O�E;~�
.�
�'"��)["���<�tk��z�ky�9D\�v���I@C�z�<�_J�e�����f�
�a���z�P)R�1�u�X[B��B�t���#u���M1�.ks*_"���v���.����Y/]���ixi��?]4�(5�-F�
�:�SXt�J���m9�D#�U.7��a���s���b����h���k�^z��{����mX
��sn/mX��+�R������.�S��0|�re{����ex��������H���+���C���^���������BN��S��������6�������Py�hI,�1�g��6S��X9��e�P��D��82����\���n��:R���m������K�X@���Cz2@zlLz�j�aOI
�i����#qE�>������=t�@�#�@s&��D	���g�a��w�tn.�! �CWR^�����5WGV�2�x�e5ZM�a3�������rs(��������C�UM���-�������~���L��N*,#;-����j���t�)���9��K�ZMM������L��8�?t��	Nyh���:��
.4��;*-�a��
]��-2�C����;�h� ��?�~��G����\?���?D��Co��!���yBZ���<�L��yj�P{#g��p��F:@%����!�&m��f=d}��V��m���=d1v�������Y7���a��Mi���O�y*C�@�!���?���q�����0t���7a=�?!p|�dZ��e���P�!G����q���=�C]���G�r��z2@�!����-�!�C�}�^�f�;lz�}z��!g �����e�(g�
�t�X��t���[�E�W�l&\��g\�r��z2@]2�����_�q����%{�������La�
����qE��� ��-vI*����l��y�+�e9�OF����_���Pw�����T�G���.:uT��^����{��V������!���+R���f����*�K����)O�U�z���tTR���&r��<n}�Z�gw�y�!@�Ca���WN�/fW/(�>����o�do�)���V.���d�������<��^i���DS!2j�oq�s`6-

������n�@���g�,���0���"=�����h�CJ�B�7!�����E��/f��]�(@�!��7��G��s�)�aM��P�/o�DoK�f�vs�������G��s��A33l�&�����!���G��wgK:�jY��Kgp����g������'#�����[�������X�vq����i%�J�����U�GdW������v�
�=���B��q��S��Dt�|�Z��J!k��Y��������1���e������t(���j�j��6D~��z2@R8�����G,����lO�~��f�������	x��s�f�4#��G���N{xt�������qv���r���wO�~�Q=�-�gV��O#�G�S'���������DH���t�F�����#��[O��q�}:��N���m�3�y�f
D<�!���.<b����F�����Mr����>�������%�<��f5�3=$�q�C�
I�{i��%GjS�����8�pu]�n�y�g1mIR�x�Y����1��T��R����V�p���-p����qP��73�n����`��
���T@[G���XUOK�^-Hg��UXX�(t�{���8��������Q��0�����K��W��6�������:9��L������U'���Xxp�@���m]m �@�Q�li�����/G��D`���a�Z�j���\��O���C���w#���;r��#�aG��j}����?��X���4w�Fss������;�=nKunl����&rV�z2@�"'���V!y�8������W�����?9��LE��8�r-@?G.�F7�����N�7���
���r=	2�N���S�������?�&/k�]4QC=(�)N������_�}�f�b�0��4���x
c�(�9��M������M5���t,���\���ec�P��E����[W� G��'4��R��@���w�cz��m�-���SJ'��fa~s_S
�����R����t����96�jI�W�Cq�(���26��#��������u|���8�/�q�8�2�[�9>����)!S�����Nz��@��Hb�^:9;����o@8���
-�^��������9r��N���t9b���b���%�q�H[p�j� ����[�C�9�|��t�Np4����e�h>�����#��<�����7� _~Y�����ANG�v���rU������j�F.���Y����B�R���~�O�*P���p����
49��d=�A.4r��>����#7�8�q��c�e�.��#�G��Y�qdE��V�������vY�cC�4e��$q�}9>��@�#mUe�4��(wU���8��!�G�'�����"��v��ai�������U�F�����5�&x����
�� 4�^�x���F�C�j8R�T�q�p���;�m�����v�d����\��S�����Z���b�� ��c�JXP���-&�/j�|�n��U]^�&z,v��98W���{�R[B���y�?e4[�^�m�A�����x�B�P�"���,B��N�r������R:S�n�"���>?8v��_����)FL�&N"����o��*F��v#�&vm�-��f�"���0Pyb��;�pFJ�R��e)F*88vr����1{c7g�0�1r������7�f�Lo��x��y�F��q�6�.�`��b�����
������-C��X�8�c=�V`����?�
���c7�XV��]��^9�rD�)X���t��X�_�-���-�X���� ��`;�:9}����(Y����q���M1�uc'c�]�d:����Q��� ��\x�rA�>�ET��4�������9�!�pr���������RM+�s��jg3���}�����T�{��D�E?T�������L����f1��nJ�R=�SVrx�z+6�s.�K}z(�pl%����j�-x�L_��[s���N�(+�`�.��_O^��,����Y
���-�j������u����h`������zGoO�~~������?���&0:�C0�q�6���Jn������?��\�t����B.���u��#��|�{����&�v����H����*����@�`r�b���5�v6���7GY53�B�ht�����p:�HS��W�1�h^��B����&2���+"��r2��x)ttH��;�L��-nf�+�K�v���'�����2FQ��[�� ��B:����A�ce�8����O�(���G��d�]]��}�m=!�,�����@G�i�fP����/��a����|�i����N6�V6/���1��c�[���E�������^2��cG�9�k�S�����WEJW��7/GG��U���C0J����@<��$���:\t�=d�&:�����rw��;���4��k�r2}P�
����c';�0��m���1G\��A��0��9�x9?	��0�-pt�G�/��0����S�MOhG=��Z�f��9�@�����}�r6����%y�����������h��{_���N��L�pQ1+�1��cw������C7b��tY��/�/�']�q��?�y�=!���z1+��v�x(�����d@�N���mc73��nAC�i���	�0p��^�s��z2@8���|Ls�+��W�����������1`�c�U6�{@*�����1`�����i/������T_?!,:��h]�;�*7OL�S��1@�cG�NX�z�d���;���-;�|v����	����c@g���,M��e:�0���k@S����%N���^�S	������ir�l���c��Z����6���9Cg=�r�V*'�����/^�ab�Z\�t|vv��1#b���-\��?��%Cm�V�@Z�n�uH�����[U>��z�����9vl���.��c;d�9 ������y2o$�XA�ul�l6�T���]�i�����3�1���|k��0��[0m��S��s����|k�i�.e�a4h��1����M_t�@�� w���yd_������d�T�������YS���r�P��n���V��/W��V��t���U",w����d@o'�r�0o���]L�)T���U���(}���*	����� [	@���Qf[�������eW���U�S�Q�'tc���==;9;?>e[����������:T:��������I ���e����EW_u.WvM8�h=�&N@8]�����A��d�����!e�3
b����������	��*W�O�1����l�B9�C�p��PO]�;��_l��b��? ���=Z���b,�:�]��f"z��bo}��#W�'cP��=a�������K��������{e�����<��r��	�������wZHH��p��(;z��	G�i=�b9n��f����h�{Y.��Rq�����"�Rz
��	�P_���~��Q{���M�����?�tr.�����x���������P<�{������}���Ki��\�d���@�8����O8�]XI��g7��fa�H�,�0m@�'$^~���u��ES�w�?����=���~������pT�_�t���.Mq=e�n������������7�32	��$pwB:��R/E��/�^\,��DL*�����C���*����
�y���j% }b�iR��*mu!`|���z2@�B�.V'i��}�r��1�!]�U.I�xxv����'�������l�����u@�u���z�y����2��$t�#��a�P���o\T@�lP��(z�u�B�8"�������:�4�����,~E�d�!9g������Ky��^���/����3���Tyb3�����SBc��}R:����)�F�f�G�F��y9��D��3��Sf�d��4��������r�}��+�#(���e{`��FFb���xi/�����'V�|��w:���$�!O"[?IT�;�l����F�1��}��D�`Q�#4K����Y��C/W�Ldw����o����X��w=�9kgy��2ao[�j.
��������9;m����W���Uwt�^������}�s,�FD�!r�VS����X�q�_��b����-��6G����H�����H
�HuU������/�����������������R/�����]+2��j��������r�Zz��J��j�z`�k��0Bzf��w��y{�:�W[@�o������+@.O�[���O��<�u������-���w���tG����
>o��Na����������Ty'��K��+= �x���6T�,�?T���w�a��b��P'�f��x������K�[�+�Y*�j�G����S�=�r���lY'��j�c����	����!��>XZT��%o6��
���Q+Jo-^"V����+I�5�m�	`�����Z�e��Dcr�o��[�EUQ��V�"�u�3$0���r|���<�Z�tb��e�����5�*m���T�]u�Zju-�5�zo�PY�<ww��u���i����6.^�r��W������5%��H���� �V�B�g��� ����fT'�G�c]���T�H���\���Z}��YO�'�N�Q��k�WW�����5a�C I��Q����LK��4Szo���X���,��Q��Z|�{gnIw�x�\����A�L�-$6�u}�l\HR�sy����$;���;��W)�nh�/>��k�!�����lo/�A���N��y~�2{���ea�^Lo�O��]Q���HX����������qSCp���u��-��'`�C�����9u����g3&`�A���, I��= ��/w������M��W���G��]]��F�v$.�T��?�(���p�t�j�$.6���E\�R��G7��f*�9��������h��;�$/����<���v$n;�C �f�@v$`��V_��j���z�^^���j�����Uz���3��P�E`y7{1�!�X
��{�Rtkn+�\��������j��������-��\�&�B
��};A����q�Iq/lk��������'+�\�HHr���^��@-��'`GA����;��-�����{�����p�k���B JcIF[T�_�c;�o ���{���j����Sr�~��^��]��R�N�����~W�L����r��f.�m^��fy2����:�v$���&�I���T��ue�Yn�g�b���N�F�y���=��sm-5.��	����HF�������:\�uF 6�V���T1��1C,�zY-�w���!�E�`�Aj����d)�/�:�����#K$~M���.o<�|�l')��M�e����9��K��(c����t	����l/�i?��L�6���(�W=-�������t6�snlv
����~��`"�y����w7S�U =p�9R�����^z71�]�z��^Q���,q�5�JW�T��,h%I���fiN���h)���h4)�����K%���� ��d��]-�?��&�
�p��q�p��"�y�e����*�Sy��
�����������)������\J
���������sz���K��)�9��bP����=81���V�M�.8��(���O9B������G4���=4�z
_L3�'t�dab�{�,��O9p^Oh�����w=P�9�w�U`z����
)�)�x�;�n+��}���&?m9&�o
D�ue7��)����)H@�:�^p���4��d�
R���N^����v��l+x��TM��U�i���Oe0]���-`���#`�	�i�����<
�J�L{�������S�� z�
�w���O�Pc%>���������s�����ih����r
�����?9���<���&�Ax9zj������) ��^Nz�"����S@���"bx9�yj����^N������r
h������S@���xx9�{
(��������_��y
��5�,���hA�g�����Z3�)`�S��/?����8�6����m���v��]Y��!k4��r�� U�$
�j�l�H�Z��Si��V'�J�h�VQ����
����^��kg�t��~�Q�~r
d!�\#���|?���p��[Bo��#����zD@�.��bZ�W� �������f��A����E���9�N��#\��n.�)@���q�'�yZ}65 z�bO�w���o`�)�����N�y
>CM���8v}|��b'oys���d�S�QLb
��4q�o&�7c�	S@��6Z]�Q�z���z2@�8ky��YE�����������d�>�O��c��z����j��;����6��-�rr�fG���<�2�wj�i����P�i;Em-����w6�6a�2�T��.k1�<��������`5�1�!�m8�����I�C=����S+4��T�K�tpF��)gz������N������x6�x�ZTG�wr��������M3��.��][��) ��-H����I��<$y�an�*�tuj��B��2��X���%�s_����']��y��z[�U|Yn����=���#l��`s������?��Y��z@�8��	�i�
gfF(��Z�s�:���2hO�@�Q��De����ucKN`�)���Zh�����d�Vp(����n�6�o�,����������W������bA��^Q5T��:���\��&th���Ow�2���G+'�G�6���������X4����z �������w��{�^n�k~�q\�s�k��vW�����r�m��&k�g����P�=`�pC��rL7j*;�]g��vW5W�8�
����Y���� 
U����4<;`$��z:v
�,c&�r`�����_ob~��d�����}O�p1�)suj�f�#�����{��:;p��6��46�`�3�mT�re�u�a	e�2�_gouk@���2�og6|�7=-�X�����'z�w�2�^g6�Zm������Ko��Ps~��Wr���T�Q��@}Wd���os�"e�*�������4j�$���z%�u�zL@X8��&&�����^���v3���o�|wx������YYn�/�;se�vS�����=��a�O�,�v8������r�����) �������`���{�z@}�u�������vZ����Lt@�m��z���,p��%�7����0�������@,]	���_FuVc5Ww���;���:p�;�����X���n5��3�-;]z/����{`����cvg�z{S���$ �3��]O���s7^P�p��w��3�1�^�����t���. �3+�N���K�3����O�=�2�.=�V��o�� ��
����># ���ui,`W]?�^� ��i��{!3��g�k�<������W����dB>�����=��v=�16$��n�����(r=ju=���QP�,�j*�7�&@�8�\O�
��2
��E�w�wT��������.NN����{;4����7�'�o�x�^R�����*b�(��37
�w�����rY�$
t5-�T�F������Qp�q�5�b�`:�F A������i]>h���x:��i������}V��lC�I9��)N����G7�����,Ct��F������<��3(yS�Ro�]y*�����~O�QZ@g6������,��,���������@l��}0M���X���u�fu���|_M�S�W�,h7��T�l�����%H�R�D�7���^�����3�C��5�F �!@�3��\Og���y�6{w���x�u#R{;r"<+�v���M9�����Eq�{5���������=<;o��,j��~_���M������z[x�����~��aN�_&����,�j��������7?�Ox����P�|qz�������3�$2@
g�������E��8����d�n�`�o��}�
[��f�����	AH�[�*��d6�p^\������z����N~�F�/'5��\\���QY��?��R���g@
g)?������;7�8Hq����\�D��9��>4�	�P����7��gy�L'�� ��Rl�����`���4�9C����yh�������e�5���zyIu�D�������Z�4W�9�e�N?0�����������o�O����(�2=��<�-�����5`������_��������9������&X.������yhg���Ig}������o�{O��P*�]�e���2PO�Y���r�;����n����~GV��S��l9�_�����1k?���q�sO��
��"���������)�"jT���4S�#	����v�/Bk�����o�~�������&�\��}��{��-bb�$�{�<����W��DVo��)�/s��}������z2@���EKM�Q����.�{m�����y��q���dr@Q�[K�����9���/�D������F�[������9�ZO�.V��
v��zE���c<���vZ���Q2��G��Xn��f/�Z��[�*w���Z=x�����9Wr����{oO���^;�����KZ��U���9Kk}>,t�b������^���6����y���wwS�n�M���."�&���A�u&g��w�=�j�kO�d�+&jej����o�Y��6����z��9g9�v��Z���������W����r�#S�	��m���>��yM0������wH��#���������#��T5�=�O�~��:���������}�~V��>��9����r@��?�>� :�-����39��s��tw�n`�.s��M�GpY��g�sF�z2��;���e���/�y�'p��Z�����������;1���i��s@����"����9@�s*��i�@�9g��'����>�4w����T�x�����%�[(���]���n�o8�<t���������4�;�a�2��z�\`@�����s�9���pk��{L�����9VZO�\�����~��J��r:��@@m���]xtn5��\��0�9b�����9���'�+r�8�����9��s79r�������H���s��z2��;Q���m?��29���C��'�P���s����������=�>��������-[�XLt�1�z2@i���I��1o��fZ�O�6�j�u�tnu���Z�=�V�h��uLt�f%��9���t���S+uf��$kP�"[1j(��1��T���ix���C�Ydk��Y��N3���:
�������K�K�������5R����$��Q?��4���94[O����s��4z6[�Mh���YP?�����w���^;Owh�u�f�}%�����z�	x�����d�z%.������
���C��d���������K����6���,io�y�o	T����R	������d�.��.����Y�es��������-f=��>���������Aca�a!P�9KQ]%�@�S���X�/�N�r��t�V�����b��s�����s�B<o����/��9��#�ksc���B,��s����8�m���b)��s��XO���q�����q_,�p����9Wc��X
��<�J565��K3��0����x����~�r��s7>8|p�:*{���p�9k�L��w�����v�<y��b!����.���D�Bk�
:����/*:����������V�*(�3����,�f���%�o�y(�[�����Q���;����e~~w�~���v��>�R.R�j}�R����"B����q7�b��y����]��B��7lU ���
[��>�6O�w�*}���\�U�����E�?"����V���������}��c ���1�F:@�X`�Q��(X:g���~��2�AZ�+��$n
�f��W#1�S���v��~���X�rYLP�r�)tb���>beA��!�-��r�k6����I����������"�ztK�����z:v��>��o��"���f����'[YA5�������9�"����H�}�y3)��Q���������8���uyG�He`T�f$��sA6����~��u��#�t��{��l=p���;������zuG�	+d�XF��3������C������&���\�9����ag���������������K!�O@}]��������j:z�mi"P�9�c�j�v��m������v��>�"�&���DD���_�5���n�����N�o�K�,s^����F���9��.[��h�:o2�?�|�	��@,��#f��t0}��2��po�2�v����y���/�c����x�v@�>w{�vB�>�6O�j��N����e�������s�jfc{���Q�����\�GG�gG���mD��@c��f�nz���i��GE=�0��C���g����g����/������1������Yi�6���-����3���i�����tQN*9h��j�;6:�����Ct��t�{/?0F�\�@���.�4�^F�*c'���m#n�S@V�ubl~�E�
W���S#��`k����N�c�4.n���1!�����z���	;	N�;N��1p���sd���s�	;�M�[HU�s�����:.z�����]M���s��������M a,�����r���yw�d�������W+:(J���d�C�FvZ�>�z�Nn��N
��Y�����^��!":����<������yd�1e���s<%+���������u�����|��I�1��2��rZ�H�G{���uR$��x��bV
$���t���L{�Ja������v��>w|FP�9��.�������t=�c���i2J�Q��=��To�n���y����fd8����5�-�=�M_��E]�K�>��'<5Z=;"N�oA���p"�{�����1%q��=?�B��6;�M��U
;�M�?�������F���a�_8����@m\����QQ[<�/�
,w=Y�����&�9�7��y:�E�I'�}��.��*+��[~���`r��sL��=�9��H�.F����h_}{�6Rh����>Lg��)��th�*9�6n��gr4wI����!���%��>�7Q5cGiM,-#J�R�i�4�|1/�K�nQ�A����dc�,>�F������U f}�+�������� 8�������k �M�B�m���o������;Q��Gi��;�|�m��[�O�[����T���*��'�?tg"U�U|�������!��������F�����m�n�%�D]9�P��������p�������4���`���y-��������B�E����b������nT�3�6�U�w[����\�z�b�8k���f�{�Q��@t��k[t[��n�;��>��}'��x��6�,����
��(l#�LN.���Klf��T��w������oN~8�.?�sUs��6�k�./�X}u��������=���"�Z�����t��}�w�N|���;ZP�������	]�[>�~�" ��m��,t[�����_[��4���H5�mGW$�_^�z��7�9��HH��7��]eK�����G�<�&�����}'�i#O�������;+�'�����l`�,���_r�p��L������M�)�9C�KN��������W���\Z��G;f�E)�.o/'�>I�E1�U7�X��i�#�����r?zP��,��rR��]����p�P��#��
�w�����7���{v�
-�������I�������u��Q�)�Y����[������N8YRs��u�k�=��(���K�9.]��QI�I�&@�l�z���,���w��[^���+~5�-D��_n~�)���}��&���
(W��T/��������h`����P��YY�e6�/����h`��B�<�u���KH�T��jf�e��/�������FH@���ix���k�m��E��{���
�a���=�.����L��@�}��=��@��t�Y�]Q�����:��T1��c�?L����
x��������{�����f���;���m����s���e��.���#b-rG?�a����.��@���@�}G4Zk����?��kKD����!�F:@�8B���@�vF'����)����^N���Ml����=<k��+���gql����K�;�i�g����)����PP��xl�����2�!�F:@+8�������\
���~��'�����<��ht#����FJ�6�F�
W�7_����h���U1��
b��8f��+EA�.����n1]�R��A�LqC C�z^�V�)~ly�b����l��������j��q�����,�L���*UO��������A-w��^?2]��}�Y�6�m��a�vm�k�P	G��<�o����F}�\�N��%�i|�����dN�n|Ok���������z��W�a��O�\����U��������?�G�pn�s�6���_w���0�^�n���bs~�������������
D2�Z$O�����o-�P����}���en����$<�vp�4S�����6��8��H�._����y��o(<�B��e�n	�p��>�������� ���Z�������l�Q�s�����~9|C�;\����F�{�+y~�WL�^�x#( �{�j��v����w�������z}��:`ak3cAegIk�(�rz_M��W2�hN`���+�����f���]��	�������[�	�^a]������z47������Y	*<Kd����^�s�����d9�Q|cX��nu���-�wm�7�U
�p|�>S@;�,��jw'7x� ��:pa�79J��/Z��l��V�es�BL:p���I.�t?;��k�J��� �G�9�s�B<��i�� �����a�e�� p�W89��d���./.�r?����p�F:@18�������������]F�i�1�r�]��am�UN�3��I���&���{��2���P{�9}N�Fr�U��X���,�����F�e8���5���A8�sgb��P6���b}m��� ����J6����"�����/b��|�%�>9����N�8a��(L��z<�,9p���%.X�P'�aH�!?$9p��v�(��F�7.x�%�.3�o�E/����&��zwy9_����*)|����O�I���W����
�W�������Wt^����K9:|u����D$>���_�b<��j��q9nC\�&�B<\u{�����g��/��wM���O��5wEV�7K�C1Y�z&��"n�5R�HNF���x��������B���J����z��f�����d9��g�����-���i�|9��D��r"~�����5����]�2|��������x��Q���2���//g�{������������k��EY|yY�>����E9�T#	��l�})�-�G�W��h����=��$��O�����(����4��Tt��@�q��w��{}������e>��������>�'������G���������7����{sr�={���/������_����;��OE�B������nSQ�&��g�T�F�������H���(���>�z&)�������|&
G��^=#2GJ�����������
��[-��H�y��S�����x~4.?��K��+�J�W���H�t����@�o�����IY���R(���mT�����5�$}&�M>���QI�S72�gm.H�����������0<a��N�mC�]���~����7Y0�7��n 2�y$���9�g�6�������f����z
*&+������3�t\~�j����l�M��Q�'^���_E����@���,t��]���������/� :1;�[��W��V���)�������%D����ooBt���}�h���?V�G���lI�b��p��=�M����2v^�K�l
�'S�����������%�6�	��hY���� ���~��}������S��k�y�"���������<��&H���Gf��m�z@���7���L
��t�����������z�oUN����vK�D��nH)�f9�Mu�Db�/�^�P���=�h��G;�i��a~�1���������o��H"���.w{$�V������s��#�u�}����>y�b����o����s^���d����?'��m�����Q���t���g�����M����D���DF[NA����/FM�>����xv7u{��6��;�t�0y��h����O��l�K���0C��~x){�����+�-M*�S�������hv{[m`���g�0�}*��kg�����9���#�o��I���t��b�6�0������Fy����.� ��s'��������6?���:��8�c�(�_�3�����|'F��b�
u�4'��u�XmY�`��������sYL���[��,���2\���t����?w>�DV�����b�F�;F��m��m��$��npU���[�m������Ug1Zn���g�0&��F�-32���s�I���}���QY�j���E��DK)����7����U��W-�����uG��p�/����8M������������������������������������~_����'[�\-{�e���,����q+��]��v��s.��Ls�s�c�������_�����s,\������,���O���,��3���i���?�O���O���O���������?����WO�������������NM����a��������X�������~_������-�`�O5u�����l���k���������?`
0�^����~��"����^<���A+��~����k������s�o��o�����{������E�>�j�H��
�%�������j��W����i<���<���x���}�TW�����%.G�a�}��x��^�����5K\��p?�i���3��%���^.V�%�*��2V2R�����y#2��]�'�of�����������du�{�wv%-9|���Qz��B�J�l�L�H/��v��g�G����p��c5����D������������O\��O������o47.����v���~jH�7
��Te�SN�n���t��%��]���V�������E�-G����p�1�u���zFk�%s2}��J���v���.*A)���
/�[��$�o�@��y�5���R��M�����d["����N��e!F*�O::����l%����Bbb�$/��2��-��c��;�u;���������t�XpN�������BG�nQ���������RV~��G���y�h��U���/n�������T�3�������L���%aT���O��pN�����B�������g�jD��W+:O����^�����;u*�-��|wm�_���+����j�S��[���������M�'�h)�B��bB����w�{�Bs����TY�� ������kF�D�TTSuZX�AZ@6YS����_���I=����m
���������bQ��	�Ng�[5���&��."��3c��X�;�A��-O8��X�2�REI�::�Zf����x�ZL@8^8����R�:��K6�����%YA�$5B+=����)���b��Y�����2q�R]���{��N_�9�����:�jFL�]v�w
��k{E���qyUM��������(���jz���.`�9�(B�����(�d�mU_���=����*��
~U��d7�^V��q����]��K��C4^��A��T�Q�i*���t�=:���-�$�����=�/<���5{v�w��������-'��b�-���,\�������s:�PT��Z���n�.�8����p_�d�I��,g�"3��8T�����r��,����x-$ ��s��z��$�y��Z5������[:}��m������5�J���j1a
yamRJ�XM��u���� �v)��XH9Z�u�m����~���?^��k�$K��7�*.���*�����(�B�.���W�t��j�Qe}�w�O����U�.��������\�����f��0���L-*��!���f�V�J�bS��<�&���
e�^Oq�_��!����,��:���p1�E����B��D����y��z#:QH��m���6�v�Q�	�,�n���������mY��RIIZMD�N�I�x��>S&��i!�����y���K^�<������g]�V�b����A_��Uim�H+�{/���T��|R~���rd��]�g�d��,�E�/���h]o�O�1_�~���OZG���/"�(j7��J�u8[wUe uY���m9��\�����������V��Bv"��^��?�j��4z�d���h1)e��w=���8�/GB������P�X�������M���$��Y�A���\��C@H9k}MyuU��r:���E��N���,S�D	�:�C�7��������>����H�����w&GYG���j\���d8��BT<����8��%��
����K�PW<Do��4j#��]�j������dd6&v~��s������~�����P���!�G���W*��/��f���j��4,H����RF�R�/���cy$��H<��A-#��\�q)���Z�d�(�1��������(�9��g%kz+P4���M��M�H~�9|op���y��=KY��K���E!3p=~��QQ�z^�@
c���(Xl9'q&GBE3!�h�9�/ul=/�ck���y�:+�����D����H��yY,h���k�B�\ f��ted(feHK�G������1_���YS53yK��������X����d����� 4��e���U���Z@@b~`+d9�2#I��_/�9��z�u������BV��Z�\3o��.�aI�E�	���5�b��v/ 	�-jo��������w��7���t$�tP��)���M�3P��W��������w��XPug�%�u�t}�f"��������.w��X�D��@�^�{g L	/L����|�hP�wn�-k�G��@�R�j�����~>y���������wg?�Uz�S�b�����V{���@�R,c���1sg c�g���X
dLS�`�S?��!*��P����yg c���=������M�R c��2���K��=PL2 c�����������$J�a%{��d@�2,c�����w2��2����;�����;-�g�gr���r1�:A�!�~�_������3�m�g��e@�2�m�� ��+�AGV�;m�>��e@����m�����M�s�-�j���f��������������S!o�����P��m��pl�5�'�r�M9�M'D&�eF����B�Y�l����M9��}>M.z�I�U��p��I\O���	���$��9���zW�G����lQ�e6}�����mK��������6���+��wG�3���}��2&�r�a������b�
���&o\�O)��g�4�6@��Z�����=oFYpWI�n]Y�(.���?��~,>��lI+i1��y�[���Z�\�����95O�!�9z�����6�f�r��9���U"��KY��y���%|�hy��f��@�b��R3�����(>g��w�zoL+'�����>�e���h�z)>G�����l�.�����]<������]=����`N�e6,Zq�,�<UB�%�B�F�����:��T|�O�*���op*����M�S��9�)�K#��v���
[�G��z�(�5�B�
��P��LARM%�m��������_��)��e�����\�,U'UY��x���;��5�� ���"*>���Z��l*��mM,I�e�\��O�QGu�Y+�D������S|��R��������e`��w�2����)"���jP���~ }��
�h6�_���3z�~��6>�u���0�]�GxM�Ngs����b���Q���e�		���8y-����������z�@�,{"����f�N�`���ox�j��c`���^zD@o�/���Lp[�d@���!�[5MV����?�f�?���7���]@����{�@}�6l�����~I-�����>��������x8}PE9r_O�H�������R����4�]3���>��O�pn�{�J�������
w#P9^��mU���/r����v���{�Z����=�3��
�!����������0v�m7�sK�����`ry���F�[��~I�<.W�����Z�E��|�����r�>0��7g�3��f�&P�$g�4_�F����pt��N�bQ�s��������f"s3�/���l%l�K����r�8 �}����my+��$����E��/���9���������	i��5����z0����������v��.�=p�E�Qn=lM�@�Xp��<���]I�3�V����X�inR4�x3��S��JN��
���D�(i�BI73����hi�h���R��E]�l��=�����0P�>OE�M��}�gn������j�Q���5��=�t�����b!��;�d��Vt�tA��U���N�M)42	`�>�)�k�Cg���r�W|@�u�'*>K���/��bC����i��z!b~��\�t����d��H�O�&GMM,F�]�8c=P�YHX��1�O=��_�O��e3�<�]�R�����7�?�%*#�@E��c=P�Y���+1~��,����fV��%�,h<�K�'=�7�s�����,6,sIOTk��k�{=�]vj��)�
� R|sv�W���_���f��.��[��)-��q���R�*.�	�����{/���#��w�r���${��s/q�������'�Y�X��5��,",��N��r�����v��Ea� �}���U��{K��,��-uj�s�c�VS�� ������og���U�U�1������V�i��v�=��j�{��H�^������`�������.o���5
��<�����'��oe��N���������G��~�L�9��Jt���I��P���>����r6���i��� .� �L2�!�z2@X������YDV���^�$�UMG+�	���]s	5��i�S.��� -g��	K����4��a=o�l��^)`^}�y�����W�^)R}HU��Z�f�=�]��3/���
G�������u].��0��V�gY���
�S��N�~�Kx�����0�9�TO�\�|�����z�������XQ{W�����j��z���$0�x����P�����=��^=�w��v�g��(:}����E3��(`/}��To��jIom�MoR���rl.}��\�^om3���D�:�c���8ULS���Y�^6�Vc�� �"��bo�'Fm����8N��8U<���qU�K/�1}�,���Y������'�:�D�K�(,�����e�ZPWY��v����2{U�I��V5��f 
J�6E�:���C�B�1�GXq�t�Z6���Y��Y|*���D�W������������������w;�o0����f�o��6WtqW��L�	'�����	+p�����F,��\t,��R�,CM����LW����=.�������� �|3vX���EM�2h�����X~�B�h�(�
�4������TsAW���{4�>��q�h���}���o��P8Mza�_��~z2@XV�3�7�^��zt�z����oZc�����07�j)6��,��BjM���	r��z���������`������ z��yF�H/`}����1�	��#�������U)vz�����|����w�7))�����K��XW_&�-������/`�u�y�'� ���^����w'
�q��L�S�}E�������U�����}��YZ� vg����������mU��X{�ML�������]�V�d�i�y����.��"X��u<F2���6Ww��`��r/����@;8nPKp�kg�\F}�_���n^��9fhc�*�j�r�-����_�:������v��n�o�Ii��(��R�(��D�3oW��]��*]��`�����`z=KO���6��+����2����z�^o���v�M����&&1�����RmF\�`�����������z�nB?���7�	���De.�e�����l&��,������Z�{�d����uv�$���i����nKe��V��F�K!���;i�[B�Yg����z�n���^-J�{gdW�|�n�f�#���b�a���nb���i�>k��~[Hki	������u����;5'�f�R#�������u���t�(De����� K�9�Q��r�����BY���k���9tc���n����7�T���P������@�A������fX�����6
��;�b�D�?h^7W�
�Lq��oFB@�8�PO�
��=�
���;k�{��������#ZAU�UW�/�q���K����esR��[gl	�n6��9�EY�����n)��r���������-�H��j$=l�,���wF�6�_)S�c��~������^�p�b�����L�5��VWB~�A 7b1�b��b`,K6���
���/�sIl����o&��Q�s�����tg���+�\<72��l1�����&����g/7�>��nmh�(�-6Y���y�Z\��lc��6�m8�q�L|�(��3�vY�`���� �d@m����������}y5����st�����8����[��wGj�i)T{]�bRO�g���2e8�Q��O�������r2���^�<�F�3�	�^��*W'��@��!�����WS[���w��?�v��9��"��/E�\������EG�������GH�����H���*�J 2���auh	��T$�TS9X�D�����D�c)�d�hq���C�d`�}_.W��Vf�����K�{�o2�ey��k Z�&1�e`q"]q���M1��S��
����5�5�lJS5���sN�@&�
2i|K���k&�;�d�q�$�X�Rp���F����\�������3����q�G���yiv����
�j76�-f�k��A@\6��N������y�(#��O���.��X= �����^��l@L�2H�BIFS����C�w���r��c9��:�Q�z2@8��(���,��g��Q�l�X�4-�������&.��0�/��A
����2�IWw
Bm�����%#��8S�c`6�,��n����&������1�>+��������c�@M��{�<��n����7��K4�17�F�a�z2v�9VS/�!�1C���a����6�}.�#����zB�ZZ|�[��{����6������+���{P�Z��}/zTv�-��
#�����66��V�&�P�����F6�"�=��i��!GVv-ER��k}��������N����� {��@��
���}�my�����Jp\�a����aE4.9�@�V�
������9j��z:@Bl���_D����o�o0���{��q�����d�����z2��s�&]����:l���=��FH���/Frz�h:H��3��Z��WJF���ysG�'P�)��Ke$��!�AmWc��5�m��S��n4��Q������Gg�����?��������:>��C������}m��xe��r&M��g;������VC[���N]�}�����6;���6#���3��HS= I@t�g������!�q�	��-}H,�2�Q��(��mX��D�tID�]�CQT5���B��J��*��R��:v���k:�G�7\��O"l�0�\C�s�����S��|`��z��\��
��(>���`� S;���rQ�����A�=%��u�EY��WL@�����V�`��z��z��">6gxQ����x����\x@i9�VO�d�(�2�����>���Q�i�$���C&"@����=�nD��/",sL��D���3�km-��}�M7����
9VO_����Wj����q`�oz:@�8�UO�G��|�
9��&�e�����`l���z?��!G��k!�P��qJ7�q�.��N6�.9�c
9�U�0j;�[�#�583"4l������n�j���l^QA]���Y�����v�r�2L;B�r��
�7��u�B*p����dsu��z;�H�P�1Wn�p|+]F�*;vh�|�������
�8��\���0v�9�C��5�=``C���Kor ��P�����X<����T�����F~����ri����
��I��rR.K4��0q�� �U:@��A�'D�c]����!k��6�So5]�W��������l,��{=���4��	����x��5�4��D��" I�����G}��B� �����+6�MC7m����'�1��Z���5{.������{yo�p�a�V��,��P�}���[���by?/�Wj��N��������������tt�)PC75j��FVOg��qgW3��[�hD����d�H�~5�|����i�M1����b����}�y�+�b\\Mfw%�����b�c)^�T�#O���F�*u��X����_����.���n��!@`C���WO�C�y_�I����v=S~�_-��<���$iK��E!����C������E's���������T�6�x};�iT�A��YNf�;����}��,�5�Wj@��Z+��
����q�����m.��g��X�2��6�;�����z*6���q�z2@���lN���n�T��N1�c���n.�!`rC��5���
y�;�6����!�$nN����@8���u��n�������u=w�����f��pF�;��~27��\�?��0w��l����=�g�A@��\}n��!����Aq��z
��Y�No�:�!Qs��,��SU���W����������a����K=%{��8D��6=){��\��$���[�h?�iY����m��6\�"�FX�W={�\F"����-�:C���7\�"��FW������F�'c��������������A��i��x�eT���������p��,�O��H?��b����s��#$��|;1�N\����zL��[�/�J�6�	t�������@�3�u�@k���X��w�NH��T�}�k.]#t�u	�����n�����_3
P��w�Q7��QG�p�8GS=P�9��'�F,D���.��$�}���R����&�@�Z2iz�Di���*/�^���:�>Vuu9)����C7��k�(g�I�������T6�F~U�7F[W��40%��u���C�\j�:�J����0@�F�h���.a��I	h�Y��@���O0�z3
�P�&W�z�i�p��Z�'���U��i����#�i��-��#�mF����v�� 2#G"���4~��
���9������2��I�d&9b����v���F���1�`"#��W8��P�Q�����_��������-pD��8�RO�M����/������r��W��,0ay��^x�k��,FD
�"��S���	��������K�� =#���]O�����{�eA�<:�J��O�;>������ro��D���3��#�#�|F�i�����'
^�[;��96t~a�k����\�Q�����)�]�izD@C�c�wi8[w�L�S��{5w�x�,����7����x������NGO
��������O��%���F},&+��:�h���n�y��
M�T�%��W�tG��,uX{�q��>@�gdc<������b?�C�����XF�@�lK#]F�#���,]M����<56R������v��5f�����K#�rF�#^�yL�Rh/�f�0���B�b*y��?��@���[�Z/��]~L|@:G��5>�:���|"������c�@L#7���Q�6�-���UK���U.f;;:�T���<�O#���#��F����C3O����@��������Rc;�]�_�� U9��>���,�),�+��b4K����z� ��c�N�_6'�PDrD}%D��!�e�[�-�Fo0���Tx=1�*�C�)D�%r���d����;�I�-������Y
P��C9��6J*���H�K���T��-��l�����Jh�4����D�������*5q"�}yuU��R�Y59xC'R_�wzH@������LU/���v:Q-�Ar�wZ�G�F����zb@v��j[���F1��urt?9L�	h��lU'���WJK���y�e�����.���<�'�h��uP��������h�����R�I	��Wd�Uvof�2���MH�������Y5�
L.�K��m B<�?U\i3�m=$ ���k�u���h�\i�4�6���jJ���^����
PY�D�����|F/���9�������
ec�8pGU�X+ZN+��:�n2#v���|>��nM�r���!�e�hUH�Y)o�/����N_[�5�W[�&}�s:�C*��*$
5�J�����	h���e����xr�l�w��(]R�B���h�]�o����.���f�9M�����`@49�Z�>i@�(w�d�{c���t��-&�����J����}p\�'>.�}��'D+�z����.GG=�j��b�����|0�W�b.�X��'c���z=�'������h��������Wf��c�����J1��c'��������{����D�H?�s���Y"�v�-�����g����g����";���j�
�����w�8��&9>�Z��5����_|��y<>z_�r���9`ZK��1�K7�A��m���F�P@K����r�wa-�*�
��^��b.��st��PE�E�>�R:fIi�h��R'���3SB����R���c"9����/������2�y���:�
�S���q�c}9�2C�e���u��f�{�rQ��+����b���1��c���@%���d����
%��&;vd�mP�mq�����1 �����&��9tt��������P(=�"Ga.�fR�$,�������:5��h���0L/b+���`�1j/�.����������D������$�G(  ���@4�y�h<�y�f���2�z�����9s3	�s0��8x���-�E
����I�]`��e4�
����w��}%=" (!��6�ME�/F�39���,����sV,���*kZ���~@�����+1@���'��o:�|wM�hU4>����_�5�����-�s���P���l�d����8�����������lT�����������������j��[cQ1 �c��������cQ���������R���A%�,mgZ����qP�#GWn��2]��{�;����nc�K���t�1�i�/d�-g%�'T:vD����pR���e�W��_��/��;><?�������5��LWK��q��,�B���J���.Kw����������Z�S���G�TT��i�/�+�\��O��p���UMN���u!�w���z2@�\�}�/��-���w�,~�=$�w�'����Z����?�)�{�M��:���z2@Yf�k�'7;;i`|���
����<��hm�#��8�Z@��h�r*&�x���w����T��(T���g��i�bEGT��=��?���B��#��d�;���wEI��0�������z�fOt=�����q�
|�0�������)����/�c���S���v�����j$�
}M�G�7�����bZ��������t�x�V(������C�'>�J��������9S�&
�Z9�,�alRQ�!��crW7��	��1�[�o�����~>;�8<;;~~q����������
�E���1��c���������*5��������#=������F�����WbD�C�yH��rO|xq����c��
X�8u<��S��B�~(�o���^�����8���;r"��h�#�>�g��	Y����
��q�<N��5���[�B/������+��;�K��k,�p������C�9�����JO������p���	������L>�<v��k����}y���@�cG��N��9�k�d����*�3�v�MGB�~*>���_i��|i�Z����8�wl�����ch���������ky%����+9�^6]u�Vt��"Q������ mck����Y*����4�MK��vS�~��[�
������f
�^��B�dC���l��	Z��c�k���6{����_sxz�i^Nk���{����@�����s��P�>��v>�G_s=��������������:���z�P�����o5?�.E�J���:�@k=�9VF���I�����������f��8����:�Zn@�i�j���b�G��MiT@��y��u(G;o�b�P�-��u��
��
h�����Zi�
�6X_�C'��1s�2v��2��Q�	�����c�m��S9n;K,�s�Y"��_�����X8f�����.C���l�yX��$vGW�(iM���db��Tr�dJ�v���l���Go__�_��8;~r������=��������������*�|��K�r[_��������l�M��u;���WvAK�\�'T1��X�����������B+�rb����MOg��'�����T�p����J�1c�����.6������q�4�Uf�������B���v����rl4�	��G�fmA�n��;!$�N����z��<������8q���/g�(������N�<	���`�E������s}�k���YOTr��Y��'��M�^��8w�H��vtJ�[��u:f-������ag���������#jQ���\X���:��'	d�wa��3a�c"��]�8�n,�G4���^p�=���.���r*������&K�
�����V�R�m������Tm���>?���me8��j9��,��}�2���-����;�������k:=��6aM��v@�	������@=IP��-;�%iU�F�&���Qn�������	��K�u�R����>�������:�����I����]t���e�#3��&���"��H#R{j�"bA��&nn�	�u�k}!g��r��9�����\��6�2o��I��8^cM���l�a���T�v^+Dp��z2@],~��C�s_t?���z_���C<�U'�w�g�+Q�W������!s8��b��g4�K:j����a������=�pY&__�t�zc�R��m�e����w'��U.Y��Hpb�Rf3��(�L��8����h����-I��l�8����d������<}���K�pn�z2@>Xwf�\kI"���t�zk���������|RT��+���9��6W5Vbz8@Ql��G��Z�y�\�S-�������u^����1�c����m�Q�;��m�fO���A�	�Z��>P;������:`W�]�^;j�b>Y-�I3��n4[S����7�����e����
�\��'����<<�e.�\Oy����
�_��_�~@���-?J][yX��mbq�f�ug&��
�m p�����6��[=�Xbk�#��MX�v�;���/��N;M��/�hTA�>��T��0n�����7I=����6qEi��7C�9������#j��q���I�f��<-]����+f�Sb�p��z2�F��W1r�c����A����
I�W��Z���E��,�@R�%�MV;�.�V������(?@X,��uc����&lnb�zV��2��U���Y�&���7�8��;^�{B�P��y������z�M8�f=�I���4t����#�n���F��� ]=�Lkt����+Y_��T���lUa�V��9|�9kV��h�z�,��s��^�����F%�������,����[5������cwiW����&G�It����h�W���5"6�0�	�����&iMrG�'u��tM�bq���������P}M8�Uo�������4_���������L�TZ�+�rG���!`dW+a��t�
��42�}@-
��� �47JD�tl���$6y+s��^vF�Qn�,�kj��vQ��E��u4s��-f�;4DH,����)�eS`�=�WU!O�LDZv���7�|^N���������P4k@���-[���M���p��C8���,�[W�����Tj1�}�R���P6���n��f�L�rT�oz:v%K��wS�����w�R���*�6?��������*G����y �V��o�b�v��{���*�� �^zT@g8�\�p]
�TO���
�����.�KZ�V[�&��9AizL@XN�N�P3��x_O
��k�#`fS�����&zR���0�A�	����=�>+*���t�Y�=��73�&}�\��2\9%�rN������.'������������I$Y����0�R�M�z�7�u���P)�`S��	��\�K����� �����)G�����������,�5��������'�K`jS���C��J�Z#@����'4�g��;@fS�U��t��gdx'�����7��q�P�V����M��Ae`�[T9���o�m�h�Z�\�1�h���7;l�fg�<_��kPFfzP@fXW%��Z���)��a5j'P%�����MC���s�J��]���k
�r������-�������|wS���f*�]�7��m/1Xo�a�z2@�lT�!&�Mm/�Rp�L�IW�K�����DC���lX�b��)���=b^�W�	k��b�,�MGk�P�`�m����8���-	��0`�S�aw�
1����*�E�@�_���E���	��L�..�+7��������w���Fx�����R����wrNw�N���7���f5����0�N9���+�)`{S�7p�T<b50po��l��S��N9s����A���)�S�^�� ���w}�9\)�N�*���]�g������kA������@vS��$��a@��z�"eOR~��:Z�j��Z��r�����a�2�pj�����W?�C6�X���N9.XOh���n�����y�=Pp;���7uCzS�����|/�tS����^���a8/�O���j�������R�k��+�5�>M.h�Bt�.�@:�ya���6e��]5��n���l���I�zQG�6��J���#��W����V����I���M3�������M9�YsI?Dk�8�D�>�+f����w�l8��W@���*�\�y�=�p�~����\����`���6�u�?������{m
h���e�2D��ij?��Ny-5�5��+9�]��^�	;�������rn����))�bS��m��C�4�l�Ez'�W#9CX�7����U7�%�����l[o��G��s�K��������m��z2@��X�y�c�U��&!X���4���f���mj�h�W'
�oZ^Lg�r���RJj?& ?)���F��t������mS����A���2;���g��'���������k�_h<F�F#%�^5��h6u��M��<�_�s2��-lZ4�0��
�
r� ���~�
���Dn�����i��6W@�T|��qN���fN�� ���f��.�W�]N)����O����e
����L*C?�e�p��2Hc����i8S8������{�_O��{��- ����kR*w��y�N5)��G(�\���I�
��I������)�i5�n��s�!�t�scN9�YOh���X���b�p|��l�sc��`cn���ZN���L�KOe=�X��]O�~�B�vN"���.5���,'S6k���0d�'�8��oONC6�K�
M]��S������eC�5u��#��U�����i�e��4�fr����]��B�����<��]�2��W;�^��.ts��z�7�4In�c�E�xt���]v2\1�)*K	g�+f[���L�f�T�.�^��/+���p��T���e+;�����~���i���u����`�3���@�j40��t8;p����,�N�������\-K�W���=�.(e�������C��f�;��G��h:�[~�^T���B���6��z���[7b���z2@	l��F#P��E����NDvJ��f!�6���i4��������z ����y�g�e_��i!w�R`N��N���B��]����_��x����9W��{J����8�iO��[M7��^����\(w ����U�9s4�
���W�G,F7;�^����@�����������W�9g6�YW
�(g�N�~D�>�xV1�P��L���]���P�9�/:?L4�s��z2@c\�f�Y�
kOO�8/���25g�Q��+�&����rfC�U
#(P'�.	�X�kI��-�������o�m� �3+��b�!R�V�6��k�C&M�6�]���!+��B
���K��P�b���<��:�Wt�w.^���aI�sf�e4z�@k���z����Q
��l��z�b�3�A�P��$u����k��9c��7����u]]O�n��<���:���!Q��U���+�zP@�8�n�TP���F�`O���tB�p��h�%Ud)�C2"s-Z��x���b'G>�����#��9����m���Y�f?����a����9�����)��n�tQ3ZM9z&�t�����9��$kYnL��>����v�O�:f���������~���/]�
>0P��e���F���U5]��soGt��q;����M��f��
�H7C�@�����6EX��|m���F��l��R
���)�e�W�I��V @)gbKux
�2�r�H/k������hY����!���\��p��=�����h�������Dk��9�(fCF��������e9��),����Q�]���7E��Q����,fg���
�����K���r���yGQ���8�4}�]a`�����yg����zT�l�h����gM�'�����y }C��H���^y�����{�@ 8��.�=�9{�s1]��<��B?Z�d�;��3����z�<p�}������7���t#�(Z��(�=akp�����_�����:�k=�Gq��?y��l%g��e9=��t�&9���#]9��wfc��N@�3����f#��>`�3����]������x'K����Y6.Z���sCd���9�����a��U�������+�-���+�Z�'��(kh;���VM�C�h64�
���r
r$jQIx��,DL��|eK�\��'�T��md�������x����i�Y���~��R3��gnxy%3�hx���g
�RG�� ��l����c)�*E��.[{�>[#�]��������J����TpzL@Xd[�,�����jJ������E��G3���>������_V1�W�-4�qg,����o���btC�p�r��z��=���������"b��P���#���	���������37x[���7�z�Z\�`������en����3��.r�i�������t�y"��p���������@7/F�z��X}�#m!G�Ff	�qv�z2@$X�d�����5����Q-��������1����T�����lqK�o�r7��NPt�<l3�Ag�������
5��S��,�br=[������]����������V�����&���{�� ��q���~���!����C(�3x�s�����4pY�\�%�9������qpsS��yR�dF��[���(te���g�Ba�@ [7�^�a;l�����H���E9��y/�����"��:gyk����)�8
�w}]HI�r��c���{�N���m��.�����Kd�����7�v�:�zY���`ts�[pr�7�qwn�u���~_C���o}XVr�N�,:��/9��s����KT��-7�ZJ�F���������sx�0��)?p�M� �0���ejo'S)H��@������f�g�\e�'T�w<8k�����������h�M_KV����i�=yX�7�O��o��v(��Fq��aZ��t�����++N�.�{�o����=���[w�lQZ0����c��d�`�[��W���Y�j�3��;���u��u���Z2��9oh}.-pt��?���{;��eo��t���mfp�X�����9��s��6��r@F����+���tdp������	O��R����I������m�G��~��b�s���\tn���3<S�R�G�9���'�r���hJ]�������E�)�b	��9w�uF��������sk0@R,����w�m@@W8�YO�J�hI�y.�Z�=�e�'Ds�96����y5� �s`�|�����h�i~r���]�����\����A��zH@;�~��Nrw����X,�SR�+����
H�2PLUS���$���@���My���n��;����������
�M����
D��,'�
'Tsnqo��n0�Lj�9���zs
����/�h����b����>\�#I���V�.�rJqQ�'����`9��sm}ex��h=.D�4����L�*:�P�p&������rA���!�BJ��|�9�s�?������smy�Sr.��L���
���<?ie�t�h���$=���:w�������
 ��3���JdsuVe@Og����~\�7�]�:�� ���W����.l�@�>u�h��������$�M-��1�������i��]2�l��c���F]�;NG��J�.�V��z��Th��u{����<yh��f���9 �s�����
��
�6�"�F7��n�-ts�*1Y�uW9'�V^����[��4�e[��'%1������0��xXM�%
����|����s79<rn���20���I�Ky��YZ��=O�6=�dBT���]=&�V.Y�4�$������}��Fy<��
��� ����
�<u�Sh���{`���g�*�������3+#`�sG���|]��9;���8Oq3c^��'�X�����mx��mBW�����_���r�`��������9w���� Mi�g���sx�������12��I�^��<�P.\9�pe=�N��������l2���;����;�EW��L�����G���@�\�e����p�n�9�YOHH9F?��9�"���X��Ci�&�m7�L��Z�Ez����&���{��	���6a�5Kn4������W��������m.�MP�Vpm���=�uo��6v��/UFH@�lN�MH��U����@�sy�������[���}���Z���FV��������8l�-`3�>���sn!�alO���f�-F�N�W]4���
��K�x6����@Z��B��]�=�N��V�����^`�a;���Q�_����/�^��]��B�?d���� ~$�
}�<!�l��������0�R�t��<c�+A�m��"4����}�����K�0����-�a�V������T���>��C����f�{Z�>�B���2}���yj��14��h;�L�oQ��43}�T�s33�I�*vWN&{����3�X���o���ZMt�'��kFW���2}�M1J��#�"$��0m! >������,,F}������ �i�o��+����5��X^��3}�Ey�#���[V��e��!]�/\���y���(�v��>���qr�zG��SN�Uc���.����t���������G�J����5������/�����aeLE@rl��f�����F:@�.K�H*��/�����f����$�3����C���;�*�
T�j�LkkrS����z���Z���RI��Z�������X��w�����y��qC5�����|�����@e�>�������S�S�o2��>x�Y8���RD�Q�8��=~D�@�8Z���;M�;U�(��f@������XA���Y��A�!�8�W����s��
�B�]�y%v�B������j�|����V��y`�a�g�]��05}�� z�~8�X�zh�6&��6;h�}�����[`���s8WFh�W5�>�[�n��������E5�nB�iydv7�&v��>H���r���.��[�;nL�o�j%���%�W{�B��"�G�If�r�?d�Z���2
y�Q����NT�N�&V���j"���t5�zN��.������h��L3}����`������*���o���cp���8��_L�!�"e�<���P����!vr�>�e�u9���6>]����:"�Q61<+���� �]��Z������@U����,�����*w~
�������)\��,��\U��&���WI 5���*	Z	;�L�?�?�i�����c	�;�L��UO;�L�?��Y�X?�������
��#��t�!+��tbO�U�6���rD@C�}.�!#0eY4�s���C���������j�|+�@�'��38��)N�����Z��{7g����X����� l����q���6�=v��>w��v��>w��������l��Kf��c����VEU��@7}���@�i-��X�����u����]7�w/V��v�I=��@�����s�z^�zJ�9�j8��"r���U��b�	m����t�8pp���������������]�7��w^��M���� ~#��OT���Om��C{{Z {��Q���3}���A��8es�h'��s6+�-ftJ�P�Vk������c#�${�2�!I��Y�����,��kN�� &�>W�kL��vQ���B�6doQ��������9M�i��3}�\���2}>���Q%�j���2q�]�rUL��2�+'��O���\]�r��������e�|8/�N�[���&���y������T�y�y0���[�^Sj��/}n]�[o8�@�h/[Oy�s5�+~���bf^]J�8i�S-�x@������������G+Lf�����p�~3�LV	���y����	(��%!	Q�(_=S��������{�mCA�G�{DS4�K�8d�f�l��pd��YI]O��!nl���~���(b9��;���.-Mg�|F�t��<m�5��L4h��\��������w������|�Z���������}n
���o1N�������9|�����vi�9d�L3i�}�4�����7�����E%��)��`>��}���h�os56�����Z��|L�������z��tL�-����~����������_ZO�������U����kO��I
���7�zd�0������y�B����ho�r���e7�v5��$}W^]U�J�C15�Z��Fm��-��B7�� �C5�#���2"���pwU�436��d���hb�,�@p|{_���o�����M���g�������)P-n������m�|���k�����R��dQ�,E��I;���@�8��& ��}wS�rM/W���cy>VE�z-����/�e%���Z��SeN�
p���;��2�
��B���byC�k��,����f����(�Q�Y�fy_�_He��Tf�)���L���W3�a�l����>qc {�l����/�\���
�w���M���e���#���m�c`�����a�F:@�l�f�p�>�����f���,}���iO���Y�.w�H�d�B��������(7�|,F��-�.;��J���\�'r�Jh�C>��%Q��q�h����(g����(�s9F�j���,����_���|i.i�!�s;`����ilS�D�s"7�K�Z���Q�������e4��p����f�B�+Q"��e��F��5�[�{�'�5��M?�t�����������s��n���Y��`������#�������M��P.����U|8���J"`�}G�Y;7O���;QE��. �����{��#�c@C��4��,�x�=�N�~�~����!|���5K��5S����P3Y�fv-Z��I3����lq&�5��1�O@�Il�����6Rj����l���qWM��<a�7��^k�*����)-�Q]�!���f�JN�����pl=.�;��'�T���"G��U}?��[m����Y���#�������I���nY2�A������������F�m��m��
������^�
��r����������]/d�����KQD���Kc����P�y�j�����<�`z/��d�N�����%z1J?V�u�P-�'�<�����
Y�h�;�5=q*HK1r�1$e��AT�+V
�Y���P�&��R���4t:�������R!����7;J��������N��mS�t�[��]�Hvz���������ao`R!<�QuV�`��X�������2����������T>�� �����h���O�l��'n;�}@L������Lf�@���������9�H��=\}��[9���#���������6�j,��i�L���j���=��G+2G�Y��%�����;���V�����������50�~����j<e���������"����I��:�����s��rt*q��O:r�
I��(_7A�F'�'a���}G���o����Ph��H����.���P_q������]�5���c�D�C����
�����b����T�Y����Oy[��AR��J��-�O���A����Ze?��I/�k��SqK]'�"����T/6%qc��"�w������A\X@�����Kn��H�9h��_�s<���>7�k�=�W�m�1�}����V9]��1^sy���p=��7#���5�6�(������ioO���
+����.&e��])#  6��2"�����FJ�&���(���!�P��P�D�����-��)-K�lH?=*���G�n���N�f��!����fC
`m����'^��a�F���;�M7��0��9�u��>G�*��}���K�2R*a#����Dk�����hVo���L��u����#& �#\S�a4�G����f�EU�+���CQH�7FP@g�0�������'�+��o��mK���
��p����`%�~������vNY��n����r\����^���Y��JF2�1
�6�kS6����$�I��sZ�X]���b�8n��������.�}��d��	:�c5[�����_L�F�N��������V�"��3�V}
�c�H����L������\�������]��3�����������_fEU��6D/)Mv��������beW�������i7K����H���=r�Y�	�LOZe)z��Q��1���*�A�s��5/8�F�@�6j�>���kU�6�5�Zgc��,������o�8R���z����+��k"��;�As����
��sW�4�|9��Q�;5R�Gs6G�F��{`w�����`�?�������F�-�9��P`�C�=�{S
g��`��-���wH���b�Q��g����=���:����_���iI����D��f�M������r�nv
�v�]�h8��H��
B7�� �D��{V-='�U�$�w�o�:F:3�D�G���4M�QzA�H#����mY;l����q��q�����jz�oM��h�������Lt��<h�� pY�Aw6w�,�w�������Ba\�@t���������N���d�� x��FbxD}���C.c�4�0���S.���E;������C5����B������6���UFKYi��y��m��v���F�q�BZ9y���O�=�������R/Z��g�"g�d���w/���Rt�'z&���^>��������ACb���B�q�>�r��BU�N�>n����8N�H�5��
{l�/b����krkk'��8\�HH��������7l>����=�h�Pl^�f
��A��Ng�A��4[���c�NV�K
��+-n<@��U��d#:�U-����U9ZJB�[d{S,��a��_oo���O�;�8�&��J�W�}�{rO�fN�b�o�m��f�E��Y�;��M���L���%��3mvB����q�f��hs0`\���~�Kz���������f��m�F�f���s�F�@�A�Hl�(��ok���~>-?-��d����7��"G���@�����.�!:���H�h����������=J"N_�� �?++�\/w�[��;��T��nU�� ����
�d�,���}��\[@����t��!�p���rNL�����d���������XI�����P���]m`������F�@�$hZ�k��d��/�b�w�b��{��^S�):��
D�c��t�6Z�vUhTKD�.�J���HD�2(�!F�bQ��F�`�W�����$*<s$���#���<�����m����:���Wp�/�j-���V���*&��G/��U����{�n��<��[
��e�����F:@&��d;
�����+p���9�fV���RGw�{�G�b����+�?p=|"������z"���
�����{C�@�8r=���T�������>�!��4 �a�R���%"��Z��w���
@�A�(\^�.*�Q�/���tAm�j���� u�
�}h�QW��g����?H�YQ�~`�o
��8�f�m�UE��lY��^;�*W����S%��y�W"�)���`��AW��H
�kLN�dH��V
k^�n�������F
���.�z������y`]����L��e�Tf�G��9������b�,������D�!��d���QA+��`t3���0����*����<���D�o[���==�������v��r����py�����W{�U����!(�
R0���������9\�|t@��i��U�8W��f�1���2�*����>���j����OV�5�&\N�[��7������b�a��cqTQo�T�������,^����%��W�������* �
���h�����{��"U��l�ah� ����������q�����HvG�R
���:�����������8��
��:`�@�; ��9����� �~r��t��Yw
xC��V�_UD�E����Li�#��%$�j%zu�ISV�DdDt��U@E
����/��n���/g�-����TD}��������h5��KWh� ���+�x�,_\�~�Y��)\���X�h�:�
Bn����]�B�.����!���������������u����,��!�Lr�	�t�z�zQ�C� ?C�m�u�@�r�!]ofi�}�y��%�{	����`Ah�E���y�_�mF���)�b$�������/�����+y`�t��ePS�������Fd>�G2�sC�� �v���������`�Ah�b��X�`�A��� ��6�`�A6�X�3(�d���=%�� ��'0R*c�+@3C�I �m��6_y:�cD'�t�MF�ESpU���UjKqV�l�'�<#* +�W}�m)���r
t�+��?����	��u�n���zb�@ E���0�_ ����mc9�6�`]����a��3����%o�^��)|��\7��{`s�9w# ������M�0t56����M�fp���a�����i�mB�Q t�(���e���a�������k�>���C��Vg�����x�� ��.�]�^<��r;,��{�`D���0���j��(��{�`_Dh�a�������a���������}�}�P/��G���bt������o�H
��o ������^��>�u��P��czc;�v�x# ����V���K���7Ns	�v��n`���oW����P��F)�����CF<@���h6��mJ�a���������3�a����vi�__�xlP]7(�}�`�Bh��w��/���m��1^�B�m�� �
�����0�!��:nM���0vt~�z��$�������D�`KB�l�U����� ��������j�
m�LX.:���4:�`B�h����V��!�/��_����*�<\��F���������&`���6�M����/�F+�+9�I��P���H��������
�C�� ����g���qg�n��5��9Sv
��]fO�������N!�:����9�����9��HTl��~�P5T}�������\H��)�
Y����^����(��%�/kjT�P���GN��t�:��Y-	�=r�C�iy��- ?d�|C�a���t�����N}���
��J���u7��Lw�
n������'��Pf�L�,%|"$m:[V����~��1b�2�����+7��s��!��C���i`1i����_�IBo���j5����%����!����_�h��_'�����yA����r�Av����@nX�P&�����l?����i���\K�}�9����jd]�����F��7(~����H���
���z{�X��(�C�7��<sm�����*�C-�+����rI ��������^@�C�n�@����<M���~�����`#5��#�s�@��������6�.sx����__���s:�aR�����z������h���5�CrX[-�r��4��������H�i�wW��y�k?�@5����j'�S[V�[�Vy��an�<�Z�����X�������[:��le:k��Q�>��Rs�	�^
p�<R=��ow>�I��J*����F�<�������3����� 6�L5��q����]��������Q?��L�k[B&�#�Gxf>Y�WSvt=���t�Z8���f+�R�f6�pN��Q�IG}������_.Ng��/9vA�l��>x�Gq��F:vu�l^�FKF;��{��2WcL��5��llx!�{���k!��Cs�������pn"�T��������Ni��?�0G�`�#`!���/4x�#�<��jAqP����N�%�;.�#yTCI�`3�l����@rA�i���-�wd�����`������k���&�N�����VF}�����l�������������/���)V�cm�XG���|�&����G`����k5��.�n��X�z�k��������9-Q7u����X���\�Y�q0�����l6xK�X,Y.�Ho�vz��`fW���j��L{��d*5W{%�t$����g#@7G��Ti���u6�5�����x}j#�p������|���������Qi�
�h�6������&8vv��gs�],�;����k�o�w�)��F�������YX�k6�jn��\�=ndp_���4o����Y��4'��+Z������K�SL�s �@��0[#P[m��x�q�,�����(g��W�CQ�m#\�*���mq]��������=��-Fa0?�`�T`��[O0���9m3�-��9�R#;����=O������<��'�X����zP�mk��4�o@��
R�w3��&��Z�J�d�h�����O��jsz�\.��ky�����'��v�xRN��7��z���F:�F���M��6R�e��6�t��Y�kf~�����z:��l�'�G�PGy�dGE�D��������� �gJ��&�3��V"@[Fm�o�����<j*���G��g�TiI9������jG�����{P�(f�+�h���Q�����y@P�cK��?`S#�����qV���Z��?�pr�|��C����7��4@�w��������^��`��1����A���P�K
����|3+�'R����,J��#h������Z�#�y6�F~h>���x>�h���&V��=�������uf���#���{��_.#��UM�M��#�c�9���6Kt�:a���n2�#�����k=#�	,�h6'�:�8�p��B%�k~�����N�h���tn�p��Cl�k�{��DL�^�C�=��^�R~�����A�\nj���������qv,�������t@����e4�\3�##��4�����f�����7�s���|�6.���>Y��W�|E����.�E6<���H{���L�D[a�U�p���1��FC��q#O�E�@�2��?����|�
P1�@EC����e����3��bds�����l:}�1�	�n�E��&�s�P����^�������N���v>/�7������~xwx���6o�����{+xP�m����Z�K#m�"F6��J�j/���re���#�an�V�1�5�1�b����'���v�}1p���41r��5"�����,/6]��[�`1��b_'��h4�q���s��C�Cp�����K{����j��Q�/z��S-~(�Lr&����Y�J��H�+#* "V_\��E�vB�}9�Mb@�V�[�#h0v�7��i�51@cW��M`t�]e�6���� �1�-;X�c@��V�YIT4	_]��UM�^*eo�N����������-����<e�������s�`S5���RAnm�l@J����bt#��K"�9Q�������]cby(�ZM%IMa��*������������WW���tr��kc�g�lO����~��B�nf��U��3W�K*]��O��x�!�,�}���S��j�B��W;�+k�i���6���R���__|��>�{��|���0��V��r���i
�c�F4��ds�55���(�Z�A.Q�v99|�
{�I_��tri���%`�,���
���oW:�m��I8D
��W�]J�����U[ME���
Q��Q9_z��oE�d��F+�S9�FY�4�w�9V���Vv�>���:3�\,f�n���1g&��j�K���������{%����(��Z�
��C"�t��2����C���A�#����e�[U`��"���3��M��x�����{�s�Q�5�r]\U���*����FX��j,��z�$����C�`
����G_���x_�6
��8o����9B�HhT��Oo����IV Q�����0=���fs�m
�.ch��+����h1d�P������I�
:e��p{M?��)�o��
�������v�����%�Hl&�����M���F�\�uQM
!�f�B�����������������&��y}?�'��S{�y)�nN�i)�<$&h�Q������\�w�92h�UD.O���G�5����YN�����
�1��|�5c��X6+��T���|�a��L�R�Dl�jVSn6�W��1b����V�_7'�4����@�������i�:~E��:������K�"��l���q��������My��Ut�\��;{���m��8���5�\�{q]�D�����]��q,]��p�1���y���fs
���n����Hc����j���E1�3	c��*�qY.��f\F+�����b\; ��`��U��s2�Z�a;�I����[�m��F�������OD�F4@��[k���R��	��x�nX{#��4k�u�})�)��bS���0��L[��{S���= m��#��Jr�D��m�������7�TE#( C���.����TJ���3e� �Ju��������hm�����hs���z���w?���sP��������e�]p�X��cu�t�qd.]tH���->4����L�D��D�V�|��s���[O����P�Y���������k����\�����@8�]��"�q���=���w����O�8��b#�6Z�lq
s4p�?O��?�b5[��F�%����OU���6{������h�baD��c���F���y)�������Rmk��p�����4wdu���+��(���u����������/�3�����3�i���R�O2�rl5_�m���Q���[]UM]Ts������*M��d)l���C'��'��=~�������{m1yh���������Ey;�X�����
��:�0h# s���qS�z8������
���5��������F�����-��6c>m�1P<����u�
Qo����D��$J�����+?-'������N�i��{������-h�5\�^6���^��e��^�;����o{��r'�GW.{0�v��+s�PLW2{0�u�����P�����PNNO�/�����N'�v�0_~�`��3t�H�����	��~<f�q����L���E�{���p�1��k�m������<N����k�Q�
 �c�P7�"���������wdw�S�L�������c�����'G�������[����n9Ed�z�g�Q������NH��>5��1�����E��T�������-���d3�Z6����U�Q@@>9��H_�u7�����cm��8	����F�w�AA�n��c��t�"q�=]/����HS���0�	`��3��.!�&W����>�L#�=p0�1g'l�D����$�������D����Q�e3�|��`��|�p�$���=.�j.Z���������@y��mq��x�t����7�����?q����r��I����m�5�~���ME�+����2��;Q���R'4�VL�e#�FeO����o���|���HdM����{�z���Y��%�M��[���������=z#�����7�-~|���w'�����h�9 9�ZibW���D��q	���^�=e��?�b���}�6�b���3e+k�#&l�q �,��t�|9����_�@L0�P����\��Z���,]�q��<�l�������?��_l��~yr~��(���Z�+��;Q��VQ�'��)�_qU���8��a��^w����9�hK��/��@���F:@�,����:�����t\��	��8�'�O8���D�L�/ek;����.��M�y"��mI��������i����:�,���e3hT�������ZV<{���'`O@�9�$`O@��{�' y���5���v
l
H��t�0a�$��l-G9���	�&�p��t�B�vpIv$������ �
,"IA	Q�����n�&�cM�W�yDA�$���j��SxW+��=��lAEt��lH�[���b��w�F9���I�^���2�w����b���R��!^���lbd-��+��K}���RlZ����n�)	�+�o�'�+�����}����l�A�������Q��������E�X���R��M�o#N��X~���9�Q:��v��������+)���>n��lU�B�<�����cZ��O�a�T��&=��}n=��]G���O�X�L��������U�D�O�O\/��������"?���yX/ ��#���?���m�K[���d��7�F�'z����
(��J�w�-����O��b���;��N+��otZ��D[���V��'������x��"����jW�'���s�1�O8O�G��'6gn�3�W������Ot��[Oo?�v�0����w���F���?��'����n���o5 O.	���#�	��}y�a����c�|��5�"F�j�G&'��N
�$9��CDi4*��W����h3�odb�l��'�O��7{���w6�='���OX�po��	��$ypo�AY'��r�[(+����[�X4_n��V����*P�^���El@�9�$�����8b�	����~\�f��]I����OI��lO�o��>@z�=���
! �����/dy�n���8<���������:V�]q�Vz�u��]�n�ec6E���lQ�m�Kh*��
(��D%-���D�y-�uxb��j����s;��CH��p�z:�N8�z�Z��7&�N�s���i�[]]���r�^�f������\k
����][/�i�&��N8{q�%du�9����\�M�Ps����z���c�#���_t
��RzL4�!��q��o��5���PM�dA��}���������
W�d��j5i�45Z��j�T�@�	�8n���snL�/g���$ ����m% �3��x���������=���o���O�������?�r���tm�W��9���v����	����m���~Y��@��y�����8����t@�z����Of�\���(-SE���#0�i
0��`�J/���G��W�8t��*N-T�u�<��P3�����#H��8=p=_(��M���8�q�Jo�~{=��8�q�A�F:v�I]���#2�z)o/�������.V)g*n�c�������`�6-.^����������mY{��:��j����*��S������)��S�\��gS�
�6����(J���K(�r���P"�s�h�_��H����i=��;����F�ogc����YLg�[��3�jd7Wf���'p�A�M���es������B��;<=9�D�������6d�� $��+5ieD����JK%i�4[5K����'7������RPrh(bm�]��7=&���}�Hq���F�$8
���P��oj}��'�x�������-g�S@���kh0���4���M�^�vSGn7�nj�v������������m�����F�J
P�����~O��v��_��	�J����6'4C\����������=�G'���p����0N
��4t�0�D��������5c������b}�)�p>�{���8��c# E�1��p���)��!�x{��Q�0���N9���2P!�k}(��y�������!
����S���~m1\p�q���S����q��L�����R�Vqo�X����Cq`��$�	���������/������������3��B$?��L`�/��FqR��V�o�R]��t���f��f��%��j��S'���A�)N}�S��$���R]�}e������5lN-�����p�
o]u�r�q��-���rJC��t�����S�j�" G�	+�k���� Q��

�����=t���-={F��Y@���i��P�SZ�/,D�
�+~`���B9����k�
���
w�V���^�)����5���4�����&���4P+��r���F�����y,�gJ��&���7��p[����+�������<�+6@s�
4?�Wl
�����t��q�3]�da�2f��Z~�A�P����������)��SW�������N]��-��fvp������d��S�W�g�v�a�F:@�����P
�����6�
eq�~�Y@@c���^F�HvjA��p��i��=�*
�n/�S����xv�����g��w����e�a�;e���)f��f���$�e�-��y����������i�����u��g���
����2J3���|�.��8��=7��hn���F��N����w���7�6J(�KO���>}W,��h�b�|�C��4�z�hu��g���w�EM&��;�������D~�\0@1]��7�����G��u
����}������Zp��~��,���9���e��\��mC��0��P�G���w�3��g4���f���7B#�|v�� 42�gn#����C-���e�����Y��R�Y����e���,���t�3�g�g!42��g��|��O>6�Q�!�^��f��ln��)��Y��������H��g����,�|f�������2��gV������_�0�8_o_�g���c����>�8{# P6��3����d������ ����6�.�c�2�g�C�|�����d���\�?��'��`���������l6���G���x���l�=[�x����� nt<��Y���0��2�U�H�e��E��F<���8�p# r������?�z�< 2!/2O8���P�_E��e��X������������G<`�@f3
���2�?C������Y��33@�g��*>�#@�g��l!r��mY��!u�#�a �m0�/P�h�N�����W������t�4�����V��ig�������P���q��~m�&��Y�:�87m�A�2�_�H�V����s0)�]3� s4���������=���m�$�w�M����VsB`�Y��Q�����XZg'[G��X��'��@����k�����+�����0�V�m����y'��r$�r����MnA��?s���K5���cJ	��3�7��d�7�23��g��(�������<�����LO6N0����&����*j�,{���XY��w{����Js�nG�?�B��N�>�z# .�����m�����y���S��q��dU��J%�����L�������%K���,jD)��������D<���S��7�f�bKA����l��^/P���#��K[�o)�RPL��Q��i��t������Zv]y���j��%��2�{n�
y�e^-t�y�.zGP�j��Mf'�i1kv��Eq]S.�s��>>K��i6�EA��xdQQ��OF�x{��JV�V=��U�<q%�2J3"cb��,�3xtY�����3�����R�5���{��"7��aKn�L��g���s����Jum��T��?7�������\l3VM'��/71#Y���0
���7FQ����rNI�����K�4@�g���8j����+��gE�w
���^��rQ�����>9��V�U�}H|�/�W�d|�%&7�j��F9@9l����?B��~���CC�����T���9��[yv�W^��-��Pk��^:�6�M��Q�-���e�W���L>
�3�����B�;��d��
��f-R�!�F9@�������mw��|D�i�g6�������7��d���q����y$�(;���T�j���>����+frD�}�\[�<T������������3@�g��y=����dZa6�(�����X��|���(���X�����0X>��{J���/@=���k%�ap�VM�e�7"��y��3��Yy���W��M���k���B�O��WmV=�`��h]���B*%�[�����49����h������[Fdr���c������h�vK��y1}W\L��v�N����G�K`���C$�zn��]<���J�g�V}�>9��v���S�o������}_�]�r���9��s�����}v��������f�=��i]���c.�G |���F9@���nW�2���O��H`��%��Aoi�I����i%�4��c��|�CO&k��=��o�K9��s��t=�����-o���3+��r���+������b:���C����������P����)�g���l�����F9@�8������c��=v�D{�>p4
�[hx�o�� 6�X�{�Ys8�#�|������<p����P�Fs�0u�����.����h�.$aw(ag�H��iyI!W2j=k�n�"�	��z�:�x���o�K�SD��
�����{%\�YU�V�L��^��m%�f��:�J.�J�7���9���PnL/�,�	�-�E�[(�+�:��eP����9����=��(	hKxz����HTWkY���v�US�uA�c������6�w
H�<�fx�r�c^L��WH0�6'Z�<���/���Z��h�F><����i"@�s.�����e��$v�I�)��'��I�x;-8��r����K�nL1��������t��..��9��]/^HSe:}h��<����!i���\�7��j����R�^��4�(�[�no�������8�1|u���F9�K��js�@un��.��rb���'��B~].�k���*�(v-����x�+
'��NmE�i��r^6��6g��+�;/.��:��H�`{�D���m�K!��4�1l]�J\J(dE��dnH+��s+'m�+`���q�$�.���x?Q�`V)1`wh����|�F9���R���8o��+����J�>��zU{���+>�*��������JD��R)���w�VzJwV]/7���M6a�W�����t�e=7�:b��KV�j��j^�c�'�^��L��^�=��)�w��6V��*iU���
��|����W���7z��D7��;FlwU�����6K���<6�o�@�b7=
O���o���=8hFs��&32���vk~AS{�\�\�s���y6������Ui�����-~n�x�z����!���B`r��Q�W������y�;��9 ���m�D��!T<��z��j����m���F)�}���oL�4s��U�@�~�[�3��Yn����D�7��>�	p�����r�6$���w���E�4�&]���c?[z{r��fXY�u��R<pu�8�H����0�����M������������<}�����������}�;�����v:����Q�?x���_��f I�Uq����'���/�,��b���k��p
���%�84�C�m����J9`�s.O�Q�&�n/s^
���L���[eNV��US���e�x�������f�"�:��	���V6u�Q
�R{Y����S1��o�����������h�l��_��)�l��]7E��(���^%�H�������#c�)��(��w��jVN[3��(�$:�y!�W�v��E�+Z�QS%5e0"��Q"r3�
��<C[�d�G�J�J����4������"\�k�B�����H��
{�q�u����:������.f���{1q����o�������K�P�p�9LLnNm��g�N�9,��h�'��=��o��y;�Q��l�1���'�G�y�(�18N;�p���,P���lDz�f��r�jwX�m�m�`�u@�<�s7-]h0pw�R��1 �s��6�b�;���>D��'��$E����N?��"1�����`U�:�'���m4]���f�������k��������++=��/��613����x/�g2��8��t���8����a�\:62��������9@�su6�Z�;$62�X�+���������(����F������H8����
��e��v����A���{����y�J��E<nv.W�b����:���
k�ZE�;Y���
�H�M��r&��ZI'���������Cq+�H��.�<6��j}�ow"
�,�?�`�5!���U��s�|5
��}��L�S�-�zE���)t9'�E[��>wk>���l(�a�U�����5��7&��j}~��`���;M�����Q(P(���,��-w�YP�C������	1��(�����[��w��b��?����x=S��Z�������<�����x7�l}%����?<��%�������(|*�����d6�?5�r����&�6�b%��~�j��������]�w������B1���Wk���|����C��5�(�93����z����E����o�K2��V���W���������q]mVS�/.��E�~<�]�����\��xy�|��'�Cq�3��q���^={����X�?�#���UF���y5����Z_��������k[W�->�c{���q�sy�]������b��S%��C8L'I������E�j�~��A�~��I�'�?x'�����!�����b�������>�B�?~�����������������������?�K���������c�g�\{�,D�x�������p��<(�7�D���������>"����2��b�����x +����Z[v��%�V�������Y�V��Pv�G�[�N����G3�D��/G�]�mj��q-z�,�nB>�p"�O���yQ_�Z����1oZ���iI}&uA>�?�Q��}k�4�A[�ym���hW�{��m��$��}�����>�������=pQ��#Q����B��TT��X7��F���X/����FT�vp��L�����=���Zzsjt4Z�������@kl���&��/����S<�������'�`{������P���������X�B���>���>�����������\����Wk�4`���gM/���K�hi��z����VeJ���2�<p�W�?�]h���b��}������Q�c���BTz�������pD���x����t��������u�c����� ��o���s��Oyr�_��*.��9W������L���_v�9r�R7���+R
�YN�L�i���K��<T%�{�4�i��N �{���>�����=-������$R}{�s�/���=��O4#�6Lw ���z'�mw�^��V�
o�u*<��=��1���AN�@; �Y�p���b?�>����y�`���P��=��'���G�������|��U7�'�j�1���I��'�S�������p[_��B
���o����u;��!���!U�gs�����[�lM������
M��Z�,���*��o���L�������8{}?�����,2��$9���&���5�����^|�	b���Q�;�`��{g�l�D�������3UM=O	
imz�f'�����Z�?/���<���������`U�{�?��_B�����<��W��?��OHR���X.��uY�;X&]i��c%bp�s�
!�6�������8��7���Re����������O��W��x:�j?��)-�m-}��h	����l
�,=I~���I��������^��>��w�������qf���04��;�7����e�;�n~�u��;u���6�{�%]'�� ��rz4��cq��&��\��h�Q���~��q����2)�����p�����;�p�w���}|������0�N3�2)��z�s�;4��n�����(f���r9��&�u����.��s���|�,���O7X
���0���$>I��?��������|���?lm���?_��������O��?~�����h�{�d��p����
�+�4�����������}Q���������_��������g1��[�����/f�"i�/�[��@�/��29�/�&������W���ro��Q���^U���?_�����.g���o�������$��pc4�'J��+��{\�W�?�������W����k���Q���xw����y:8��O_��������{@?I44���Ip�~e~d
t������_������c��b�=��}���c�������t��z5t��t[�o6���N�GM���,�A7G�-����������e��p��;>@�@k�bqq�6�t����i%^q���H�������r�k���������p�=���+j���������4���<?��s�8�i��������6��
�������k�bMU...U��������_������/����j�0...��|�#f6�{Z����;QA?���-�����~����V��l�pY���Z�w�6����i������P'|���jS{��lC�t���5��9��5��R��\UK1/�;�����V�)]����)�f������[�����������fZ����
��v��[W^��N��fN�W]�^���Y��g���+��B�Q8��\��s��d��\���i��<1���94k�����P��
z��b�@G^�hz��]�'���or-�La�L�����n�i��&+u4��d^��-�I�w��GU��iq$�������}�7! \>/\���w�Y��\	�M�H%n*�I�!+5Xy���15��2����Z��[Uo��_�T���8P
�(l��6���J����T����{m��>X�������'�<��f�_������X��)[8Q��HO�������G���H�QH�F�v+�#e�\L5�Bv�~��+Ih������K���(��`��n���=O2`�@!�B6RCQ��x�~��&d�n_%���trP-?����42�j�|#�����!���wM�Q���^�)�K/B3�d`I�Ik���x�x���z�z�=T*p:�8�0��O�y��������Y�!�
�N������h���~*�X�#d������g�_��hD�h�v����s��C��k7�.��dv].���~2�XoJ�t>)s��C�7k��NF�W0�}5d���v��g�~�K�L�����T���������V���o..
�1F@"F,w���w��	m���f�%M������{��j5�e�d#r���������t��H�& �V����s+�"�"Z)@("F(*��<-yN�����}��BD�B��8+.�D��t�N�_�V��Ys�-��������X����rf��d�;��?i/2�<���F�����D�M}����{��[o���5��g����t�������~��c�+k��^3�T4���d�T3�j#|1�^�Y�r���n"h�������4��S�cU��:S�:+����;�oNV��q�1��1���w&zf�Y�M���
8���#�����4��R��y��r�M�c 
��c#�u^U
�d�6j(C�(C�6�6k�x�X�	@UbFU�&��:�����?AZ0�*���o����;\Z��$@F��L%/��AZ�H�V%.G�WG�c��Y@�F����
�B�������S��K�y��K_*��r�}*gG�q����4a��d�����&��o�hZs�Z
P��Aa�6��evO��5�O_j�����&�������q�r$�cM���d�O
�7uQ_���JI���{���b������1j��0T���# ��]�e����vp�S-AN�d:~��Q}d�[�o�>���R�J}]�����P���J��+:�����n2�����<5����P��Rx���[�����'�"�2"��d8edXLy�jw�����@JS�V��L�5���e@[3�s�����.�U=��������Z���T)@L3����u"�� �#�U���������7�RX' ��9D�Z�W�:���'�f�X���_���@���,5��q�����=�hn�����@�@t�;�����q������@���%���k����u�?)��;���k��YBd=���m�{�,�@�sF��R�`�������A����vnh����'�=���d�N Hd7gd������R|��<�L>^�i��i�C�.�Dy��?O���{��c���&^l:YJ����f����k]��")>g�jAO��;�������O��$>.��^�]5��mQ�������,>7�D�������$�U�������%v��;�V�'��+>�Dt��:�W���P�|{����/���M���j����[��	>�^����T���D�����mEx��C1���B"�$���qU]?/�u>t{�W-(�����D���
��
0�>���[�}0�>�������C�][gy��L�gr�������V��z���������WL����������{����]������yw��tw�u���]�
���7�_������x������g/_��}����������/�N�?}����n	�r�/a�u}y�0cwR�gj���.��?i �g����w�C��b���0�1-�������|R/g��'��������ae`��k��6Px�E�����g�=�����5���(�T�mq���9^/h�{4�y&��gQt���o�����mZ��j ���.f�[��jq|��%:�Y�\�X/�7Oq%����'�&�m1����+��^�uM�^<����������	@SX�\��?�Z�A�ns^x�W����D�6�/����w9�Byq�6B������6����&q�:}-��WT
��!���������M�W����������Dm`�R,�Jf�����C���� q�����F��;���6����Xx��F�5�d�96�������C1�Z��z�U�\*���^-<�Z�FL�WmS�
����
 �>����aF%���Yu��U��k�4���U"�|�6���|*���ouk�lq|��!��2�u���������Xj��g.&���j��q�}s��������D�L���^���y/oW��+���������������e8?�x���w/�q����o�������s�q���FGD[��B�n�h'r����o���c�]�������}���������p�>�����s1_���+��?��x����<xHc�~s�iO)	������j�����������Q�������������&R�1��]�lD�l������y�
�k�UY�;�	�J����F��}���r(���8�[�����M��svU��������!��\�!���b3���MZi��@/l�?w�Z���f����������(%!z'������g+�o��@�����t����_h����Z��?lW��l�" P�&��)Sfc+n��X+~]
/�h]o#<[6?�1P�'���$�o����m x�s�)^������`���m��1Y������2��$n3�~U��Ys���s|�};�b���s��^���`�+:`��_������Xz?q�C���.j8���[�I1�$��9x^/������&�'�<���X7�[9LM&�^�fw����r)'���K�>�� �����O_�z�������/~|��+1�y��^�S���kz��:}��_�������������[����1i��?	c�����Q]0��rV&���x��H;�}��=���]��4w�A��q��^�,mn���sD����U�����h[U��F8"���R��s����=�Y�(�	?���>EP�o����~:jE~����L�P�~��QG��F�"���Cj^+N����g�<n���I�Pp��n�6v�h���9�ia��Z�d���Ee|[���w5��s�FW�#j�Sq�!m�B��U]�?�"Tm����[1�C3d�?�bV�+1(j�w*$) ~*��c9���t���ppvc����	J������T-�I�0b:���������\P�yK���PY��C3,����}�V����������BM�Z�o��#)qS|��V��)��L��5����8V�nR��F���������&�xI�Tv�������5�5x�9���������-�[iiaI�x��-\�'P[�,E�"���E�*f�"shQw"�����n������Hq�tc��|pZ8���SoD�W�o�2�m���VddT����d��"���4Y��L���=�����GC+�;X�IIw4|Yw�ioV�~��vQ@��
R��	$�sG$0���p��W��N%�p@g==�\5��#������������_p#*����q��7��[2E�'������������{���3�9�J�k��y���O�y@�r�u��yd�����o���������w9z"�j����S7����BHr�]�oH#G%��w�, k��/&drp�������s��Jj6:�8��c��J'�U�}��I�������a%�X�D�6�^�������[{MF��g�kYp���Ik��a
Xp��..�8Z�L:���go~<���/o��O}Q9�tp��HIu��v-�1��qX����+�8O�f.���v��@�m�l�KPp�Q2��q�������a`����.�����������Ac��YQm�e�Mp����;p���&��U��V�d��'�Z�P�c\^s� I6H�I��k#e2V0�y2��6�^���ZM[�L��|W�Z�2���d=�W�@�|����(a�����J
@������v0��<(��G���B��0P�"��jm4c��`o�kV?W�0�M���	�3p�������0 ���O��n�"���@�$�P �������I�j����=p��g5��f,L4wys�������b����x=��=���j�:oL2��l�P��cJL�X&�x��������1@���<�a'�"�~����4�W|6�Q��r>Y\|��p����b��:�1�r������P�	���d�Q|`MT�f������^D����Y�����oi����/�����5@],���4����������_h�rJ�u1:=�Nb�qx�`B���tV���\���Si������� t��Ij���m��w`��kI��o���C�oV>@��T<�����,����g���2�MG�`����'u�l��9�$1t�F�Mk�xl@�g�>��L�������P:������bk���?�h���,��
x&��Xv���g���-�h��<ll�������������;��e�o�5�i?���mh�����}��y!|]
�`�@�%�o���8{�%�I@7�;��C�l������
���[��=�[� v���%�q��9�����X��N� v���!��j;�/v��B��&��e�]�b�1v�{*;;��6#��X�!�v!��y��50�X�� ��+�����*l�
�{k��C-'��
h�P�jE��n;�Na�����sG�����7c����n���nK(�
��E@����pS�$a����v��YW2!�4�x)�l>7�P4n7C[K����/������jR_y��4u3Y��rc�����Z��C�C�_�c��lb�{Ym���	����H��$�����zw-�e$�������@���,�s&�t��`�n)�����	6E�����B;+]������}�����b ���3{z/k3�������	�>���}�|bI��
�nG�^��yC�����<��'h7��^..*`{G�to��m�,&��9k����=��i�hwsw������h��,z���$L�lo=�� o���C�+�cE/H]���uw/��}�G����q�K�b��d����wNvvWrRv����8�h�Y��ja��������w���������������mV�&�����H)/?k��1���"`�Z:.�����=��O��JK��X�n�l��
���3���R��v�As@w���7�r�$�R^2P��puN��7��J���hi�`���<���P�kM�������HF�e�"���*�;YODS��r:"})��hp@;����_���M�\�	O�T�L���vSQE������v�R�������[�;+.�{[C���J�K��{������������IZR��CPmwh����e���
��u%
����2AQ9�o�M]R^F����
��m�����*jWK���������\4	���b�*��mn�M���M�%��(8Te�]n&�����Y+�e��|����v�m�%��ns5c3k��
mX��R�����`����R�%�%��=ZOOQ�-!���[�M����Ia&[��.����bI6b�L�
{�T+=kxv��'����%2d�sz�Uvs�a�T'	���S�E��j��v�Q�e@�����t��:�{=�X�N�%�ng���H4�j������a�]%Cn�0Ht$�4h�A���l�|�
��)d��z���Y��`:�����h���b������=/�[-��]!�6��=�z�mP�6�#d�`��l��-n��B��"t�j���o�<Jex�J6[��f�������k����W/�0B���������������G�uK����?����g�����}��~��B�a"t���`��\{!�"r	������uGC��R��3��q
��3B��!�e�w��m�-��kQA�n��J5n���f�=I_6��
�|�|��l<��z1@ol[\2������]��	0����z1���0}QQ��Z�����M��k�ur�f�u���eE����Q����h��
��V��I !��w����f�z�w�I�N�L�R��eB��c&�E��k	@a��z1@E�
h����b�����.,/������(�KPO^P��M�����?�O�kB�����}���S����������]�&�����+��o�n/��W��D�U�L=!�C��!�C.u�~������!��W/�D�'(�l���M����%��r��^��q��v		�<}CC��T.���x��{;*iE�������l;4����~�����w3Q����������/q0o��p�aLfK���������6��#q�����K�v���Y�	����!����Q�Z��X����)�����[a�7�3:���}��>��z�tP�{ t=G���O���g������wO_���o^�l�O
T���Fg���I0)�Jq�8���D�`�@��6�T�;�.	���������yd?��\M�t�\m�vG��F�������s�C���6��v{U���gZ+&�Of��.��x�jI�n-Q7a������'�[�(�p`�%}�MI��Z}�hsU���s�e�����u%wq[a�tI��d8���`�Ah�a���`�A�8*�a�=:� ���9�v'���Z&�����SQ��4�����&$k�<�v���&���4�_��-����dS/@���z1@�\w���F������X:K��]�P8{7�h��w
G����uD0��`1�uv �ab�/=�
8������nb�;�O����L}��~�b4O�kP�M@�\��J���X>���w�,�Bn<{x��v���!�!@�C���{�gM���k��V���{�+��@��YO.���-��$%��W���SG[�I�����g�Y,�G��[�W��A��M@^R��%��T�j��Tus����c�R���������Vd��C�Oon��Q?�����Lt�S�*�����I�N���7�+QKb �����Q�vm�)Pj�&���.2�99t^)����������JL%�E��!��d;�����
��@����x��~��^�p���Z*�j%�r~��M.W�G����	jF&��Eq���	�����51�rg{vD}�]6�Ce��v�O����Z��R�	����-��]O�W�/W����j��J���.�����/���bfu7��!8���Z��U��X4k10��Eu#$U�H"8wa���9s�$lr��������OU���Dx-l��Ghxq��V�U9�=R��p1�d��?��V[�%�|�;GE�-j6�?#l����9�{�y�-fX|����|����n-�a
�mT�.U`tK���H���{*�9�a�"�y�V�]��V�71pp��d1������VM�="�ev~�g�E@8mg2x����"��y*�t����Pd�e^�-=�,.�����&�d�(�ZQnN$^��?;-����$$�m�����&��nNv+���n5�������fs���uN�~K���vu�������Q�?���|����<5R�t��Bv��4D/�������C@��,�/K�bO@�#��v!���)�>r;f!�|�z����4��	�f8���S��)�h{���D���=|�>:qf>�����7������t_g�]i���a�~�b�������z�������.R��Q
~�=�`�]����E�g5�G�5���z1@���q~m���)����-8�M���(dh��9��#w��?��9�(��v�B����Y�K�e����\�o�t ���(��S1�����C��b����n�{u|���1��y���Qo�B���Q���'��k`�F��V����x4����:q[u[�<q4>��]��7*��`�#k�X�*�V���
{m��^4��#����U+������uh��G����t������gv�2}�������}��l<�z'���<��6	D��7���o�l4���z1@����������l�[
�Qg!u��xUL���0O
6Dn�"�� 
�����Q���2�������;;@/�����R��(��M;%/�&.��]��cYmj�P��������iy�m���z?����������f���8�(�7�}����0�mr���Pr������[�J3��Q23lsR4G��fqQ�wv1��W���|�Y,��'t�@�gy��$y��<��������)���kF�p(�WT��?�z��������,r��.h�w^���E#�Gl|��$>b��3	v��������zpH�cT
	�j7��������`���F�o�.��G�(V���VP����jB�f������M��E/�`61|S)��q�����\+��41s�D��b����^����I[�;�����[Z>�\1W=�q�-=���[�TR���0 �#��6���,���r�D���X�3�����6�&u�R��u��h�l9a"�QGl�v�&Pt��Z7��i���n9a"@)G�c�!>�%�<B��u�����#@;G���(��`"@6G���(2L��0n��0���<,��#�<P�w��b:JG�W���J:JGV�'�J0�(e��z9@�lt�{v���>���H%HtdC�]��D���8JZ/��kvI?H9r������q�G"*Gn��#�G6��^��D(�2�X��F���D;�,i��OH?�9rK�:9����%\����D)��b�M��F�Gm�pO{(9r���y��2{�#�!GV�s�9�@@��
*w����"t�1�Z1[�rG�/>����������&���Y�{m�L��`2�?S��0���Y�^OF$���qt�^���Q��������\H���Cs�{�9�����X����M�V[��ZVu)��1� 9�]������M�^��
�h������.c@$�'����OH�����s��O:�vb:�lrp����B�������h�����^���u�q�.
���-�������{�Z����:��46��H����;s��bM3����Z4������/��}XT7�bF�=���r�����u��2���w6�*��!F$����0!����\�i���HCk������j1��c6��"Z^i��n�X�f._��X�����~t|!����l��N�l���Uw�6�C5f�y������bz����hGv���,�;�:C���o�	���?�K
��k�>�������|m�jS��~{����D^���k�05�i��[�w�G��s�r�:���u<�)��Y�O��h�
.�u0��wq�:����o�nU�������9���
��6�����8��]��>��4������I���w��vN����4�j�����r����6�R��_����{�-�Z���5�����
���C��b���������������du@��d�O �'�c@.���2��%�e�dk�b���pW�:Lt8�b�*��s@��@�c�D�\Bv��m���wO�F�^���nsLt�:�E���|��
�����a1��c���l�0��@Y,�{�*��I9oj�T��s�������V�?$�x�m�G:Km��3�e����p����70�j�a�E[Z�����f���,g>�NzI
���1�������^]7����9�[�A�;v�s�;��n#�����o��m�R)o� /�j����n��&���|S�u~0lw�����	���������=9r��*c����@��6�����9�[�;vMn�Mg:o���g�Ta�,��q��pF�@�Dii�����?������go^��<}���{�C�<��r�C4<��p���90������-�7���^�]���V��b]�)pM��R���
m5Q	������y�[�����9���l�;�]�t����r1���e%��=�'��c��6�'����J��v�At�3��Y�+tw���z1���l�'c/wG���~#��m�s�y51�@��s.�m}�B��e���w���_{dq��:J���z1@G�s[���Mce�.&�i�Be���s1���v?O�jFa]#�y�%��^w��\C���G�=d���W���������������s�k����t�������������� �1B����UP���:vMS�s������P��=-d���1�n������M"�D�l9�,v���	����M{�)�`�#�m��kr�^569��lw?�wi����z����q����g����=�*���c���2��?4'�y�j
�������zS\U�Iyh��XTm���6�Y�57��f>S�rY(�j�b�������6�W���RV���w]�����9$=�2a���������T���v[���B�>m8[V������o2�����&LK(y����<��:���"��qI��b�be��3�����/������;{�������}�%=�g�
�3�����$�a�1��cW�{���O��;���1�h�Mg�?_,f�����������P2�w��&L��0M
0�1�h��i�`k#Zp��M�|���*���.b��2S
�����m����zV�,tc�������=-�'+=�P �����A��1�����%��N\�=�1��Rt|ht�+lu�e{��^��'��������M{�0FX������=��s�.��(M������b�"��fr���;E7����(V��4��}"��$���w���O�%��
���Ye���5������{�v,�u`��5����_�\u�u;a�H:q������J'n��H'>#C�(�Z9aie:�E��b������ �T4�S]C �7�������8���Q��SF�$��N���=���8����tm�)p�rg�	��5�7`��qT�!	�����	��K�<��7�N�e�$5�(@,Y�G��n�4j�rV��S
:D;N.���Y�v�K8e��:��4k�^9	\����h_E���B����������U��T@���e�e�|��/�VDhpVi\\����d9qC��,'�l�9����^�6���s'�+��28	������7qbxe34=�o!�o�!����7qb|�OLWO��}�y�&�d�#��aeV{-��yqA��h�s@'\�d� ����Zz��tp�I�0&������93����Y�����&�����s
!�mbI��2ky�6���2�t �q�Y���iy@�&|�f]��p,��z6q�a#������A���l����j|6��/Q�
���_m�a@����vF���I�wL7�0]��#�8Kdh�Vs#u��V��&��w��\Jf���9���-��&�������U*����Y����my���G�L��~�MbI�,���������u�~4��vaV��@������6�����>�)*�p����g!M���uU�����m���mU�u��B��d���	����
���M��
��$F�i�g���oz6��Y/t��16�mO���,<�������U���������X-&���+�=_��go:�o��GhA��Z�l0Z�����M��$��H�;���V�+4��M,|��9UOfw�Gf�(�I*8�R��������SN�0�`�	�����x��9��SR���d�f���

qo��1z��l%��G�k���*q\�7������TZ�-h^ro�_�����5��4���1@���8�#_�*��p�I�G���������{��V�2��P�@j-���W���������HU�L� �	�P����rX��-�W���)�Ex������p@w�'�K�U��^��+��WY=]���������F�������C�y����Z�A�<�	����2zQ���0�W��Da�����3Gt@"'�$���������!(������s3rb���7-�g�]5�;m���4G��d~.���m���
�E%�N2G��|o���&Y���twj�+8#���R�w6��;sfq�\rm��cH�n��&G���:���*����P��������oC;Y3��p��-�7�h��r�9����.�a��'�������'�l��������<�+���0��+C�FV�1y���'�!O8��wLL����Vn�@�3+~�
��i�<���L(kp`���'z�;�1Oly�ywT6?����t��=T�������[�W
 ��K��c��5g�n������z��M��?)��S.7�^��������K�f����E�Uj�P	������daLE��8/q��)���JE�fRJv��b��}��*,����,��Ev�I]�s�"����p��$u%�wu:����A��;g�]�RW}�61m���
<T��[����b�+�>lM�9I�����Sg�O1&z=�}��I����7���|�Q�z1@�l�i�t�n}]?1�p��?*(G� Tm��R���#�D������L���9)�;�q�M���=uM�mV�P�Z�wH��u��
R�������e�n*�m�t��������������E���N������A
���K�������NR�����]wC��Fm)��SF��^P
�E��z�) �S�<��{��MO96]/t{�������x��9O��J�9OC{���?��Sn�T������5��q�a����V�`��f>`��.?��S'��g�����-O��r���.?����u���5{��hOC���]��w��rX�^�=[�h'� �i��� �q�1.?�������{�!�Mm���>u����L�`II�$>��Q���a�Q{��|����,<��#�|����(a�|���n,|
X��c��b	�r$�����_O-��������$��U9��&3��Y�~M����Z]�1^��W(N�
��������~:��D�c������<������n~������#�����:�N�W�Q������w�����)��Sz�'q�fOSQ7J�eBJ�6=j�G��Hoh=�Q���\&���3L����=�)^=u��S�����z���3��z���)h<��q��c�	h��1�t�2�f��1Os���3L)D��%������0�1��B�T#+W��(�Y��_�Xt�K��������K�����:c�^�K@��Z�7i��%`�S.��^����+J�N1�N�q�K>�J
0�4-�������8;�D:&�f��C����N[sR��D���=&�7Oq����M�[oW�����"�+7�<8y
p�QSo���9��So����xc3��v�e��������:uL5�������C����a�p�)����v���P:�C������K��:��n*��{��9��Vw�p�.9�-�2vb��4wsO��:&����}&��Ns7��o�1�t�����i>J`�r��8uD���N��Xg�Ns�yX�>3�������n���U��������0�i>Z��o� �4w��
�6r���9s���X���1�t��L�M�{��v:sc�3�Ng���S����#����S����#)=`���c�����f��b1�J�<mL�(���bF��R�t+uv�� e�w�N�����e����i����r��o��G]w�i	m��sgI�8$���1s�p��?����\3�s�H:7&
�532gn s@�����f@��T�$��fi���f���c���l�5��nf�n�8���z9@9\�e/���f�,���M73@2g�L���fp�����b�(p9���;�o�,�2yi]�h��#������\1_��-�&u�h�����vj�'����u�M��IX$
�F�Gg�D��>���78t�n�F�3�Og���q���o�p�	pX�,-Q�{�n;sJ�3��&�`�3W�{���`@zg����5{M���������8�w��e���B;(�'���\!�]��1Xu��U�F���:G���2�v���Z1����D�������`kyz/��7�R���k$���G���j� X6h[�s=Y��Z�����l��}���k��]�>o7��'����E|`���O
�M�?��@�"�����gV����y���������g���?{������^���/O_�qv�����tK�:q���KXr]_tO�h[�L(ej�f:��#-��C=�LH�������q�7]�����.[c�D��Stf,�y�q�z1@xX0��x��4"X��5��g����<�R����� q>p���<�W�	�nO���P��3<�\.FH��#�������UJ�i�1"��,�8C##b0�8����1@�gn��3�g u�g���<sMg��9�\#b�-���g�;5y����3G��S�n1�wg�Y���Z���6[#c���)o���$x������<U��3Kvr��F������f�� �3G2�o���1�g\�s� �T�����<KG9ot�z����0b\d��Y:ZM����3v����Y9�0�W�=5�c�����v �M�����I���da�8�]��wI��g��z�����j=�d��s� ������e��l	�{��@����3�g�`����0�{��Q������Q���3.��^�$��o�vT�@�Y~O�����3G��1���h>��f�u����|����#��y��������4=���f�N�lr.guF�,z�%���^�]Tr�l�������Y@���;f8��r��6�|d�,�y~�����O�x��	���#�]�l����Y������"�������k��]=:��r@��������y��r����h��������#��1���g9@�s�d�������e�}���!k�	�����}WG�k����@������|�m��K�,�{�����<�%��S����g9�s�����'x��>wK�n>��y�����^�����3�x��3.�<�s)����Y����1��������N�p�lh�e�����-��]�g9���1fu��Y���K>n\c�g�o��r��\�s� 0�}{���Y����[����:��t�-���NTE�����@�����
GN�{-\{���Gl-�z��3v�si��Z��<t�|��$��n[���Y`�����w�\y?���!������)� ��s�S��h0�#e��8]�D�zn��G���s	��Dxz�e��R�� �����O��<�(s��%N�}��D�G���.��@�s'�|������k�
��y��0�T)cdCcs0��C����tA(���N�:k�� @�s'd��f� �s����I��(�s��^P���A(����K�Wm{�����q� �s�4�9@�s�nDs���`u��SR�V��~[{B������x7��Z�^��;]���q�m����
�" E��������)V��{�xV-�]{�G�!�S���\^yt�D}Um�3������\r���Z�H�����3-��Z���A���OL���|.�M�������'�{u#���8��~�[=���<������E�����)�W��(����'���~W�8����@�s���.9vl�|�8�����;��������`'���w5S,�1���n
X���JN���`�s'�\V��6`�;�[���@���
O�w���|`P�y���@��S�;�kh���-O�~����h;X}d;���Q����[���3��{�w�;��8�hxnC���;9�s��^P��y��}u��;�sd�^�g$|�0����9�u��Y�xn��S�Z������s��^�}A#�c��h1q�;�������d
�;w&�5���;wL��1���F���:v����sg�|����F"��!��I���n����9��s7o�
��}����'��a#@���	��6�O��9}�������F�������D����#qK����c�H���G���;��
}��G�X����#�G��V����5G����V�����Us��}l������S��?��$������n�mg���{�(���r����A�=�$�	��g���FA@t����������jn	D�5��S�_��u�:�Y��)@`F�%o��;U~9��hO;9��z�0���=4�����\|��+��*�����'v6�>��P�(��=f+���a�������Y6���6�ra�M���K��V���C�����iV��������&k���o��{��K1���p{���r�{���|^����I���rkc�JzEu)����Q	l�r�������#G���?L����g��&�x3����z���_^�7c%�������5}>�������X���[}�3���~l�
+Zf��h� �S�Y|�4�e�r;�L�[��{�������ZR�������b�i���#�{�5@sl���5��%�.��x�%��+��m�N7����8����L�y����6{ �L�.��b2��v�-��bG����U7�9��w��)�.
��C�x0��oT����3���1f�[f��_.>N��l���5	�k���I�b7�f��`�����='T�5g� ��'�h�x�%�.���-���A����;�M���9.�n�{;�M��U�����6~;�(��k��]
�Ri;�;�;O�[�@�b��������1Pg����|�������4}�X�@#��h���c����=�2}AF;�L��U�e�����.�If�Wh��Q�e�����Y�*i���-�/
3.���x)&��Oeho���Yg��1�g���������c��chv��>�?�f���s�V��K~�]��k�K����q�d�����<i�������t�4�Yc�|��&M[��O���1}��mw�6�e�|�I�8m�����]�=m�S���]��o�d����=l����F����0����N7��ckn�i��|���*�N5���>m��������'�&N;���_ pn������>'Mv�>��^=f�d���s��
�������NF����&;�L��U��O���d���G�	g�X�[k �{F2��1}��dtR����k��h��Y���+�$�������?D�(�&�������j�C�i>1H;fL�;�#��0}>�!�u����;���o�kG���{���Qc��I|@
�'��)��GE���P>��e�?%Kt
'��	�=M���]���
[���m0e�%�wKT-�g	d���������z\bQ�]7|g\�)3M����y�������3������w���?3����wN�}�i����9g����4��@'���{IN#�T����r��q,s�Xx"���w�-}��4�.@]�h�
m���������w$����� e�Z���h5e7_8h��}�6�������WesNJm��sm:���IO��53��?���i��Ox�I��%����9��(��-���p�k?p��
�yD8����a�zH�|����F9@R82�i��m�G+��^�W���*[8W���,�;'�n�Np��\�C��H�>Z���-p�@5�u���>�Ow�p
����m�����l�C��r������00����F9@G�l�t9��������l�`:�~�P���^�m�nM�k��#I�����$�Hj�����I����������A���Q�v���0�������>G����+?=l�����}W~z�������]���������i�����i���+?��)~�i�)����2�����9��(h[<f���g�c����T��M�����o���?U�Hk��'����OU����%��?U��l?�n�5h�S�iK�3Nl�w��}�m���^�������1k�k��W�n������| 3.�w�N��u��� �������z��(��5Qu���)Z�V@��	��W�����w�N�Bjob�t~�0(���r�d�
�o�n#�����[:k�= ^RN��H)���t�/u�H)@�}}g�p�gQ
����m���2n�%�
o���Vt����������<���	��'L �~�(�C���a�������*������1T@n�6r{t@�>���(hD�����+�
 n�-5�����:v�1T�`��(���{��>�o���,��{����F�cp���=xc{�Z�C���]�^�f�!�v�,l���{g�=FPp��&�6nt���+�>l�~T�����{6�!�
0u����r��q`y#YT@�������������G+��T���vn���#����JWp���y"���[6�����U`:�����V@>�=#��{�k]!��'�x�}�}��pi��r�r��9[U��������+|��Q?UF������~��i(��Kd�_w��0l��{��S��`�4�y��M6@��h�	�B<
u8f�[p��<
B�"��<
(����&&x<p���Ox��w��������`��X��>U��O�s���O�s�����="����[��������-����i�������I��o��i���K�m�t��s��4�w��<��]��i� t��;����?
�����-~�;]7��7��O�r[���W|\���n���F����sN�=EP�������A
��r����A
���{�C�9������������P\�]��A
������"���W�3d�(�<��V��}�����	0�x�
����0�Z"F��=�(p��<�����"F�Bb�������M������`��xo�c�ag��n:��y���n�t�F��=�G;rw&=p���l��G�2�y8l�>�x���{A���,���,8���:!���� @�E7�H�)\�rU�t�g	���(�c�s	����$D�x���F9@,W����S��8��.s�`��
�����h�|w@�ZE6����������h�2
��Z�ct|v��|�q����������=���B��,H��Fj�Kp��K*�~U:F���r���mQ<�3��x�k@(@ d;��(��m�P� ���u��<��.��y�F���������8z��������pX�Q�A.'x#_�<����\,�X=pI#>l��x��Wh�3d@����&M�f2��B��\f��)Q=E����9��>@��4�������] �H��B��=��� ����~7�s���_���w�C�Fi���A��2Z�d'��
Dp�x� w�"��r����<�G�w��=�G9�L�/d{���F9ve
9����(���w�1���7��<�r�CW'��<�gr���a��� tr�y���_��������CO������s��;d!@�C!7�Rb��� �:'� %:���|�w=�	�����B������dJ����0��s2qH��F�B�������E�B@���>�������8�]7g��S`�0p���I�=tL[�=i�q�)�z����.$Y �0-��E�B���h��c�)�|��I7����MS�+OF>F��5������R`����9�L>�0y��{|)�}h#�{�h�l��`X)8~�e\7��7���-�U��p�~8�0t�$B��\�u���B��{A�w*��|���\P	���c��0�!��[�JG�����B�v����dq��}�k#W
j�w�\
����A��Sc��h���Q#���	P����3����C�;v8G�������&����������3�o�h��C?���xhc�?C�	P��S*���u�6�<tLb<�'1��h ��x��8F�M:��C7�F��l��M)�����h��F�8�|��a��"����P ��x���)�h����g��/���0q	�6�!�h�����r�&����&���\�L��j�{��V�=CM�1�`�!��\�8�4 O�.�~�7��w��0>������-d}�����-d}��=������1�!`�C'���5G�������F9@s8�^uQK�u;�W���h��|������-\������S��u�{�?�%\�]����W������T�z9��P��~���N�uj�>\��}�9�V���4�M���{��zYbX��,�C���[��,��C���!��C.g�%���h��]vw��}�`�r�;u���)��U5��"7��c���/���5u|Q=����i�_�v y�k����q�s��C������U���>����m���0��C�>::��Q�:#����WtP��#�
=�}G��:s�z��G��r���r"�G����f���m����R��Yk`�����bu�wC��#@�G����0�V�z���q`�#�U7��FdKr~��@��3���<��~x�����ol��7���������v�������c���5�������G�����A����#��G�+F:h�^��`��?Z5����9�\�Lq0}#\8��>�]y�m����k9���Zl��F�r��F����\|��������%�@?r�k�i��#����Jq����6��x(�n#��j��;�;n�o��es����8:��RoV���z���t��G��Z��?�mg��4(�|�%�7�b�1�����9�J="��>�����(����(���dE�F�,���M����R �^cY`�#��\�b�-�r}d�����������3��1�{:�"��G���1T�=�1�#CU �#�D�kl�*�z���"�G�����'X*=��t��G����kYXsT�������#�=�P��n����Yu��(`�
��uF.Th���d��8mV�<��s� ��V�<����p���h��!`��){�
@��%�:0nO$5x{�ON�{\�=��8$�N�+��G�[��}d��mV�!|������U��!|������u�OP�����v��S#��G���x��K��^�Q�*.���>�H=��<�=����#kZ����F�Q�8F��iO<5H{dC�D����_w���Gn�G\�v��;�t��zw����J��vD/�{���F9@�l����%��d?���K��G�n�/�q0�QP
�.7G
�G)'��i��ovW��;�W��'1���U�_��^��.�r-~�z�b�r-F�R�����J�&�U@O8�\�c����z^�
��y��f�i`s�z�+�l��&��"�+��&u!:�M��vB�'�������^
i��z���C�l�z}�my9�wYL�u�o_�wd��P1��	
9��?�]M��dX4����Q��L���
�%���K>Sw}'�X�^5#��C��uaYrS��q������#�������Zy�����[��Z���p�M$���#��&+�;��w�q�F����kBt�Kf��6KrGgg�	?�.&���X*<�R���n'6T��!)���w�%I7�z��
}Wgd���_n�a�
���=p��}}_�|��@`=#�qG��#�`G�����)N��B����S9����x&�.^o��>`�#��6�]<�&�4J���&��+1����\��h2�6�cM?��$��R���"MM�h�t���0
tyk�p��)���fa������@������Q���+O��6����-
���~�z����T�G���"�q��I}}�����8��.c��udC�
7��Q�(�~�8�y������m�h�//tw��&���fI���^��_������Zx����� ��x7]OV��������.:����j���d���������.:>q��v������������O�^��������b�eP�����H��1��cg��o���s�f2+g2"�������0�s�>#���6�t�i�,����	�c@+�'�y|�U���}��2 ���zh�t���K�U���[�b� ��T���8v�c�v��(
��-'��7�kl'���kf'�MQ+����+�W��#O�oJ�7��r8�-rA����/d��X�!H�r)�MI��W�B�UQ_���R
����xC
��s�m�D�j�X<iY��|��j��6d��j��a`�������Sf#��i5 c����������XG;O|W�����'������F9@�8��.1����M�]�9�k��/�-��T+m$Z�����k�����;<q����sf�����v��}�@�,��@%1�b�\l�����qp�1����5�)�H�(�'s���k�����<���85e���|rK+��)@Z�������Z���|Q���b�,����O:"g���Y��W &,�{���M�������V�G	��'��R\\7U�����M�``ZIX��)k��Pf�Z����2��<��8vM<�G����Nu�QG
��w;##�CFF�`�q���U@��b�r.lk
1��c7���\8���,����-�/Se#�b@����oh��)�t�IF�5���-������CF�3
�"�-�nQ7<5�"�P���c��N��{Ov������cJ� �1HI=���[6i���������S�^��S�m�C�L���1��e����m�1��l����x�cK��t��s���h7P�Qm��������.&����6��B����&��m��w�������UA[i���^�@*>���8�f����a� ���5;P����|��@��6"��Ee�W��N�����N������;�j�r;?���I������^��
>6���������E������v�5�XW� $����v�4i�?��Yc�U�.��S����4&����)�UExr�!��?6��$����1��W
@�6���Z�NBg����p kc����� �q������au�X������M\;�e�+��{�&��-�����y4�m��.i�,��r`nc��5�"��.^����W��_���O�Q�c��"e��M�����?�FQU��=&�Fh�8sI\s�)�0x���Dm��{����w:_1|l����X��;���|����7��{6Fk@��6r�<i�<�_�K��_�������	�z��`�-P%�u���G��@o2�(�w�����PGl7�n����<VN�����&�y��uK�������	��5?�fg�%x`v��&�k�w����1���]c��'/&�\S�������FT*5������]���)�z�n�I�Quh�@a�uo-��c�<`�������
��fP�zv���a
0�	���c;=��K�s'��N���sK���u8���O�	����2�X�l��M�E��]C����	��.o�Q�]�S���jo��/"*lv��f�C��H�������sQ	������� *�v�!�z9�N|F\��bX��e���Ec��"���.V�n����j�Q���@32G�LVP7��>� ;��������c�����[��(	�l��pm�3�t/@d'���Q�S'ni��g��A�"	@���	���b�y%��,���'�k����������I@X'n�uPTl��2���CU���oL��:qC�����)���}Wj��p��
\��I�+w' E��v��5��A=�v���[�N��A@9��(��3���{��p��N��u'\�h��m��u�1��h�����4%��NY���	J�L�y�FP�:���*��6��6�����zI���<�(E����NB�ZO�����=��a��K*im;A����
WH��ybK'M����<����5!�T�T,��$�$��)�����lLz�*]������y!ZhqH[������x�$��\<�.����Z�������p�W!2�P�#����1lJ����M��Z+o3A�7��p�;��DV�j51��\c*�����zg�Q*P:'���������M������$I��=�%'�{��j��KF��#���E����(��T5RZ�T�=)?�����m���S�Ht���� ��]��
YG��^]Mp�	����0��'�X�&���D/�?������K��O"��=���.�W������o{.�a�>�2q��1��>qE��VC�*�}�@�]a�1����q�S�������l�50�b0<����=�A6$����e��.����`�@ ���F���I���ao�����#X����$v���'#}umI
9�`WB��Jh/Ce�������Y\h�&��Z�c�`+D���;��x��J����1C&M��!�Go���G�4�U��������q6l�H�P�a����,`�FbM6�
`[E��7�2j��	��:�D����1�x�J$	��?>�L7���}8�@��������o�z[?`����$q]&N������ozk��*�#�e����v���U\�u�NkK�+�y�Z��m	�W�*_������7l�HRWI2�����.�w�6-HE�n���|d��1��Y�|x�vm$��
��hl��2c�`wEb�]!/�:~��
m#1��)^-	�Q;��Mr��yu>���a�36�������d1-�)�:�d&J���]U���������V^]^��	��B[7�d:� �dl�s��&�9m�(��]�m-���L�������V�$�p&�l�H���;����r�V� ��I���13��"a��o�e�
`�E����>6p�J����@vC$��K+���vH$���]Wv�Y����b�`�w����:��#q������k^v�Q�g��I)(�5���)�@.���$1�;�l{u���>���m1�m�7$���D/���%`3G�ef���0��"q�q%~o�����H��\�����_$��(q�wgA�:[�x����[�����V��!��=�E�6,�~2�x���g�����b4;g���fE?�����L|���Z������{',[
RnK��������|�AtB�wE]�����.y4�<3���R�c �v�b�����Sn���g�|�V����h�Qk^)�L���~�Zs^��5M��D^
��'N}��3��S�� =q��)��������)�R]��w��K������O�H������~�������1��S�m&��r;��r2�fq�\4���������������r������n�����zWT�Y��a#�;+�pu������J}�u.� �r���NjK�n6}����\�t�U,��������v}R�E�:�Z�[�*#��I��;R�����X�.�U��[R2V	��������Om��P��1;z���Tw����*���6x��������~IVv�[\O�~��6=��n\fU�7+�yN�r9��r�J�O��9m�y@����R��K��������`�)�G}W���V����-�����x<���^��������A
P�����j�X�������v����R�����d�L�FG�o��&�+������i��MO�r)��r��8���I�v��%��3u�9�gN���kqC1)9h���Q2�!������E'��,X�2��@?l��fI
c��f���P�e����@�B���AA�Z����wJ�>;�x���e]W�o�
���R:g�7n�m���]W��o�k(�sv������}���M���rU���/��x=S��Z����i��s�����wS��W��''x�����oZ���^S�%5?-'����)X������u��+ay-��������������/�����UA{A�?&�������X����������z����E�����V&����x��j�^~���|�<���jJp.��E�~<�������\��xy�|��'�Cq�����k����7���x��Q���2�������-U���z�������!�������|2�P,f�;'�?nb������������=N��$��O�����A����q����p"&YQ����|���.�y�����������������<����|�����/��������3������>{����������W���*�A<~��umj	�|"��A�P�)&RS�x]����yDK��<�!������e�����
�Ie����j!7���������k�%Ruyo�k8��W�����\������R���	���-���A���O���yQ_�Z(���1nZ����5�$}&�M>�?�QI�}��m5�Iz��}�S~�}`��ag[|;��g��l�������*��^�7��<Ux�z�.��������,n��
�%��^��	�O8x��bV|�j��O�j������'^���_m���m�Tn�luO\���.�����N���pg�����lt����-�|�k����D-����o�Bxw����,~z��������}��
6�Jx����]/����V����L�jQ�X
�-S��%������^,�����x�%Q0��~.�)�S��U����:��������?�O<��.H���;V��m�����d��I����8��JU\
�r�v�����j����N����n"��W$R����u%�����_����*�����Xs�v�����b~�I���������;�H��
�]�d7L���?�|n��0=��#���"����Xc�
o�u*���=��?R�A.����������b?�>����y�`���H'�1��X�����G�������{��U7���j�1���I�y'�3���������m����^����B�������� ��]���{�:�-V�VEgQ�Y�gj|24���!�
����r}'s��A�^��uiF�(|�S������X���^��@��w��e������h�W��m�~�j��������l�����|sY.F�@}n+U�M{�qFK>���-��u:zv��~�����h[��H�����	�����b�?Y�=���M� �>{�����+��������8
?{k�����[���)(���|G�]�K?wJ!����NcF����KH�����%�������lz~v��,��j��'�g����U������������?nZ]_��;�h������m�\��`v����]��rm\��u�:;�=�V�]�l���+��BpB�qcuS��������	mi������w2�g��#��R�g�jU������������6����I��V�����B��]D�.�W�5����3���]��V���vf���D�+��Ln��P�����f��w�h+����������K�=��v�7��uT���m@��k�����)�4]l�U������?��>%��Z[�A���_�Q�k���f1�IO�0�U��O�����������Z���F:�o�����Wn���k������5Kx�F��&~|�?��+�I���~�?���ZM�?���_����'����g��?~�?����%�
���%{~7D����X�����_V����~]R�<�����a�/n	�KYZ��x�/k��ZU�
��~���F }��.��m-_��/�����b}1���O��$�(��Ka����b_�F�/rK�����������U}�;G�8��������c������������
~����?��5G~���@�]����kR��?��+��{\�W��Z���W��+�������"�?�x�=��Rb3�Xq���+���b�����8	���$��$�h4���7*ztr|������<�9.��	~�F����������\���a�F",n�m�.<:[�~8U#�7�	f�����s����G�u��%��$.�$GO;��z ���3�
��W��� qq���;CW{�/98������M���B��x����� qq��iuc=�G\�qhZ1��{��sv�=���+�i�������#b��5���:��F�jIu����6��
�|����/�z����V�#u����al��|�����?�|����7��bA��>�Z�O[���@U:���F@?��~�����Zi��q���xU��u5c�wY�S>��c���G���mDk[����������]�o��Ae}i�.��*-��z��7�r-��t(*��/O_y�6O��*��0���F|J<W]\����ka���,��=�UYT-����M�fM�{��Nz_vN�k���b~��2��U��:�;��Dy���eT�>��cdEuuMQ�����x[�oJ��X�������r>�[����y
���-������3i�������W�'�A@{|V{������n�z�[E�h~���\L��U%������~m��&.*���O���rz%{X�Z�h��Z;J�~���+lZ9@�V��.t�:u�zs�������B�}R	�����Jk_��Y�)������.f����G!5��E�Q�����w�������L�W�,V��p5;�T��J;����{��L�M�z,Ac���g^�R��O����
D��c��x�C�m�I@+�S_�IT��*�-j���Z�u!^������P}�N����i�+��M4��X�'��b)�P)����D��'^K3���rZ���{�	���Xa�U1e��|^.&�[1��,=]S�)����U���6!�y��C��l�'�Q@6�Cf�0����V��������v���o����xs��4��r��J��D�Jyu�Z�.�|q�z�xvC����*TJ�|���!G�w��UFM���!������~����fr[m��~�2����[a��y��
���2�Z�&��l��\�o@GC���\��K0�����9���=�G�@E/j�.(�9�����Wk6M
��Z^�e�M>N�� \������v�i�Y�r���eU����,��6Q
��*z�d�ZD����C���>�tKi�Z�����*��C�e98M�����R�U'��I{�����V5�^mt��~l�(�q�i?�U�c���lf2�+1���V����}+�U�z;:o{�����;��R�	�Vr~E���aQ�H�\�k�A<�"
�5��vEV�Zo��L��g��U�i�T4����b� U]����Y���fy����W=z��^l�n�)*K������^�������=@%#�Jv��Z�wM�ZT��)����juK��m>�����/sU�|��P��������8U����D�U�\��N��?�zOvH5��1����X���.r��E@�"�4�_8b�Z�T[��X�
��+�4���a���*[du������R�/J�r>��K!E���L��54��z�ExD��,Y%���D��G�
��dU������	��4OSh�@"c�D�#_�
�E����T���pV��_���C���@	c��r,���+a!�W�s-�0�W��5�������l^�y�����/j	��&�9QiAt55;(����i���X�8��@�b����z�Au�(�i}S{���h�9*�@�b����u4&[���������6v��
p���"s��C�1����I�R+(a�*�V��*`2��5^9->�S9n���T�A�c������D���
[�����W����s��i�>���-�7_���� `�U������J���,n�I������
��
�-/J1���f	P������T�����suC39������I�o\Nq��'J�X���
���7j�A�4�z���:A�0a�P+�WI��\_�{�����Z�}M�J��-
X�\����^�U������H�>��KG4g'�^h�\D�P����:��Q�+i,��R��]�U�^��~���Z� ���4�
Q~����9;d��i��K�.`��T���W��y5���Mc�M���:�|PUV�@��PC� ��U@��-��Ds3Aq���M�b����1K�b�����vu�z����S�����qnh4�Wj��3*6��,J�Z����rIo+dO�*")���������!'b�fDIS�c�]�j�)=}q��-C��B��ei�@����HvC���*�Z�=����#/�k�r��W�4^X������n:�f���*�==�)^��{�Y%��uQ\+��4
��g�N�-R5i�L��98S��qYV7^_�&������Vu����\����������1t'�z���&��*�t����#���\���3\�w}Y~�_������y+���k�����n�����u���=@3S�fn�����R4e�����2!o����4uT��i���gb�2��������f��S��y����j�}��{|�~�n�-c�M+�Q������������3a��a$P��Q�^��m��^O��7z�1z��:y�tr��M�����
Vd^����oEKb�k�N��fLG��� j��kp���A���<|;��������LG�
��}@?���q{�yu)�u��9sC��s�Ok7�D��q�_�^������B���c��oi@r������Nf��������]u/�����M'�1�b����a�G�&������R������w��a�-�/z���UP�T|����y1}W\(�^���������57��+�����v���z/7�i�u1��cW�������#����������3�������'v�������>��%[�v���������
#���]u����%�����8���Zl��K��|�m.�y��f�57���9X^/��������B�
�.&UlB����Z.T��s��Z;�T��1�����z�T:���;Ok����Yt���{�����=����������.��^�}9��2���,��yF5x��O��cs����Z6nz3�������A�������z<���N�=���=�����;�
�N��z��!��%En��
�q����-���g�����q�����
�q���ek��(*�5e�G��j@��,	..�/���S�����f_����zS�U�q��9^[/tZ��vi��z���i�q�����.5�
�>T2��,�,.�Z2Y��Vz�S�l�"�,�l�=�d�
:!��e7^���7����]*��}3fZ��R��d��M9fX������y��H����������euqQk�W����=1��=�$y�@��p_��N�g9]q�Y�o>^l�R���j
�B���h���"���0��Y��#���l��f�Y��/�������-�2z����`�Mc�UV��nPF���QcT#��X��J�sT��{�F�^��9�W/����������+�5��V<�N��}K�w}%�n�Hp��mlc���9�v@L8�V�l���Y*V\���.�����G���
���c=m�|����z/�m�\���~n&XX�ea��m�Rq��A�f����O�����w]\.*�p�����w�����Q�b@��`T����f����i�����,�y�.0���5z&{������0�X��F]����������z}F�i<Z,?����Y�����'C,eQ����^����W/���{�������^��w��w9t���d��`���4����b�R%h6o�Y�sd�u�Ew6g^��9`R�^���YR\o���7r�Yof!��L�������o��h�ZL=�5������~?l����	���/��``�>�Q��Pm��i;�Y@.XX���hs�*���};����sS|��v��7|�����%5{�"���=4���g�C��������n�4���w�R�� @,BI��:���j^,&�!A�������C����7��"��������<��u�]O���,�����a�z8�Q6]�C�s�^P6�}��9�mmQ���
p�9��6�9�A�ot���AP�>Kz=]/�z�S�aG`�>�z�6LQ�:�NA�����s�Q���>����~�|�$�����>]�����je�����7������@�>���=y��{������~���������O/�^�~�������N��x&~:}����������r@WdI�^�7'n��:*��
T>����W�?{�����~x�m9��}�CY�o[����Y��e�(?���� >����py����o��A��>�w����O���3����W�kj`�%�>��z�\��q}z1�Ss��c=���|���/�O����7QP��>K�X���G��������~u5Y������{����P��e���X����g�5$^��x���>���M��PAYn6m�)�[G�����M��$7
|p��5��+�pqU��F�&[�������M��@���vo�<MO�-Q����1��p���I����\fL�r�������Zp���^��=�<a���WW�E�!��Uz�!����l���v?��.�$'�����S7#>Z/�h=�3�����\�QB�#�cXn���P�}��y15i���}]�����dW�����C�W�x�v=
8p������Go����l�wF-��~���.(�XVN�n���
oq|�E'y"#T����\�o���xm�7��{��v/�����;��������U^SF����������j�~@1,�o�|@�,|�\�5�@�����l���T��/�IlW������C�3>V��o��Bh�b��G�K��M@E8J��j$<�[9�O�h��v�e`�B�iD�������lQ�&�b����:���z���lR������p�"��d5������[yh��`��+	��w�e�B{]�7�"���������
~Z�X��^�!���F��������&�I�\��(�=�H��'4lK�w�N��.�#�{�9��y/e`�B�E�es3�~3 d|vY�  7���2U���V�B����uA�2e������f`Ik�0@_����5N�[|ZW��\e�q���6��/��yH�P�����b���c5O%�������FC��p����RUC6�.>������cB�+��\����K�X���O�>��qP)[/>���xI������x�Q���������+��������k:����`����L�b,p��9XR���p�$
,�]-���3�w*hI�������������G������~�EF"^QP�G��6��y��41Tj�Q��,��{�o�f
l4��Ea�����`��L����plS[2��M�e�J�����@�9^�|H|�Q��=�P�QY�
8���T��uT��'�v�1�l�<o����J��d�Y�/8@�Z���,������l��bx�x����rU�q�3�N�bs�1?j���U��x��@��U��tw� ���ez�v��:P�_����f�8�(��w���[�x
D�v���
g7�9�s��lU��*
��
����t+y��D�8�n���������jU^��M�{�t���u��e���rZ����G_��	�jc����?v?G0��%A��<�H�	�?�K��=�
(��Uw�xp�q���}_
��4Y��sp�gp���
��z�"]l�����`"o~�y���W65G�o�������{G��X2����L�I<��2�w�cI���#����z8����yg�����?O�}��
��H������7
`��U#����t�I��6������O��������8��^eO����m�������h�����Pr$�^�?$7|9��+����<`���69�d*����N<�8�������
���#����j����1�����[�tIn�y�V���F��^�JiP��lVJ���#�1�������.����\m3�O���7C���.�)x�������-�y�EO��
��+��Zi��q���I�&��O��J�����=�	��k�[/(�C7�����`{�,L�{����:������C����W��L�,s��<�PA���7�K@����������������g���(���C����:�o�
�\\!��C'�����p�����u���������W�H����O���c��!K{{r�����d4!N7��H��-V^�i�I�*L��y�6�����w���9xO������<����3��������kE��.�������W�X�<��r����Wo���x����O}������zq���! �C�l�!���g���������y�r����I��#��A���������7������z1@����$�3M��yr�����#�d�\��r����c��b�<�0�q�D������3h>P%���!��C���?�t��v��9�����m?$`�C��y)US����;���;}������ ��H�3�*eK�;��UI�hh<��q��=;|��H��wh�@�8�|�����^3��K���\� �@3q
�?Q7�h!����-������������ ��C[^�qW�)�n=������d�|���O/:��ht9�]/���rwm�Wr����9���
�Gwp> ��p�"n
���Mb����MG�x[�������g
�F�]������;�������S�.?0����nG�����2Ud���zQ������xS�������go^���{�G9u#�C@��;2}�7r6��~t���!�{}�iruw��-S��/�%@�8^/�_4Z�:Tm��'�@������F�N8��E��c�7�0����E��e9�����m�t(0�
x&��X�'��''-��v]����a|��id<���S�pE�� �0�3��1� ik��)��05�~�(u�d����������*�=*�.��*��]U%N-i��tc���J�0>�31�����L~��hI��$��s>�I�l��G������m������!�o���{��7�l�U!s�:.�f"�[��hj
vS�]�t^�\k=�����9��H7
HGr7F�[���7���b&�4��k������r.���f�2p:Y�����dQ5�LE�l�2���h������a�S������Z��.>�m�e��m�^�;LF+��n���������(;5�c����Q9x\/HY2Z�V2	��]�4����]/�
�~��Y��oCR��I�=�����r���6p8���Dr5I�^�����(��w�H
P��M
��!GN�H���qv����>��p��b6��9zZ/t8[^h���^gK�,Z�_�u*(/�rB����[a�F���]:�z�j�f�e'����}�c���y�x���c�U�z�Lr�9N;�&������yn��q���z1���@��������;�Be�������k�S����L8���]u��7��l�����	J�������p�8���?�N�vA-��[Vu]��������a��F?"��4hlB�z�@#r7����V�����P�!G�+�}C[6j�c�B��zf��3���K����q5�m���w,�k�zj<����0��(��#��b�4�n��37(��|��s��]���)��9;���e�v�1��8:%(6C��$>-g�)vE�N�b���,��!vI�8zY/�.<����>P���]/UF��*V�h�����L{�'r4��@�����#���fz"�:���mM���.`���9:qiv��M��c�r�E\�������#��c�@
G5�{^`#��e�k,8��O�m��v�N�%�����8�tgj����-������z9��q/]tV���$������7�L���wS��[�3��?7t�3e���>x���:�M��%lh=�Ix���O������-��Q��;2��P�s���A\D��G�Z�����
�����[�� ���6�C��d��Nv�����I�8m=�n���9Hm���n�v�K����J�Krb�i�N��(p\�����4��V�O�y{��G���nU���6�@w�k�Q�z1����s�������������y��>5r�S�M��g	��z��n��kwO�\���=��lm�E��d��-��K�4rdI�Xv�s-���5Z����W���������y���P��A�M����x���3r#8#@pF�g��m_�:�82��-����2#.w�^��aiL9cq	�w�~Q�B��V�sO@SFl
`�kTd���L���
x���w���#g��0.�-#{}������~1�%����^�
��n�b���Q�����Fx�}��A_����t���.��`xg����@�$�z1@\0���7q�6��GO$v���jB���<xW�6�����d�8�Q/h�����E���Mg"`p#S�����[��z1���F,.h�����AI�E���Y���
J:�?��6
�T�	Fl�X�������s�w"�����������p���rqQ
�
����B�������������NN�n�e�6����d^��b1��E��'���G����d�x{�c�\8����o����kF&97gj3���#��[>�#6�ksI<�v��w�
�@�Y^�
�	��
�F��Fp��/^��!+����<I��=P��E]�(�{P]�2�6*����xQ����ZY����[-##02����Xs�Z����(�p0� �(s x�-��.��\���>&�,#����#�qT3��3k3���m�5c M�KP3H�<��f����i�\VFX�-�f$��5-�p�Q��0
���U1��<	�9#.��
������O2F+@mF���R�P��~v]_/W�E�	��4��fL��R�X�8��r4�WhL�,em�B�!����9B�? ��h����X��@�������l�����:�&Y�H&G�n�������������61�\cG�5��������c���@��#�:`�\��~[1acGv�<���7hl|�&��Y�oH��F������^?}{��}��6vDh����p�i�k�v��9��{�U/�.��#N;��t
�d6��n�kv���	3vr���`���%�������}�Ez�9m]��1Kw�����@����6�?��I���������_����1
�/���������+�Q��^g|�2�lN�<^Blk�G69�R	w/���8v�+l>/]�w�I������:=��c���4�"k�$s�;��4�>�8��E�������w��!�go^����/z?lr�r�����_~y������������.w���W/�
��^����'A���
�w�m���b����1��c�������9us8�����V��������������8�Y������6��]�k�:Y{����x*�W]H�B��1���P{6/�A�i:�P�/���PN��c��><���:�z�pW�sd�p��������~H[��y�q��|)���n���q��T%�ql#��������s�J�KVHs��T/�
@��8����7�����������OU]�\L���~���QI�/�n�hc�.�q������-0����K���!tY�����c41@�ca�\�����y���n��h����<���/��r�����^��Q��m[�����]@�����
yW���,��n<�:�,�d�����:^��Y��a{.>;��������������]�����I�����@�8*[�<�sL�V����q/u���C��t�b�5�>@�c�6|@O�=mR�1 �c'���O(m��Vj��bs�1��1��U�`���|�=7]��[��t�Koo�L[���f��s,�����iY��F&���V�q��w(�j��m�������M�j<����/2��^����\KL��Gw�\U�'�'G{x����(A�@{�jQ��������O�������e1�����w=��#��/�����&7�A@F8�Z4�,�]7�T��J$�=:��E�Mku��ti>�IU�(���i�$�\BU2���g�ne�(i��0�����������~(�s��������M@uX��P>�r�Z���|��S����}���:���fU����}�M�Z_�_�H�d����c�~S���D��V�<�>7Q�����k�4�aU`v�e���@P�l�WqQ�5��2V��X��j��2����i�V��8ut����|��B��zHg[U���V������^���d��2x�e?��Q�6����s��^������f�aU@�������1G:�����9���ts�[;��!UM�b�����
��p�Q�3@��S����9 ��GY�`�1��r���	��5��/��7v��we����9�W+�1�x��F���Y&��{�?�Kl���'?H���X�i����,��E���V�[c���\�Ua��O��F����4.�D�{�.Jzy�7O����[��m�p���M�z������������A��j��J�� 3)m���87���
Sn���:��j�\t�DY�-��)��H���0�e���������sF��|��W����$6�������	b/��������91�o@����������<�v'�x�z���lR��v�P��G�j���59�0
.i���&'�sV�wX,��}�
��X�������,fF�v=I�2�&�MN�<����zZ��j�?g�e�g�d(� 3d&�MN��	�<���h	<��t�=�*��pW���<�����\�k���N�p	S�=���;er���A��'���	�|����SD���x@m&�&5��,��<|^K6�eH��Fr�*���w���(��������#��B�%��B����|�&����L}�p�\��x7@#l]��v��.O�{����1qI�*+op7����r��E�� ����M��\"���u��Q���	$���$�C�b�:p�Ww�O/�r���d��xB���#9�CV;�� >���p������4hO�+��I��;6�s���r�@��	h���fTW���7����	��9��C�M��!�P�	GE���'Wen����?����\^��-��u
�`D��@��[����	�6z=���-���=�������1\�z�VT��e1�#(E���v�B<}P�W�;�c)��d�;���LB�1��7+J����7��p|����V��p��^��6�Q4�����W������0tg�����I���
y��q��v���
�!�o]2a�����?�H��p���u@�kZ��	��h7�d9dOi����7����?��������_�~n���-%�~������K��D.�Qz/�(u�����%����q%�F����$f�hc���5��"Em0�
��aL���t�3
@&m�^B}�uN�T �����}�A(�+F�]�>S�������qV�!�j'p����>��s:����#�Yc�.���	�����3u���6�����e�>�J��=��Q�
�S�����rU
@f&�k2j����2`G��[��r6�rA:yV��OV��Y1�h�NcN�"lV+QKgC������z1@�\�N��wr�[v|�p1�S�+7�8���e6����:Zm���zR���j���j9/>�3�X�e���k�j���	�E��
��W��P�W��
��E�j���is�<@�&,A:T�%"���� @��Mh�86�r��{	��B�4���TTf7)�a��=��P��,��u�������03I�����g?�C`2�d��=�����p8�^���.Xc��L\r��+�����1����H�[f��I:z��x�M$��������4@]\��j-���p�ru�@g�t&�L���F@ �	����������K���y{<A�i�^��z-	�1U{�\U�����PN}�p|��M~+y���^�P���ol�%�.���|�����_Wc���	�wRdPF��5u�U}��m!��]�r�0�z���������[��%{	Z2����	����p���b&_�7����z�*�{6�����E���Jq�:M8�T�49�>R���N�;���Z}���@/��j>��U������0SWb�/����H��X���h�H���5qX�&��F��5q�X�<������vUMEo~�:�c���;b���lM\�_�S�FQ�R�_�\��r�:�C��@�2F���_[2���U-P1��&�4��������`+]��j�
��(�<B��R�x��[n��yg��M��B��v��o$�V�mt,j��	v't���~��,Ne��\������d~�� =I���r���I���$�ib�N�M�4�j�����]<w	j�I@���Umz�]����FM���'���3�W������A�����U3r�.5uJ�*��$��G���.���f5��4��4�i�a��:�N�m�R65��y���|�Q�����`|�Bj������t� �,��D����O�":�\�7�Wy��s�����.�'�B���)�`�����j}��G�r��7�cwVu�)�*��g�J$���dS�������}�����1ai��N�S=�]'9��O.k�oO�w?O�����g��<=}����U�����ZR�Y�t��9r������](�!/��6u!kU��x/����7��\&��!�MisN�����&e�<���q�h&�U/����5�`����I/����|GGT�9@��YIw�>-���4F����k��Z\��k]�����tp���k����T<��8���i�K��J�OF�_�H(���7
���[O[�S��L���4���U/HB��~8����=5�B�w�7�ZML�i1S�=;=G�f�}�	�^6��n�	�����Z;A���h���q�YyH������i�e����8q�� N��)�S�nC����r�v��N&zRqQ������1�`�)����.������$�����b�����4�R�6q3Q���KZ��U��c,5������S@&�62Y���7Z�^���\�����6��V�en%�����(�J\J��(2����[��H:��Ve��vY��������-�Z���a{��b>��W%�S7��T�0�^�����.!R�R����9����P���%���eiw7j�AI����
���p����%���'��wG�o:�����d��
������ ��n$KYV�Qs+��� �.������S��Q3�r�7����m?t�a�z1@�8���F�2��n��\�T�X.����*q-���*.�Jt6�������72Tu�k@�S.)��k��m`N�����EiN9���!@5�\RWs]68r��#�'~�F*
�F����lZ>������M�]�Y�R��9����3>�#����r����8u#�S@�N)a�
���~����l*���\'�N^g�r��^P
.G,]�4�7�2�����K�W��q;)hi������Dz����-�N��a�R1�ZKv�T.�dJ�E�-+�Y�<��� ��b�����!�;��q�u�Lqjc�M�Hq����w�sZ-���Zw��)�H�3CM�Nm��`K�p��om/�xV�,���5�f�2�N��B�'����6�;0��p�fq�$����g�*���o�)����lu@<8BxwS��qK��gIp>�J�����s��T�NnR�� ��I�q/9��HH����N4��S.�^'���t"�`7S
���%U,{s��8u��S�.�a����X�M�����d�3���e�A����D[�W�W���S���D)@�S���}��h�jY���/����.����y{{��n�iS��,"�u7��R�C����kH8@���e�B��&/*r�8�q�����c��b�q�pS�
��/���gMl�� Q,�:�����
�j��q�z1@�l$�c���wM{������_1���=:�@��(9]�S�.	s��WX��+������8��b}��p���N9t�p�6�Z������h����� ���t��)�Wo9N98x `�����Wf$��U���A���p��p�������������XY� D�_*������(�zMz(*h�*.�O�@2�e��UAI6�f.O<���|�G���:���i����^������n�O�3�g6��x
��d���t�Y����9[�{�>v�8���,�Y�Vv��8�X/��.������u�fB;�w9Pzj�t�34����Av�����X�/���6qzS��W�+��
�)�*�q��^���gN�����j�y����cr8s$����hd������8=}���: (N���:y��u}y��a��^�f����[m@���ahXw�u�l�����F�&�9�Z�VPA���yb�0]�X�<����*��z1@}���)���]���E��'��hp�����b�>:��tX�w��>�Y����a��R����2g\�_�t'	��Y�*C����K�m6���,p8�+]C>V���snqF(�����U@j8�X/HA��pY}m��H@O�3�)��}�;Tr�)�3��u��,�����h����b��W�����"���Z����5?���Z���it���-�K`�,t��zV�������y����
M�K�.@'g�^�EW��7��.&*��
��&�M���9W@�8zY/�R��O����f���h[V����g�����Y�K��
�����~��(�+o<�D��X�[���l�
H�,��$j�����?�{���[����zKHp���K���[���Yd	.T����#�2�f���:�����q36���R/�b�+�?�������+�l��.3�Jw�n i�V�����\� �sy��{#����H��Nd�pW�d��[l���D��.2�f�k��,��U��UK�v'u�N�M��x�Z��z6c�Y��k���l�� ��cJ^1&�����3�l����P�,f:�?����������S�dw��������?�7���OW!)�Z��G�����9��m^]�+���u5�3��6sL�k3�|R�U�#MTm�Q�z1@���OO�{y���m�������5Nh��
��f6�V�����fYoO�[�nVM*WU���BV�iG�V���m3n;���s,�������q��^%D����6�I'���M���n3�D��Kn���7�%|q�@��-�o��,�<n�R���������������K�T�W�E���*$n���W�r���S��@���+��Y�<�|�������
��5
�gn,pX�,���gj@{�H�����v=��5��n�U�r���[�6��i@�R��o����Ah�QK�	��Pr���b�
���wM����Y<����
�z��[�������s
{��3H�|�M���+�;�}s��'r�+2�g\c��g�^Rg���M�y2�g6�X�|R8����!��-?���������-���[	���F�n%��3,X\q���F���Cm�?��y�_D������a�b�[g\�a��.���t	��
��g�s�k��^@g��$i�o����RpJA��,ws�FZ5��s���z1@������o6`����s��y\Z�o=�~�b��������Z{0������k@%9�Z/�b�����9��<�����`����������Yh��Cp��3��9�s0�w�,�8�0�Ob�Y^���%*?q~����Vx������s���E���9�1�]�R���V����~l�i��9��XL��&���X��Z;��!�H-�r~��&�q�3�hvi&�p���m
�Q�u�d��8�D8����W��}�[p]����s)}�+49�hs�m0o�	]�R�R_��ox����:��7��$�o���j[���D����hq����Y������qp[L����k�b@�wJ��y�w�-;��L�Y[��ko9����s�x�b�88q��C7����.r������3m�Z��'F�d���u��]��o3k).A�^)KV`�SvS�_e���[�wF�(IS&�>�4T��-��������Z1��� ]��m��-P�9G����r�w�����q��QX�A07����U���<fP��K����g� fs6����s��|V��F���������*}�-�T�n����������1��k1���}��n���V\~l$���WH-hl������s��z�8�d���-Gm��<r��������g����9pV/���q#��!�ch&��Yq�\����g�#���#��8~o5�9������6�Z���]��6�{��T�9�U1�����X�9n���q�Cq�b�������Ax�Z���;"��1w�/�7�]wn�B'u���9�Ww���������s�l
7�9���sC~����I���i��r@t��	h���l�R����uk�����|~��6�E�+g���R���������,n����H��!a�8�����(��1���������_�m��;�is��=l@I��a��n�#���l��z/K�z��b���~QDy��y]-�u������#������d^�Z�M��0S6�����:�pS��<q��z2���4�B�l�Yl
�p�9�����V&�]sG�������<9;�s�����S0���P�rr���s��^���di�����4w�;��3w�;�������@�wIH+-�3�=�l����N�V��%��z�=��An�a`�syf��[3�{+K�l��f���L���'!{L+hj����g���-�l��HV�?�;����1��b���;7���6�^����dC���9rTa��g�������zk�/X�������Kn�\��%���"&�����C�b��syc���3s.s�>�z*�gW�\o�������zA����`�_&���BHrSH�N�@Q1���O��n�X8�21N]�d�J�G�2���6�]�E�^/�$���F�^/@sa'�1��[����H|�i=w�{��1���c�R���u#@w������F��n>�(�����1wPh�yk����������(n����,�*��{�����}�>wQl�=k����t�F$��n_S�8��0�*��Xcp�fo��4������x���D�;�(����c���c�4�2�V�R��A�;3^�������\�������z���Q��?.��� *��z��JL�n)<VMK��B�)��u��*c@E�q��P�r�Y����t��(�������o�np&��A����_�Q*��6�r{�S��TOnT0��?���f�z~��e���Q��!����b��y�[��u�����?����n�����lZ�����]?X��sQ	[������j�1�s����j*�3+�XbaSB�NOw*�mbFo����3+%��!p
l��9pLh�y�s#�����|rY{^���8{yzv���=}���w�'N�$kQ��N�'��fr%�f4������]{O�x'|C��FD�'v$�>���J��$l^aAX$�;	����.n����Tp����������`�����*�(�*
�c��N�Y�kc�������������7Q6/7�R|H�-s�x�o�ex�N/����i�b���=�]��a�x����$>e��?�k��������������n D���cG9�sku�2L>�n���d�N;�I����N+��x�<\�Sc������c�Y�pN�I�s���[hz�E�}���T^�=4�Az������7�v'������%����8���lF���Zd��p� >)j:#Y�M<�?��q��D�ttK�������j���MV�R�SU�����-�S��9oQc��2?��e�j�C��&�e�h�`��{r[��0	��%'+��x�nK�+�7TU(_�(q���!�����5,J�'n5_�~W�mQ�1O��?%lir^m�NkO�s��K��>����\z�B�bFlLI�4)}���(}n�����u�}jmTK��V��d5��a���3`Cq7 w*"�f-FE1�K�lEhN��,n���|�E)�O'F`q�ql�X�j��qdi{����	��j��o���)���
uU���r$���2u����&�qH��$�Z��q�fqZ
��{�H5�~���)����yV��5gUL�R���f%_��0���V��q�R|h
������p����'�7����
�[����?�+�L��0G�[;�J���h���F?����)~d=e;�J��9�v"�>�|���y���Z�S��e5^�g����vw����z:�oc������Go�:�e��3Fq���q����J�J�9��c���SX��E���E�����k������QZ���"��7m�����5�o�#o�Y���4-����Z������,}���;�;DK��g5�)�!R��.[�����y1}W\L��b=\� `gj����i.
K�5�i�m�s����������?;4K��#Hv��>w��r���)�O*���?}�����#�mA��i2���GF�8�����-�������F�rJ�z�]WeXS�od�`mL2E�2i�v���d�N�����J��`�S��� p�������O�t0}~�%o1.��yA9���o�-� u,~.������U|�#J�J�)�v��>w��~'�b���FqjZV2�����}{(�)�������17�C���^�c�����[XO��JhI�z����)C������
���P>`�};M��>���m�4]���h�FF��;�>@�}
�^����Z�����p�.DqG^]��T����~�>�z�?�)�X�9�f��b�3��f�~M�.����Lj���?�:R�|@s�lB��2��.�
�6
���j7���=g� ^~m����
������U���m����f "������������Y���[���1�U�U����'���k%�Z�'k�2�U>@�}�6�>G`��������+����������/a��N$w��[�T"�ii�����:�m[���������k��_�n����e�����w�2��?�9f�FF�Y�����2F���(��*��(n��XK�)����[�N��f�=GLS�n�����M*g4�l�9��Z16����_�S��cj�]���������
���S:Zq%Y��nG�s��%�����m��nU�x�b�2��Yg���>��QTMi�)?i���H���(����1{-��}�nz�S��[�6�����S�:�p��������Z�aPX-l"���5����,��PC����z�kS������v�r�N(\  3m�g
�s�o�r�*pX�Q���lv�5�l��C�t�\V��wQ	7����������c&kyV�G��N�KT^s��a��,�?]��m�Dc>ph�����@�Q��;��Ku4�><�S��������
e��>�c����%�:1f|��-X����U3�Z
�jOj��O[�rV^�=L�~�B�A�iuK�_6��a��UL�R�*������^I��Q��IcH�z+'�r��z2��QQ���-d4
v����Z�m�y��(	��I������.:�!��r1+?���d���Ug�����=��D����c����c����r�fp��)���-����O��r�*k�$/6����DQ@:xB��G��9D��IKc��_?v�)��$zh��Z�GD��'���79��?�/����������X_# x^?q���}D������5��k��'���;E'��\����<�gm�0��>��j<Y����UY��&_�O���b���~l����;A�V#��<F��`d� :��������G���t�|^��r��	��2��l��J�"�A�p0-v����@�,t�uI�y����&n����7w|��5����u�c�K�k�$���QP�>K����|^MI.�Rc�}����v���p���~T~�kRy���aP>�/��I�p5�*�����2�f}1`�^�7���$9��<5V�u�D)y��6��B���b�7��GW�,��~�O��ds�:}����	��hw!��;/��M��}���b�v���c�`����M��?�={������h�Z����S�v5Q��������������w�}�Q@\R~jt>V�����n�$���>��n��O�b9��r�g����(�{}���w7����3�(�O'����{)�dh�n�>'���<��,���j.�k.����F�p�>��2��}�������4�����Z�v��`��#��XY�o�����6�j����e�0lI^n�����,��[^���o�l�w�i�������F9@8����D����C$�@Vp�~��=sW�O��x�$ �d�s��Q�k���}����xA��t3���4E��j��q��<J����W�i����<���) g�����,�������P������^��[���#_��^XS�
7�9AFx'!����!w���Y�*��_�����n�X�K����C��d��t��-���G?x'�d2�kp���&;������M�.6��X���'��:w�������^�a��}�V�:p	��������	 �M��y*}�����
4X�Q��������90o��,'yk��p��Q�]5JkK}B=0��7
\�kCmS�l�(5p�R��#@a7O0F1�8&��(��=P5����!�����������`V�T.���l +��jg�tZ��^�
�1I���~)����3!m������bU����3s����h8U]�RV|=��E� �A0������K�>�_��X�hl���F9@�,��mmT	�6D���} Q��l��vKh�n��l������w�n�lq��QP��-�[���C�����P A�:U.?���������5�=����E�v������������/(�1�
~6p�d�fQ]\.*K,�m�e3��Vp)��r�|p�,]�AP�������Ty�'
�D?�����xV����C'���j��u�Tx��47� \� "6����.��n%��,�3�5��gG��f����YV*�C�	���h���lc���
����������K m`K>����R+��{M3;uVD<�+mo��-��y��JN������������D��ho����_����^K�iR�b�������y���)�pl�2�h��
�
"�|�����L o���F9@XX��$��/)���������h�#V	���Ez�7����6W��bme`MU�1L�6�0[�������s��85c���������6��JU�Tm���L4���4�]S$�o���
b����0�L��fm��R���7,o;�l��b����[�V��7���p������.9m0�3,`�\��`����q���n������{ob��+�q�f�J�a����[�r��6�������fK����o�Ln���5{��q�F�����8:MNv�cx@\c��U�)��O��D�fg[���?��}{�0��XKj�@�������.C�=$�h�����qI58�9G�v����X��e_@eL��
�8@y���}�p(�QP�����}�}zp��QP��MI�gq ��7��
�\�d;U,p���z9��
,���J-7]j������%����pi�k��	��A����z����>!�P�A�<���
L��8����S�ap9L�4���9%��Q�p��K>b�	�q`�F���g`�d8�y�k��6�L��A��ls��E�^�=��4������#fd�Ajw����s�����V��r����t�gv��e.6�	�����6J�A�F9@�l�3M���%t�	k��]�g4:�Fw��9����9�vC�G��|�����:t��C�C�\v`c�3s��t2�\P�ECJ�MDrU,��p�l8!��C�6���;�B<��ll?
���{=U}���$YK�E�O�6����w�l���.g��!G/7;�x�=�:��I�9����hpp���7f��9�f}��9����5<�;��N����}��wH���y3 6���:��!��C�'/*�|�9i�92��>�M���m��DG��)�)��Ln�z����M]��,�{
(d��9�d���N�p���6:�@
-I��'����>�X8\���y�	-d��,��vC
��E9DG�X�w-�u�@���j�����Tb�UL>3��v��rU|,�M����e�0�����%���m�o�A����z���`�\/8r�����B�Qd/��z�H:wC��J�)}m��lY-U��l��c�pr�'k��H�T-��^Sa�����C�)��i���=U��7^�<@>��L�>��:f
�:q�����Ra;�-�^�r(�QP-�lN�d�2��)�f9������E��V�����M���D�4�3�rr��y��R���������}�A�)���@��#T�8�C��?)��G�������9���
�MxzR��+V"��&Z,�Y	���Z��Cm�a���
N6�I���@��y;QUAO���E;`��!	U�^%���9�\��SN��"����	�f*����
�W�>G�:uhc��N�InN�U�X����*�+'T��M��F�%�u�0�a��.�Z�|;��=�xl��Y���Tj��*���|2-��K�H����6,mf����vhC�
�`�!�m��2!�����T�����#�����fe�a���:t��� ��.5��-��M�	�?=>7��:�q��
����o
���C���������;�t	��2'jgRA�
�mUL�8"F�������{$��U@Ob�n~�M�NI�dK�1�b�h�I��L�b~��ew�c2��zS��61 �C���M?1���y���
 �CG�z�����:��������r&����������F=A1\t�f���!R�^��w
�����Z���N�Qq�������\tu%R��`�����J�y�8����8�t�S���K������T w��.:w�:��Y����<x���_
����e�a���}�o�5���G�_���8;:z<d�O�a��r�H���]ip��7�_�BrC�C���:�x���z�5�y��-��0�z�����}�Y,@�Ci7�R���z���f��eQ�jcV��K;=\P'7��dc�E��>���YzL�m�>�@���5�Q����2�ux�}�@����F9@F2�1v��V�r����:f�,r��X@���?�y������`��KSh��T�,�|����a��-5?<s��o�,e}5��q�
t^k�����l�����	!��C���0��������#q�	�����rH9� e��7R9rM��(�>37���Ju�>�]�>�c=[�����U] �q+$td������1�s���%����1q������&g=�����E���l9��>p����\q@��1�s8��5�s_��qn��#?m����E�����6���s������>��;����f�;��3����;z+O]�Y(��L(�v�;�!������+��#��������P��cJ��X�� ��Xz��8��R�FY�kyuo����7Z��\��K�A+�9@R����x���rM'$�rr.K��;#HG, m�������T+Ds��g9�e��@�^�*p=N�s��S����-m�yU���\T����M�hb������|^�S���{���G2�B�yX��'����a����+���^�n��(�R���Eg�����=�
��(�rd����p���+{C����f���q�F9@�{)u�8i���E��lI�"@/G��r���F/�~�������C�T�QDGO�7��$J�����&[�6������O�( ����(���Z������Uv5Y.��\��f���	
�Ro|{
j��VG�#�0G����&:��@���.T.fy^,qJ�NfE��(:o��=��T�������1���w�	�6���Eu�8��I4�U�����(r����n�"g������5@�8��(��cf��8��|����rd��lL�9� �#.�2]��]?�R����91������[v�y3���l��1��7B����q�QP'���6����MxP|o���F9@0��@wS|+�1�y���.���D����rk� �K��x�
wH�d=����q8�^`x��%ta<�=��	q9��2�`�Q���<�m���5�iHl�X?����qZ�jr�l���f�T(.���z2�.�z1Y�W�'L$j�S�������������b;Y������5�,`f�����$>9�m�	7��6�R�e�i���l���3��5��V���S������;.��1
�������an���������L$FV�w`�r�Rl%Q[�H��9s���"!	c��&H�������W7@XX�4e�1gvlQ�O�����{1�P��~�>
`����q���O�j�:Mrp(u>����������������7��^�zvs~q}vyq������������#Z�:�t��n"�{�o��5��%�^P���r��(�N����Y=������*�=�m�W39�Y��y�g��Z\����>���f;h����K�:4��F6F�uh��2�%�0��z��g2L�?�-&���d����Xl<�27���S����:5��&����,�Y�\y��h$�Mv[�i��eq���=���j���2�$�Th�6��y����3O��X5a�����(w�;�����:����5r����u�_VEA���:�
_H��o��W@�F�����D��5�^A��a��PT���o����j�T���l�I4�j,���+���M>>��.��o�85.+O��{������h�N�f-��=c�yY@v��m��
�N����L���������T2���E������]<b����Hw��z�����#��,�T����7Q�~�yhK���
��XPv�E�_�6I�����������vl%hklC[y'�5>qo�:	y�a]����%:���.0�����6�OM~���3�I�ip[�)��"���a�"|W7U1^M���q��������_p'��u@T�7:��B})�b8@�$��vz������?�g�����m>��K?�fuJ�5��������
�-��H��\7��4�,����V0���4�������4v2�0�iL��K��>:�4k���ub����ip�>���/��i;���w7�H�p�k`���2x����U���c������1`Lcc����@�g��=1��xz ���W��1����#���1 Tc�P�o>�>^�}<�<c@���j6������*��rU�`����%����Ds;����`0@<D���w~�+�#��E4���FV�������n�N��'~�k�k�qc����J�;�������B�6h���@Z���4I�e�46��X]�	������\�.��������.@�][�^�������Kb����M/]��q��XBb�}���h���c@���cF1���-�� �����W�!�\��q��Xg���l���K���w��9p�6�Zv����G�d�1�������q�L��=p6�t[1�8��#��Il���v�N�|/�8$l���+��/p�6vDic���Nf������8>��&���m�����%d�������?�-�.�Fap��Kk�����q���(1dc�.�pvV+m,A��>@��������	�W�Z1`ccG66ll��m�5�&6vr�e��%�V������Q������P�Il�S�k��&*S�8N^��L_�87�\v�1��<�=��#e1rc��-��BK-��=���������#��x�����w�k���IGK������@��P�:�k�V�=U"W��jr���_R
���FR�h��Q=8����e@��Vjv��y�dh;�Q�MJ�I�k� ���>�i�ZEv��1eQ��:��<r���VU�����
!Y�
o�������b��X���U���"?�k�9v{^�����u�e>���aZ
��D�������z�[UM�����tr� ��{�� �����b>m>�����mx�
����������yi�6������.���S:)����U��d�:�2Y��m[s�Wg�������PY�G*lz������X~���i���:g�p��l�K�6��4��J7����Q m�����W���=���:�T3A2���33:��}����3l{6fxr(n��7v�z���
Al�S@{M���k��g5�9�����<;}�V��7�|d�}�x9:��^��Fuc	����O��h]�������5��|�K8����zme�3�X@@���	�����XD@�$��VD.�u1�|c�k���`}cW��������>�.�o,��6k��m�ns��������������A����/Y��p,a���$�N�,p}}��� ���-���x�zN�������%�o���8��cv��%VNxWC.��(����L%'�#�
^���8q4�M6��8nq�g�������E���*��8�����h8�\q�}���h_6�K��X�^2�W�����V����M���A�<��8��	��0�i09(�*
	����vT{6��\Mkg�X��z�QD�I�V�	�j�����O:r��c�<�m�4��M�FP�����Q
yV:���b�����?�65���@'6�x@�@�8�|��������%�"N$[����������u�d�%05�}�!Ea$q��f�h��M9K��8	,C(�h�N�l@'L�O�	e�����������,�G��3)��m�'�[�A�5���'�KND.y�!�8	\�%=�)�>l~��$p=��_4������w�rtvy~����kG	`��Ef��#b����h�&6�\}������p/�������T��4[����3�pb���E5/t;�������JE4{n�e����	|�M����7P$���S�&�F���i��Ymqy����R^o]>l�8Dqbs�5�0�lq���Og�u�nn��;;��Z`��M/�X��H`2�P�����Ff��*L�QUu�x����9Z
��2o
����o�_\��3t�w)tt��#�~;P�����|��a�n9�l~�}��H\1Wm@'"Y<����76s=/���;�&�%-�?��z����'h
A��lz��D��G�o�xyM������������G�O����&�:ND��7"�
'+,'�$N"�#���KgP���.�7������f{��'-N"ap����Nl����'��2����7��^�(����;�g=���g���H��>��;��n�=]�&3�W���[I�)0�A��:tn���j ���mq7;&��lGI�5�H�0�'"��<�f�P�z�k6e���#������lqw��L��'���$v��������S�L��:�����Qk���C����Yp��g�����Wx���G�a�e�"�-��4_�b@�a����
6��@
'6�\��VE3���%�o�ov#����m�����m���	��D�!�3�U?(qb%��������W�z�P���$�k/����m�V����
�|�3������z)E�j��������������?����U��7.����fn 	���/f��8	����@*�KH�`��
SfC:@)'VJ���=�zO�=��D9�"����������+��9�"9A>�{�	���3f�������_~�@����X74q�f����i�r"��>@*�h����h	s�*�6��]��x��&`���)��_������ ]�� �D���}��dN������;^3j�)�j�T������>H�����qc��x�.?.J5�^�Y,���}���Z�~8�P�w�jj}'M28q$�@'ndp�P	4{���'o$
$.�d���n8t�M�G��{��������#��#��%�.3`%�o���_��_�����b���8q3��g�he�Qx�2Nd��1N�\
�{E�����?�s�������8q4�F7�b�n�E�hX���d�n�x�PaX#�G�(�97�
�sb�0�7��F1@����@�N�������v�I�~��E���jH^�35g�>{����t����4�5����v�����R��8�����ct����Z����n_��F��OJ%�Z�������NO�$)��~���]e�$�{S���1��]ER	5���S�O�W'�<>�JU�u���j�<�O.�"�����4t���j����S����c":S'RV��qk�����C��.-�
��R�����r-=�vQ��s��o�������%�M�$����:8������v	�,2�U���:4{dz3�]���l ��9u�8x��`���9�b���S@,�6b��(�/(�+�6f�#9
s��p*z
�<���L���rZ��Ck�5�H�sG��\��5`GI�&	�<�g�~L�NEVxd@�&��|�v�o��Q`�����S
#���c1%k��_����H��N�&�Z���<NE���3��S���&T����nZd�� n�=�.�����:�S�p���>�H����1NE���K�q�J�����tA�4f�x������JL]T^L����`D��NG�nw:�j�� J��8�q�W��-��@o*��>���n�L�|��C���*�:�~��
h�#��'�<�:������h�k�����z�b`@e��T�s�hL#�|q�`��N�����*[U����7/9%���U�]��lx�	����X<N��Z�[*>Xa���fI���>@�\�n����$#��:5Tz@����C�O�sqc�=z���f~���qMW��
������w�*7.v�������xW��t�(�^N���r+D%5��*�����V%�r*���>@��������g<��o�k|L;*-??PB	�f�Jh�>�S�&����������5wP[�/D����m1[�u��b���\<=��5�a�����*0���Pi���\����O4}����l)����4c+@�KJ*��kd�3��I
��S	�6��&6��N]Y�)�_���8��mRk0t�C�q�W�������Bt_�=�@��#$�H:�-�/�e�9�yS�T�@����l�iA�a����}�6v:���-j����[^;��X=�Gj�@���V�b�\��,�`�Wz����w�u�K�;�>�IE�B
&Y+�!�G����S�5��U}x+�|��g��@��5��-q2>V��
���i)��S�����"���>n��H��&n(a
(���RN��r�7<�#@r�d��V�:~���PNW39&�.�Q�~ 1�i2]lM ���$��h��F#�IS������mM z��z�8��i������A�j��"����/zkj9^���P��mN%���h�
`���]�e11���j�����@�J��C�f��U���M ������d%�{Q�d�5K�������Y��q����1N%��9?����#�N^7�������=�V����b��@�'�����d*,'���M3���,f��mO��m�P�I� �
��0�`C @������C�E�bo�O|
��3����YZ�z�As@��e1����M��^�=��'���Nx[L.�;=Y��!s�y�oT�oH��J/��	��R��T������ O��;PcC4n]����y���:�R�M����!��#����~ w�cO�_��
��\U���{�*��C��q
@�4������y��U�?/>u7uQ��lv
@�T��J��T2�(=@`��qb��inG��t�������) kS��e�p�����A(-�>h�F
�M���>�= �=�Yzb�40nv�x<��]f8���Dr`;|/�n&���>v)�\M}w�3P�N���9��f��������Ky��^`�L�i���������]�f����Z��j��.�������U��fW�LBs�}�j�!��u�@e3������t����������i�j�0���������e�I&����6J����J�sH*����S��������aWGf����D��#��`����>���������������u� �r�,l�������O�t,&�@`3��j���2N[��QNOgD��!����S�o����u*����I�-]���p����w`Z��6�x[��j�6s6�
#Q����D�|��C�P�n�=[]�}@�f"y;T+t�{�f�	ex�L�q�}�H�����l&��#�J���R������S��eZs�X��*6��Xv:Z��L�bG�?�Y3���fL��W���~�fv�����@��*o�Y�~6��f��D�v�������~,���}Q�%-�o��Py��^��<��I����x���&���Q������i�dB�G�s��M�J� s3��U3���u ��bU�D�>&��2&�z:{��T�{st�z�oL\d8��b"������j���������8��7��PEu�x:b+.��]��3��uc1M�zMz���6���
�A�3>-��U���_�X�r��:��K|^>��R�����S�%�ic�v�������#�?��
:���xv*��>L�^�jj�*���^P@P%2X��5�B*����LJ�ng�>4����	�W�F��
��������Vk.���PL2��O�w�T���G�ZW��8�8e��Y(9��|.�n��bs��>Vc���2�gVL�,�U9m[��*���������*�A�S/��B�b����lG��C�i6�D��Nei��Z�Q:��:����3��X���vz�0��\u������LV�������t�����F,���1�yg���l6[<�FjV,������l�����b�L]�)*��F����z]S
*��������MPZ�3�
#~�t;�5T���jm�epZz��y��j`��Lb�[�Q�Q��m\�7U�B�Rwd�rn�Qu;ba�f:f~I�f��X������Di�^,[��y��N���B�1��T����b���,x7k4�z ��DG�^�?�A!�8XL@iEz�c0C���Y�[��~]�:8��:a�:��6����2���F���gzE��e~Xo)���t&R�z0�[\kN}<�*�j�*o�2�Vg�3��:���B9��������.�tap��H����e���� �2A����~��L�����3����h��\o���f���Z�fH�S��z~�}x���j��P�T���WY�(�q���3'�y[�����N�:��f<�:��)=qw��Y���y�9��z��Lrzf�:`�y�Dsf7e�J�Y�j���0Cr�>J�s�6}�9���WZ��I�s���]��a!�5f$����@�0��${���9s�}�$����w��������k����5�a�����Ug�k�l\���@���EcT��ydy�}vY$�I�4[K�s&�1����hd��G�D����Q:��J�������5E�Z�3�Zg��L�B��I~�������� �3'9	�0����)�O}��+*$�|d�Y�$fY�.��N�R������K�����9��g��V���^�c$�t&!�l�	���ay�����l+�����,sI���QfT&����n����O��E[glM����:s"���2���Zu��Y�(9�2��Xv?��I���>@�$�d�/D:����������0�Y&.��� ��dy�{�Eg�+0�'�;4|�W�����5C����iW�����~��J"B��e��P���LB�Y�h���6���t]z
�Yg?��E�����h��NM����G ����$����n`��3�<��F���1�]�<��J�
��3	���'|v.�%�^%�u.$e�"�%����Ekd������%�v*9��sg���r�7���}J ��i__��9��s�7���9���WY������O�%B��f�S������[�u��%?q2������q������+�H�,<�X�[u��Pbyh�<���_V<udR��:<���]o���
�������}g�<^��'J�����K���k��^+$Y/�D9�s	Hg��'�?���s�i�/�Hh�h�BL�	�W[w���*��&`�h�;�[]���}��X��uT��A�U ��l^|i�Y�
�����s�8>[h�9�M��������)�1��1��������cY@����r�6��f�9��sW3�P�y�� �v��d��'���_$)�{.���>@dDB��YM����b�'A�.n�'c���f(�<t\���K����!r������)�L��c�`j7����K�y.����	@���Z��R#<���<�G��J�M������n�}#�0�y�����Nr@����4I���8 ����g�D
%��m:+z�,h���6T_=��-�J3eEt�6�)
P���un!��me�O�A_
���[��m�7L�))e!	�l&����%~Z���H�G�6�Y^�tZ�R���
`�s����)�;������54�_	��f���?�-�/�����Sw[S�*�F KVhv
.�iX[���`�s�qf#>�%�6�f��(��6������,H��7\6
�"T�����sG���e�<��7VCqn����m9<qnw[�
�f3�W�������d]����O)gi���(` ��	�P���P����9�Hfv�6Y�'h����N�k2W�)"�^����P�I�]������N��:�ji���<�(	�%@#��:��u�T�ovT��a��9�s���5�Vi��e#��@e��\��}��� e��(�h7��W�8w�n����^%@�����j�#q�@�	����������.��v�.����lAuBCEd$�[F�kq���.{u�������p�*�b��K�2�P	Z�c,�-���2�5���8�nil���\F����z���wE+���r��a�u|5��pX���(��H�3�=�YS:�r��/����vx�����q���V�;I�@{Y��?�z?�����r��c��/��������d=%o�~���Q�g�f�����N�	�>�{S�&����%T��j��^{8�x��(��3�'@sn�|�X���y�������>/�u��o���be^)�����5�h��I�c��^-6��F#9`�s������������3J�`%	��\��y��<��(
���8w��k��o�2�W��\9�pe�j-Z���e7��,]Mv\Dtn3��~��*��s.@OD:�i6Dt.Eoc�/��7�t�+h�<w<T"�2
�C"����j�����g���y�x��.�N=_�VzE�${�mU�^[���`t�s���Vt�Zm����%���v��t$��rk�'��:O[[�[�jTd����qc����,���-����b��6��fe��������6ht.�\��&��d�oFXj�qK��C��.��I�Vp��I��g��:u�;��%������/SNW`�/8�����KQ����&}n-j�-n�J2��������e�C�|�B�����>�����;o��������U�r�ZU�>w,W���������C��7&�e�ftjAUg�����j2�<�*OV����~k����M�ZP_eU���C1���Th���b���'��2w�T���_1_l�<���Wtn���EeU�����X��{���J/�B�@��WeQ��=�SeV>.g������-���*b��5�R
-Vs}c�w������R�N�Pp*�b~�XM
���X�<��@�bAYu�>�Ua���������}�/���	UU���Q
�������m_����j�MF�����U��/Pu�N����������_�Kf����o;OM�[��;}m)���
��,g��\�V\L`�t�v�V�&�vl�>�W��xV-����_~&i8E�O��n"m����}��,BX":e����������������. ����v���-r�h��]��7����b�La�j����/���f_mn[�y����{�rf~R��5�if���Y�@v�>��s"�t#p�	�q�+�+�VQ��n�}�1j�����@�����0G�4����d��7�����W�g���?��@o��������\2uo�<�l���_r����[�d4}����{O��Z_9����o�X������K�3y5V�z0�����w��~P�]����uMvV�>wx�$	����n-��5�-K�/�&�s��jd����l�������iv�]�Gv��SH;�L�;,kU�����X�t�@������������.V�
�!�EIt��F4�$��5�����I��z��_Z�I�w����R�����9~R����2}�(`'A�P=��g�@CD�����
��+��������5���T�b�5�K���XR,;�L�m�}ph�ZK��b'g������-���P8W�����ra��M�>I4�$�\��������6��+S��jEG���Jm_SG��zw�������Kv���> GV@zO\���mq���s}H������������5}�����]���_\�]^{_h��������j��E������_�����7�=�����
��1[��kW�����[������K�����}&����H6���s}�d��j������*�&'�hq��[UQ��w�h���#E�;�M��`�d���s����c���>l��Je����	����tT_D�U}-P3�Gu}���o��n
�Ni��/���S��<x��iz"��y�m{Bc��;U�v�.�>\�=�����_~Q�y�����g�G�>��8}ws��������_��(�d��t&q5
�j�|�j��Y��l�q�%��>��:���)�d�Z�L��+����"#��BPz3�N*�/Th�zr��������'� m�l�������h�pPj�X��v+	l\d���}��uav��>w���q�����e)��/���n�\������
5=8���pyT�N�f���W�z<
62�T8}��%[�@w���#Z/���9��)q1P �����%��r�.T1 ��[z��7�2� ��8�H�(K�z�>vV�>�G�5V1��"�����R��FLx������7�)�Y�h'�������?�=�>��U�����g����h���vGR�&6{;�N�;�_�s����Tf;B�)���Y�������8���)�S�NW�����0�����/�.]��)����n����4J"��.>p�c���^���t��t�p��f���#��O����, f{�}��p��x{vZ=X����GY��)���2����������5��W���]�8�B����������p��>�k[�����z,�P�5��,���9���Wd@��7#q��|?s~m=�o�g}���FFk*sU,2I`������U�����c�X��6����8�0����?����k3��)\h�������������i�|�]��T�
�YYP@O%6^zs�TZlfS�B�����s5���'}�cm���|�#���^��SU�=����djJ���z�:FUf�D��;Z}'�c��\zo�6Rl|&[:�������*}@��]?I\���?qY"Tc�8�|�iV�����0h~E��p��H�w�����~���^^<u&d����?q���7-�����]�}[��u�Pu��s3]����<��i�2��Tzy���K,� �'��+W-&Z{H��
x�������W�H�0�����������5�@���/��5Vx����X,v��Y.������_>\������P�	��P����}�3�l5��5����b�����������m���wB�lO����3V��y�<����������������(���5z�o�Z*������L��y��.�A��c��w�D@~����-�.  �����}���.���4�w�H���:4 �Rz�t���Q��i<�P5��}�bK >�V�Q�B�y��s�q_<.�����	�j���s�f����;o3��G6��A��/Y�!�=�W����oc�nDHJm�b6�~�6�S��)tC��������?�z���{������if�R#��cz��E��g�i
G�+p/���KC�@�cs�g���E��<y�iiU��M1��9���w�|��_T��s?&�����-|��G�V�\4�^z1�I54�=�P�����^(��T�T�M�f�V!�8��}���g�}W<���*���������R1���K��������?t��Q�E�S9�s���w�q~������~v��L��[�[����������\����r�����4�z�^�j���=�:7���m���������w7S��@��/z�\}~��~��q�C��!�r6�������S2�~�2��k ��w����.�
�A����� �����~h�0����=�R>P�+1_+���=�����������f����i������ �����������h�`��%%)�x���0�B������dq��9�{�M5��h$�:_���K)�72�~��X7G�K������ 7�R������i�Fa�&��?v��$7=��e^�* �NI&����}�1�;e���O���Q[�F1EE`���MtZ�
c����$�c�]�0;�@�=�����u����������R.�tq�Y&�d�/^#��v%��smyJ@�����z��*(���_�Q�!iS�9u�Q��)����X�sU���O6� m�wI[�e�
���A�k�fF���f[�m�|�����i���P5��c���?���M��r���T|�<�^8w+U�~���� Y��j�XP�-���X@@)�}��&�iY}b=(H_�m��C���'�r��n�&m����9d�����->Ho�!<��u����@���}^S}��	�2q����1���������,�R�\�k7��E�SJ����d��RvM�> ��w��i����a���RJ���}�X�P������xg����1���t#2�S3#�c���I.�xt�pL5��j0��
xEK�^�K�R�5yc��1��)>�<��2O����>'��"�T0p��-H��>a��or.�
HM�f��u�{|�z���z�d���Rn	����'����i�F������*5��]3E��>D0��f�������V���Y�r���A����9	����\2��U�VMk?��$�1����k��s�p���D|)O��(�>	|�5�=5��F,���i1��
�u�LKY���f1��3[�eXV�M\w���RJ�q���m3�aAq�3E�$�W�zk�9M�����{j����
�=�4�8������J$����XP@����`��k
 ��/�����y�l����h����B(�<Xz),(�t{&V����?*>���]�d�of�2?�OE�_�@�E�_��d�0�v����S2�)�hK�k�~�X�������9��l�\�t��%��O%�C
���������`���Gk/$�f���t�W1=�����x�W��-p�x����>�@�C`=�w���H<N���^����� 7!8���{��:�@�B`=��w�G��`��
�}�YO5�]��Q{��3���b7-m'��U���)G���^�>�<�����[�>m������xF�%�T(���AH\,S����/�N I �����b������J5������,��w=�W��c������������fN����"U!�H8��g�c3�{���q�'1���o�h{� a!��}�H9�J�]�K:hrr��5"V����.���
2���m��m�Rb�A�_�Wj���������4�y
:��1�Y�s�Ah{z�z�}W�n�5E�_��o
����`����y��x�CR�#
�%�S��@>C �3��IO������%�C��7��2A�LO�2��R���F�����}3��|����"��r.�� �!p�l@fC�������F(?���J4ZR8��@���h]ln	�)���������R�������&� !p<N��x�u�����[��	@�A���ZU��������KN�@C %1���r=1�E#�z�ol��� �fc���e��}H`�v�V���u��-.�V4K�hHgW���rM�,��?��:�����d� v4a�|�%�R����q;[�V��.�N�R8�X���&�H��>@�\�@��O���=�>�p��:P�}�T�|���l�E���=��h����������A�@���4;�P�P�[QV����w������uA���Y�N�{�� �N��G)A���gy��&)����k���sq�����	@
A�r�Es�v�e�i�0(5#�
d���4�����j�x��V��^���w�Y%o���� q��i�c����n�
Z���bY��]p�`��-D�?H�e������*
z�f_��;�A��hg���Z$����=Vfo��m����	\�[�$�`�s;������o�JY/ ���9�18Hs�4v s��&0�1��~����R6G�|v�==�x�����Z�����[����� ���#�X��WV��z��n&�m����}mY5^�`�7��}2TP��)������|��V�gh�s}C]BT`�	
A�`��U��8� P~����p��@fB�Of��@fB��ku�a;."�Kz�����N�A�C�O����2r%�le2r�������?P�}�x=����@�5��[���$��@�C %;��> vRD�> �!����Yk�����>��x�(��k6H|r��y��:d���=������'���a\=��h�N��+|r���7����	v�.��.��7��?�x{s����AzC�;�+�~J�y:���S^����7�����4�|��v a�����[Tv�7S�B��+ +"�'+��8�nr�!��&�B����>�x�B)�����u����a5����Hr��r�s�B)��v�T�B)+MJ��lq;�����P�}NR!���7�f�EWo!R;�����B������B��$N��$N���� y"�Nm`��kYh�j�����YM��`��b�x&C2B�3���l�9i��M����B�Z�_������w�� �!tKk��+l���5��i
!Hk�AZ�vtz���wgo��4�
��
��8I�TFO�H7:u��Y��)6��h�$M��$M4���
@�\�K`�d2����
��H)�>���S��u�Bx�e��:�{8svU<��y9�?����\_�p�G���W�-�B�C��
Fu���X�q[so�"t�-�6yO�S��^��wr{YH@E`���9���J�t��'�n�Q~�%�/���J#w����6Rdh!���}���0�a(�m`����!H��L��J'��}A��������r]��53����	�^����s���\����z���������~�������|/�@`����!:��<u!�>����Z/�n�+�m��a��f�b\sj����/$�VPh�?�'2-����R��[��{����_h����5N>t�����-�)�9���{g�@����
��I����Q�i����������J��y20j��L�>�b�Cr������M�B����ph�2W���b,@|\�Y��y���@(���e�P"��P��E�Z(����~���f^a�H?^�}<�<3�5����4wn�_W�dDJ{9����XT@6b�e^���I���q�C&����|�<�G
'�8}���������Q���<�]Y�D���2����*P,]������qU�$�<�y����B��^p;����}�� ��FfL��O���j</��j,UL>y���������3��5G%�j�_.6C��s���������b���6�wZt>=�;�?�6o;K����4��Z^������\h)H�q�a<���;�'��������aVXxg�����w>����v���0�z��fK�9���_���e;�`Gq����s�MR�J���o`����#x��I
������s������sq���v�g������~�p�@�RGE�����{��S�)������p���y���ADw�p�Y�{ JN�����p�0���d��B<5WZ�P�)^�9�]�����v������:lM%<:t�{zX�
}��#-�����C�K+�-TK_��J��O�xJ�2�TAL����$�bU��?y�������vP2����>�������7�%E��Z4Io��?kc�[�����
�HU���#�y�,V�����Wj0Y���B�G�����p���'<k��"�����,JD4��D>+!3G�gq���9�TL��,f����c�����-���6E/�v�W�����r�h4:��B>��GV�(-�����s���4b��z����������S]��6�N����2������~0�!t������p��1�:7��I1��E�7�?�>���x��r��Qk���3{A�;�oh�S/0�O �z������r��_�~�Z���>���C��b2-�G����@v5��5���+�H5����By�~w������+�U�>R/��	��,����v�YP@�����j}�x��DI��L
��O�c��X����Y�?Q���^��z���g�z��$�\uV��P�#�b���t��>��Tv�cX����3�����+z[z<���#NG�Y^�),*����\��D�d�<{n���Z"[G�n���H��)��+j���Etv��v����sA���Ge�2������On��"�lH�9:y��q�-�����?7WS* t�B{^������n���D�����������~�i���Q3��B�QNOg���z����i(��7/vK�	����-�P������u/wkb`1ts�a-����
�m?���D �#z�*T�B;����*�
��
�A��f�����}`��3d:�V%k�#�;+2@AG�����v���7oN���&NV#�IG6N��r�8G�6�������7���6����F;?�z*������D}(?*����Z���j�X��a�7*V+=�Q���K���0iv�)��\	�������)�\V���q��<xU��/�SH������].s���yhR��CS�R��l���;���T9�G������;�(n�X����� �F����x:��b5)�5��E�z��jt :�U9���o���,&�B!^cR_M�c�X;�_A�Y5U4�(����wy����:��]���:r6"�T�}7"`G���dQ�/�`�6E/:q��B9;����M���H���}������r����E��'N|��Fp�(r���{E�2K �#HP�4?#���yg��o>����N/�N��
����w�~>}����4��Lb]�����l:��D��U�m�������i�&��uT���6qw��G[����Go�4\�3�~5�f���GBu�R�
���i��2��9�;5�e}��#�siDx��;�4��<�����_u}|��Z%%5�x�q6�<x�����p���P�>�n��^��j�
p��P�ZL��D5X��o.<r��3�����jm���;={PM�v�hYT@�\�p}T�@����
mS�����)(�������ewj�d��.��#GC��i�������?��������Q���_:�m���
h�d!��4��\
������u��\�����������:@�#W7p^���6��^���/�GQ�����?�VE�[m�;9�p���<����%u7�y��h�H�R�f�������k�#&�M���}�tA���%N�.��#��n����p��m�Vmp�i��f�g��(��K�hg��{�c�j�����g��Du������@XT@���GTm���O��$����0��w��X�����X}�Q#�#G \Mj���zN�>w6���"�G{w�k����L����<����]-s�p8�r���.@�#����?��G��' �#�yv/Im��������M���/���Sc��W����Ck�Zp�h/�����/��u�������>�����������D���XF-��r��v�y�b��k96D��!��
6�,��7 �{?��=��	YHp��.sa�e�wd����|�������������6�j&��#5�o�"wLB�O��*�����3m) n�����_�[�&8zVG���$�O�����X��+�������3���V�����+=�R�s��hyY��3��#���?\��XD�G�U7H=6+�j�l�Lv�r��]U�P������<���l��7�lzpx���.
�[�{lj1S�6c��},B���;��S��/u:�����4�va�Et�^�-�F�������th�*��\�u�-��N��S��~�P�k�F���d��X�Gf'~6��R�h�����P�����&�1`�c�e��r�}^��4�b��:�
�Z�9���xv���q��h�7��1`�c�e�]��
A:��F��y\`����_�[�_���U�J����BM/f�1 �c�����c�*����_M8k�#��@��qp�����Uy_R�/����U�O�k�dJr}��cF�60���������l*�t�K�
���y^���_�U��~*�b�H�#��D�}n=X3+�+�n�����MC�(�������[�����P-�1{���j�����I��:v��y(�v���YV�h�(�a���/b @�$3jv :6��
�c�V�V�h��������P&lIG\nGA:=��=�!��r��9�1@�c���������:���D��#e�c�'!�^A��;��:��Nw�G��=e�~^|j��o���W������o(��x�R
��X��b��~�<0��;{����8����i\\�Y�hV`iug�f�i���@
�@�~NSN��N�U1Y���K�XB�Ml�H����n��.���n�nK<�AK5]l�0�8t�c���l���x�Ik/1��cG��^<�9���
��/Y������c�G�6�i��9��L��AY����b;6�|��t����!�1j3�,(�Vdz���92��
1m4��C�}�y��W������pZ^���J5�7KCT`�S���
�)�J���x��X�Q1��E�� W�}0u,tl���������O��������������F$�|tl3�f"	�����Z�n;U��V�==������]m���pO%Xu�":�h��
���FM3��s,���L9�]3N�������TD�g�%��������������W��<�V��qc��b'����4����u|������9v���]md����Z�K���c	cf�J ��|���Xd�GJW?�,��
w�N��}*�fL�p�M�d�q��3i���I��R�,K��~K�������L}�X�m#������7�_����.�bz�^����:��T���+����1��c	&fC@�6���
���;Z>���]�]���"����)~~���j�����#=6;j<%v���_#�o,�J���H��w_�-���w�L�������^�/��K�/�l@������Abs�}�2H4-�~��"0���������%od�� �X�`)��s�.V�N�C@�8u��Hk,!�C{��M�S�C#ra���S��Wc~��^��
���xm.�V��.0ls}��*hc��u��p�m��[7 �8ssP�Og�G���7���&8�q��%�_�K+����f���'5��w�sZ�5������������]huP������b�Ur����F�1 N����<j���S�������;�~\�0��X��:���j�v0�X2?f�F@���e1��	�]��=@�����hk�<>��;_k���ol(��B�V�Fg�V�Y{A��&�-v~���U	6Xl,�}�J�H��H/@kr����:���@�&��}�x���5�����q��46����k�����6	t�m_�%�����(�8���2lbs�uT&�MN��@�&'��uq��*@�&/!]�T&eM$���h�
K��Ae�������n��	�U��p��K���n�0����!�nz����������T��g�(n�[|& #���<�{�F.)�~B)��b~@����������\���0��I�k8D�C�Hk+D�d�|������5���8j��������M$WMl��������D��g	@V	Ye��#��l6��4	SZ�����x>����/J�u����N��Q���S(b]�i��V���� >�i"r��w�h"�C�x��&�������X���>6�A_3�ib�H���Cz��&V�T�\G�>M���W
����HT�������jv5C���]M$`6��ib3�u��4�Sv $6^��1Ml�im���)�w��C^u���Z{�=���vbf���2�s�1�	�T�������N��AUs~�d1_�)?a���]c�������iB	�T�����u��3Yc�{�-����5�����8i9/�$���^��$��U�w��&V�r5&�o�9Y^�8����v������3���p�MK��8��,�}s��&� *e�����I���j ���p������_3G�-n����6��&������?�k�D���,����;�6�}g��M$�_v ���r���h
������_�^���93�k��[�X�$�\�����&��f�����?)�:	�e�J;/h�+����;�C���&����UC��@ps���7q7��Z�UJ�������t�r��Y�e���&���>@�$������������:�o/<$p���fO|�� 	`~��u��NI0���B1�gS%�N���PpK>�8��	��	�� ND;��9���6�S����*o�O����I>:�(���g��A�����"��R������Z��m�.�?�E�+'��rp���+_�����mU/�O������/��g:
��x;(��c�h��1���t3�Y���'�b����w�����r�
�[��b����R� :=~{	��R~ ���~�S���>v����YQ�����w���h�h�;�����y/����i@9�-�/V����/�z:�iV
��t��v���&���t!����i����t&��v��Y@@�D<z�M��+*@9'�<�{�9�8���@4'���~�r�t]w��jm-D|��t"����}�4'6��mqP��#���9�\s��o��%� �v���<'��p@��1���zz ��b������#��z�� @���� �}1]v� ���(���v/���E��2��Io��{n���d��Z�����z�0���9�]);�q/��`���I{��"�s���u���{�{�Ks7o�M���H�s�>)��S������$�s*��6���G�=�t���5�wc`&�����L����-��n<ls
���q`�+4-*�$�>��� O�eY�����}�C�"��M
1�Q}8n�%�-�C:���.�Y���?�3�6J����p�)��G]QR����E���*�U�^c��
��9��Y����N!��)�sz����i�m���1
��;�*uz"/U�����4�w���q_}J��JH����2T
���fO���R^�x��d�fL+����X������5�u��=J�I�7IS�[���/����/�z�J`��F��t��	p9�y��+��j�4cV9-�&����h���k���
H�o���Q4OS S���	V������b���?(�f�����#����Yp2=��oa�%)��S�o6_�Z	��SW�y�J^�<��9�Hf�NCN�sS@!������S�4w�=���4�S���.�g���V��6������-&c3�`c�3���*V������
���(9��dv'��E�z]�������i��K��/�k�����q��,(��E/\��W�����`4exd�i�G�$��������yY5f�@N%y`E0@q��f������~�E��l����4�?D�������)��S��k:+����f=��7%%�k6�@q*��>����i��x8�����v��TW�s%�;�0w�X�1{\��(�:�B�d�:4y��[{���U
R+���������|�-6����_
������D-?�7�E����a5*�j�X�e2���4��������b �@���F7�9?��5�>z����T
V��|V~��v�x���-1�48`���VN�y�f�=�Mz����6x���9���2)�q9N=��]��5>-��������{>(U��?����W���rs�|#���UN%T�.�y�WNmv�t9����FN#�4� ��3R�g�g0�!��9���t������x�Y��`�����O
X����<*-�tz���bR�~v����~X�+�E�?6�x�>R8u&������g?��k����!�V8����7�y�������O7�oj��m�����7,��
8u��x/�b�8En�l�	8�T����)�����McW�
�^�Y7�R�����p����]~# �In�>��M%�[>��l*��zz��>�~����,����]=��0<���h@�Od��	aW�2c�f/a�4���j)�fS� ��A�jEpv�M���(��RM�S�1%D�1��W��	h����	�;��	����q��5�T/y�������9�i�J�U��>%m�����,j�v������c �����tC������T��Q����/�p�-���������j�!�#`l*����a��j���P���=2	�fn�o����Oe�x\��t�(��*,&�ed���)�\����xe�P�_�:,(/w]���R���2��F��M%�vh	��i���y�=
�
����D���3 gS���\�r�M%��h��'p�������@��>��k�jK	�J$,�����@���1o�nM3�E���t��R�����<%>���� ���	p
h�T�Y��Pm)�b��m�dI�fl/�-�l��y)��7����}����XHL6u4�m��=�	���N�)d��������j��;���]�=������j��<5��x=���YT@A$.���;��
�~��8���3h���y����tb`�(��F���&=/�s�������V�RT��	�l�.���lX	��Tr6���%w������	t���������3��7u�qi�EX��V�@uS����r�2�f6�w`9/�o�hh���9�/�o&a��>vY�lp�����Af���N��g���U��������:3@�f��0��]�2�n�u}`X3�ay[Ah�f�\��{g����Z���T��k����]-2f��0�9������;��v�O��L4�oq��VY���5��"#��#qQi������_�1@d3	�X��&�����trW����MP�LBq��6�j3V��h��6�P[v (����~�q/Z��j�YQ[�5�h�7�m2�f6K`a�f�h����W�>�o�&`n��!����7�^v��E����f��D����jj�l7�]�_35���������MM3�fV�a�������uS���+ ��m�����f�^�E�����/�l|�6�R�'����`/o�����49��d6�Tr&Q��>@�$�����s��n&B�#�l(��N�� �fM��;D���*u�q?��f"����R�M��N���*�(���4uT��W3�Y���>��7��w0�;V�A�������4E�-3�5m�MW���3�g��p�8�\���tXnX��L${y���L�z�ff���\��S�1"?0I��W���M�&�E/<�R]W�\����iy������I\0�,08s����k;����haskfe�U�=}X����l1�t v�lX�z�����|p;N������J�]���N���\�}�����PFs��m��C2�S����������7�nf�tyw	 ��
��=�I����t�Em;L�I�_����2k}O�g������q���4�����>���:\�GH;�jR�dFOJd�.pfC�59������x�V�F��M�6���8Uj��4�R��Z�8�H�D���z���Yw��	�.
��@r$��|�?o�������N�\�|���1�7fx��Q	��IX2�P��i�'L�-OU��T����y��?fj������X�@���R��a�x�-&����m=48��N��U��\�@�$���(�����U�P�hJ��������|zY��qU�������
������`��n9����}m�I�1�>#�Y��B){�$� �������5����Q�c��q��	�5F@
gS_z�jVK��>3�s�z
GERo
��M����>8��a�M
=���jP+���#/D9��YB-�����^�4�����~W���|3�g3�����T���$���z��	�iU
E����B}wu���?,Prf���o������|�|�]���Mm�zU���Q��\��/UT�)������U�c�����go��,tNP����������@��*e��U��R��������+d3�*��ug��o�#��@�dO_6�(s&��Cg�2��Q&q�2[�z+�v��1��3�������I~�t�42����Oz o<S���;�����.�����Ki�C1�d|�^+�+�������w���N��%�.sX�R��q��L��$3}@ds��p�@�\��]YI#�Lg�n�@�3	u�#;g"�<�N��R*�z��[WM�s�����Oa�Q���?����`�3�/��2vvL�#��MEe�����-�uO�;��t��|@Mg��0����cO|N8�L��y{u&A�C2��,w�������O�W����e3�AgVz��vG^��@l����?�n`��#�\5]]�R��z#c[��x~-���p�� �,w������n�����<��G���x��_��&<P���m�h8�Z�TA�}�|{vy������������_�n�/��./N��\]�~������]sr������0���r�T������:Tw����29��X��iW����4��z����d(��c�\�Y��9�Hd}����v���$�������Y�%f��_s�,�6fY[/����E���;G�x����L�P��s+�Xy�X�2��l�j]��AB��\����ux�\��z�����o��AT]�Y��o�c�E�%�z��k��`�%��~���8�y�:��9 �sq��..�5H��j���7�q����e���������;��B��`�s�
���; �s!L;��z��L��j�4N�D���T�2eQ��|{��v_��h�O;R�K^����~Z8�#��N�1�
f�H7,����y�F��b��kk��v|��W`K������MA�;t�Zp��j��V��(P'+�.)EAH���/T8��}
�����6������i�6����N���;R�(
��9�C�e�*����f=.g�/��<DY���?�C������.��au��YuW���� ��,6�N����R�r9w$�s@.����]�E���w�q��j�67�k������s��5?Y.����������0�	H�LA�� ��1�o��w{��52����:�g�	�E���`���DM�A�Gn����=�N/��������s�:<x�'k��9�����H��l�`�����)r���k�����8�q�F�����Z���r�T�5���vDN��B$�k�o>�z�����&��sGX9�r�+�U�kWY���1�MxR�
���N���Kp2�U���`������GO�B4�B���o�#�it���6�Yg
h�\���C�b} S�N���=��f8P�d����G�e�!b��;.�[l���+���	��8T�F��n���Y���i���������E�>�6T�+��|�C�K3����T9�������szK����1���!�j��4��������N@K,~�� ��6������4%f�	�[�bK�snqC�6�z�J5�-E_j4�I�\c6�dq���%{(��'?�����TaM����`i���!�uz���P��������z�]-�W�B���@y��'�0��8`�s�9������p��w�$�>��oCc��Z;L�20*�S*���2�,�W�b �\�T���O��>%�}++	�f#�0�"���+"����[$r.���7 ��8w�G�UK�{K�|Kk�������+��H�a��WR[S��Xx���9p.���V ys�q��;���I����c����fI��E��%X�9k%o�����V�	�%$���=V�=���t��Z���z�]����8/�U�Gt�+�m6��?+��ixfV�t�;�T�KM-V�r6c� �s�6��:.��.&���O�@�b8-�����3UU�m"@�.�p�d�8����%�ev�1.$o�~v��#>PG�7�o.�Fahcx7��6��>���;�^�g^E}M�K�����V5��
�*sX��-U��hd<����b�T= �����O�����7�Og(��z����b��:n��s���?	�G9m��n@Rqn�b�Zs��.8��$�I`y8�����H���M��\�~�}�R�`���T����l��4���	qd������� �\"Z�����;�K�;��e� ���������=-�?�6����
}���ol���[�?}�R��Y�<}n����+���Y����+��vO���$��k���U"�s)��8LO��qU"7���3�����_��}.������rA���p�p.}���Or?��2��A�����&}�&��ArU8@g|'����	���b��a��%��.&����``v�3��\�@g�,0}����:c'��s'�	��I��E�:c�@g$��.�3v��>w�@_���#=t�����V����1������/���
���n;��7A{�^�IJ{��bvW�
s����z�H�����y����C�q�\�im?�R}P�?����~4h�AXo�����|2�L��{k4��HX���z#Y�\�X4����#�Ak��9��"g{�:"�������r�,���\�rGl��>��E�`.�vD�>w���;&K�%�uM�o�i���
Bh[��>�uY�~Q����)���E��EF������B�r���$r�������+^�����x�]p�M}���-�H�"�$7,��C�T@�$���"���;�q�	V��&���#u���B4Lbo���������z�IU�c}|�k��S��\���E����Z"�w���b����F����
P���`����f�E��V�f��M��Q�G1* �n.�BT CD���qwO������/r���J���8��9����0}���&�l��u]����V�XRn�;<L���#1_�
���oZ�r<�������lF��}��`>������8	��:�vNbi�# ��4�N����L;L���qMsJ�%{&���SS75r�����o�U����������.��3gh���c���<v��������;�f.��7�XH@N,\1�a�fc��X=�k����_�M{4s_���+m
i���J)�,,�N���c�:{U���Doq�����|�5��c�c����S������40�� =S�v3�r>Y��|�K�E$J��U�`"xW�(�����-Q4���,& h|�J$<B���-�a���C8��b�D��-����h��U�4�f�������n
�Tf�����5����\��0}���y1�L�%��������Id��%l,34�%��cu_�U;�L��$��[��c������5���v�����7��Msb�lZ�����9�,����!��>@\0co�����C�2}���@[\����j�z�N%�z*����E�w\����@O��d�{@����o��
��752����i3�1��>koX���'������R�x���.R�_Q��r5���:*l�m��G������d����c�|���IDi;��'��N��u����C�
@�2'�>�{@,\���Xo��n�9wQ|
 1�!�~����fL�������;nL�M�:�P*; 6��7�9a������$�V���2c4�R�1c��kb��r`�IEo����DH���b'DF,����<!=�(T���	��x�,�����������<2�*�^@��@��/���f���>;�L�����D�n��_���v9��"�kU�MH��h8f�����@��K�~�����}b�n������; ��������mr�������e)#\+�����O���B#�W{��|������k��l_�[f��k�����)�������=l�gOp�� �gWD��M������F�5�'��p�.Q}�>�����wc��[
�|��������jYu(Q����w��~���P�/A��>@]��n�~CW)��@T%�jv ����:��[��	H���u����.�w��}���O�V�|���_��;].��f6���������EU���w������U����E��
�[�����>�5����6��?B��P�/C��%���D���5��up����0���
��Z4�k�u�&���f�%��?!����,�m��	4@"��}@+��?Q��������z�#���`���(N������st�W�qt�k����J�%*�����s�X�/a��w���������@3#}���h�"��$
��D|i`���
o�����e��+5�2�bF;_���~�����N��}1�	g�	�p_�c�w�}z���2�h���v]/�����>`���m'����H]w���o�f����S=����o~;��ysz}J����_��4C2bf��Y���"��<��U��&JP���T�������p�r����D@�AM��tU�����8
d;������&o��q�0�E��������G�\��bNGb�sf���Pb�Kh�S��N���~D��%��:�[zj]��k2%��9;���f�U{m������[
s_� �@db���,������h���x/
��t���%
��(��`s5��o�����X=o�MwlGg����u����9?����{�<�X����������[��W����<��e}L�mq�}.S��ca�9i���Kx3���{mH�T�M���r��������K�2����b�u�`b_���}@�N�<����i�g��������\�GQ���IJweS�C�S��������y���!�#�yX�?t�/������Ow���s�9�M
Inj5��5�@��U_DH"���&��}�6f�!���'<����b�>}�}�
	
V�������pc���9��p_�1��Od�e��, �����U��%x��8��:�]2���h�����XIej�K�1�����`��p�����������^]{�v�k ����<����E�=@Hq`���d-L�O��p��;�{����+V�d��S����8w_�c�
b?�G���y'{v��(�%cv 	6 X��q��j�n�`�Z���=�v����w�������V�E@r��<��g�moU�M�_�T_�j@�}����h���q���h����l2�o�&a�l��_���X��	�e��!��Q(�q�\N�~�v��1,m ����e�I�>W����2)��k;����eo���a��4�k��-L5���Y-	c����;�W�z���CsHYYm�_�k�@P	A*H{s$S_&	K���q�i��7K��+vqY�:�y�$��\�K��@��Qpf�3p�?w�I�������ZUm*<:��1�4MG�����9�g����jr���zD/N��b�cV�%Fm��Z��
��<��]f������DZ���zP�*������`�& *���������e�|`�B@��L[`�
����*�� �K��=� ���Gp(���#88e`����O��/�>0��WO��/���=��;���j�u�:f�f��� 2�q�����Nm���Q���5�c�t���V��S�7������4����?X��xF'����
���G}F��+F6;jjmB} ���7��W>.W�����X}����yc2�.�.����P��t�z���6�5����<;�X�����g��}�?�����R��wp{�}�g��'�H�%�b�.W��y���]�����F%�^q|l���R6��h�"+���u����k_��A���L��$����a}��
�q3[�tj]�TUU��Iqd@���U��Y��Ns"n��4��GCO����O��b`�@�8��x��OZ���B5� l���K���3X}-�K���z�F*<
lV���^�vsu�?o��O���,�MU���r"5������i��3�#%������7oi���Q�7�e������p����s��D�[�)��������������e�5Dj�f=��+~3nz(�x�S��]�U1^M�A�+�kYMY@j�f&��Q�1���'W�o?Q}!P��m+'$j ������&Z�Y��V��z��n(+U�
�!������N@���
�_�4<��n�����)��vg��
���a���Y��k'&�7T���	`Y�e�����ao[�N�1P����<j���+q����D Y7��)�Sq@��50�7���l�M�lb�@�$�Us@���������; II�g�Cf��I�����Z>�,i ���;�4�8�����(t����|U�z��UU�������9eA}��Vz�M�#5��%^��vHv�u�k}p� "�U��Y�i�g��W|D�
����������c�����*��c��7������z��s�8��z���7���y�@
e�Lt
d�\v' �i�z����/���~��������{�P�Y��6�]����Y�giR��&��N�������E�F�Z����BC
�z,2�i������~�i��R�U�DEk���O
�>-�������Kn��a��}x����z��G�F�/V�rE(b1/�Jr�>���[��Z4�����t?)�l ���B���w�;k�NC�?�Tc��Qz9��&@�@t����z��O�'6������j�M�d���J����z��I�q�x�{P��^s&K~5G�zvZ�aL�y�OY+H����%�W4	����������b��7�S:	�
nN���3��p �����J�x� �[-Mm��{c�Z�ww��� �q������o�����B7U����RLM��6��n�>0��P9���z�����'��r�=����m���+5O�z0*��r@�@���0����te<1��-�{-cJ9p��@)6J��gCcO�l��"���m�Y�]��������MO����G?�V/]�Ps A��>@c$(����gE<�v��7s�hj��[���!��O������-�y�|c��K���wLo@Ol����c���8����������Cn�m�N=+V�';2Gp��<����0� 3�>@Q\-���z��i�9-}�?�^��?i��.=< WW���H�t����`� �D�����-�+�.��\�m�9pa��J�~Y8�h9��L���7�/F	H����.��8��fy�oTJ���}�J9�:a�u���D�"�s���9�Pf�.�9�zG����u���������:-&�i9��Q��O7j�_�O����]L��<Di�3sv
JT4�a������@�0;�����+��Vl)CV������������X�U�<�ZG���m��5���M�l]C]�m�C=M��{u���s�������N�i����>om</��TB��b6�]�%SC�PR���~���2G
~Sw�"��M(��t0K�s
�u�4khR����\��:]����C_^z�d��E��
�C�r���M2��:tA�U�e�R����Q�8��C�b�������}
�4K]:����}�A.�c%�VK�L�'��_&E1�W{:��@�Z�BP�j]-\}�����O�H�_B@W���%�35�j{�Q�4:v(y���
�s�S����hyK��{�X,;L��0o�_uP�Of
�L�XP@HDn�u���u� ��0�0�RQ���P��MP��P�����o��qi���5�QS�25�,;�6x(�uPT2���{��bULhE�>$��le������]x�����PR@�D�UR���o�'��82D/��q���UH*��[���X���|{)Qc��X�z���W���L���j�&����~����C�o�"����$�����opB��u��������8��C����jE@�P���k+��C���^#�c���,�p�'If�T���1�
xT-�m���:���]/H�i�i��:�0��F;�������'t�%�$o/���;$�������):9.��m�ws�(t7�T(m���'�����n_?���+m�Y6����#�Q�v�����b�$<tD��j��^A{��w�����Q����Ub1$ |n������� ����:z���������{���4Amf"?P>���UK��6�c����Z=����vh�+�/��^;�z7�W�����Z��w1i������4�)�b�
����G��r�������������LjF�a���a�p��7������1�K{�W��
�$v>�\�r�,��a�|`���EY;'�������+v>���|��(.���C�[�/�?�<��sv B"%^_���wh3+���;�#4���8b$dM��b*�������h�����P��H}�b�f�{IuNT��&�Pud
���H�H�o���@���y:���C"���3�G+c{�����{8@9D��jY@>=�C�J�"*������HG6����z�S�������mR�z7C���c �0G��^��C	ff�!��\�-Zh�+��CGf�N��^P��5@���#������;-Z8���&'��CG>8�C�n���<�w��Q@
���p����6��7������dm�����>��|������`G6�~st��������+_����[�|��������X�?O6��bn�*��!����m�E�nRPl�q����<���DJ1�P����@������������?��Y��*����p��q��~R���=� 7�q(����>�����<{�j�7; D��r�>+%��.58�.Z���,������z���qz��t����5�	�m��?�3
]������7D�/��L��8���8B���@o���M�{����A�! }��m����>�7��!���+:��!`�C�6���f9�w�2k�`��W�8�(�v�=�<���]s�<k�upB��o�&��:�q$�'�1s��72�o���#G��������y�i����!����G�����@[��aI�1��]$"��/6��;��psQ*/!,dsXg��?��%5:��Wf��Z��J��C���� 6�Ee�HD�wQm����Y����#�~�
��FN��'i�}��&4c���S�f���|�1H�qz�_���w??�9�j�Pk��)�6���#����������8����bBp��w8F=����r�~����|�����o������D
��up1lVWr���j�U�5M�������z3[T�>a��j���*$Q)�I(2�PGK��25-�z��� >4�!���@G1��x����o/2��+L��w|K�����J��y�������t��.���*	_m��l8���!����f��]]����&5����+'��@`���#GV�x�D:��.��O�����y��/�1O�3��l��L���p���+v��-�-D �$�������/5�X�H�q�5�_4Rr</&EU���[��X��Z#�_W���^(�$+l�>���<;���T�����w���x�����[������|~qz���E�8����}����Hb��7�N�>��Q(�,;��JO�����������mA��w��R���\�n����L�SA��?����?0�����8�)������?�����^�������4
���5����d��d*tXC�������v��HII3��3�\��yxfN>-���x�F_4��R:�2��.u&W��&�"�����8��0��#������G1|�b6Gh�6�9b��v#���.�u���z���t�u9���o3��V��3e��t������7��W��X�S��h^6�o���������>�������7�^4j�����T��������s����zK���&5�#Zfw*t3������	(�����R��_��Dw������������g7�Y1^�����F�*��}t���#8r�vn��I�n���L;G�00�f4���q�~>����*�H	��F�D� hp��D@��n6���?_{�?_�qqs~usuvy~�����#B�G
���^����%n�S=M�o���Q��k�����"�GG��$G����`~#������r>-?���x�l�	�]`�H��f=�|#��Y�"�F.��nl�
��v�����
�/��hvf���gvF�$�
	�H�N�t�p�bN����������N�	��g���!2Q��I�/�P��������n5"�or8��av������X���&�*ot�=HS|F H������L������"�V��Z]���N���}������������k���������c��
�8������Yw�5GJ�
r%P�y�����GU�X������n�P�l�2]�qUx�g.��]����/������������5����F�<��|����^��o%�;|���wgC�d��lK�����������7b����x��c3'n_-paD����q�����;�!�X�G�W��������8�$1�F��I}@�$.^�A|����J{���uV�m�L�G�=��*��-�2,q��T,?��?�9Z=������|6����N�%)2S��#3�Y"��H2G�d�\Hf�	i�o�X���!��b��(wY���(��.����?n����������+�*G����5����b7�G���F} $��(���LxB1@�c�6���bc�5��6�1 ����R*j-���%�s,�/��~k��;a���A���������n�O\�����3���Zb@K�'.S:��2��Q�.��u}��bW�X���}��C����I�NVm�i?z���"�>)�D)�t�BK�-�k=bS��u2�#���HgR�������q���������L��
m����m�O�R��D6%
��`�J�;%x���o�lv;�L��}�VY(l��R)���ul��n_�� v�b[�,�:��:��?�:���������n��v���
�ct���n��+�n?�)>.��%���-�������oo��w,D��
�zrhF8�v�:��7�.����=��kG���v�;v���1�L;��I8w�>��]\��x%��6�|w��w������1��n.�����Z��UC�O�3�r����
C��em�������l;!4v,����I@�c�f�<`9?�u�Y�+J�����\w�����j�R�����X��_�9������%���}9��5?������G��K�H�<�E�+�#8;�\������_�����U@g��.��E�USk�z���wX����y����5��c�����on�N�#;2�B��):��c�A��h��g�)��UY'����g�is���^�� ���=>n	�QGk��
����/�5�������H��9O����
�_��Xb�����x7����K����F������8vW�]3�{r�v��� wn�h���8��_5b��q%u�_�8���Z'�c�c�����M���#�^�Kzy���)�P�q�fp)<�u��Ze{3X���	�J����������lH�X"�����>N��4��q�(J��@�r��9�f�R*�bzN��Smx����C:ew���J�k8�~��A��V��T��K>��Gsm�	��z�2��2������]Pz�i�j��8u�
���M_�K���>@�D������+`���q7\��n�
���?\��:X2����V��k����4��`
�����1�*E�U�#��.6T�~�����8�x��F������O�4?vA��������B��W��l�p��[��1��c	�g��������]t*�2���Q��@ -�c���.�����,dS3�.���X���\]"rh���=�O��N�gL�&��!�������)�jsK!JU���d2��dF"���K�F@��������UcN���6����ecN�&����o�cE�;%������ ? ���{�I��������s�.
��T�&��8��X��#��j"Y�����/9q]|2m"��&�+i���v��i	HH\R���[�q�RX�M@�@��$`���:n���09q�s���{{�b�v�L�q{}��,	HRH�$��}@�A��p �c��������D�/�_�������p�d.J�p3�'�[�(H|Y����/�A���@�AbI?p�����G�	HAH������IH\r�5�Vl�	HGH\���q��8Y�=�/�x3����T�D�vo_���D�������Qh�E�d
$N��^�H��Hn��>@c\Y}5OO&���N�%�������\�6s3[X������?�8�;	Hw�C��%����U��T�9J@AbI p�_�d3)���{2����;�O��X��R��y�����O��X�����V����R���$$_������Y�H���>@��T@���V�����n��R��#=1�����	H@H,��N�b����$t���a�H����r�}z�������l�
��}��|x�������[��@�P���_�������/�e�xo.%�C����z�W|.V��z�H�V����K(�H"��1:�?��i�d���Y]����'�O���
[����_b�����&@�'.���(V���@�@�I��Y�{���1'��<����>������7g?������\�_�������"-Ki����a;�����?$�����H�2�{C�b?�����}�z��l7M��_G[������>�^��	`�G�>�}��{���h��i9���k�&�<u��#��,ng��WV��Q�S.@�'�Z�Mi��o��o������o���fA�A\�\���I����~�S�����q�5f$�t�es�}�('�d��K��������'O*>>q��aOU���jL>�0yv�>62�y�'�wO�������X����Q)?T�N����Y5�6����S�~0�C���Ot�����dQ%��#�rGt��a��-�X~3+�s��b��Rc�����g������0�fDt��3?2bN)(��u���c���s���Z��b����fmB�.��d��$����:I��uH�s�V}�d5�<U$������~it�h��:Q��3H0��<�=V/6��6�y����$x�>�B��=�|z�R����y���3�|���<�;��50}4b��\i�	�����0;�-�u/I��8��	@�����O��<�g��n�;�X[���	�����'���=I���@�'���^��c�*}J�UW��x,�~,g�{:�~����"��b�X�����,���WOeU�&��'{^pR������b�����F�O�,��jz������c�N�&
�D���@�I����Xt�[�������J�}����/ko4YQ-��/W�g��������dHB��}�|d�=�JW`O=���7uT| <��F�OO������/5�m	xu�[}���6f��'���Db�q��z�D��I��`�Uz�<�{�z"�������-�>�7m�*O\������T�SotK�U��z�kh���H���=|��$�k���j�>��tD�=Z���j���c1-��R���s1�mVM3���vV��J�X����X��h�\������
U,�8���=�����l*5�Y,�Z�}^�*t��'fcz@p'�m���������a�@��^3(�1����|
���	����������=��^�EQL��O���nG��I'�+�������|\�iL��JX7]}yL��"S��t�z�����9VM��k����������i
����@���~qZ=����7��JE�z�u���C��PR�����{K{�W�W�g��,$�x���X�
�/��_zb���@)�v*�����%'=qIA	"���;�1�Xv9J%����S�-7�t*Q���K&��R���3U����w������z�w@-@U����������<��l}������M�50����~<�w4) �S_<:��
�S	@���Lz����Y����er����)�R� �T��Y����&�3���S�]��) �S��f��������E`���r��p���K����������K���d��5?i� ��U<�G$9HrjC�jF��� "
'������qO_��)�S�<����{�������#����{'���1N�������l�w�N��pk��a�����a/
>��}�[����3zwG�$Uo��0^�����Ewx��l����wM�4EzG����w_�iR���!��u8���HVj�u�Z<R���*�W��B}������=��q�	`�ih���[,��Z��������#H3�r�j�Z����zIM�ZibV�Z]��pda&�O��'�����<�����������Jx����W���&��`S����7�}a��\�'��?=�)�@97
����+W�ih�����{|�����U]�
������Q�t=��Y�j����y4@�����e��������R��^,+LSRE^��b<�V��r��W���Ed�j;�h���|�-�����x�?{���n���.��_��������P-e1e��#-)�{�7�2�bx��o�����Q��:��:-�)t^�<Z���Rd�(�����D�+T3���*��L�������U9��4��XN���H/}�r�-T)T����C�$�V&�Re��z��)�����x;N�)2N%kr������
�k:+��MI��h�M���)N��0�)���L�%U���	 ?���I�J��~���������) �SG���)�����.��B���4�'(�A�rz]��kk�i������q�l%>T�M������.����X��*N%������x��6�j0M�����j�J���W%���Sd��`�i�����>�S0�4q�H��&Nc����Y�	�{]��:��X���H�uS��5��x�Y�5H���7�W=a����������Q��%�T*�Wl�����������|�����#����Rc
�7#�#�jP};�of�L�����:(�vZY��4�3��&C�x�
��J��J����i���yP�4q=E��_�i��0JDq����~�8Md)�Y��^�k�LOz"6h@����dA?(��#~/+7�
	�~S���� �i��qK{���B���(E*!@���&��@�3������l5BBH
8���	�^��$K�H^��Ub�	��ij����+�7��^����7E��?��2�n�_;�g�1<�fT\�����[�(�����=�7u$}S@����+����J���y��x����,�;��������4�C�����O���z���
(�wW���#�R���n�'8�Q��_;!J�,
��es�����;P��QX�>K��,�&��Z��2~.���{5I8<V3U�N�@�U��h��Y�TD���Y�
����{����������ylky���^�Bl�"�N3�)ZWE:�-_-�d0��#s��8����� L3�z���D�1�r9-M��f������5X@�9���T������W��Z�T�U{u��6��X�q*����Qh/���nX�Xr�C�b����N�Z�����?��~�0N]m���>9��dv GV2��#�_VE��tT��,���U�*4�����_3�z��&R`��
`���-u7�N����V<98����! ����t�b�H���"���G����mq:hqfC�Y���8����<���/����b���'�]@�]<��f�m�:9����x�o�����Y=�G���Q}�����4�I�bE{�3mA�
����=�~��B��u�����Mo3�L6�c�H��&F��@��j�V��<�z�6f;�`�3k\��g�w| �3�<6��+`��	@��{��9�N��W�
��0�����"j�����z���4��9s��3�7g6Sh��}4�����QA�M�������@���;Z(�NoR�G�m��5���[��t�#�a`�d:�i����6��5�I�4��"����[g���F;������
t���g���������J���}�B�0j�x�.H�uR
e��56;h��:���h�D���u�����w�� �ieg����)p\��=_=��)���x\��:���v�K�����Y�zQm�k��zE��m�&�w8�a7j�B����������Dl�a.��3������S[��sr�]����	���H_sM�t&��uf��% �3����%I���I?3��r���u�L�	{�@�3	��o�d���,V�OHW'��_;�������3G[����Yx���\��9�w�4l���h�Vj��T�{g���"7�4��<�@�>�Q�1Q{���h��J+{�%����d�9Cf��o������$2@g_��������7�8����{Y(c�g��x���8s�4��mm��B��-f��f��"�Y���b���'�Ul���,r��(��
��j��uV�-������a���*�*�pf�Z��&��d���������l&�CN9sqT�#3�m82��%2�]��� �3rY~B"�LK�'��3��q�b�p��	W��4�A5:���V��Z�lfT�b'�[P[1�n��(�����@H��Wy��2��Qq�H����^��$�ev�T.�����v/��G�Hn.�Z����-�">�����Z��%�dj���bwm-�������2��Q��xt}����L��}���P�rk��2��9CD�p��Z%{��������_�������Xc8r�8n=��������UqW~�iN�X�Tu>��v��c��t���
����y����2��Oon��d�;z������E�+�#*:�z ���_�gR���+���9�gv $.�rQ[���hn.&k�lX3]����O���=�>�p�T�"��b*����h��,��AS���bz�J�xo��!���?�Q��FI����k�:����N"�.�E������b�:���7P�#�!F�i]�����o��@�3�B{��������@����qh������ ���
}`��!���G�7f�Z���~��u��~$=�{V�,g��(��k�o��������v��s��~Lvx�L���}���n����Z���D��)0s��y�����3*�)(��������H��~�y���3	cf�*�
%?��*=�: �U�����^�N��*��Pa���>s�\���K'}��H�mDrs}��>s�3�'.�0?�s6
��� ��
Y����������"?q���@��+��p8��a^v��O��>s���q�>���%J��H9�ts���������6�_�m�jYjC�u}�!%h���U����&����6��Z�U�-����~�N�iNE3�l�+��9�qs���v��<��W�t8�7�C�<�������7��������/
���@���860eOW{�te�j�m����p�]9y:��F�V.��6�y��q���&����������p�y���-P]1�5�����+U=��|'e�����%dv �����N���vx�G�}�����FZp��������_\@�$��������E��0N�!^
k��:���-���#�5\���.����i�iO�1���bZ�=�[T�"�A����esffb��<pY=	O�<%��l}N�k���d.%,���	��:k���YH�\�Xf�b�,�O<-f��09��\^=��`@�\cP��d��u�S�.�h82�{6�e6��q�B����dy@�<t\�a2�5)#9 �s�Xf�
�=����M]/����TG���y�����[�����(ev�[��n����<�$�l���\������
�����#��*U���/�n�����9���4r�hd5������~:�"xr�����U���k2�����#���>@o\�cUC����J����^��[B#Y 7�@#�2]���?/����z���_?�^����-��R s����Oq����������G���^���\�n��u���������kX����Q-A{h0`�sGG� ����1{��_�K^��N�?������p.y��Yq�6�x�������w�s)��iQO�r����Oz��kh���fE��c��*T�G�J,* ;���J��AAd��x/u����e1)������w�����-*h8���,p;R|��v��*V������~T�{@�{���,��r�)��9�6�K9�Vd�K���W��z���]2���w���z�����bN�Yua��|����zz�������?E�Pp.A��>@n����{>��U��H���������ww�����s��������l3��`���2�_��B5{�`��L�Y�Kd1��&�E�K�WON���b�2���tT���@�^�:�-�1����4�W>�|8O�1��������?J�8���N�rU9�a�M����lU��y��1�
���Tag�[8c��UC�]?8���p)�n�8�]��N�R����i
��4�f<!���C6�\r����^=V��������a(d��P��K������?���������^��s<<��=������@���s
�{�-�����/�[mW�~������]�;��3:��ei�B"����������t8�t�iFu�|�h�G�f������
����/�����kQ�oA����n�c�d6�`q��h�gxl��f��fi}��c��%OP��;�,����>J`VP}=h��5�N���n�$u$s.����B�A��������#5=��9�����������\3�K��>@��q��1�h�6�s�����V�,�J9GN�l8��<���Z����*h�$�V�)�����\V�X9�_<��Cp�4��L3�n4��r���������L-�o�x���]�����[<����i�
 �A��
�:��N�h3}����m=��'^�Lskx<�5j��
��HX�`-��U��^�Q���S!Xe�>�K������M��� }��N�����t{L��|��x�^tp:�"�J(}��f[���Ye�>���6�|���5)����.%�{V���_�4V2|��nV	��������/y(�������F��9���v��>w|N ZN���9��Lz*�`��H�Ec�@��~���I$��rUt@��������Hvv dN�y|��'{����:����U�N1��',�����b�/R���Ww�W/�������~�{�l=}����oMA}	P[7�^�P['���x�9����|��R��<}��R�X=}���9������go]+P�@V�oT������_Y��J����B�?������������+f=�W���g���s7y�3�����3!;�O�;>PN�%8]��>�/��A_����t�l��mT[���t[�A-OM*"���3e����X�p��!{pN9���z����n4
���]�O��8B���['�vw*Og,N��t=}�����iY\���
����f@$���C�w�b�����X@"�?�/�0zv��%z��������=��,��������R��Wz�m1Q����������hQY(Qaa��.b�����?.~��6F�i}��4�3�c�h���s1���Z�@���\;3� ��'���i��J)<��0:"��2�������>fA�V,Zi�(Mh<+��?L=3\:R?��}E�^5�3��,������'���c}<6�@,z)����������8�"���6%��?{:`UH��k�������w;:O�K���N������]>/�)�%P������	4����LT @��h���szF�j������������3{��/u/�06��rF�;�N���,v��>�w����J/\�}�;}��h��'{�+���w����z;�vvx�>w|(�6[��Ft0�s#E��g����}�M�&�����6-hMA��[����h�����p2��<���~�����q���N��9}n�/���z]����`��z��zc����}�p�s��W��\��b��)����_��.�=@9U��9\��@c���z�.uG�9���.uG ���s�����8�[�~�}�IZ��t$� 8�M���!;�N��������Ce�z���~6s����p3�84�z��y�@�xI�:�P��	����Z{��@�2�E)"������6�����W��lq�}7��_��~�����q�����k;fN�My�C)c�#��]�G�H:;�N��5�<��o�Ys��k��7�r9hH}%P?	[���N������b��o���e��Y��u�y�d�!��iu����������T����%a�����7�F��X�G���yy����as���U-HF�����*f�5���ti��Q+����(j�U���E����A*�����������m[�}4�N)�&}��.p|4���Gn���Z�������Xkp��`k�����b�w���kh���+}@�����cW����� t�oK�G����������Or���e��I�����`c`�������e�?q�*�r��D���h�;��bDt�%0��#��&��<�����������Z��68N|�������6���;�b�d�
�����>��}W������6��h�8��d��M��&> �}'�]��`;��R�~���+���H��d��ti���:�G�_���P�~��YQ�%�w���$��	���u[�v���<\uK l��'a���������oY�k_���}�����ZJ��:�������zhU�Ul�@�O�E���)�aoLqeIh��;�����g'{�����{�PC�/���l19������4�o�I��z�|�@l��l�Z���l�^/���*���>��q��C���Y��<�������}��0����y��V�8�]	"g�z
I��jt�����CbT�	"GUsp��?7|/�l����������6����jK�}�D��G.�/�P
��	�y����[8�����%��������i(.�ee�@v ���A��Oi���wq�g�$��"�������w�oO��?\����}�o]l�0}?v��p�k�=�G����N�(���u��{S1)�a����w����Vt����������w��W�d)vm����W�@r��� FF�[�H��An�/����r���\�r 5�w��,]��x��@����J���?q>���I~����G@�����r":/N?^�����>�5�]��Y?r|)g�]����a��JZ����+����W }��x�. �cQ��\f��cv��|��}
-@��������	2>�
�s|��[���7���Lo�y?
Or|�}
�Z5Q�p�W�ye�	����o^�9�d2_�!�q��,4[�����	�;{
���K�>@�R��8��jdh�������N��%Hf��dv Z����P!�R���cS�����}��9%$��W����l/[�/�\���f)���"-����$�
�4��r�u���&���?ZG2�BHY�3g�D�CZ���_h?v��Hz� /�w��p	v�X��u:V(�[R��u��|)��}��/Hr�]�~2V
�4�T�M�^}�q����n���z�����
�?��f�2������@���|�46�Y
�����>�����U�f|�J����9f� ���x�;��c������>�G�-�N���q)��|2^�s�Lb/�|K��S�u�h�h�E-�Hh�U��
���[���������m}�H��	���������V=�	>�����9H6�Q��sv����_�g�
|)1a��|x�������[��@�����_��������]N���a��R �!�e7�+Pr)G���.!�������z�Qc�����GDK_���8S@nC����	��gN�KeQ����yZR)U���.%����9} !����5�~_*M��s��5�L��>����h,�=f���c��G�=�w��&Q,&��'6Y�<���PtA��M���z�_:+����P4.o�q��/X9gA��&�
������;�U�
L���\R��u��U�h���wkf�������0"��T}���w[L�%�*�'2@U}rA��a�^��H�����/���>I��uDS��[[�i�Xxj~�^���.�r���t�����:<�N�}��$0����"=����n3W�������{�?�M��o[�g���������e4��uNN��CSdQUA&C���J��o���m � ����	0�@���Q�������1E�5����cA������M��E3x� ��vu��'�~S�����B�7Kg�:a�.�~ !����`����������Uq����xU����������}��f��
Z���o�[����A��X|v�x-n�5��S�6�8]�7�jN&�5�BOwYL>���`�A��&n�k>�� .������W��|�#w����6���L�?R����*�����:�\�~`��t5�'> 0����~���+���W��Y��g����vVht�����j�ga�}@�W�>���W���,y�k�#��gm�6�B��GI�X�L����=@��z�������z���_?�^���L�8r����po�k����G�P�dG����b9+'N��-�G�)��rzDoMn@�\��n��C���_��C<$�<1���0���~zv����Z�R�,�79x#�{`sZ�_�o�^�����7��/���b*YT����<���TCi�h����=�w�l��;��7����7+���7������R��Wf�>�*%���^)��fM*<��/W��F&�q����w���BS��b�_J��`<���Y��@b�-S(@������)T=w����$U���Z�f���(I"�H����pv ������e���>��FU1��$3v��=n�����KD;�-���;�&��1������T]��vx?z~�'��$���4iGv�����TS����A��.��7�[�Q�����p&o[��Q��GJ:�t �����/���^~P��V��eQ�>T��rI#�r�C�{��F�+W"�MNf���a��jfS�4��k�!��v�X�]��_��yP�����=���$%��7��0]���;�Plv�-���S�kS�!M3�9��re����I��?Za<=����Q<�s��*��/�b>-��H�`��;�����t_����������n�{x;����R�O���b��P��`��1Q7�-�Vt������� �4��'��m�t�����;5�j4.4
P,l�p�\��Xpp85��i��]���`��t�#�������M��Xhowni�:�r������X�n�`�su�>;H��q���;@���k��9��dD?��{O���F��������^SQ�4jM��C�L�8M�}�j���% �z�n����w��
sd�[��/�B��=���2a�HN�(9�|$���q�9q���k���{co�[������!��������7�<�qG���>��:���3���Z��N��U��)����j?��m�@�����e6+V�Ot�Y}��j��������d���L������-�D:i>�t !��E@C����8
��,�Es��b���pn�[��nr	-����A���w��F����9��� ��#�	��{�o�\����p�������6G
_/����>�N���Q���~�8��bv �����'A����o�ax��������U�W�����XK�v$1��@Yr�9X�
m��W���=���bh@���/#���;���g��m*�G�J��S?<�}l��!@����L���v9td�C�.��u:o�0O�e�����������u�fd�3�%O��
Q;^���Oz���{����n�={��P�G'_�����1xj
�.�;�(���Q��~��T������2��C�(��<����-�|��T9��Ml��^���V����"%�������
m�o{�d��W�6�o�7����}����������������d�}�t�����z0�Q�:��{
����`w���z�:V�N�7�����
���i\]y��l��Ki�tW����3���?������8U��o,�d'J���,ayG��[��t�������`�P����P6����@�q=;�[d:yK�����+���I�*~�+��X74��������5���B'��W,-�x���v<�rh#��gw���������0�h�e*�f���Lk+����TCM��H�6�5>�����y��WvhHAx*�4x����!h@U\�n�p@N[����9�9��W,C�5��\�n��,���_�Q��B����Y@�CG�8�q(��J�_�]��h��lj��I5y�������6K��a_TD�_�P@*D���X���p��P2�����0a:���>RF�RMt�A%�k�T�&�*�y�l�	��0��d{J�����6�\�m���"�5���o���ks�B��\����
m���^���������n&��6�6tdjC�����cW��N��;���s��^O��>P��D����6����gz,�bm���W��kF���h3�6�\���15�y��x������t������A6�HxS=������8[>��&_�+�k�'�����9�[%�C�^d�WTxq��5�M\���
(�D�F�/R���wg��?��~�����/o>\\��q������otbI��J2�PD	5�#���K���H���&����s�F������3�P.	VOq�\�)������Tr ���^cHA�fd���i���Fp8S�8��z�\L�������,}��{6�Lq(z1���}C���2�0o��-��f����y�o��9��$q�p�j���.���TV���,���t�o��%_g6�lohc{�JoT�YJy���F����������$f�v��5?�[l�S~w -6:�u.���!s���ajYlm�_V�����~�����a���wR��������E5Q��^o�����>X4MH��6��Z�Z>���oJ^-v��`���*���4xG�� ����q�
�:�e�$� ��I6I{��2���V����������U���������]�	0�������p�][�d�{��u���T@c=�[�4�Z��VJX>���b�>���"�/�Hf�2�i^� n�H+S=�1�ky)@�=�y����f���FensVnJ�ZwNj�Rj�~��/sC�^Pj@/%���]��w� L��7�`!�����
3�����c�'�U1^M�id
��P���.
���zm1iZ4��<�{������_s�W?���u
@�����D�b)l|g��>�~C����Mc�:��j8����F��V��6���m�!
�|��y��Q�|�xc�JN����1�R3j��>@���#�+�Q/���d&w^������>���B$@�#G����%7R��v��h��7����}��8�=�A�]��dF��,^���g���S������Sm������������i���i]����(�G=������(r����QG�z�fW�H��=NeD��Ng���\�lfU�+��?�N�!����/V�F'�Z����&�OA=�+iduZ���#[i��cW�H����Lt�;Z���$��H�d�2!	���D������n����#}�:�9�4�d��������n�d\S�&���X����5�I�w"@\G��u������������?��.���5���oF������i�����
�����qx�������_�+
kA�E��j����~�py�J(����\Tst�����h����!��/)����l�M�sX�����4P]�Z�j����N��^������F�f�����@�Q�B�wm����t��eA]S���'ND����Q%��#F-����?��l�?
P�`�I�8sj����1�[�LR<	#�v������t'.���;
��������Dr���
\!���HX�1��)��Uo�.�z��TyI�'�C9U��a�@�F�������2��8�w�I�;=����	�*����B�u����>z��?�;#/���������}("w�P��'	�V���������m��L�M����J��������v�V��K��&��W�z7�}P!�fw�b!������#��"�*�_L9�L�i��2��B����=���DF������u���bw\��x3,��tZ��=/n 61�"�@��<�;����y��Z�9���H�:`�WM��x��zI���i�+������L���wpy$���<��q����5�E�(h��n5Z�T���-�-�=�YD@zDh}|��VM���(��!{�tF���i��8���O�J�4�j�(�0�;zT�(d�"������j��b��8m��~*���~����^<��#�!��x�P"�N)
��2�hZ���X7m��0��'=��-�+����=��w]��O�����5������,�@��jA�c��KO��zI���!H�dGm*aS���7� K%"����dg�D���d�j6�p|$yL��a����mO�4����K��-{�=�<�������Q,�X�.�z$�Q��g���$��#ONC�����e����
����E�3�#��G6[��7��"��G6W����VC#@�G"=�����R�W�����a��!������tM&�����H�{�Zi�a�n�;�9O����\o1���Bz_��H{{����#	g�MV�w�����v�N����&�������O���������ne'�V����{����n���
 �#�����~�0�Nz�5�z�8�����B���@�����Y�G��ne\ ����AM���v���I^������\xd��=j O����������vC�U�J��):5W���93��>Y��yc�x���{�j���y7�U��"��#GK������Yx���{��}��z�����A7mP~F B�Ty���f���S��=���T�\�WU=��/��.�i����oO���*P&W�|W����T�K�#R�<rD�[��L� ��#R�<rD�[������|����������L�c�_���v9b���k�:�J{lC��yd������/��n�h�W?������i��G�5t�r~m���&�I��i�9�~M�����5h�d����I�y��h���T��}tX�O>5{C�>�������LO�Ua.u$'u.��m�c�����R�)�����M\��yd4��'0����jI���H�#}�<r��[���E��#G7� ��+��{*M�L�'#�l�c����<��#G?����#k��T]G��Y�����q���^�4����F>�r���eK��G�,y��g���]���+o7d^���{��tJ��y������a��c��������7W���y�����Q��=v��}�c����ots�.���f5~��
e����������������'����SMw:�=�����+EU-n�\�XQf�4�+/R�4�6��i97��j�f� �]kbGT��z�I���y�l�����R-}o'�?�]�bDn�Y:��h�va�O^$LWT��T�y3]�m�61 �c���������.�>�(���6�s�|��sk&��DV����[�4�#1�����\UK��I�,1�cW[�<��/8v���|b�P��5(�����e!%���_�M;%���f�����[#ko�U)j���h@B�IH���((y�����9����qs��K�r��.��]�p(�����J������J�bv���(��/�[|9@"9d��[��K��>@���O�����>���Sa���;nR)��k��.Lg����U�������!����������F���}���|�wM�&5���`(�
R���T�U}�v�������et��2�./N���/-���?�]-���}[��cG[� �q��>^�~��(�r��������s�V8��d���-V`�(�����F������7��{�PL>�� ����v���h���jZ1�2Tq,R��8x+����wj��Nb��k��oc/29���(�pu�N:�������x�)��I��X���K���r~���+�����-@��		��vVv'�q���:�/��t�����vH���D��/�V^�����Xp��4��>�_����7��9Py�^ID1��w��v)����)�S�*V@���������}c��1���T�.����
����]%����8�h@H$������������u���~po�zs�R�)K��8�30�1���IY
c����:-�gr#�/��LW�'rlC������C�mrs
c�f���P���J��8<��c(�6@���m���N���gW��������s������^��G�t���d#T�9v"����7���U�����BP��/n��Cq�l_b@P��u�8����N�ldb]���| �����"bL@����s��:f'R@���%����]�x�r������d�L&.oO�O:��C�����4=v��cw�jB3E-�6}����6��_>�QGn{ ��N��������%f���n1��cGK�a�����k�
��/k����q���H�X"��K��w)#4~�>�a��<��B=���t�l�#cp��7�H���*e{��������������*
��8su���nWg�7��R+��J��n��30p�V�����^�o�������������=�{�}�X�j�9��L�x�;�}������u��<v���;z�����k����\��#���Y!��H��A8�"�q�@�����)L ��������B���s8�0W��[���]U��)�P��JYz���%��.6�<w�;� �c�C8�<v���M#��M9k����W��>f�6���i�����z
��v����$E�����c=�t����~X�g�Sb��Y,g�?����<��E�Q�q��G�~���;2�b��lV+�y��sv\l(��Fq����r���cZ@�$v�,����o'��,��$p��!�P����������e<��;9�;�op�P�������@�(9Z��%��x����0.<u�
t�T&���������������B�y��c��o"!��>v�LN����nv�K���dQ���b���`O���Jo��t�=�#��XUeE��UkU�����"	��+&n"�aJ��)���+b��Dr�6_��� ����w'���H��-�'H&O��[,�\��./�G����@%)Ol^��-�P���Q�&�����+Y�<���
��_;�$�"I��s�������<�~���X����J�#)�W����`���$p��%�O��Uf��c��!?	8g�: �'2���djg���L�����x�����.��6r���s���H�9�
������H��n�B����A��T<��
@,�����;���e��@�����I�S���e�N�R��D�Y��D4��j(����\]�^�~�=P��W��h'��t�=[��@p'���l��=�}�t��0.�v:�� 
���&n1�K��H���>������a��IJ�Ri���4�}��k^�y���X�~�,�;��O_k��j"�����t��'
�^���2;���6E�I;��~~
�����r�l�P����f�`�D��/�N=n���Yar�y�t��D��f���hR��,)z@�^�r�e���A����x�����$Lzc�"�dJ}'�yb�`�I��)h�eqv�����#>��|+�1Dx�H�'�O�p�Y���X�.=���$	O�cG������)u�Y������`��5����f��;o�w
�B���y`�'�7���j"w�\R@�'��@�	�f����N��H��j<���b��v"N���*o�Y��cI���j�����}9!�!�7���:I��'��@C����;�V*�G��w�����O�����r���q���I�%����>
�)������b��g���.�O����i�L���i�b�Uk5�����Mp��zZx����W���4w�y�2 ��-z���i�0�!�����f�.��`ttOs���������j��wjG�M�(�P���=����x�\�.&���{S����c��?���YzR�C��,UE?f�!I��o���f�Z�iN$��;�K�c]�����\t�:hJY��.�C*"P�TV2
���'s���1/L�y���U-1�������1��<  J"�L������������������;�LU�X�{��;2��jEST,(�o��5UV�&��QFJS��P�S�/l�B&tdO��h�P���*)��:��1�?��3��@�D����T>z�z���gd����[�j��X�U���U���������7�P���L����q�X�*��D�g�����Yaf$*����m�0��V��Y4@�DLyT?^��A��/U��3:�@�5��w���El�-c���d�#�]}�nb�tb�z�GOG2���yY���b�Of�M�@fzK�G����m��c�����<�*9@��G�4\��"��LN����5�MI�)?Z�|���_?����BI��
�Z�^kL�o�-�y����
h�h1=��:��*���������$��--�L�!J�[�����Y@�DX����|ZvVap�D��=������'�(:l������sP�DB�uB�vy��\��;������K������W�������:� @$��<���]���b�M���g_tB���r @������e1��^y����%p��&6�W�V�O����������~<�������/���w����As�����O
`�T2K���wT5.>\��~u�����S�Y��N��J$���Z���ua�v}��7���H
<����bS�&�_�S@��6gc���0������o������4��Sv{�Km��e���By������R�Vb���w{���rjcOY��i*�7�>������6G��=��������7?����l���������o�CQz�]@��&�z�JMS����:��^��}
���k����d��t�0S���!����s����]���������mQ�F���9 ���D��pwW���
e�PY�M�S@������
�/��N!�c����4�-
���H;�J�(���j��S��e?�,�g�����L%��1�) 9�@^t<]��������_/��;���,�f*a�l��N3}�G��C/lZ�V�������{j�_��&s��������`z{��<�M��3��OJ�k��M=��4��4���:/K�e.�?K��J��eHH������������MkM=�F�G��r��S��<�����HO�j$i:���3�@O��S��x�`����Mmh��;vF
����o�*�����;E;�gjs��kAwJ�����7��\}�}@�v�;w���j�R�uw@/��.�o�-��j��~�����fV/�)`ASg��m�w���aY��I�h	G�9�����.�����������/N/�S�
����o�ftU|��:R@����~���-��4F�����:���D��	�������P�A&�>Sgk_�������1����������jZ��TS����!i*Q��>@�$N��z��!*v_ J�w/]��g����L�?�g���%�	�������Tr���F>���!�mG���f��<�x��V�)�3���t[��������Z���Ab>�e;#Y;��m�j[�!H��b�e
�4vl��L%'�V��pD'C5]�����(���8-oE
�!v���E������Nn����F�@�,�8=��9�����������g�7�>�z���������/~U??}+;�G��^���vdOU\���H��D�Yj	��qtKW��9
 l�y0��+W*��44�����@�i���]N���.����������^� H����J0����YcM��{����zu��kU�3%�g~��9\�x:�����u���$�g�����vP��������IA��8
��
TO�k�}�*I����M�|������is:��'�L�T�!p7�]�e&��������Z��r��C�z;�0��M%2������p���v��8E�L���
�U
����V��8Z{Fo���f���Y�3���j���W3��S�]���Zv�c�V�)�_��q�'�p\z@�D��uu��,U��
@���
W�0*���o���z��x�T
8�T�\�k������o
�TT���X|)�6w�>5�0�sM�������&�ET�@���_M���8���D����p5�E
-T�:������C�p,
`aS��.{�(�T�\�}�^���[�2��_�Z6�h����46���]�3�lj���tgpF���jc5���3����	�E�j�}���/R����j	�m���D��L�j�CJ2��fV��F���9za����c*�*�U�SF��cA)�uo����H&������]�d���T4&�f3�|.���:��8�?���7��~�Os�]D2	����Dl�\�g�}������lr3�Ti��'z#���8�]<�AL����u��s:]����
��R��\���Y<>��������^�wj���LZ
�P�������U������F?�"Go��j��q�.�rfu��O��Z�Ymm�Yxq��b�����7��k6��3W�� �����j������H�9�����v*��8�u]�P�p�����:$�������v7$��R�@�$���h���m����]L�n��
�=����������P���:n_��`�3�Mf���=�W�b:'�s�j�k��T�	�9�� �3��f�� q�|���L&��7:K�����K�����?|)�?��AmZ�\�(��LV��3���6��t&���ds&��5L�
�%h���d��?�[�����T,�R�B<p�G#UnG����7��P8:��h6@Ht��D�G����z���.�����3I�p�L���z�S��
f�!<��6�����3	�f=1��3��f�� ��\��������t��t�A]��\k��[Y�2@Bg	= ��B�����L�������w��c��p�=���Y5���|Y���\���w'q�vuU�=��h��y"Kj���~y>��
�p����s�p�t9�l������3���#b*�����q5�R�^�R=(�co;?��gV��8�q�����kH �S���*�������M�������W��(v����5����h���F �l�T#r�������\NFD�S��E�O�y�9�\�}�|L38<��Be	��z�K:�SWNe��&�:hJ�/�e�=��&�=6O�vAmz�N90ZO��)�?���S�^,���V}����b=�H)Y���2FQ���B:�u�����)�V;�G��<Xf�sN9��gp�s����Y����+��nVWs����I��&��2�#']N����X���7K������M�;�s�}�p�!��4�k��G�t\fcTIP�:��]�M^O�~�K	���2��8����YI���x;�
�����|;jv�3���%����Z�����lK�����j������^��v�W���(��1�m
��4qu>�/�i������^^��-�|���u�L����Y����,Y�wa�I�����X�D66�KR��lx]�g{��)a�y����\6K�+�bc���![���,���/�\�S���}Y�B�����Y���rS��v#s]����J��"
w!���n�Kc�P�)����05��
��y�fk�f.@1�x�F:����?�������wx3������)�Vuv������Q@���}�lOG���\��SA@����4�w����7�B��A��m��jB��*_G"M�R1[�z�p@����BM����	�Y���v�&����7My��c��0�)���eJ=��������+��L����'Gg��?���?���"&��(\My;����R�/1B�?���`��
{��������[���|
���c�a�Mm�z�^�,�7{�d�������Zg]]!��r
 ��B^o��)�5�-r$!
�D��F��(|���~��������8E8�P��j�t��W?j�]=b��4s�!���tAj��Nd��Z`��Y@�ul����L���	u_�������M�%����se*s����k
��#t��8�"�vj�`Uy�/�9U��m �y���*�E���������������zkm�aj4�r������5��9�g#��d�I~��7A��*��������g�b�k�e.���v�W���Bb	�/����n�he{��Oq���%s�E/6��KP��&�T����I���mQ,
%�r��������m���s)����>zkF_`C|d!��6O&mZ�2�X.��goW�lo�d��$}&���R�)6r:3-g�Fa���,���/NF[��;�%o�6[�c}W2<r��#����smy��x\�c#�<6V��G�w��z���B$�Y�
-]N����8s��W�������2|i�r��3�g��q��N�n��o�^��6�qf�\��d��{)I��*Dd�����q��F:@6#��g��o����t{}kL���,��{]���H��==^�_��j �����u?o������V{�x�)�ae\q�5;�m�������!��Y�|>k���3�=g�l�����l�5`�3�E^]�����\T��w�3��}�����;u5g{��n���r�[{k���dQ�XD�Pd�q�q���8;y����u�����Q'/e5�lJ���a�*��3[�f�1�}:�sf���?t��<��]�����}�B����K6[��Q�mu���/��bx�����
��YhqW6���9��f6#.fl��,���8�77~H����������b-wn�.������D�3������m.o=oWW�T6W��`�3�8���3[esd0qf��,�u����4���$��U>:E�^��������.o���
��0X�0����F��<���Wsy1�j�K�y��y���N5L�K#�����zp�_?��8�k���4�Y�g��if)�����d>��������9@^�jX���y�;;�������N�-�l��d�P�]�Ra�L�W�/�}��M@��!�����q:�9C��ka����4���M�V5,$P��]�vhX�����k�j�0\Kg�h���(��5�J�h�w���w����i��Hixz�*l:��i#�c�E�6����9s���-��-���Y�tg���8R�HI����z�}-�+�fU��bz���@�����v�3�=g6��!��Yc"o{W���W^�$����E�������{�Bx����f^����5l�a
����+oF����Y�}`��3��lY�E�T�g��V[���fM�����^g���Gd��3:6��`���5�������05��o�Rey�W�@	,���=X!
����h��7���(Ir_%ijO���_,H3O��V�W���j(�����t��
��V��0P{/nA���&A��3�=g���3�7g�Lw����[#=�9�Q��%;r���`�3�9�`��������@�3.hp��F����qr��u6�r��z:�:�l�M����5���p���!#�^���������%u~1s��j���4���3�g����j�L"&'��/��-G��T�������5Di�C��{k������GFi�U�Q��*s�J���P��zI�"�{4��&�0�������3�1gVX�t��Y�_Fi�)�Q�.�`�����V�8��n����+���\^o��I��Ym�3<�Nu���X������d�n��M=j4���&�d���.�"�l��2W�e����Q���+@
g����
WS�}7��H[f�6c{ �Y����t�q:���J�2��q	�`]��p�� 6��
��J�
O��,�m<u�g�=dP��5>���r��?G ;�������&�=;L�ojA�Jy��ZU�>�o�M?>��������#���iN?�����/�o���{U���O�����xr�b����k(���W���b�IX���z��:_�����������������������&WO��B���W���!�/��?$�[c������������?�b��K��D���yq���u]��}�|\�v��b>���bwR���S�@~U<��vg��'Mh�O�gtb�~�����?yM>���GT�kVE������u}3��?~��y�gY[�=�������������C������\�x�2��p�r1�RU�<D�K���������~��q�?�$~2�����_�w�[�K�y��M��v����?��_�<>8���C�������������{����_����_��TD�BUtXj�A<?|���%�o��zRL�o�\
�n�Z���g����O�������Y�D���_/����Zv��������_�~|&�B��e-�@J�3������~����x_��/7l��/����_�32y�hL��b��IU�����(j!^����aU=ipMqI��3�N����J�Z���O����X����
�k��9������������%�};U���tO��W�/��N�����\F�y_�������XO�O�x!�v9�uK����R��|_Og���Y����h���j�R�#���xr):�3����K���N�%[��t-�
��X��2���V�r�,E"J�������Q��,����������lZS�m)4�EOd�c,������)%���Z�)_���rO����Z���!]:��{?��{�U����������p7��~�2)��
�<�����3���o���8��B���W�(�4~���B����p����j�\���O��N����ew&�1����]�VH�r��f��@��_�Ut_���=������4�\�4~�i�2�yX�����w�H�]���j�^s.��Z��a:iG���(���{C� �jx��S��8����Ma��*
�s1�]�r!@8w�O��.��g���U��Vb���B�#�7�u�E��6l����_�F�����_�&o���������U�#�k�8����k��h@�+�S2����^�t�O`�<�����k1�)^U�c��f�Q�7}�eUe��oo�gm����X$������!g7���'{�d��N����G�Y����~�=���-��8�y35���8�%���n���vy�c3�C?���p�����!�����@��MoOm�/���:&m���DT�P���q�e��u�i\T��fF'� L�y�2&�i�����M
�6��@_��gs�{m�F�M�
�u�����\��7~?��~�����������zC�3��l����SB	N��E�Id��(a2�}i=�a���i%��J|z�n)���������e�bO��������tY��v~>Vk}������bU����A�E�6~FAd��q�O��b^
����H�6���D���r?
tuo�Vo8�������2����=V)��Q���{������{��|D�y���Ai���V��7��4w�wuWB�"���|��.��#����X*$��T~,'eu�v^��i,����H\��:f�����PIa�6�v?�C��>k7����i4�t�����T8?�U�P����QQ��~�f��0:�)je�ob�����Gj�q����O(������5���{J�W7w���*�P���&��_�.V�I�c�G�^�5��nn�Q���ju�h���H��z������O�w�#yl�.���h/M��0����n�)���������w��3��G���^�����{�@�X������I��c	w�=�a����,|$�cb�=�|A!}�A2�3}�l&.�c-U���{����]��<��u������������u����j?�r���@��{��&�Aj��7F�Vn83��$��4K��Mq��5�������A��ri���&��&M�����1U�k�}/�KY\O��k~�x�{��k�Y��^�c1RQ�����[�lJ�=��Nb6:�y�u�8<V���R������m���x7)������>����;���B��@�����0J�
��~����8��c����1�4����M�
�/�o����-�����	�c�x`@1���y����pc�+����iQ��W���h<e���)-v��&2F����cav�XS��} ��Z���*���|�$x�.�,l��n�I��Gq��XOb
��t��yU��{���w����E�G���'M����������������b6�����q��,��M�$L�G����N�HQ��s6}+��l�G����ic��>z������|B���~����� K�G[k�X/ue�A^������&������}��"a�pW>��0�m���}o��>�z��N?�+	DI/f�����t$fT�cm�D!��$M�����Mm~,�S�������7>�Zr���lo��M�������.7l���\�~�����K�����e�O��nxK0�[k��� ���1�/�������K��/�������o_���|�����K���T��%��C��EK{�0i�
�.m��?.P�����4��c~��V�>;,�K\�G����W-���/��gw��s�E?��7�g������<i������x���z�����<c4|��k>W���=���� ��	��2��g��|V��>�x)��������gS�s����]*�S��%����]�lC�~��|�'O>�X��g���/X�g��3��3
>�9������
�A>����_���;*�gy��g��s�����5�,#�~��/~	X���>��[�������<���;���z������oa�\c}���~�1�>��>���{����������5.�gh�s�����}�y��������G��Lo���o�Loz������K��/O�Y�Ck:/��I��:�a���{A������%��c<�������?|	��%����`o�F�����{��k)�z��'�w������%��z��pp� oQ��`��.Ft:�K����{y6�*U��n�%��CE������A���XwA�hn��}��?�KO���a6/��Qci����2�]x�\���9U#94x.��/
r�raM��x�n��[�fq�L�����m�z���=�C��������m�'�dB.��+��^���������u�o6��������nD�z�bFn��z�1�8v+�$�����9uok������;���O���x��4Og7�8�4��#�+������ ��}����o��~<:�?;:~��jR�l�>G�fIT�-O1�:�����aa�����S-5�}���.��]����<��F��}9*D/�.���x<�]�[-'�9}�sn{���\����/��BHe=��z�)e\N�|��VD�h�N�����.��1.�;����,���@�;O�N^>y3Z)��~!'�A���|w�V&��G���n����k���.�;vi�i���'�"Q���>diM��/E��z��Y]�/v�EU��o.E����:�������l:Tp�W]9�MwQy������2��]]x�I��~%^��E���EQT�E1�������������_����E���J�Eu�z�X�&���Fn����S1��53�W��t1�"���v=���\t�[Q@�K�+��4��j�j	Y�KyY�H9R����r��j!�M�-&��]!�����RP,e�&�����T�y[��T7$"�����V�$Z�)L���]T�PP�i�x��o����o,TXd��i���5��>�NLd-�e��W9�K5��i�m#�.�f����;o�c�i-��Oe���
{3�����;O�P"�!!F3
����|�4<������o�����yJo]�����xy��E��������	�4�~?�+�t�j�j��$:�l^�q�����5�ZV�H������ o��j�j9/���N���i���t���m������Ryu�����Oo���\�c�����������J��U�&U{4�� ��ySE�Bl����Y3�SL(�eS/��h*5h/�-EU�� �*V_�H<�m
Ye��dx=�N��M�@0C^0U!T��~�#z��I�(jey����J� �!/�Wr�x,F���t$�����j(<�QE5e(�T���/'��t�\�hh���4����iqi�T�����N�wE1�3YW��j
A�k������.g��@�C^���R�U-;�:EA��Pm>�����W��s�D��*�fP��Wg��5�woL�����8���z�]B-Kx}z32�2M}��	�����6ZW���+�Z������^~����f	P��Ugm%�k���to����M-�@
�����(���L6�
��������5��u��M��~��S�����S��n�B���6R;�,��$4�@���jdSU9��� s&Tb6^�L��	d����4BS;�\k�cQ\�� ���e%�|Z��Y�b�����-����|n��T�5	�����x�V����N���%�hW�j�������mr1����|q4��zt��TL��c���y��7�������]�wr='��:����rzD��uCN4k��D���*��)u6aV-�cAc6E��&���`�,GL��y}�U�Q}��U�����y��V�"�ZZL���f�����J�B�L��[kF��y���W����D���B���MmL��]���:�]�5������x��_�����;j\�Ui]>z/�*���pc�P�m���^����B�BoZoa�*���r���"�u1�u�k�5�T=*�7z o����@�b���F/\�r�7�gN�T+��!�KWy4kBb|��3��H�f����u��&,��D��o{�];���Z��n6/o�y)�����)�h*��Hl!�x�o�%��lQ��	�S�����Cc,H��%�{����CLWD��|�R�9)3`�x����5��)�w����V<	��MOJ}[�N;Y�O���j:���j1�Q3�jVN���E@1�b�"(e�f#��r1�&r=���jt&���
��7����M!���b�+��
o���T��`���%@�^�l�X�'N��%��=��0��d)ad�?��oG��_[L��Zv�Z�[��M����
b�(���q��bRV�H�y�9}~��������9-*�f�����Z*��~�UX>�5��W�H�����N����p-�GE��c��@g0��??�2�dz�����}��5�����j�U������@7�u�6/	����?3��Lw�2#L5���y�������Do�j��n������)��2Q�����P�OL^���L��������,�6�n��C�0��H�$��_3-��(D���B�[��7����o����E���lgwb���4#u��6���
���M.�l����(E��m>��+��2 ��\,j1<=�X�c1�c�q4#s����x�i9�����I��jvET&# ��p���#���M.T.�l�7���R49��1dx��C�!c�AK�������O:vf����j����k��4�iS����]H��3������S��Y���S[3��+���{���?��������yo����������OL��+���$ ���o���={_���4������|j������#-o��������]���g�d��M����������0�#�jU���K.���+{j������� `z}���o��W'��m=�:�z%���h_=��X���������������h���{m���-
�C��T\?����SI��Q�k�X�2������d��pD�21�](����X���8a��������z����o�=��y���\�7/����'�,���5^��,\�j���Kj����o*���S�S����=��C�� ���,N�[���YV<����}���;���s\����,��[��k������v��s��"i�9��T���Ri��:A�d!���=�%N��|�?���V����*���I�d@�d	��*��b��J������b(7��F������fo���g)Ko�A/�Y�+
��>�K��p���A���G=�)9f�l
�OYP������`;B���7Z�s<Gh�>�V��b�������@���>}F�>���<�9��|���\d����gvU�F���qD���c?�'����&vP�G�2�s����@�b�9��&Gx�-�����SIfe�D���,"���E������?{S�%/[��b1�������������|o�<���_���W������|��d����_�px�W�}�s@���D[3}���d@W�p������K����A��5����e
�u���^scI����i�
5+U�]�a@8�l]`���4n�'F�� ����-�fv�w�u�-a���5��[�����f��01���Q�l/����B��sjy�k�eW��VM\��9��-�X�,���|��2`��~���o��ky������b�S0��,o*H�]�U(�<5��[b�KO(	Oj�]�Z>�jQ�)�����f�lUzo(���Xz2��r �����s,��kxZ��X��Y�+���j�hV�Z�,
����a[f�t���-�r���W��Cs@T��Q]z2�Cs��Q���9|K*�\�{&Wv<�'9��q!���-���)��NVG�t�@�8/y��=y��S4�w���t�M�<�_�|x�k�n�k�-O�5���~�h���q ��
X�8
]xo~y��h�@>8LO�z�����J�	��i�zA�P�]�����x�
����}���9?=�?;<���������F<��>�A<x������7�������\M}���zO������o�h�L�T��qA�N��l^��Q1�[��Xm��}���i�?�q8r�H ��O����`.�������:/���p�����LJ_�{K +����t�>o�]dO�B�n����0y!�0��*�s
�F��<<?���9<8;s|v~���������rtz$z��COwtK��p���Kn���2��I��-
ycu������:��[��J���������3-���+�l�V����u[�I��������L������}*���4�>O�_��y�����~u~v���t��$�h��W�9^��aZ��2~G[�l����nz2@'��t�_L��s��|@����'d�����c;>�|t��Ac�kD�����}
L��7U1~ol��8D�S�g
QpD����8�NO�.G��s�0r��m{��5Y�y�&"=��z�"���u�,\g����F'I'���jO��
d4�"��++��_4��
�����=��S������"`y<���#�`���PO����]g�]�r^��/��)�-�����O����*��Q?�^�Az���]�
����67�Z��nS�N�t���R��P�k�f�v��,7�z�������g�"9�sM�5�H���{������P���&]��#��d��������`Q�5yFt�^���o��>~�U	���������c�J�r����:��6���O�~z���pg����
�7�zh��l�G����oIc��B ��	���� �'��$_���[��<�~:<;=;����y[��D�b�f�Jd�����%�k���q%g���bLa���(��R���P��u����bL��4�����Q�V�K�D7c`c��1���������l�d��m���
��?k�����'_W��hZ�`02����a�5-��4��1�Ip$����pc��Kw���@(��{�%������������%��A6G
:���������o�b�����qo���0�\>n�>�\d�q����A���9��4*����s@QE��"���p�^��A�J��N� �f�M��r��
R�I7Au,z�B�x�q=��
�-6D�xN�Xx/�R��?(L����6S8P��|����^%~0�}��\R�,���+L�A��`K-@O�F=�[�����2��J=��cG�}�V�UU'��:[)��]�z����z�':0��3O����{�lS��IQ�H_��M1/�����8�'eMK;�3��!�M�����L�S����YA���cO(��9l�u5A�(|9�o��}U7��O�������,�U�-&*��*�<��N���8�w�2`#�)�:wmP;	�U���w��s:�9���|����
?k�K���q6���f������<A^>:mB�7!7=�N<U��&+J3	���r*�.s�8�[]�o,�)�j�2��Nw�OF�b|x'�_���`�Zt��e�d�w^}�T7(�r*��AP��	zx������v�V�����2f�n�:U��p�hsJ��@�Jt��T1����U�K�GU�/�-���SQ������������k*���]/���T��������R�z/�Q@2���(���~�F#�o�����X�\���GE��f�RE-�{A�p�j[P�7���e���<�L\�5HWSW����T�e�in�����`���Y�d�X
W%:>\1�����bL�t�nQS�vjp4����D'��w"��0�k0p�O����V����{��n�dK1��?��r���O�_�������g���
X �3����$k��}���,���/o��������N�[nm��t�=]��a�@\8"�������.�3���z
���v�~><�����'���������t��j�
8$VO�J�*+ko�LkI1�$�&����9�b���{��t����7�X_�m��@�{~8;��"����1^��T.�����������6r����}����.���H�����|[s�`�QO(��w����	q��H��tQ�$~�%�����#z�#�=��-�(�h���B��|z�G�~xF�q�a�G��5�����������m4.@�\TE= IX����A40cK|����zR|�����K>h
8�UOHO���C��U�Du�*��
rcN�������������������"5����c,�>5`�,v�m����pc��2iW��Q��(����f.�.��c�G�>:5`��u����FA�u����.������!��$��|���=]�N
Y:u���z�v��6�&��
a��;#}�ooG���^n�_p�R5t#UC@��{o3�O��c��{���z\	��&\����t2�����k�]�B�����S��� ���^�A��./9���j��T#t�0�LOw���Q�A�v�����M�=P��h+5J�r6)xn����	�9����BF�b~D��iXn��7t�zMK.�E���p��-�&=����h�w����q�4�d����Z�v�Y��������	r�g<�m�;O[�����U�����X�����
�����!7�����Z!�oC����=Bb7��i��A��=9��d������Gg����<|��u! sC[��g�������K1�T����������
9\WO�+��jUjto�'��
7t�pW���v!�nC.������
9rVO�SO�S����f
-w�����S�j5��v���r!���^(���v]�hr!8�X!�G�����������x��$1�e��D
9�z:���w�{�(N����S=����I
Y&�[�W	���1�z2���~vx�*�j�q�Z2V
m�<
��ih��.<����+�T��o�{�W �)�m�nT\��h�!`WC��2K���(�iK]6g�u�dk�^�-3�`�a�����n;�k��z2����V�A�o�W�D9�i;�oZ:�)p���U�����t�Y�SW���;�&�����7�v��������:����Z��<�k����x;Pk����LU��}��l�����M[^�rz��C�EP6t�2�5�]O�E��Ie�2��W*��7t����/B@��6rVw48��������j�r����x{�_����a�z2@Nb��������������
3������U"+��<������PW[U,k�bQ��/S��&�����������+6U���
�2�
�����v���m�����!����mHx��f�����a�����,��4������;zs���,�hhEG��YO�/@(���������L%�,z�[�����!�RC+�
C,�����U7��#*=K�_
�����!XU� �!���d�ob�FZn��tHa^���]�;-'�b�T��u{]���?�6�b���6�Se�����r�����P����{���)�F��q���f��S���lz:.���EKg4U��)D���&9Dxk����trE��M���O���F^gDMV�}jO�����]�$���y��Djb��*���_�z|�O��"�u�����5�k;N#��}�����{�R�(����%��0��-�WS��&zF����i�6t�&6L��\�7�o�[��;�������7g�����F�/O����D���������f���$�����w1��! fC�V��C�y��&GG�~����W�����!�_C�U��5t�Z?���M�+����@��-.k���9.�1?S/#,���v���{��:�����=%?7��6\ok��n�V�+��%�m�j�=<�?'K����S�dM�����6eu�����%�����������l�`���i��.>��
����d�� L������Edu��<kn��k����6�\�2��zV��.b�	HX���e3ku	c�
B����\a3�a�� ������[<-\@����<����^d���H��H�3��������Z9��Kc�� ���rY3�R���g#6��g�
D�k�l7�2�mg��HTM]�������]=K��D�{�����%!r�O�CU�����DY_�f5�7�s��?6�^�5���^Tw�������������Bl8D�Y��#E+�z�\b�5�������������z2@k�i���t����)�V������U�W���m|@{|����Y�����q���������o��G�������.���g�nS��o����=9�����+"�F�}�z2@�����q��1B�5b��������(��2���R!�&����r#`~������x��-�����\�*�+��7�����H�r<�*�p*�����O����i#O��������m��JE���%���!zn����F��m��l	m������O�/���]E<���*�5ff������u}u�6�Q��+����7� \=��C���k'����=.G�oK??���
������-i�R��[La�Z��P,����������mp�1�z2@5"��|�Jn�.����X�_�1fwf�[�H
G��Eq�����������l�lu��o9���.�Q���� �pm�Y�4�a��	9����JF�UK���feV�Q�i@���E���fm:��b�F6��a��F�����b��q���G_��p�����zr@.�����}��Q|O�o��y�;<p����:��vP�1"�F���;���{#��v�����
������_���.7cD��Qm�g9��8��#�
��b��Ni$ON������V�<�G\Z#�\�X=���':;G�~��0#��F,C���)#$������F���<����,
@�FMk�m�6��Z=��9|����>��5r�����G���=������������`����^u�����iG�}z��6?�^��'�%�pL���\j,��4���g������e�������E~1�������S���S�jw���\�n��O��t[x���L����99���$8����]��E-���(��������tj��em��m<O�����7�eD�ZH .�OG�!1���d�O��
h���Q�)��(5�@�L� �����=��4�/f���5�n	����K�d���KK#'�htN?n����V|Y��rK?xP�wTKg#�EW�8\UOHP����y Y�`�Y�U~��u	2�E����5J�L�E�[.�_#.0���t5��|��F\U3�I����
m!`}���s�3��O���g����tc����z�h�!�z2�C;2�My�s�A0��#��1c���>�-2u=����#L���<?~>�)*F��{��%�2v������������nVN�nF����������{,�}	)�g��A�xz�2������3vd9���(�6�l����b�x�n�g��18�&"_�"e,��Vl#A���e���xz��N1 >�=7�8f��cj����"Ep�k>2��_1�f���O�'$�w#4�1��*45^7g~^�zv�������cs�g�g���uq#3c@f�ndf�kW�?@����1��}��h��-j4�"=X���i�*�15Y>>�.���X8�����3�x�Z'�����NNT�����E0���<9��
dE`�f21`Gc.pjoZ@�F��t���!�1@<c���[1Fc��������{��8n6��sz��������Z���8����r�n&{�Z2�K���v��t2�Z�u�^>�� ��������iCu�Fu�a�n�l-�y=�ra��~!l�;T�����MZV�Myu]�M������2�7�c����a@�B��T�4j8���G^^�	��O'�EeDK����1\c�����z��b(K�j���i{#'5�\�K�5��V��M��=��z>������������z��::��������y�*9�k>����R�{pr����?��I��)��Gk�6��,H���N%�NqKd�����N��D�P[&4��C��hp����m���U�[�zo�u#u������n��7��
���5
�m�QQ���,�^�T����\��O�)�7Y]{�����k�����#G�m�wx@j�����!Z�Vc���\ ^}� �8r�����3�������of�\x/���������:����0v��o2�.�m���|q���j�([d9"�]�q��X����m��u�`��+��*���ON��-��� ���r��J��-�����|C���~"�m������9���R���p���]�A�m�?�m���z2@���5���������D�ssg�����':�-���[�,gF��Ga���f�]�c9��9T^N�u6�4���Ay��b"��n����jmkO����!�SX��������3�=1�/����o��0M���c��q-UK���S������?�[!����Y^&p'M���U��������[���^�6�o�����H6
c�>�n��z2@Ql�yu2+<nl
��-{�D�.���b2��.f�����q�+����FF��F���\8��Zz��M@rl�.��\�� ������9rW%���wcK�^F^�7Cw$�����h�K�I5,�;8�O�^�;#�����p�H���_O��������O�?��?������$����eW��[FQ-'��\�����]��+*�c�FQ���f�**��c�h���+*��c�?���c�������u��b*`������UOP���>���c��m��T]�	���'5�����#rz���	�;W��X9F�p���fU��rD����V%M�p���#����!�S��H�d��a�S�@��>����}~���x�O�Nl��'�����l0��}#��H0��c+��y�%��q��1�e=�T���rG��9����w1����]�:8~��3��I���P���1��c[@]�O�y�����M`rt�=���N0S������W��z�����'�/���7`r���M����67�Q^�fd9v��;H��
z&�1G@������3 ���������q��
���b��-:�}/J�K�XjzVAQ=o��n�7�:Lu��Tk�I�F[���L=�]�b���Y��B1��	� �u�F['��N���Jh����p ����>����K>l�����-/����'Pu�US���Q7��pmN����lF��o�!1@u��z2v�I�#�v�p}��/����N�Th�d���V��d�9�d�Wu��
b��a����O������O��t�{�n(���[~���]}�=��i��uC`��j�����5)qC��f'��r�.����2\?v�`'�NP��M����N�w��G���
'&q���yEL���	���^�DN���Eu����"���I`�������O	 �7�Z��Y���`=�-F1V��v�z2�s���m���A 3=���:�����M.�]	�K]���\��@_'�G������l;�����y�+\v�r�}���%��NX>���k`	��n���y��i=�?��|��%5 �\�����=���#��d�����.}��>�s$�jN8�YO��#{��2��h/_T�9N9	7�����
���''��+���rnL2���[MW�X�[��`��b���q�8a���y`o ��%`�G��%jN,P����%hN,��?vg-t�I���s��
6��,��������*��� f}p�r����8q�
�.~ti�A;]��Z��O��X�|o:��������$�X���`�����F������.����[pa�@�N�x�b"q�)-#��ZE���Q_P�Hw�����=��&�N,����4Z��
4��%.������s�%�p��u�<s��@�M����	K�����}�~'W'����.�����Y��\����W��qIT��n�
6�������	X/?�',j����E�IQ�*�,�:�[{�&��Lv�m!@��~i7��~�)W�$ ,��-o�os����
�{�Kgp>��s����������;�l^��z�e]�o�������c�*�+,���D�$���yH���h-T���82;��ul����
�����@�9�B���&'���)�w������Q���	�X��8���=oB�'9[�9�����o��� ��
9��L���;-j=�9���U��p��������nQ�/��-���:F9�1��3~��4�%�SN���z�}I%8�J0���a������p����$�
6��x�"�����0�0�	��gy9*.�_N��OOO��~><����������$�3�*�c�B�r�-�,I=; \�c-@�&�����
�����|W��&��za �J���������K=1��9hWO���qn���{��M�����^��}���M�����	`u����t;m�b7�������&�k������������A����2�R�����6 y����R���v^�9������'/�N���!�	����a�u��1@q.��|���O�b�+�g���'y��7_Lvh�u[��4��B$��4���nSS���MFz��	���uwp��#w�i��x�y����,yh�����Dmg�d�2����n�����`s�H���;`�`��
o;p`�,�N��m��;��`��������0��p\�az��A����6��YUVr�%G's���������+)���� o���;}f9���������f_�w���k;+��N��;��G����]�{�Y�z1�L\w`�u�8w���}��\�w�X^�5�`���P��;��%�F��;���[c�;������r�D�0�b|��~p|r������f�����v^v�%?�zD�c����i]@�������Rl���f��g#����@{��>�0���.\��P(�k����W��23Hxw��R*�N
��+����n��!�����6UR_/�rR����c1�
���MI���qud�)�8�WOh�Sf!w�w��N!��@�4�YaR�1y��iU��� �.�����|����?*��#:����x�I]��~��>-���B&�
dyM��V�����s{M��Ek�O�d���Q�=�$���O�<����w�1�z2@!\�� \������#x�E)��]��1�<�eH���\������B���Z��OJ^8p����k�p�	\���#��hu}��],j�"73�A)�Q������v�T�������S��2
���,�������tu0�����9j���g�����HZz��h��%
�����;�E���y= n\0��������������w�0\���v�x���kx`[[����x�k����8[z��h����}�����=��
����}>��`u�a�����v����@�Pxa�kU�X]-���
�':�9����|�d�%���. �����!���{.�����X�
k��fzp�����~9Qk��^�Q�����=+��p8o��h����{���X�A��XN(^]U��_������.����)n1/NgwjJWO��:'4����;�1����^�=1�V]���H@fx ���Kw�7����7�c-`��{I�������($�XO(�+F�n�D�^��rY��F/��|"$�V^E#����_�f�9y����:^>fk��	��|�|������ E����=������o@o����y0p
��l
�b*hG.��M���rY�V����7�l&y`�]�,�8�y�
1�o-�����2g��b��?z���������y�����"�#j��^�x��������`�������{�Lz�@PR�Um@E���#������`�V��M�GV����@�.���<��� ����}�LH�A�v��`�7lz���M�g�X������>��M�T��|T��@���'��.=�pi}}
�������
�����%E7i#�T�G�[������������7?���?�#3R�z���2�X`i�}���^{[��\uI���UY
�E'���E��tN$����O�U:=<�V����S����������G���y�_Q���c����h�)`�SW�y�'���V�2������`��
tN��Z�
;m�����|2,�d��8�4�</G�Cp
���:���LXM�v�3����������e���=;���X�C�Q�
. �DWzv�Jmd4x��P2���W�xJ�(zf��kWz_��},��)��S��k���&�5�F���������:u��u$s�J2��]#�(�^��gN��>��uS@8���t'�rj�@�8���DW{Uz��m�@~8�YO���XX�����MN]�����)���}��E�{�H
���`N������V���.�j�D_�K����Q�>���{���`0-���&���V�iS
YB�2����AJt�)b��S�=7�pm�W�@�S��+;W��pfu��A����k"���|���>6Gp�������0uj����%?r2i-/ �����������|1�u7��%P2F!hR��z��������b�n2��x�G��f�R�
q���S�m�++w�[��m)��S����j���Z2�N#�C��"R�3��%`1�:�QBn���HQ��t��������-��e��U���V���W�o�f�ir�*���j��w�@��H���x��r�r1��������	[E1i�~4����_��dW,TH�T��e}���
U�MQ���]�+ZUENn�O��X��=@�X���bB��TT.��\�S�!e����/T�zjA2��hG���&�RT�nP.&2�C�^t�_<E?��7
������(�;b.r
{=�N���i���]��O,�mtU�@,���\o�4�n��\}$��T�e�AE{l�*�O[^����e�)��(����4
 #��h?��(&�JP������������B�v��>����Q���W<��~�U���[!HfWk�l���UL��\xd*����w��L����elT�����P�1�8w3�d�����Nc�K�����vu�3��<X�tV�s}��h�����S���d&[F�����;�7�4��^=�3��!��d��)&�+��N�a��%!��!c1H��������:M����M"&�� #\pe= ��y�}R:`_�^A��P���*c����m��	�A������2�wqd�gN���FL^� -���{����YN�t�8���=�h�(�n:M�&OkNm������S[�c1����o�z?N��g�M�[y��������Nz2jf!&��7#����nFn��i9$b�<�L�t�^��TF<�+�?�|=�U]�t����e,���f&����c`[������������E�\����.F����n�����)}%y&����9
���A@����p�K1e-)���V-Z'���+u���'z������d6d`��[����)��,=���K�wtoI�M~��?����LYLn=�h���yf;8r:�XV�o|��_M^�|����_N��t��qN9�YO�ZY���`���.?�E>^�
��TQB&�X���s��S3�� ��:�3���N7
hPj��EU���"�V�hc)NSWd������pj��u���sp�Aw1�t�{����	���-s
��4�g��<�����{�xHx�2jcA+oy�&������@��wW;�f�g����2P��;AP3}j����z��p��������N[�L��8��Od�V���pa�S@$��@����KN�X�z2@U����e��#�qj��\��z��+����c�L�>F�r��9L6lM@l\qOk�[�ql�����g�i*<7���������$�����}K8v%�B�q����g{��r��g�A��[n��U�g�q��d�z�!&X4����������R����Mq1��]p5���g�1��[EM�+�\U��\������!��zzvu�>��m�^����/�g���\���1N������9���P?���7�$�1������G/)dJY-/j"�_�������Y�����&7O3�?���8�Otd���c@�3�����37�8Tqf��-ba��2�g���F!M����)�]�E��Q���VMB��`�NQ��
���}y`�po�����Q�3�g�l��q^V+JV��?_��86� ���}q��E�\���)�#k|����v�w��8�0c= Vz�>����@�l�u'0���}��XN�������J�?�nf�p-+=�J�6������Qp"0�Y����.��"N���8/���f
H����|C���+�����J�  �6Y�B���Ei��������������Y�t�+4qf���o%e�)���������Z
`���dV������6����@)�K�TM��{�5���q�rc�Bf���7�J"FcZ��*v����&���;e�-���<u-�n�7�yV6�������vq��f+b2m�J�w�j?bW7h�5��1�X9�6�t�EX�a}����#{�CX��t6���|�#�q�������'�������AF�"�E�<��iN���g�=W��	�d��.������#��d�V���E�
5r�r�o�Y"@��OD�U]��y�-uW�SO�����.����F��b= #��M����������;}�R���d���L�+�8�XO��
	6���f�H����������t#��g�<�����$��j��u��H�Q�������S�(*�������a:����t������|*'MX}��q������]]R__O
d&�p��mVy��`3���k�@G�����t�tX�6�1�`�3g�8�F�5���:��]�nY����gN�P�`�Y��a��q���tX�q�e�[2@Ag�S�|wOlwm�J���l�Z":)�����h�����?�~:����?�~�)o[���b@�'W���'6�zZ�++��vk����vb��M�	jy��m)�H
�2�����!�����d�.�k�O���]/;�56�:g���8�r��(�P�G9�����EK^��� ���X��	���>o�O����A�����f�`��)�3q@(g6B�vZ+�qf	����v�J�1:2@)g����]���$�s�EO�����&?�����������/
Ri��������4Zi���3g����L����'�/9�s�F0g�`�,����{=tsf�lh�!3�tEr�Q�z2@X��������E9�S�e��:b�r�`mh�����Go]K�����}z6f{�d�"��,s	@#g��"�>k�J���{1 �<�aQw����q���0��d�=;�L�\sZ��9M�D�VI�����%��}�����d��C�o��?�J}���*�l����p�=�>���r�����*���[Y�9}��[��Oh���V}�o��$�[�<}�j\��d��G2pki[�v|�>w+f;�L�oZ��(��t\$
D�c���dg��s����c|��%�M��9�W�%�����w���lZU����)�X.��-"5���X�f�B��i���
��G�*>������,
����WAo�H`��aO�'��:S�)�{2D�P�����b�E�C���2�7^>gX����N������7m�u�]���<,�=|o
�"r}��z�t@�D�e��������,ph����q��/���YP��D%�2pu��.8s�����3�8u$�m���
�z�Q6����_jc	�kya�Y�wm}�f�t����\����������^�=���.}����s���n�����k�N��]�I���z;$K�;��gJv�V���{N7��+}~��f�?���h�W5���!���Yv���+R�����������7l'V�s��<�����{|�R0�wW�|1�4�Zx�m����n	��b62T�:>3���
u�D��Y1,/�_�]@��`��0�a������]����G�a��x^�$�HH�%oOp�66�CV�����k������}9]T�;4��U�����Q��(Z��]���N�� w�idXD����.Y/��4NV`��e5�({���wh�^R�Sm�-&������z�W��W�@p� ��4������I�����+}�����w[i����
Dv�>w���qU��wV������7�����j����.".=�=t��@���iO~��	����
�v��>w(�+}�����+�~�����|5����F�&�C���)��B��zvSv�=�������}l3Z4�G0gD6@Yl��1]����[+�������p_?�?9,����P���{@'������L��fg?�����;��I�I�gs�J��x%oOO������������3#|��
�
�s��I�|B�����{N�Y��:QL���c���e1��c�E�s+����JEv@/�>�b��,�����-T�9�C�����$�JE�@�8�����W*��-�n_K��7��"��c8<v��>w{G�����}D�R�N���n�����v��w�R�4PWJu��6l����v�����O�<�����&lU��������Y\��I7^�m{�I�2�'������)}����� ��,����)}�V��$p�N���"7�+��� �c��&`�T�s��s���c#�\{
�h�[�E�'�
�U�@(,��C]�����g���"m (�����0����P�:��<}�z��f�|���������W�-����]~����+������+�-k{p_��]�|�jV�]k|m5��K��H���j������bKqs��~\5�u�������"q�P�(,o�|TlG��]�=�p��K?jpG� ��I`z��;�T�@9����E�@�|Wi[����;�������~��^E�u��(�����bW��[KX�����w��D�Sb���(LR��n�mn��'
��>����[�p��(r��[$����sX���I.T��P�`cEsk���(�R�6�*lLn�z.^�5�6�E����+?�0���d# �P�����"�h�H6�d2B��ee�T2��C
��VP�&T4[�g��qf>�2�0�>��W>�}(�aj#k�}*��e������()6��}K�����5_���(��r1Qw���i��5�R�������j^����!���A��<��P�=]p�>n������Eg)A����X�s`������E��=�dL�*�s������am��,�p�\S�d@�>����m��������O�����Y�W{>�����������L��~�9��/.<���`=���s|�`��{����;:
�-�r�w|��Z>@���Y�Z����e�uw��"���s�u�t�����Lo��s!z�t�0���<��L"��;���n@e��k�t�#��I��}3�a��&��w;���z}�3?�����U-\s��#}��!���'���AY`���<�9������O�t���>��B����E>kC7/*���| 	;[1�!�9��o`?q����
��&��"\r��
R�������s0�1�4�?ptD��F0Z[\b���"�w%��M���N�����a�5��������MW ����QYYk�.�����@������r�G�ka������IQ�#�0�+>���w�Zs	/�R�
X�G��`?�{nW@[���l���	k�>GX�.���w���w���|x-������V������������N'Ty�]�������h�e����E�W$�v��gq�p�3�����Z�O����-��;�5^��Bh����pX��������!m,R��������)�����=�9>�~���~��}G�����?/_����5��0A��������qpe���xi�ZTZ7?�5�T��N����Fz@h��m�d�9D����������>u��P��#u�:��X]c���c���8��H�.K�#>|:p���O{����m���=�U�p�����Z��}i-�����j�.dP��E�u�<�s����2X�;�����Z��w��s��y�������W^>�n����*�_We`� ��d��U���%�
�=�r�6�}�C��)Z�������u��g_�=J>)�2���H��H�X|Z^M�YZ��{xZ��V��l���\�`T�H�fg��#lv�����������t��/o�A����N��=Dm����+4m������P��M���� �����=J�,����^�8Tl��K���rb-/�S�b^�JU�E�u��+���X�	0��c0����k0���Q����P����aUS,��-���+{�OE����9����D{>��������
�� �[�����-����nP�F,���w�?|����N���6/���T��$S��]�m���5;�N��|�(3�JI�P���G�����	(_:9<=<�����7��MV)�m����������?�������mI��� ��-,p�#����9��)�QQ���)����W������?�^�?=}�?)8�``# ,
����	��D/=����w�(��*�{���y)����wh����P�y!WaE�z&�����7`�_e�p:�K����AW������0�t�3p��2��H�
\P��,/G��w�����R���(��#uvxz&��d���H�W�z��8V�uG�|2��m)�UY��M����m-o�O6�X����j����7Go~Z�m�`�����N���qy�������2T��7�������FLs�1��YLF���P/.v������[z{�Sp]�Oy�R�9g����H��;���3:�	�����F:@?���_�����/'o���^n5����6�6p!��t�p��9Z�7�"�'F&��.�l�`S�`p1�{'���
���Z&�t���������N6�8Yz�����-����@�A���{
���>��@����~�8���1e�m�A�F:��'30_��O�i2�P�"�A���T�����m�-�'���k��#����N�J-\�z����������-I6:`H�������k���9�n���'r{�7��A�������k��=2��wR�ani�Onw��M/{��1\�l\���f0�d�P�����]��w~DI9�>�����8t����2�E�u�}ep����w)�]�.�>�����'�t@/���e��S�?��,L&_�s�����8.�HhI�����0�T3�PM=d����������1�X���rZ7\6[��f��3
���D��@>�������HL��	��0��
2k�oNa|����_��<�p�_{=��!���
��	�r �����Dm�_��0�XCz��**3y{G	�����u��AC�>B�3��8cp��9�c���s-ak
���I��po��|6��%o�]�B�����c�o������������c�t��8�a]���B@C�
i����8���I���#uW&�c���F:��s������r}�O�����=w����^�5|���!3��2!��`�:�/�"���2{9���e��s�=R}�q`��#��1t�-�PsQ-�.��;��!���y�w�����p��]��4��k�������MQ�����*�c���F:@�Bg]�t��b�A���8����t�Tp��9�H0���9��g�E1iFB4�V��B�A0�!�������Jh�SYm8��`����H��0��:�,��90�Ht�����~8>~���(��	�r���2������!�@za���4R]������S]]�j#>H�j���������	��B6��r�Os8�����G������x�|N���7��������n��JC�F�N���T�d5�NFz ��}!��	��
�:�-l�'U�"KK#���!�	iv1�]�s[T��b,�MWw���|:s+�,�*(�o�������k�s
������A��2�8�0�Cc�,a����C�	C�p�x����:p:1�"x�
C���,�?��&�7�xc�H��i���Z%���:Hc�!�F:@���y>�lp��c����a�1��1!�C@���O���g������A�C�D�l�Q�y� �!@"C.V��P�dc�����C[��Vl���/5����t�[�;���f�%I+t2tD'C�N��N�:}�)^�U�>�2�!`(CG�2e�1����(C���j���u� ��
g���9��p�m��\-������v}jW����p����2t�_���y_0��c����a����G�(���O��4d�cj��(���p���p.
HC.����`����v�}	(W�pAk�+!;;z}(Z���g�����y h\4O# h����������i����3�%\@p�\�� ��%
���r��lo �!�xc<`<CKt������
9`�HAv���l�0��%(�=�s�}dj���`�H��{l9JR���COuI�fq����]E"G�4<i�E����p���L^�CU-Li����U-�g��{����P�������������o��Kh������J�]*`��X�#�g�F���9��R���q�/�t@w�7��|��.4r|�3�POz�o���6��)�mG}��=(A1�EW,�k��2v�"@�F���Mju�Y92�8k?<�x'������*����"��FWj����x�B]��yv������U�t�������+jd������ohk|R�t�ES]n�6�%�2��U��^^{���2�I6H�	���-2���|o�^&��
��{W��-������m��(�rG��F��E�F��<]"zs9)�R�O�W�u��bQ�uf��d����6�Xe�!��, �������v�����Y��q`,�H.BH��h����m>���v����J�aRU�Xb�]���(����_&�O"�������F:@�,��G�!���"��F�8kp���i���S����%�����D��|_��������q���VE�r�,���C�"��F<��)��p�[������Y��)q�l������u>��)RT6�PYc�
X���ZI:��5��Vz
�LH�����$(lmJ��i�`������m|��i:n���6��Z#��������,}�3�`m�x��~3�/ �6��Z#�%����>6�] �(������!�"�F\�O# H�Q>D��-��Y)�����HH��@��)���Q|��=�Y0��1�h���=��})�@�.���\6JWm�9F���q�j��	���=g��S]Oo{i�Zd��p�5��r'�M#���������iQM.����*��/�-������>jn�������W�������-f��eQ
���b6?�%T���+�����P����D���`�����Zt����t"��Y1?�������f>�w,k���L�������|D������4SZ����9*.�7�������e7:>9�I�d��R����W�^d�T�cX�t�`�Q@=��t[x��D.p4w��	����A�O�������g�
@����x����,�u.���1j�D���u��(�\z�����Q���zg]��P{�_#���Hf0�����H���b�tr(���_��=>9;|ip��c#[���g��
��\��v���x�nlTI-�
�	����a1��>�"��g#=kiE_5
F�g���m1z-��Rw�z�?��>-��y)�����dE����b��%t%j����~r���%������Me�_�b�6��2�l@���q��FfiK@n7�t�-���Z�������Y����)R;�Q5�������o�w����DM�
����70�	H���=@jOB���,��vws���VW���	������m��R5E/{�$�����\��7�O�����O'W+9��RI?��$q�HG�$�,�`��o]G��MV*\4������|������4��(3f��ha���}*o_;�n��	_@��FR�,�2��5�Z�]���/
���|���v���T��i�6�d�������������|�	��D��*�v��o�F��d��+���������5��k�-\3�6��k��6�9�0g#�����s(�������a��c�����v�E94QU�w]�2+F�d2pt�lw�0�J���P�_&C��/����"�C����;��sp�*_#5�>�{�+1����Q}=�.����f����5�C��v�QY��>�#�uqG{��Iv����dz�%M�v�7Si��i��h*}�:'����f�5�T������!�Y]��O�a�]7b!M��_z#*&95T�C��S9.k��4s����������i�u���*����k\������R�����z:��iz�e� ��
�����O���l���)�t����\�o~V��/���i�n�.1Y������}���p�{�����#��9��������g5�5��/�_���1P���9�B��o�\Qg�!��'��R�M1Ps���[��N`Qn��m��{c�1�
(�8��X6U������������0�q���B�����R%6;�����az����b@-��1�{������Kk�/,��q,9�<80=����U�����.����'1+!�7�5��@�6�L�Rxi�����Y�hg=;���>��
���������k��p��l�R����p�:r�N,��
�
���p�t�����L� }���n�!%[�����n���2cC�1��R��t�+���
j�Y������Q�t9��������Y��y���qx/����:���D
����i^������-���aJp|(8��9��}t�'�u���G6�g}%A��U;�#�:�xk# $�����&P��:}�7B�+\v�K�T�`��v��8�um��)��/`y��^�@���m�|C\�@v)�P��;�m���b\�UoeKp�1��HhO|/��lo��jE[���N�n��������
D��b���6@�c.����
����g$�.������]���1��H��{D�������W��
�!v\p�s����dT\�Mi#; ��s ��r^�h'��"��
��9l�s1���?�/������Os�I�7��/��B���+�����Z������b:�{��}�Eh�5�G#��&a���Qk�|1.���������Ovw�/�����;�\=ir����8��|\�x"��,n�5R_�)~�lO�������wq'�/������'�u=����q=������������IQ?NE�U�|,~��]��xu.>9��r��=�������&���%k�f�Q�x~1�Q�_�7����w_�Gy��)�s:�_LF�;�����+���ys<x�K��.U�s{�I����������I��(�k�?�D��������|��{��f����5���������^8��|�������H�=y��������/�^���=�;����
����'�q�)x�\���b�~S�R1�pS�BQ��<#����'r?~R?;��OZ7�A���'w1*!&����LzRZ�Z(���g�[U
������Q���_��]���{���t)18�C������~�uQ�O$���6��'�4�i1I }&uK���A��C�	��'m)W��]�ST�]�pLY&��"��K��_�@���U���Z�=�Bl^�
������b�������'�0�S����
��.���A�������S���LT���_M����A*E6�O.EwxF���z����	Wek��R}e�k��#�L���]�vs,� �����	���Q}�,����������lZSqm)4�COd�G,������i������T���rO���[�������=�>d��Q46����(�to7��~�"n~�A���x����=���
�o��#����[������qU�ROv��+�W���/�r_�6�R
u�/�s19p�L�4���B����4�3m=u������D~���ZM����������������� |o�#�����������5o�{���8�'�v��^ir���7�a����:��sY�������?���Nk��#��t��=�g�����c��������1���*���a3����b6��N�^����x��'����W���
�}��I��-�����btjP�0���K0���'�����?���QR�����`\���l3���,f��^�7G�����(M����w��{*�����������'pZ��z1�����tUXlk���L��>�<�����O�� f�h�7kl�N������O����m��
{@����Wg�]+�oV���T�{E7B>�����-��~�����1�����i��^�V���������o�f���~��{��PhF���'��=/�CYf��
�On��il��^�S�{4�
+�o��A+�\����p)���C��7�#?y����d�{
�
0��t^���f����f;��K?����z�b��+�7\������A��������=���O��4��~��A����0����.Z�< �D������1�/��������|��r���|���?_��r�������/��C�?����a2���zp����8B�c��M��O	�|^[���F���A�,�~v��=��1�����w<������g�/�
������v�'B�yP2����>�4����x�,y�G���f�ys���������m}9���/�����q����b�m��y��?�� 2������1��O�?m���������Y6B���A�{��k)�f�����������$bj���R��"/4w_�Y��������).��!��/���=��9�.�����������}�?�0������4Xd�m�.�����q�F8rh�H�fO��|m�|����}���������`�+�����������w����3�4G���&Ko+���w������|k��XCw���Td�;�����~���S/���g��;���i<���������<���S�f������+��^��[<��������7gG?6a��S�F����.K�On�YZ#x���	D�*��[^`������P�,z��];$���c�,I������J�7�4���rx�Q+/�j��trrEWDV�~���F� �f��Di��PU]�����z^�7��C���W+���/{ubS��w#JQSO2-��u5���!��-��g�_�ZA�`��>���	
_#�Z\T�y9�9ks9�\�U�����R�~K���(JKn��
1��{�����c��J4����������7[\��JX�W-��E�j�d�V��=�f:��(��x1]�*�������V��(��|N����M����{�v��p,*sKd�u���-J�}9j��Vx�����p���=����j�0�LZ:@�V��t���������>|�����9�?�DSN���"u�Gpqh�����J������\M���1-�����0,
Y�9�Y8}|\���'���{�" '/'t���T�LS5(���FZ�>Lw�],j�7�Z��;��I�T���yY��D��
w[��G���Z���\�1
_��w"�����JA����K2�����%dGP>���R�HM�(5�;E:���2��
i���Z�J��&����iO����O��1�|��Z�sM��'
�x��-s�&f7uEwE_L�����t&���+\\t-��
�9�@o�K��Q���R�,�T��������*�Vv���;�?SV������=@��+U��iU7�����,��R�����F�9i��~O���U9-�g�����oi�����q>��-���L���#LM�"
{T\��q���9��A�^ �!/���jf���fgmq���hv�coh�^��T��3�mU��TSu'����,���r�*��KR�5s����7n�^q��e�b����hZT�oH��bNV��
�\�(r��E:���ZP��~X\�([]3hP�-d�M��!���_���p+�S�@����[C���\�bX/DG���<�Y�j��e|��I�P�� q�m��(9��x[O����p������zz#�C�p!�_��A�+�s�\tZh�T3(]�+�2A�]���c(�Y-�c8�V4U�����D�������t:+d���B�])����Q)j4T�-���d�Z�Hx�#���l|������O��1(i�+�����:�t��;wT������Y��F����I���Ia/��aRI���m[
T��b��Wc��U�RR��W�a���t�����K�Ea��cr�&�B�W.�MM����[��y��?�i�����l?@X#�C)������Z�1��3���U���G;(\�����f��Yd6���V�����h���.�)���n/��,���jI�Kz�Ow��"�;�i�����?�
����T�������Z3
�v���H��m�h&�x���o�v����Z�d�KR"e5zm�h6wK�jG��G=�^���P�J��|&;������v�L:[<�U�����"QVg?����vMa���E���5EB�Lj�]������k���-��w-��{���e�*f�JK�w���q�]uW=�c���+�L��-�����G�����v+��NL�K��U]r_Uj�����W`����h�(�a��4a:iVt\������{L>�����xL�1�!��	�G��L;�=o���zr�������2�c&7���K3����/�@&7�����s�z^PY3��%H����M�����0������`����MP:���ni;�,7e!?M����oFLe����e=;������*�yyuU��������P�${�����
h��Mc�r{k�� 47�Yf9����S1��;����������S>+�<���4Fny=����V��qo�e�&-mVUy��x)P��MMn.HHn��f���i�MY
��M�4�HG�&����7���Z#S)���M6�B�o���������r�n�1�j���,�\�����������W��hE�����{7_��:���D'���L�@5R7���x��i���hG��]�hT=��]�#s������H��#=�I����J�&+=�
//���HL�&1=��ze�&s�n��y�����dn���n^H��bAK�L�@z27���ZM��������dn�����0y����g-O�r�<:"����T��������{v%�;�Y	���GR�)9������|��n������|��.������F|�Q>�l�ed��F�-Eg���F9��9��]O���������]L��N�����9�
~\Vv
�;e���q����.'�K���uC|�A^B�������u�9��7���$���k./ ��W^/��Y��>G7�fD�4-�e`�>�5��u��z��}�z�������, �>�4�f4��*k�z�Q����7�./ ����.������lP����������.8������z2@8��?G�%�z�������������?�:$�f���o{7��1Ao�]�mA��wC������|���e=��L����y�F�~/���Sk@��V�����,b,�����sZ�<�_������������������b������ta��2P�^o��rX���+�HtT��O��N�����U������������Q(/����N��O��������mu>��`p}����}��j�Ns:�����A>�y�������t�����,++�Wj��������L�c���(b.��,	+1]�F����'�#�OY�U���FL&��r��zj�H�R�k�����Q�'�@�� ��$������k������z]L�DQJUQ4 }K�B�S�jb���%��d> H}A*2��($t*DBB�M�����HA�#M�Vt}U������H}Gz��>�%T���Q�O^6� >��-�XN.��Bp��&������R���6Q�C-0J{����v�@�����
����|^T����
�V�h����������Z��i��i��"���K������Ju���>����/���r��x= A��+?�2}������;Q����s����.��w�#�,��0��r.~0(�d�g��V���i(�I����Q��/��:J��y1��c�,��c���+�8{�]�4hGe�k�A��2}����2�����Z��2�����������U�������W����p�eO~ #���'$�0E���;lL]5b)~],�l>@98h�����������
�a�����6S��o�v>����N��:�d@o�����)���wk�f,=��6/��q������������2�����9vSO�\���Z-o��{�pg7����t��d��S�=�J��-?������A��0M=��9��9�M�xTy�N�����!�����C���$���L�������|��������MX�{'���(M��}C	[�l�'z;Q3<@G�)\p�#<�J
AK��3I?p�h8yXv5��8?�Q@8��N����_���3�>�(}��4;�#}�l���g�?x���_�
�s(������|u`�9����6�;���YD�s�cSDm�K�~��}|����������7��(�������O���s�"=`*�9���#�Z��s�"=��7y�n�]7{��&�:�������1te�p�gzyy��B9���~�Z�d�?<���@��z<5���������q�����*��=f���
�_�9~Q�Pr���Ka���}[�{�}G�g!�u��L��S}�lOM��Z/��Y}Pd�8��?�����9�������z*���h�S�"�QK!�(4*C��K�� 9�HO��-f:uCyK�[u�6�\����A��2�����s ���Nl�]��e�I)x��co��������_�{9��E= '���W9��{����E�������=����X}��c���>bP��C{�j��Ce���C�d�Rp��>��p8"����'.(��n��$����~�'d�{�.%G,B"1������"M��%��YM�%#�XE�1��64�.7�/�&����C�vQ	8���������Vl�'f�v	
8�QK���+���%*�i/LN�G�2~�j���>N}
���U�SD8q�1��b[�r`�&
8���{�����;j�Ln�����`�k����
V���b�v��?/�����G8����*T���08��'���,.&��������<|?�Vm�M�����n�,*y����j@<8�QK@�1�W�t.j�h��Y�{m����jb��~����A������.�s�C4����oZp8�QO�1���K���UD!%�@s��e�PqC
a�c�!����Q��G��Z=���vy�A�'�3;�f�%H16���^�Y�:)��Xt������1��;�@�@����iMD�vA�58[�QS�r'���U������a�l-fb�VlZD�� �_D�i@r�x����l�F�Y�����3��[l��m�)��|&�,�84��P������U������:=� ��6 u�ZS���QXf�q��������c9��!�h�j*�o��li��i�5F��f��~l���f�:<4 �lx�{X�-��wT4n���l��)���tAA��,|���������=�-�����=>��
X�vSK���0�*K��=7#�n�bg��z��()���
� ��bR�5���	q.F�7�e�����.X�7��n���z2@�8��Xd\l`�Szz-�J\���F���Z�]���s��&�z��#]d�;w�����l/'b�,��i�7�IR�M�*�o^{���"���Q�c�zU1���:U�
�EMhO��x�������[5����M������q��Z
�������M�����7��n,-V��\�Wq0uc��p��Y�@ 8��n���m��w�b��b�{#�o1t�+��.+Q��N�������C�E;��mj������n��k�S����*�b"�T�I��x]VtQ��*�����t[��p���%`8����{��U6A���	q�����v~�
EN�wp�������n-��]D��_�Lo�����b��&F3m��l�f���V�u�=���.��r�To�8t�\&dp�������"��k|K^���IJ���\��r�s:��[����y�6 YQ�=�6+P���)/����	�)��\)��S�M�L�`a����j^����{}�*�_��(D29�#����/���;|��	�����)���8X������8f�����RR,����N_�#���'��/�������Z���>�����-���of��������
oF���Jo�J��)��C0�>������������z2@�8��p��X����;"+�}jj.l4�21�"���zB@k8rY��O�,T�e=�V�����;7/�C���P�G5�#����K��4�I��	Us�������*a8�Y���f�w��F���3���6�@��c��1^t����iG�N��@���Mn�����������g|8�YOt����X-�����O�}����r��!���������ic��Y�RD��[}�q������.^�Q�����f���C�����z8�YO�E��2H]�A��{�9X������W�u�������g=�+Z��Ex��} 2��'� s���Wi���U���
ov5����|��\@0�l�Z����L�����uO6�p����D�<�{-�~$�C$6�� qh�im�sr��"�m����f����[�8���tG8��5�~�(�W�7�.:/�����'��\9h�������B`��E9�,a{�9�X����F��G�s�Z��QC�7�7�e%+A]@LU����/9���1��1l�B��{.3�h�[�ms��������Ol��V8�������d������z�@8xXOHG�CG��u�>m����2�/k����bN�s�
:��6�H�����p�!����.����A4��]e�~ET�����W��0�C�}�)@�k����a5�V�������68��`�	��@��H_�����j���E{y����q�^SKq�f�6�y�2`�C�nj��e�0�_�e����%�G	�Lt��p���ZCS�'f;�*������f��E\��p�q�z2��.��c�)�(9�XO(�����H��M
�V�l{�X���b2�My�P�Gw*�����z~����-Y�0Us���4�r����
�+JSKQ'���`[�������!@q��e�`��zN���V�!����Jn�pZ,6�����u�xV=��9��I�$��w�\V����m���(���E�2�+��DV�W�J�ptk@g�R��}T|(����]�����!�UOtbK��������b~[���n��G�������!��|U]
�r�f3�kAT=U�%#�.{������+��edS�|wr��>��d�^e�-�]z�W���2d���	�����-���A'w�$C�I����_S�����w k�U�f���bY�d�'�p����y��|���F����za_t��id��P�J]��(,:�������7�@bWt�4C�3��?���UGi��o$-�iM	����G���79����LT����*X^x��Fo?���-g
��i���������U�&Nk�=2#o��o	����8C���:;nG�$������z����������k���F[� �����1�dD��'�o�&�t���������#2@T���:�Y�3��Dq������z@��1�d��9���^�f���x�NW��WrX�d@��"���/�,����+>*.���������p:^�L��
1��
��7�7g�j2�[:>VU���q�&k}N����j���0���� RK�	�C��l:��_5E%o�����kF������6���a�Dd�9}��b�R����!G�"���;g��A���
�^��H@g(���i��e&�s��O:������'��9I=; 
��O��\�$����eQ
��#����H���E�:�a��n����|v}W�8�J�I�J�&�7g�.���{���
r(�����G_��d�H�e���l��(�=��$� ��+n���c �[x���y��)uZ��������R�-��]4�
�"���m���LU�+G�����w���Bp�����1�����\n�]	=C�g����2�:WC�x����[��I�����rr��o�1L]VQ������i5�]���,�H2�\�G�{b��=�<h��PrQd�d�B8���<���QUZ!Z�)��SoF��P/cB��^L/�[��p�a�2+���ZC�u5���2o.,.z�����eEd���=��Q���:�?����	5��7g�����y�v@��l�ATe��*iF�����\G�����Gw��r,f�P�i~l�8}�f��iF���\�'
c�nw�_:������_�i��i���j4��fB���O���W���,�	�����)��g�hP-'�&e����G�d����h���H�����
�����_=���D\��*�s[
���M��.i������'�sZ��K��.��t�vH�1���������6fQm�����\F�*�8�TO��E�����������oA6�5�Bq�h�G����v�a�,��jS\�s>]\]{�q>Bcx�`L#�ib6�����U;Z�s����3�N�����S#��)�#����r=R^��)�f��������	_�r��������}��mi���������)�0qy�o�5@�|�Yd�5]3~�{.{ T\<�����P,.0���W�3	@����	��|������K���+���3�K����.��Px�iwD�P9��)�����K���������'p����`�����n���);����`����,�����x6�"�
�6�B���P@X#a�m���6��[�����j�D���LV�� c
���/�r|����b�G[��0�1��5rf���(�'Q�'�����]Pa���)��{������r�8�Uw��Y�U���y���-^O���}�>�ry'}�����Vz����\@�����'���DnM��52}���"���2�Hv�<�0p����	���x
 �':�-��s��	�6��4E�
9P��bQ��n���*
�eH�a��_*f�?��Ts���w���7�g�hCyY�~�8�m���z2@H�{���4Q����{��WX�_u3�o�_�����W���b��N��rbx*:���KM�Q�t����FN�;�`�fB<�H?�6�>D����{��X�QQ�ce�����_
W�N5ou��s��q����,��]+�&������U���Ww[�@�,x0[���,J�\���X������������#7�84qd����o�8�[�����������;';;�^-�@Fkl���Q�u��u���D����$D��)+��l\��Ge���_W;�v�u�X��d�������V����|$�_~��D�B%��?�&RKY[��3����au�:�g�Y6t�����#
���[���T��8��b}�������'����/�#K�S��y2z��rB(�����X��F|U=!��+�+lb KS�`�EI�?=99�vQNr1%R��x	�	�D��q/?8o�&U����8�V��c�N2�`������7,���T�t\�,��9ED]���H?be����W�{�Rj:G�_���e�=;��8bYc�w�C
)F&@$XxX<r��!����?��l�t�-ni0�h��Z�=�����H�j��TOH�
� ��z���{���p��Zm?�z�ob8�h���!��v�|��$�8r��2^�
�����@Q����7�.��,��Wu���1�.[�~i�1��Q�����5k�jU���YjR�9@�c�2�����r������1���p&K7>������-Ta@	9�YOh�
`bt���j>WN����b2\
����d:y�PX;����Y-5d
���kh���5�6#�jE�d
�`��F����Z���U��a�P�t�����N��#��Y���$�G������W�+~/����\��d+h�_m��u���/�����"+��#���`�#�����R�5�
& �
��^
vh ���-����&3�|N���=�0��~&��V���Yf���_���� �n�u���	��}�����8i��[�	����#@]G�S�����SH����wK�u�0fN�qX��}�n�s6�sc�c�b�n,v�����QZG�����OG��N�������������&�u�F��<�����V�������x}7+'��~�+�u�F�yM)$����]pb.���]p�=�	��4H{^����F�#*.�����D�6j���H;�To�(Q��	���I�w��<+9�,pu��"gl6-B����M.������J�L���k�1��c."����7	�wj�fj\P�;v
'��[�Guh�� <v����	d����4�yD�b�c����7�'f��1��w]�����h*��[�� �c��
Y7M(w�;��1`�c.�����R1��cDW/��Ql��w����(
�M~G;@y�N����DAU*�e%��
����=���v=�� �2]���T`��X���:����K$�wlC��M�V;b~����eI2��nmh^�=�z<�����p���Wj���������P��
�h��^��c���lQ�#4�D��G,��mD����#�7x6s!�s����E�P������|��SJbX.koT�(,����^O�G����#a1����e:����4����AK�L�e��nv�e�S����uxrr|��B���\g��%9�QA�<<?z���WG/��\xL'���o�d��;�I@Nm��G�tS]m�����\����-5v���|q�!k���G�$V��8t��=k�LD��O��8t_������~lC��Gw�����eK0�����@��k��������;��8��g[��T�����p��~�jRm?�
��*�b��6-1�R��9�y��O�/=�~�#��&��%�v<��&o>�+�(��lH����}�����tm��Y��-H~�~�w#/�@y[b���y�z���-���fu��=?�x���+��c���Jb������1���c�\�0������)�:P���{�!���k?`���A#��y����^�����R:��le.7���7���V��Qp�����L�K.
o��������}G��@�����R�E.�x���j��)*u$�.�;����q���$G�z�q�����cH�K0�1�����`�sO����3�0=N������z���z2@	�����*������|Q����fx���L�dpx����on�H8�-A�O��:R ��
[�o�-��f6�s�q	�WB7������1�zy	
-���rv{	���|v~3��w���s1��J�X�?GGt4s�[�_���%�k����(W������a����J����:I�4��l�oY��/�^��
���6��rG.��Cs�8{�E���-l6��/G��e��
��nr~�E.o��7�
�zV�o�9��,t ,�N<��Z���b��L��L��������\����W�s+*��M	�����Kc@��\�l���|�W�.��*���]���*~s_�B��H]�:�(�k��t���
�@(mp��1��XF�4����������R���Y*Zr���v���G�sQ��CY1��c'�]L��eH�F'-������c������-��`�Y
�E�f�t8@=�u}�	�8u��j*���>T=��t�����b����e!��Xp����ox,�y�������Z�>2��7b����o�c�{�)�q]^�K����|d^�������c~�N����������[����qf����v�Yo���j��4�V+��U��!�V����.��_�Dv���M4�w<���gb�k��m��lW@��Z���}�<LS�X������^3Q��WE-�_F��jld-��Dm�YpS�gw[=�P;F���W%������X���e.�#4Y-R�&���L��N���	���=��� ��
��`6t�L�),@Lov��G�M��9nZ�j��pl��)&w_��;,	`�����o�����<q��u�pl���	(g
��pl����)g��������S�o��{�M=�p���d�������.=	�<lP��	�����M��N�}������y�P������PA���A�G.����MX�8�7L~���	���[����;��0^V^]�~dN�&LO|�+�����h��"�y���Z2�5O�Xsa?�X����gH@�'n�y����<7^��0�W}�i���o�A��0��3N��������t��-6y��$p�����7:A�p<	�������	�`�	�\O������N'�O\�p���;qE�N'��N\�l��j	�.�����iR���@��?A�.;	��]I"?�3{��?	����i��a�"�q���g����$�+OB�))Wn�9B��<q�5t+|xb	��y�(�8��{�8�y�������$��J�� NK�D�!�x��=hxb���}�����I�x4��N'O���==�	���	��W&��C�#O,���� L�P�lt�8z�s=�yb����J��q�$v�KH�A����T���W��|�	� ��o6s!)�p����E���w�84|�D��-��O'��Ob�_{��	��W�=w�8x~����=��q$����'�p> q:�6�{���|@����<��$\d���`� K��$>����?q�F/dP�X�4'���}��@�8i�x��<����S�{7��)%��Abh�>z��������[�,�e]Sg
�������.o	��WS@L�}i����`�}o�#e	@����������	G����n���M)�OF������oT�4����,�k^N*Bh~�i�2�~��O������'N���6�vK��p����Uo,��>a�o{?�����:z���p��9��H��B�p<{L$�{�:z��;���w���L�/��r��� �������*@�'6~u��2[���/ ���5�����	 ��������s9���GMT�2�����`�����d@�t
K��b� ��|\M���z�������-�x ��)���k�h�����*W,��-�{�[<|z���w�evE2s����U�����������������vdd
��!p#�[���0��[�����>�~X�{9��~}t��>@q������N=`�vo{g�/�w�j1��!����	�K1{�%yC�y)�s��i��.x�1tP=�cav�z�FB�{E��s8��)k��s�����s-l!�Wc���[���F� �nD����]n��}�l��>;��(����	��x����ZC$v^a��Y�/��}C�`�':�-�4���&��dx=�N���5X�c,��|���S\!�+�q�<�8����^+hZ�?��V��
)��vo��*�9���4����{�)��,<���{9 �(��1x�w�Ro{��0!��K�-�#�tM��PgL+�
(oP�G�z� ,0$��7�V�lM��a^4F
�yu��h�h������zh�B��yE�iO'��?��U�b2��� �n1���������f���>����z2@��M������>�/����	. ���D�������[r�����[���N��2����xQ]��x��a�/T�#���������A��������? t���5��t(z��O��B�UM5�6��J-���W��:x����S$�������QS��������G�^7���0�� ��!7��E!:��&���:�D�����-����;�"*��������H���e�x��G�����n�������O��'+�T��"�f<.�}�HE�D�uv��;'#�1�����v��X��R9�2�r7���t����uiiX��[������@`�'4�c[
����nU�uE������L�����z�7^��=9|�r�����9�? nGw�;`yX���u����B�P�+z�%�BNSZ�����j�����
�~e��F��+ \Xd�(� v\1Q/JO���}�Y���pq@�.������p���8��g�-�_\�������
��l�`E����f^��~O��(2������>!d�8�3J����Q����dd0���3K������9@�P�da�E1����8��<��0�A�����[��/
��A�����������iOu��"&��p��%t�P�HYn�v���/��t��\kjh��v�x�8#��Fv�8������_j��r0��1zmzBW��@q�(N����	0�'��0c�F�o8~SO���q�v�ez����!P86�g.�t`CA{V�9`.{�0��y>�*#@>�{^�w�uF�x,�����W�9�E_��#:n@���t�	�=V
:�hP=��m��z�p���q	wU��J�����oYOhN��g����E��K�� f�P�7
u(�'��Pxt�FJ�p�����U�����^{x`|�[D�
��������'����ltR@�����&@Ena�=
��O���3��N�2G�
�����C�ruh��r�����N|�Y�1k�~�0�m��SL��|m������[D�`��="���]R�A�������IvYHY�U��R��,�*���f���-p�A=��B
X����u1���R@��1�'c���-�o��,'eu�Hgw��wQaG�`E��@���
���B
����]����B
`��
�M������y�������:Y�?b���7�8^= p��M�R����Z�0����)G����r
�k���N�R����Fb�?YJ��r����.*���6
h�tr����t�.��M���������M��R
������N�R���N�r�����T�q�	S
8�4p�0��|Mm����A?�������lj
t��|@��V��h��M]aX���y`/@��+$�b�Gz���M9�VO��������m�y�{������S�b8��~�;}��xU9*v�����>Q�!M�B�5@B�!j��:�-�\ME>�	 ����(�����R�Aj�z���g�M@�lt�'�o���^�-�N�����M1*����v������bn�?)�sSk������,��
��hM�����B]2���\�����|n����Q�]L�)�����RYte���M�!i
qn�
����`�W�������D��o�3�' uS[�Z}�n������8>9{n�q�d
��4r]/����6X��om�J���+$�����3��5q�e�� ������q�Z2N�1�~x��8u��S��nAg�/� +7�N�����c���)������Q��r��4�Xuh�`�)�������u����n8�����S5)@�S.����.����7�x�U����L����)K� �)���k$�MW�"���8D=I��r���l7M,s�����2|���o��0\l��v��]�����5�>Y[�b�`��[d�0��3�|��������f�.�e��
��,6X���v��'��P��G����|��z���Wz�sm�Mh��yV��v!�-s�i
P��Ce�d�T�Y�
Q���K�*�\6�����>���J�Y�Z`��JD��nZ��F�2U����J�l�n<%������j n���VZ�h�x��6�������|1���:bah��w1��M9T��1�`x{r|v|p���OG��z/^Hc�rQZ{��������1������d6����
)mS[�W��=�j�0�)����/�7u
�|f6 ��-�k
����j���gS��mgS�g'����I�>�Tm�Q�z����)wU*�
�^K�}o���QW��6�X��������w^{�z5�Niex:+��`J��z3K�n{���:��3�U�j���g:)r���fn�8����f�s��\;��ymV[��k3��f'�={�������������xA:��^z��5���O�3��f\lWz:K�G#�C92���l�i���4�s�nXV���������F^��m�.�j�Ez�������0m�G��hQ�Z���6X����������n�������4�=4���.�%]{����>~�h\zS�f Gn�j����F@�j��|��S#_u=]�G������tB�mE?[A{��TJ7(�-pl�Q[U��D�)�6���m2�<[}a���~e�����C�N3w�T�����]-*��h���5��5'�!k�W=1�f������{t��*p�ic���=�N��r���E���]D��
�*;��#���m���.�9r����iiE}����T7���~������.]��q����g�u5:� `U3�U����	��T5���a���@i&T5��sH��P Y��H����3��loE~���9����7�z2�fnj�,t>���Ig�������[����x�!pk���z2@,lT���N��qR=1 �d B3@�:���:����(���x+�`���&]�s`�,��)����.��SSex�q����6J�������q^U��3 =3.`lc8��3��=u
T&�xYe�3 � 2�f�`��q���v,��~d��������3=�|[��q�|�����38z�P��5�j4�7vk6D�3��f\pY= .����1!�3@�f\�X=�D�l�m���_B���R �;1�=����������bj��r����
�>�9��l�lmg�=��b�f�0�l�d����%���J
0����\3.��[�YQ�����r�/K*e��{U1/��,�v*G������hC��/�=�t�9@mG������'���4p���j��Wz\���6�SZ�=����`c3g6�W6%�3��f��g�A�LG��l����5��Wcp���]�L����v �l�2#zB���!Y��ix��
�j�i��f���9�,�i/��6���p� U�
����'�q�(=F��hf�E��-����I���l:����v=����\7tW.�jc��=��|&�C{)�O9���.��w;/�����h�n����,��������0��	�4K�	������}��'cP�{S���G�9Q�<5Ky�j��45sES���f@���5$VS�����tQ��2��������f�G��
�K3@U3.f���0F5����%��:E*���*����s��P����^3��f��^��O_�����d���f�Q�l�\/��q���r^��Uqu#���a@�e�T�������+��u^]������gGV�s��"�g�r��&L��B�������od�����{&�m�����:�;���5i��w�L���tP���aTj�|�Z;:}n��	���b:��~-�IAl��cGD�����^,.iV�c�d�
��j����s��5m~+���X�E�]��N������V���7� ���H��#��[��C���u�h2����6��MQ_OG�w[|#|�N�����7��N����)�����l�������c������p�4�����dE���g��}1F%	��w\�\�d3*[�i5d����8�6d;�J�;�5P���i���CO���M�Z-��Z�F��Z����m!�aM������U�r+���6��q�M�2Q������0�V�B�5wNVSo(jSt���k���&���4�dGa�6�E7d�������e��7�1�ok����4n;�J����#1�&����i��w�]E
��G��^n�Ua��@��?�*����]�Rl~+G?�/�v]	6�a�f���C-L���s�Z2k	P&w�k��66�.�L��J�����h��}���
�E�~Bkh�r~�����O����������E�>P:{�/#1 S{��cf�s�`>��/7��Wk�����aY����9�$w��F�j����Fu;�K�;:����A���|e������{��;�Z�eU�}�f� k��"���{Ur�x�H�z�J����M����i�$]9���W=�;72W|�P����*l����/o������*2�n4��_�����_=;�K���w�o	bi[s���-��)R��r�fW������+��O@��6���K�������
���>����������"������<��6<������E8���w0�V����p��t��^.^�)��"��.��������V5�_5��\T0���i�\����ev��>����er���2;K����R�P��9�7�U|���k��,��tA-��=y��������:������d����C���q�M1*�Y���h������3�����)����yV�����@+}�pk��Ud���i������*�nSh��z��Y��7-����t��[HK��aN�M��C�]UqCV���H$���!�"G ���������c��H�tv��(}����k08'�N����N�� ��7v��,)}��Xgn�~Y:N^��;���|;wJ��R���=�%L������tH<5�]�9��2��}�;qJ�����8��7}#fgO5�vgO.��T���i�X�s������BggI�s��^�{;���JC��"��*$�(_�:�RB�ek�X4ES��a��\0����M��ua9���E��S�|���� ��7m��~��5��km@',L��"��1�y��'� B6x��4��)���@a�9�s�������&}�����?R�Q+o��.iJ[��[�V�����a��X��`�#�_]����z����|yR��:���(���U��T��a^��N�!F*417�J�}���2,����+�d����u1o����DA�s�����D��!Kvd�>G�)eb#��^�WZ����N�1]G+$B1��w*>��r��]�|Q
5.����"W Zl�N��(�0+U���v3����T�����a���L�i����%�MMP���	�;��|>m��1��e�����T��f��X��b;����JQ�����rA�R�JL~�&�ttF����p+ml��*�:�{X�b������D�)O��<�U�R��bb�c�F�#m;���9�^{���$��vM'EK�j���:���m�nc�x�g��l������Y����[��o�t�B�[�YO�M��g}6���b�Z����`fa�����>?,.���J�����}��c������b2�>@g}:K���y��L��y����������W���sqZ�g�����	"���W@���������D�<���=��y]�x(�,��;�V���F��e��J��(Y# "��T+H;oJ��V��X����0i=���������O��1�@p������Y�@W�w��"����k��)�g���]�y��Y�J�r/
�Y���5��i����[�E��
Z?p��gE+���>�n��Q��c��F$
���yt�\���x��8J
�N}�k>�H}F*����&^q3��t�R����Kx���r�Z�0�|@���H��&9�sx���,K��~��,����z17]_@���������������|���N�I�L}�3mC�p����UK��j�):��]�R�K5;�%���;/0��C�vl��!�C�E�P�Lu	6_�A�p���xQ�	k�d#���w���V3�o���[���l,��~��Eo|���j��%��/B�b���FO�y
�����aJ�X���_�-��&��[o�@��a^y`���7/���~qaa7y�yB9�V������4]c�g���������������1���0�����{���d*7�(�-Z����?O�����k�%9�����(��k�7�%�@��v��_Q�x�%��O^�z�Z
`\��q�t��Y�Z�h����t8/���vW�l������v�)������/	����������M9�T����oJ����T���
 ����F6�*�rQ�|�@9��H����2Ge5��k{�bN5#o,����AMt�t�w�i��Q+Nj%��(�j��������(JI���\�����������������1�����
����vM�|��`�C��1\��l&T$q;�b���]sa0�>gVa�td����i%�X� ���W���i^�]Oo�
.��E�r�x6����ts����u��sXn���F����sh�hu������=w��D�>)�o��G����^��z�����K�
��wxV�1Yx�������$53����m�!�^�:������)���UpN-����_c��|A��9��ih������6�vz���G�=�_z��7�~}.���������]�����r�"G�&�-b��������\�~�6 |s�`+_��V�X���l��PU.L��������*jI_��������"^/�i�:[����o��=#�y��`^���3��T�[���u����^�M�z�������n|���6&�����y{�����x^?u]��
��.�J��}eL����n��~�\�"�R����|]5?�!n��x���e�������`d������<�!x~��>��c[4t��m�>9-�w������K��o^���+%�a�\W�����Lg99[?�:������P�lc5-'e��gl�SPg@���1oH��5��E�`5>��c���1o�`�u	w�:M���r��������D�v�	8��H��#�-���B�
@����;Z^m���.����J�g��IU�V:������p�������.�=n�-�H@��

�S�������K�,�X�����=���U� 6��Q��
���-@�\�k�}Q��R��t���N����3����������zyNS�s��#��1q��'g��h���x�L�@o�o�Qt.l�c[�[x��kr���=5�z`Cw���`��5>����]���j_�����,���L������6p�h� ��Hg�����?��x|r~��������g���J@� `�������}�����������H��3�e��a��|���78��5�P���O�&���_)L�i����Y�*n�����
0�+��;�%�b$�y��1�B�W_��^����%x�-^����*l�3�����p��P�c#�i{<��6�[���W[�
BW-������Z��
�I�� �7UyN�@���,_�7A�_N�r�.�<Z�����
�����\5�.�#��K}��m�W�}S�Sf�r����7���a=*.8����~��T"- f�k���)������"���3M���7p�7�h��Rq��a��Q��/�R8��;o����v��T��N�Q��!Z_�kB�7;�al�r7��]# C��}��6�e���*����F�����:�E
D�)&m��I�PI���\����#���
8.���@�� �W{�G�f@�����I3�{.����ok$\O���tv��cM�����$������4�7�7��%pr_y�%g��N(�a�xy��� �o����k��q
���������I;��(���Z���N��>h/���t�v���p���>��w����.y����w�t��9�}����p�*�H���
�B�/�Wu���n�k7<m����_�u� ��c����A��ad�R��������q�o���TB����ku��m����k�]Z��2��(F��������o��H��@oX�W&�@�- �e4%�x���
pT��`���X`z���
8��H������j��r^���7��sx_-�*�%��������wK8�O�,�����Pa�.h��1Dp���9Dp�K��]`����2p��~Z$pp�88���`p���z\���������E:�o�����>7��\>�x�� u�
�z���j�Y<\o�2���cn�������r���[�90y"D�<�gfz�3x�i�w`��������k���3��|����9�px�����sL�����7=�
w�i5����`V# �`��M{�aQ�����4�v��^����-Zm��@�\`]# &��A�6�V5����qFN�o�����<��i����t`�y����#�25���5����5����<�M��U�^LU��m
�</�
���s���{��iUR1l��Oh�F�Mh��<�����2�����5�����I
�6���"=����tr��S����WLF�<x�5�s�W�7T�73���L������D!����5%�s�q�}�����
����6��������������k���������.�������[=��5�W�QsC@���Qs-�>?k��c'����C|Y��Y# `�atW�x?�m���6t��X6��������a���W�t�K}��r�����V����O'�/�,��
,�=�U! lCcEj�����se�f�HG���7�������w��sZ?L��j��r7t�������&Hp8
�zC��������}!���9x�������8�F:@e8t�t�����r���������n�-�n�F�5�]������:N���� zU�H���|�����(�������M.��zg�[����x�������gY�t��M'B7���s�s
�����J��iR2��(K��rC*E�n��d|'�t������t��bK_`�x��H|(����NJ��~,�o��������1QJ�j!�H�d�1��l^S��r�.���6t������
�e7 m����
n(���v5��4�{��-�����c����k�pu�������/ln}s�����RF������9Zc���
#���a��K�];�����n�s��������b-��M>;W�����x3�RpAl�����a���5�V�-Tj�R�K����oY�`�!�������A�}.��E��h�!`G�����;��Emw��=!KC,���^H�{��'�h��y5G7�y�������#���>�QA;�l�9����r�O�Y�lgw3j$�������U:�9
�1���1
�a��B�c�1C..��(C�X�K���L4������P�:�4��Dd���-@(8���r!M�t@��JspXe��4:�SQ�EOQ�^9�;=�;�������������QqY�<.�f�����|W�W�|TT#%�	���G���(����������U��DN��M�d��4�cb��x���w�#,Z�5��W���?�����rC�"�����9
����Z��}���L@��tj����e1mY�+��D�-nY�Q�������5���6��k�]E�b��?Q�R����X��>�e���=e�VI��;��"�EE5�
0p���o�Q.[{���}9����r������htny�����9=<a�� �!����:�\R#�I9J�2��b���a����Mp��d�\�8�k9��D���7'E��@(m��w�i'M94���U����pR�w���afa�:S��l\H�sR��s����us�m����q�����U��TbK�la2���:S��m����o��6�+���
���em�����C@���`����\vkz�$��:�0a��2����>��a��M�#F{N��^����0#�FNh�c�������|�h���{��F��U�9����!r�|���[5>���[���dz{�a8^T����t����.+����O�u�����)|��-����s�5@�(��"@2F����X�0���zX�P��50&=�rS���#D����(h������u���o�9r*��a�]q"k(MeL�������*�,(&�bB��w���0��7��dx��9�f��i=d(�:��6� @����U�#@F�����&�a��4�Ja���|�;�����T���pQ�C�����"�FNH )����gEd���8W�neIm�[���#���k�1a��(*���WJ����-e%s�2�a���Ak6�<����E5�S�I
Mn�Q������ty�x|vvz<��T�i[�89�0���*���;�z!�f������������� jGkyo��H������������E�p�c�FD�8�)%�E�].&r��Di��i-�O<�^5�g��whT�*j��@2FN$�Z��]T�������-�I:����lh�����H((�+6����G
�o���I�ZE�WO��6�J5�d����g1�����F?-�
4�XA�H��1b��V*/����p��S�T���r<>���J�B���x��r:EN]3��0S`��f�I5�ed��*:���}�	���!:HIk,;��*���+t���}�-#� �keK31]��?��R?�w���/^
���7/�����|���l��W��\
`��%�����=�*���tA�"}�{F��6	��=���L��^�<WK��� ��?{o��6�]{��|
��:-kd[x���%Kr����#��=77��"!	1Er�j%3�g�T$P�S�H���^�J�e�
U��~�+�y�����4
�5��sy�Jq_�J'����T���n��
t�<�x9�����t��X��i��������^�������B�^_�{�yqs���.	|i����������T�<���u�����< K�?�q  NF�������[O�W#�8/��]�V���2�idq\e�{=���jX.�#`F
��U�:@+��{`�����u�������m���������d2"�O,x���Y�	$��F�j\��
!�}n��y�6�t��>��#��F��(��R5��5�����[c�vL�������|eS�,�[��uD�X�����s����`�U���G���LF����9@���
i���_)��p���Ui�j�9�F�]w3��:U�V9��P�Q��&�p�8����C���3�l8��(m�"�g��d#�a��F���8�����k.�
6�P���9-V�cg����}k�S�P����I����vo��;������yA����
�x�������Cg;J.`f��i�-
�_��e�Kb}���KZ-��W�a�����FLk40��8l����3�y��z�]��������Ye$�[��;��Q/������j\�g��^�q5P�9�U&�U������w�Q�Z9'�^�r�WN��{r����s���p�/��#* �hm+��g����*��m-q����Q�3e�8��~P�n0��o,8�D	r#���� J�e[A�r�Ix���u�T�oBw;<;:����_�����g�����w������������/-8�sO"��F�k���'����(i#��Pg��t���eg8/�&����!���K��LT�|Dg�r=����Xw4V���M�d-f q6O�����������I9���u|$��>�8�%�rdsc5��q��~&G=~���A��P���*���@�#��J�z�ea�n%S	�`h=�;�[������x'G�o/O.�R�������������p!����!���/���J�F��Ms��P���4��Zwd��������;�8���Xr������!5����]#�WG_mt;c@X����z3��wGB��<1��c�����/1����]��}��>��+�3T��v]�Y_U��r9�,T��n����F��1gw���[�b@-���wVc+�6Xy��B1�c���U(�ql�������B�9�RJxOX���p��t<���|8����[�b�XL�j:����>�$����c�MO1@�c��������qUnU��?�z��^s���\�V���9�Q�:�:;��d��Z����*�n��)�j;�^����9cQ�:�����&A����`kUH����3�l�f��|
nkU�P�[���P�����@08t�!i��v_� ��G��?���+(n����b��-^1nc���3'�b����=i���:`\c���N���+�k�z����j�W������v����:w����b@��������jl#T�!o��L����\F)�h,�V�o���T�ca,�h:��4��]d����6;b�1�F��Nwl�K.�h�fG��^|�.�0�q�q���w��L���O�1���w��a�9��59����;���{��]r1 ^c'���g�%,6v�bc���������dk���R�%x}9&]PJ>:��i�[�K1����5���]������[4�j�������h�G�E��T����h}����eU���7U=��2�@fc�V�#�x�e������t1a������
3����Y��:k�`�
�f#�U�O�@"��o���U62o�������
�(��VAU�_.��W�\��;�>�DU���b��*�x��3�����9�WE`4�����x����'�h�&��?Y�j����c���x!Lo�^H���<��:����R����x�C���.�F�j3�����'����nh����~)C;@��N�o���&��ZT����/�E��xM�'����vp��y����:���7�U���'�;|�N�C�
f�������g���h��8v"����I����h>���������v>�E��
' �c�<��Q@�'N����&��}������OG5U�?=��f
�c0�nsW\�N�(�i����Q����K����������w�����o����=����O�G��:�<����:�kM^��U�%���m)�	���]���D�;���1r��]���0N�����y������}�!�s>���)�$�6�
���Y��J�� �c�����<����P��C�)�q��X��gu�:�����_ *:F� ��'��>^����@�c'T����fzp��j�XM�v����u���/D<�cG��8us���s^�U�z�,�#������r��6lv�d�����d��;vsMnF��9�@�Y�q-�W��i.k��[��&;�,�2c��`�1�����y.W�����kR�L-s�N{��o������?��F`�����,|�b{�����+����5�Q_6
�3+��c��a��<�\u4a-PT1���#`y�)X��:��'����m����1`�c���P>J��1G������ma�(<��p�:@YzO�=�1G�����2���1`�cG�������v�Q�5���;�1 �������1GP��C���l����7;�w��vc�
��0�H�p8�n'�O�|�W
�����#Lyb3�f��1���#ty�<�;S5�[�����@m{:E~7�����$�z�A��4p��d�URV}���T���5�O%M��~�-Z�����h��+�=���iO����v���G��@�'6����_�6�����
dw���'�n\h��%v������]�DC�����6vQ����?�|�e�{�d���o+�����#	�>q"�[$K��������@�u�dr:��5�9�U�����'>�T-I���t�7��H��#�]�1�\i�1�Li�8����GFAj�r�%��Oy���I`5%1���J��*�V%W�1�'6�ic�aP�$�uN�3����}1�(��#�\&S��x8�\�������B���&��4��U���Z=q2unQ�c�o��$L���pv��p�����*�*=	7���>��E�������V._�MG����~��G9	W��g7O���99��n���(>��x�:@����J�����|!�c����J�>�{�:@�����NP��<��>>	7V.�����Zh6��L^�s.����8|k�xO'�O8�fJ�����ws�RE�$��)�z�u��p�$B�Ej��jES�j�*+�X��W\7��I�fi��u�����svt�I1�<��G���
f�!co������������1�BaC�U0� �<�xA�;a���9:~��'����O��x�x���{���e��{9�Y-&w:�$�O��x;�b���c��J<�X��:��V?���Y[��4���P��C���e����N�	�'@�c�����:�7�ritq�'�v���(�����q�MX�8z0'�|N8��l���$���������I�����*���c2��Ir�$��>	@��dc_����#�v>�����Ox���u�� la���O���F����$��MXdo[� ��
�u�J����f�%��M8����N��f���h����M-�8���5%�Ssd��IU�8Q�"������>6��������:�������OM����?����$s��~
�����eK�> ]�S2����$|M\��ul����5�,�c���BM�j��B�@��G��I�@��-�a�&,5;~M�-g>�&�ibaN���Il�p�)%c4
���#N�������0JM���Mz�Kku��&�i����a3���&(h�s�h4�����4q�J�M�'��$�.Mzw~\���0	�X���59���cM���j}�?L
p�t�>�g��In�r��q�2��+�#l�L��r���u�B����M�B��M�7V������M�7VN����c��������
8�7�����d6�!�xS����nJL^��k�g���T7��F��6����~Sb
P�����d9\M��M(��S��u�
�����O�����n�1|pjs��(�k��M�v�����p
8���n4��{S��
�_p�����|Ob
p���]��{S����n
���� {�4��'1p8��ZG������)
���f�~�m�$�N98��PW*x�H��IL%�r��>��B8�a�:@8�i���j��cOb
`�4d��_vNU��R��~����$������fD�&�=�)`uS����ZK�j�}w�0B��!���C
�����Z�tJ��0o��\Sr�
��6�Q���S���6��Lw�r��q (���������R����F�$��j�^
���#��q��^?������.���:�`K�`�\
���#���I��J�z�u����H��*�7�wX��,�6��7�!�F�p�4v�4���(����@�8b��P�/����]L���z6k�ilE9R�E��(|a�-���
���f��5g��4��4���)��6J���C���M]e(E��B=���8eI�f�w%����/����}	)`�S�x��Q�%�ANY��6���<9e�d��@9�e�:@�
�/!�q�8��4�v��KH��&������9M��V~_B
���U8�r�!���9eQd����^)3�J���0/>
��O���hTjP�����[wA�{�#*P�Y�yW�S.����(�����%hY�/�S�+��D`�E�xP�
=�s�����l^�
DN}�<���C9��Q>�����K���&::��9��gcGG
������F��=
^�����b�������g>P������ ������7����g��(K��n�q��j�K
�����d��2�=��S���K��:�y���z]b�9>:��>F7���P>+F������6�0���Y��[�_N9~����N���
Fi�������K@E��;�Y��3kalW �)�,��b���W�@<t�K{R�\g�wZwm�gr�d����@A"��I�g:�aGE������ ���x�*��4�Z��8e�BQ����O���^�\������iGg���#���9u2�m��k�#C�)(@>�N���?Y��3�����Y�e{�pB��FUS1r���S���(���n���
��SU��������U�\���.��L�n�4U���"�k�t������s�}�w�����dr�� ������m�����g��OUd5�Q���g�k�K3�g,wM���Z�.��������c�����R�d.�X���d�8?]�����q�"��:7�F'7r2���q���d�	���8��^Kr�d�l 0�6�I�����u��9�d3
� 6�]&��.f����H+QM��Z(���t��Z��wZ*���D�	c��L������������HW@U�U��[���
`#������]-����+��vg�<0 ��^OK��%����E����9�J����"Xnfs��U�O�3�������{N'�y)I��(6�>#�nX���"(n���f�����f%�5�+�h%�������c	&�m�F�6���*�����r��o�`c����z7+�����������k�����{Jw@]��:Wz�6M����
(�������\�`w0@�'���J�%Z!�=��VaEQU����U�<�8s��3@ge����0!�iW�1���:�h��������s;���w������:�qntVz(�,�X����y�o���/���o3���Y3@qn,w��LU��������%U�� �3�lK�y�����@��,�;�S3�`��n.{���5��d\�������q���p��fL�7S2��;o0y��l�Md)E�~O�V!w�r!�:�~F8@����k���$�j!���=�	^���ia�?(��t��WO|B�CnL��9�z�*�{!����D�mh<��3�q���O���UK�tW�v��BA����V�q�Mm����T�(���X����y�s�\��n�%����H��=1Rd�n:]\NW/��QvP�,v2�_)���5,;s��E�2�kT.=y�X��".�F)@�W�[�;�������^y��=��t�;���K\@�3�E�1�p�������S2��a������1@�37�f=P�M���T�J=4> �6�Xy$xo����$<x�=��pw�l�Q�|o�9����l�Q7��!�� �������������0$�G����[�e,������h�E������O�}gNn��yt%:	�������6�d��>�*B��,B@9��mO3�1����z�"D��W�M2�&�������@�H�<K]{R�;�i��gNn\�������H�wt�Q���9^g���X�����=��3��n&e|���hQn$�?�x���d����}�����<r�b�����xJ����E^��T��8����@T�����W���#1QJ]���z�c=�+��n�173����4�P\_�4nDD�	 ��,�(�aC��	���,nN5<s���
@�37/k�j����������������
��_��A���fb�z�}���T�<C��ur�k�p�A��*�k��I~o�����$�3�g�`y���	,�_�*��2���_U��{����z��+�@��O��SU����3<7�$�#���b�H~��N���jJ����,����3'�\~��8�z[�" �����b+���������=
���ud��P=�@u�:@��l�=�l|�5Q@�g�${w���#tz�J��@�g��q�8�N��.����S��=7�m./)Y�J;:��@��;���������;��Y��M�R�s�����m�@{��������)���{6��f�X����Q�z��j����
�c(��|o#D�S��t��:o<'@�{��[tk�tf#V~�;F�`��F0�)�{����X�k�	��4���Rw�G<j������N�G��������5guSu����=��l�|3jp�����d����_��l|+�Y5m���+�x\M,��p9����V�gc��P�'.
i�|����;�x�����H�k��*���z���x��}��g{^���bY�V�#�kl]A����a;&���o����L�{YwW��Y�"q.�]���Y
6�q�%/��<��P-�S5�/�����}�=��gz��qV�LD�EK%G�E��UWIA���-���|�C��p&���x�������x<��	������������| w}��;����
�������t�`D�Mrop%�U*+��\O�������Z�}3(vE�����7nd6�{��y��>���������rf�;J>�cZ(�A�s�����Y*���r��=�����3gB���0F��[����uV��F�>���(w$�W,4C����su�P����f]�r��'��=���ts,�����d`�=+��H����=+��H]3l�����3l���q��]In�����P�I~o]>�H��9���m4��_��=����v���\��i�ZOV�W*?O����9AA�7���9����3N�b��},u�����C��T�7���F�����(m���XZn�Y4�X�a5�yo�td������\����pS������N_����2�Jd��;�����bB>��M�k/����r2����g3>7�1j8��{���07�6�0�=@��l���}���g��R���z7��v���^r9�����A�U]U;����C���c��\O�r-@���,�q�����n������~�:@�>��M4���2h��k ���U�������(�'o�����BZX@Am�z2��`G@����.:��`�@�h&C�6��e���bY�u�[����`�@/��|
=Y6����]�Mve
��q~��u��9"�=�����L/�pX��R��
V?���z���%6r
�us�������a�Tymx��`ll�9ml� 1�=����d�/�"�����^��}��D�oZ����>������M���,���j}���~m�#�c���c���G�v�{&z�{&`jb�Z@�]7O�����c�s4���M=�M��x?�KX6��N��M=�M=����m�@lN����v@��h(��G����	���7����q���s��o��O�g��Es�7�nh�����w�2��='k��nS��m�8��%z�O��Z]���:;��V�^�����;��������Y��
����}QJ��H(w�����yQ�������^��;�N�F��}O}�����c�����=����'���>���~�=#$k�@���:������}����c�^���2�(��*����~�\���G��S\���&}��������*#}����dg�o��B�BQ�oG���&X�m���;���4Q;k��h�Jr�@l��,�7���[�=��$�lf1/
t��0��x�����6���e;���(�,�����)�?a�En�:>??;���-'�|8������O���U�������	(\����������M�2�(����w���n/����>w,�+�=���[��k�y9o��P�7;����hq �~�:��LH����������,���<���vO��e9��g��%@Mg�n�����n�P[O��^�����l:_��������k�0.����=*L�	�x�k�.���G���r���X�E���;%7���h�����h��.���<���pP#:�B��\���g��?%�9�{D����Z�H@�v�����(���c�h�]1�P��0�U�F@��:�@P@3��`���J�������a��@�b&��`��@�z}����G ��S	��������xht���
�s�6��;�>�4�h6��R���}FB��r 1���.��R�i�}W}�i@�����{:��60�d���{>)����lg����>w|
�"�6��������F�"��yY�I�Z��z�y�$�Oo��@��=�}w}���I���S������p��\��%��y�0�,�M����D�qe���d$q'��u1/}����+ :h#R���v��>w,@"�i�V6���Vu�����w>�_kM]�Y
�'v��V|�������}��V�7e_j
vGb�\L@�b0Yb�e�L�_~3��I;wO�;L�);��%�p@�����Nb%�\75�tW�h�w����<�5
2
(h�P9����r2�I����I}������������Opi�gH;7w�94��������7�N��_��(;EO��s�����������	y�|;�k���BV��h��[	��o
����.��h�0��$��1]�p�������J����fl�36<;:O�o�6���IY5���S9.�1��k$��=}�Mp��[��l�>�07�@�1��{@�o6dvx�>�-��'���Jz���w3]x�z�T�F��:XY?tN��<`���k\�$��6�N������|uXI}�=/_z���:y{�wp~yryr����y�����@@���������9�4g/���X�%���:}�����%4'k���9xLV;}�P�br����T.���[��s����rz���t�d<}������jA
��w�:@�2�Y�Po���,����W���:}�s��Pvl�>w�P�"����N��cq�{�E�����[Q�{�H�"��#������g��2�D{,t���]��V�k��l����������M�A�z�����N���ne��n��\��w!G�f�l���sWbD?��YNW���}Y��=���:r�����cF������6������z'7y�o~7S�U��T�t5~�����]�9y��\���A�q&�w%D�\�u�u���W���������	d����,�r�u|z@����E����������w�"�i�pk����3���c���.����I��s|�U��<x�"P`p��+G�O�����9���]q�}������9��*�l�����w��bj�`���~��]�|g���1~����`�����JV���tz���V��!k}�<9�����\h<�m����w&�yudg��@�����(�
n�F�f��/��}�c1���
0/d�F�R����u���Pf�C���������|RM�-Kp.���0Je���	o6/���^���r6V��&���.����T��5>��X�N��z9�=<�*/����$qP�xW�(��|nD���J�VA��)o@�5��Y>,��m�Y����,�����`q���i&����rQ������4�����_\��^�y.4��i�����y\	N@Cp,��k?����s����K�,D�FXM�X`���#]���w��o=��j���m��F;`l?p]���b����h��4�(��������7���o�#����p��[O��8y�6�z�����L������9�}8��LF��%Q%'��\�p�X� m?dz+f������%��Y����Z2����d���]����g�����p:^�������P����
`�}�]������hc~�(c���<J�8KV8��p���<����
�X���������S��:�b�^z�����������V�j��w��}��N�q��J�i��Q���P�9�����@uFN�bt���{'��������������N���-}�
��c�����=JW�M1��~������w�_�s|�q��_�����ysri�2 �c�=����+���X������A�2}��3 ��TN�G����L������GxDi������o{g@Y�l�e��3�Xl\���;��:<�xg������X3
��Lfyg���|�;�����m\��z;��2�4����e��!�g������8���q�wt&qP��O��������\���
���{�4����;#�F$��HZ����I�#��������{
�j�&��V�����W�k3����p��q��q�����]G]
��)��X���Q��}}q��6���/���n���{@'�����O��ho�-����h�3u�������>a�xo@i8?g�:@'����..�����t�������]#�Bi_���� ����0����#����G6���w���x{�^�]�U��7�Y^Y��c�d���W�_����m�9�^�������T��/9���7��Z+�Z}g#d#���q��q T��?f�������}^�@8�W�q�l(�5�V�:v	�����?�=�����<�����9��������+6�Y��8���Sd�k�w���kF�N�Z�	U�sx�v�	�y�� II�/�
����~�k\�����P���1���e��]���9���;��"xvyvxv������hx�o�� 6�z��]y���=���mw�Z����R��Q��vX�P;�V���I�"�7�i��|��g���c>
< �g���t���r�{DQ�����y��_\����7���k�_N�lu�=��MF��`��x3:8�3`�O��@mgA�]:Y9��t��C��H>����;�x�=SI��,K�G,���%g�b�������Iur�t�=�X��/�q&6������s�dR.����7 h�z��zW�b<�m(�5{��a1X�e�D\I�UJ�>���<�?{��r�����y�M�n���]��v���b���sO����:kZ=�h����hqf���E$����Rm/���z��\m��U��E��2t/�b�XFP@�X�tW�*�v�Q�S�OVw5�T+����x���A+"���w&B�/~�����eN�9a���M�H�b ���*e�A���@9�^Ru\z���Y��7���p�g��#P`?��i����dkn�E_�C��f;��No���������������6�5,i����Iz�����qK 94��>�F�������K���.Iq2:%�	
B�AP��(5^�R"�:���O��5��>ey����#Z�4pCK�����x��M\?+�,@��[1����G,5Xj�f���q�JiE�������a��i�k9.'t`�L�5���p�+%���5ps�
�^�����]�i����.n����W@e3m��/�wZ��n�R�?.�l��l����u��8!��3������i�}������Y�:@g�y������(����x�����r���G����@E\�{��>]�R�
^���q�F�����u������z��T�{F7��|�d�+�6��4N��^ i�Sog1Hm��ZJ��g��+��w4����`"�)��}��%m���S1^��cC�m���rLgD�������;��\1d,���v�Ni�j=����t��Z�Z�1v�m�A�ZD7��������P_��wjf��� ,W��7p"p�v�=}�T6����vW���������@^\�|�d�����s���;df��5`|���k��9#�RN��QsB��8&�h�U��	�>C��p,o���Zq@�q�1N�����M�|����Y���n�x����o(Q�(��a@{��P�G���  n�����X����+���P������������^��4�F/�z����~�����Z&�I����Q��L�LwU�(2����6�m�~@e�
�6���#�hO8b��b�b�{[������G�Y>�L�����]�	TWWDV�zJ����Jx�KV�)p��u�p�k���T��Y�;�
@�A������.`B�w���T(�X��NY�g>�n���N������_z���S��>��v�wxVX�Z���=.�Zh�h��8��@m��u{���T;>#
�~���Ws����N7_�L���N���
P}�n�J����s,���<nx���q���6�1�$����?���R��e=����?���B@��s��t#�\t%������z�_�\�����
���y��9 ��.�S�M��{j��x��3H��#q��l'�v���=��T(a�ZZ���0��	�u�i��"���lh@pz.���<~�C�)_�.�����r�
z.�
S9��J$$�/�Y*�*9b�A�5aA&��c��p����v�Ms,�!@�C�xC���N���1�X�����5�c���t�auH��%uw=�G��b�2�Ng�h9@n���'z��6�gFU|�L6�w2��}�OF��������#(!`mC����$�m�����R��n����j(7��]��� S��/)E�	�!@tC�KOy*�~r�y�q��'���
9�����PK����
mV�� ?)�ZL���Fdsq�\�`9����?�d����LCV����!���vsUC�p�����|��T���U�7���|�0��L���j��ARXyG�o�� ���p7L��g��������O/�'����XB
:����
Yh�����v
m���"%o_9�_���H�b��g��UY��@��\�|�_����t������
YBVE%"�'���*��Fh����
K��b�}�FMpm���2�&06t�\]���pQg�c�f��6+�M����Sp��!G�n��k6P@��U� tC��bWt��A/Fc���	&��~w	nOe���`J77��n�'�W|}�h����a������rtz��:#���7�p_�:@�l6�/*q�.�W!��bWh��0'i�����#K�~q�.%���~�a�0�ah�#��`��
����e�'��t�!�'�]�d���N�������qv���K������:���������
9������y} ��t��������q�jy-��*
���
��]=��@'����U�	�:�
Y��:��B��Du�����d��
��a�z��||���������������./�1* pCG7n�F������6!�pCG�����{�?�X��y?���������6N�53Q��3�����!�qk\��K+2m���s
��~����&�I"��hhI"���<dH���S�p���7�8(���9���� ���	[�j��"
��B 7tr���df��0�����pC��5�� v�}��T����rd����|]@�(�8��,�+�`�PWL�X�l����E�s/�zo���s~
��-��������#�j������7�ZC��v��������'!���j\�tf����MF�H}���!�l���E����eZ��P�a����b�j�
��R�����j��47���
9�VOf����p��9�2s��
Y2�#���8����P�5����o�����IL�OW7��������r��
��q�A:���7��_�:@8��y����,a��yZ������+*���.��|g���nL������+S�� �C'�\����e������{���k�����)�� �~9� ��r^���m�Q�sb�Aw��Tn<��0�(f����O���q0��z`w���%9�^�:@���_��Q/Hc��������
�
���
 {C���R��C����6��@��/���D�u�0�!�s���5�
$�c����p��5{��Vo�Y�D��D��D8��u�@|���<:�Q��1����a|@�z������)�&�P���XN�'�q�A/���j�vt/[J��A�'6���	%��T��������&�v���E�;4�JFMl\����#H���n�z����
9���}���I���?��i�P�5-ps�l��I���9�
kd|3�&aU���=�qxtW��g�#��pWr�(E��8�K���jv��+��#��Ys�^mm��]���#��@~8�b�:@~|�IsK���e'(Tt����5��6����6������C�@�q����Q������wu�����u�������[6���j���d/�:�Gd�w�b�:
����v��	vlV�{O�E��Ae��;����;�!�zz��M�W���^����r>����������t.��1�d<��q�:@��i4��������&���#+n}�@
9�d�:@��I��|(*�v8���$��1<v�����`v����u�Hp~�uzAv��b~'(�Vz�TH�����:
����4i���"/5I;��#�6�$t�>��G=�o�I�2i�
 ������(�|�����vd��)�$@pG��J��( v�@l�v��kZ�#2�$G����mOm*�y�4�^�����������Qu��x]^u�I�(�������(����WwFD@_lF���&�{}�Q(�(r�Jx���L�z����a>[�4r�j\Kv�?�v9�h�Wt.�����U9�3i%��(r#pG��LB����+U	�����|:)�S���`"��c�?�w�������*��O�H�@
M�,�Y�
0�1������=�v�����uH���3)���|1@1\���:���=����_����<�l���w;n!�D��B.���|P@�8���~���g��m��*�����7�Jw��B��(v�f��^��4�I��0��	#�I�/�A� �#Gb<�x����� �#H��z���������D���7D	�U} �[}��!R�>���5�x�����}��^����a.Da�@�����}��5�)��]��aB��"���W�<eQdIy+]�i!���H���=YwF=���F�o�]�p�(A�$�~y3�;\���i9��'�op%�_HWV���	�S�����������w�����F�`�Q��~L!�y��
Q�����9y\y���qf�\V���Z}�4p���\��:�x����������<ys�w|�?����1��@�8���P��q�K�$���]�S������T���A���:=��@o�_N���������W���������?�/�/����B=��PoGBo���������������7�����������mO���u�3o
�j�����a��|���_�{����_�oN��T"��G��� ����)������fY���������1{8�(u4#	|B�CdE�+,��Lz���w����p������=r���B9!������i��<O�w�4���<���e�����TV�s����#��4���QQ�.������'?���[^q��lqpx��G������cS�6�-�@4nGP�]xpsx	x������&W���,M.n�XO����A&}�6�G�o�!<r��wW'���qZ
������q=!8�W��^�'�b!m�Xq-�u4���q9��BU,�+���D�iScp ����[�m?5y-�~1������]��L�������%/����#�yZ��]�X�{V�4����`�@�~���8Q\����Q��<��|z3��t1QT%�_N��.�#o���N<r��)$v��t�9�.��lx�s��;�C���/�uG=7�ZM����	��r&�D�.}� 7��Pi��8�=8w��&`r	�fIp��������U)�9=��������M�(���v
/�\�V
1`�c��n&�9��;���OI�I�N���8����:�+�Z����9?j�:v���GI�Z�������ou����|7^�o���(sQIG����&sf�'w�)����zf�?����9����"W����'��ZD2��.
�Z�;�����ij�1d?;��4��c�����)��vWL��'�I�����)�h��T*�X��`�1g�m\���N��VK��!�Y�m�s������;v��-����g[� Z��h�l�zm�Z�R��R=�u��e��N�
1��(����nK�n�`���7a����8xy��
1,I1���f�����b�����[K�$W`�x[X;�v�����
�l����r��)5��X�����7���M��@�q���8�����9l�������h��co4���2 ������5�Sa�����6����`vl�;��}1��sP�qP�9���6�?����w�f�.���5sW����3�':�;%`1i�@Me��]�;�����wt����\e�N/�z�9G��o�YL���9���di������d���4�����M'����TwJ#?2��f��s��T��I��Z��,���h]����S����NL�ojuzbC���	���C���������W��v��6"�5j����H��Y�n�i�S�di7���3��g�m<st�)b��������@8"K��^nqr�?��c�y5s��������*�U�m�r����<3P�6���
���c@�����M���b�!���k5�h��)F���-�x
�tl\Tojlv1%[)a�5��eIX���u����UK��e����@���%��$x���B8�������Q��������Vk��mx4�h@�V
�l��'N��u�JINh�Nc��u���
x������Ku�O�P����9gB^8�L���UH��z'4G9�oD&��2�.���vJ����7
���"Z�����D��d_�k��Y7��(��d�����A��L�����/�/�~=YS�L;����!��x��1���`ol{M!xnl�sw��d���K�����V��Y{��Lq���h����+����/�:x����nM---�������bH;xc�UaB8����5��,�L	o�:	Q��&G���6:�@�s��u�����z���c-�����z��zdU�F���`3��#��7v�~��]��u����2����y��4��,��(�l�p�����7�p�L�n��7��^�:@O8����J�i6���i�������>i���=�����(,p�������}��+s��'s���>9�uUx��;4(s���zQgU���C@@�2��4Nh�������(�����a���������5���M=���:����������]1@�����m6��q�dm-���,����j����@9����1'w��������Z%�6@��YXp�������r6����p��x1�2�B`���,�6>N^�������C��p�[C�N����_e�z\(	`�[�/M' Nlq������9���|ke�h��v�c3�.!���s8���nJ�������i�Y������d��d�mhf�,`�H���<��=�����.O����kl���26�OKlvIJ��d��V�@��Bb��.s����hp��I��>�%4��%��3�r�&sT��+�+7`6P�0'�s��
an>p>����z�S��)����z��Dv�H���S��i�������Z�������1������6���s��@M!���E2�6f�o?�}��{����|#�����z���j:��H0��1�GH�1��!���g��	��S6����Jn�����:�b�Z���2������	��9�pd�:N��a}|��!.�6	���m�G�����U��k�'������4���8�����[S�:�&Y�'N��Fv�T�ir4�q8u��~k��C'`r4pr�������#�� 9	:�XK0���0�J��^��+�����|��{3-���w_���=��/E%4N8�gs�4$q���<	��8q3y�"?!��
T��\]�q�:��1	������t1e@'6��fq����"��
�M(U��R�Y��z���o`wy@�������ls_��_ns����[�a��J��2}��L�r�������c�Y��$t�*j�+�����K�3��q�'��^�,�ZVg�y�0�	� ����K�+����2A�}��k���vY�rrE$A�x����&��7|��4��7'������;��`�����l��G�e�k�zA���D���������o-'����V-����J��p3%�CX��c�����A���EQ�*RE:����x���?r;8r�f��,�����H9q��)'������T��V���L>rj�y>SG�����r1����9�Hg�:@-l@���sbC�E���@Y������K����G��J��LEA��|*�]�&o�Y�;!%vN����@���v������P��a�������w������������F?`��������]�F��UJ�V���
�6Z�7'�����.�g?q���N��q���
]^
j&�*7�l���������A����F�@��S������t&z;�����5n5�E�������J!�[u�N$��N8����+Em-�t�D���;���t:I,�d�)�8�	 ���:(2�4�j\Z�QQ�I�7uN��������y�=��?��\�:=6��J'vV�h���w^m����D�;�N�9���F;���7j;�����l'�p4�\���.2
���t�S�)m�P�:����%K�I�On(5�d�<�|N8�c�:@g8v��|~S����J�	�������Jg�~{1���9q�����e.t�!wu����>'|N���q5 �}�S7�tX�l0](���7�����
��i�`�Wr��FFS���jpL�u��g9��R���7��u���=�X=�;J�7�X��������n�,u������%v��O�o^
5�����	����:M6��`�~���z����F)S��5��w���(X�znx3�lQ^�/QtDY}�G���,�Qwt�0��������Y����&���u97_���]�hv����It�b'���	`�g����;tb\�pp��>O'v�+����k�g\�6�a���N\���i�>%`��@Z�N
��t�U�[���G��;�Q�z�2w����-�uj�l^�~R�S������tu2�iG�2�v����7�c�4�*F���G� ��Nm�� ���N-��}\M������M�vY�
��$�m&���h�`�SS@c��.f[h1jx
 �t�".f���E��7���r�Q�u���\i��L�o�Z�������k���9u6m^U���<�����d%P�������}c�mVs0����f�:@o���f�]B)�ST6���FK'yhy)�wwb0:��Q���>?��V��))��S��]~���B�wt��j�p�X�*�J�:;����K��(:�7����L��t��hO����g!�'G����9u$�S@8�����t�jp
$@�S7��Rs�{(_��J
��4p�*9u��=V�$f�^R�$������8�<���y��`c�,Xp�a�t��`<�[q�/�V�|(X(O�N���)�w�`�S�%���P�����W��6���z�����`�����=�UR��v��y9�"��^���)�]a@���qc������`������1N��r�qP�9�����)N9{b��w��P�>u`�6����[�=B����>���8���}��8��t6��q�5"����8�pjq0~-����'������^_�C��:,����6H����������z8����>]��x���C�M���������v�uWP�yc�J����������1����tB����U������ 
�t�y���)������0�����ki��R��k�����sV�q)���������Lp�M�{S�+�����l����7���������n�����L��{%������:�|Q�g^�f�O�k�7��_�u�_����t�On!�n����^-���Y�q��������qP?c�Z�J�Rs�������C��`b7��]�:�R�X]2Q(&r�UQO���C�X�_�*�������!�\Tg��h����@��|0��T�p#n@��V:�!�
f�t����)zS���C���������G���hX
��r7%�2'��<��`D�J��B��,w�R����w:kv������:�,��N�,�;*���;@{S�5������r���6?��w�	����X�����]-nR���������������a����<����87���]�HL�8�.\rN��G���������7�N����^����4q�a���������\���'��&�(�S�7�"�]O��4u�KL4��2�r�o����D@���4������#�9]��#Y�c6�������3����y�y<@������a"Y���-D��MS�������)�wS�k>P G"7Dnj!r�o:����o���C���zX�[�7���M9B�����d6��Mm�.�r��x���m�<SU�gT��P~S��$/�By$aat�b7�����bQ��k;����r$����h 7�\�>gM���%$\x�����qS+���R���M9�������p�)t���i?�[��+M�.hz��M�W���p>-K1F/����Zi�j��|mG�U��K�����u]�:�j�wN�VD9�-���F�=�FT@:zh��Y���:��-c�G�7�05uS������|#;e����z���'}����}�A�n~2)�<�)qb����F����,yM��S���~��G����j�R5����N��a�6r(*%�g�5�wx�f�2��]QI�����e����w�??�����0��#���5�w�U�AtE���Z���U<�����^m���g��\��6m�5�����$MJ����t)�"� V��P���d�Z��<���[�#����:~�;_�����1+�X37;ak����[-3��f�j\�.A���o����%=���m|3�f�u���<�Z#�����5��A��4�X3g�u�{M0e�b�8��h<��qf��u��p�������$������6����BK���%r+�1P/�gK2��f,/jf��*m)#�����v�#-���������������#83��f��,K�wt(��-4�h8����3�xO��������<��(���@A3�'2��f
Ji�"�h���� m�|4c�Q��C3��\4c�g=�#�h��zn�J�>3�4�*�+�����������F(�t�
�8@�A�8l��l0@$B���`�w��@@@M8R���
J�dg�u
�T�����@�Y�8����Ogo2��f,��� {�~5c�U���7`X3�auL�&�k�Q��u�x����11�r5�9�*�D�mU.�+cf����m��t;�O��49�����Pd�N��e\(	���K���Y�8�����n�*�k�i���7'M����^3�{������6]]���r�����6�k�le�z���_�����`�#�n�`c3���'W�,���9
�W�7���c�i�,v]�I�XW��4��s�\v��Z�j	�zz��4�$	�������X�p�L����@}\ZrXNGy�z:���?���yj�7����Vj��W�������g'1����������.�{��u��d���\�w�$��L��@��#T��6K\�sV����s��9Z�f��lX���?����$�[qG�:i��lz�D4�N��f��J���v��<_
/���{���7'.lnIs����m���7��5�\��gp��d��������`���JF�He�K��w3�lW��rZ��������^X4��cw�m�#�+9r:�(�{g�U0��H���:}Q#*�S�������jL����[���P���l{;*8�p3gwU��S��\�
����Y�q��s��������TrS;�F�h��fp�Y��p���m����#1�b8s&�[��j���z��,�������,�K�I���0��O<0���Xrf���i�>.`�3+��L�{���p�H\�����p(���r^�
W~ ��4���3g6���#���`�3�Y�����/
�����%zp0�Z��J�N���><5:\�Xp��,�8�hc�
:��3'�8�� 2�;�q^5lq�s]5�o�����R���v���f�����z�.�<q��Ma��x3G�7of3�}!A��X-�	���\_�p�����z�a��\7���@Z/����B���{���9y�vET��ZZ������{��q���8�W�h�jv���hW;��_O����^n���������Q���Bn� �6�h=@��8�W����y{�KJ�e_�:Z,���h���~�;;�(n���5�c��� ����e��9��em���#�%��7��!�!��&��n=K0�_!1�v��yN�zP�����f�4N.���[��h��n����S-����$ E�CN�TnS���x��=��t{6N��������q�|*;Z���{m n�w;�>�C�����#���n�cvUm����G��r�M_21CU�E�:�������]��{�r�����cs�
���b35u�=�UUX9�W���f@��x�h�q�/���Cc��&�����c�^��5��0�io4������p�P��`U����\��h\E�2����*�A�`����)�<R���OJC�A��buo�j@I{��L�"3��R�-��C���2��H��v�*O��|�������M��=�����64�z�-����n�:EQf�lu��@�l�q����8���[5�����k���������3��$r�F"���@�{6����!�L������m��R����u6��l�Mp�o������9��l���x7w*R�7��f�q43%}���^����<r/���!O��B�#P���8����N��%���}Hr�I^�
��mR��=����������V������q��u� �R��m���� c+
�����~���y{�m,��.o�����������Y5�T-������j���6d4�����_u���m����k�=�\�p;����1���0P��w�g�\�g��5�<��qco��N>����
�zB�gs�5Z�)��1�`�M�w_z�6���]-����0��u�#^{�t�9���H����J�<�8���*���h
�Hs�i5.����������@3�8�����l��;�YM{w��*Dk���i�`m�����B��
�r�v�M#"�3V�"�����:�Y�c��$p��m�K5���|<^/>qj�^�|��nI�	���^�
��qpF�?!BU�X�r��@���l+9������q���#0������{gl�4�q��q�)b,'��2���Or(#�.�bR/O�#���b���a�$RL��]�H��=W�_?i�"�M_��`�����B@�=6z����z�����|2���u��me�P�������U�������z����*T�<�`�{�9�q ��XG�|^��c9Zj�Q=�pp��(I�8��SS�������(�z��gAl;���zT9�A�<n/s,)W;���\�:@v8 ��q������V��~m�%���r�q����<T�V����`e�h�Ng���c�~�����������c^=�}���!��O=�l0���F@@I8�Vnu��:��:�l���r�Y��)��OH���s�9
���t
�Pn��ry����^�u��S;��66P=���8V��[@��z���
l�z���q.%�x�\n���� m���XU��qL+�A�v���m���n��V3��l��n����$h��n������9�����?k���h.}�i���\��V������U����&;]`�D���%R��*4���%�@��T0�L���U��s�Z-�g����G�	E�
/Z��R^��d!5x��aG�Z��>�4K��������U���McA'�Ts��d�}�BH�o�v��cV������������X��>w�v��>w����A�����yXHG5�s����y���U�&@�|��r�Sj�K�������Oaj��(GV���]U��6�U�{t�[�lsT��P>���3�#���~��W���
�'p��cG��s�)
�M�-��p=D����3.
�����yu��qu�����n<�(�b��������X������e�C��
�6���,����)I_�.����aU���^1"8���mQ�AUg��w����������j0�>�;������L�������������7�����w[EDu�����I����l��C�D�=\,fT���E-��]g��&���tY�WG�u�&.��5G6��(�����,�j���_�r��x�p������ei�f�W����x:� ^�s���1}�gq����bUY�����1���WYV5��]���f@�8X��;g����
����r��=��v��������_�7wBO�>��J�������6��d��8q :6�cd�.��Gh�.2�����>q�6�b8����d���7cG���MJ/�j���������|�wk����-h�������[q�GFF����[QqE��P�[�_�i-;�L�;�0�6c1��:r���3w�1"�jx���O��7�}��jq N6��%����������XS�p��'���oj.�
���F6jX�5���7o@ga������ql?�j�����k�xz�st���O~
2P��>�����D����tCPL7�Y|�^l��n\
���[�>V��jpR
\JO��|nwml^QU�z.��%�������b;nZ[]x;�L�o@����������=���p�N@��n���>����F;�L�[����Q6����Z)�^���Z�#=9����0������K��{��<��);
r����w�����L���PU����t�=y~�L�l��MG���N2��U	;�L�oZ%h��ihi����
�'�,2}n��g��*s���
���-/x����q;�L�����:�����R`��s��z�c����X��U,����v���)P���2(���b��u���T�O9�\��G����:��S�17��a��u���B���3�Cw9�\�
H���p���Ve����1��n9l�����J�z���t�c���������{@��W���qU����Q�Md{�����>���+F���1g�|�����nKqy i�����7��u;�
P�l+�n��T�����ah����R7�4�������g��7���qf�v�Z�|I�����D�:�U�����D_�~�P�H�FD�X`�����v��>���a���s�z'�]���j���_����m���m����������wq|�����#>& .6����v��>wk���3}����7��t��3}�I����� g�:vy�9l�*p�>k#��d4-�.��X���
W;���Xy�;i��;�KG�f���@���f>�����6q��|���l
��t`�e���>��T��|	������>}}�I��|����\�P��(f�������z����
�~�"�@]��&�u{��;q�}-�`�_��9X|��+��~���#�����d-#���xQ�����L���"�>PW��E�
-�#����(]���k\(���]@��2��ZZ`����6������bE��,�i�t��6=�6(���������-���zK��7'�����s+s�=�5����Z�*&���t�wr`�t������=���������	~��N�~��a�?����7F4@Tl��T�d��=<Y�47���C!-K��qu�#<�ktrX�s`���[���w}��Y
�1EM5_,�JyVN��}Z�sh���p�����<Tg�Q��+��>G��*���v�i�@�~��)&��?4�:l�	g��[�ps��`Cj_�k�b0��g��� ��������P�J�X�;�w(@o�p����I�;����<����1,����j3G���i������]��L>B������Z|3��;��l�������3p����W��<���E�J���q�J�}����I�Xvd P'�_����*+)��(����a3��H]��B�A�9��*++���g�	|u}gp����~=9z���@��\~7��l��[�:���<|�K�z6�"���c��:{�{X@��N4�*Hn4XY?v[����sf��hP�>G��Yy@��������s�����Vo�3gk�����Y���s�*��	�"vuL�y�EM����
T�CP�7�O}W�t}	w;zx����z���j[zw \������u@�v6��!WU��@Q}g#]k�a�L�(�O\9������>Pg�]Cu��'g����T��Q��u��p@��U��w�������8�w?�!�j6���)�r�9��K��������Z���gx#z{;G���NR�����S=@��k(8�_��Ql�5+��IQ��_��T����D��"vV���m���6���q $6���}w:/hs�����=xs|tpypz��������|s)�**�y>|����S�g����WW{���zqtU�@8��/������?r&�'Ko��k�(F�d�-��?�vKI���bA����AE��@���&��jTmf���~U�2��F�b��P���+�Q�F�l���W_7bL���^����jz���X������;'?�����w!��3���_O�B�Q�����	�R��(��������7*r��@0\����_
�n��S59�'Q�S���I��u������������U�6���k.Hn�LD?��.����~":ru��j�0/>�_T��yW���QJ�^���i��=���)��T; ���S7o��q�g�����f=0�����u����tp���)s9�.��l9���Z���hL�Qc��AGo�N��������r��d��f�/���c�=��`�wV�`�]�6��:n�S;T�R�B�������n~��{@h�����s��\\���������z���:D��dMr��m����b��~mLC>�]��Bsy�@�����zd3����	��4W/������k������<�4��o�+=
�r��c��UL�FL}���g�������� 6������O������{��������O��
���G����\����]����Cy��hZ<��������K������MNV�hUX���z8+F;�z��
��*�]�W����%&�%�7|�v!X w���Q�rp��q�|�K����
8G]��H��������M�:�2�}�B���������N�s��K=&�6w^���?�y��n��d���\��l^XQ����	�Y}����X�)����m5��e���8��w[��p0�*���F� �b"�����Z���Vw�	�L�,��M4�9�~�<Y���F�\Ng�I2N��p��sR�����"�d3�uJ��g��$���_�?��_����S!)�����CSy�������`2b��@�Wd8h��|��� ��#b�871�����E����M�7\��r�Z��l?=sl>�~��|��Y����gn_
�|���T3�nv�r����;�g?q2���e������w�M=��V����As����%���w_����G]��"�:3���%���|k4+�rFg����r.����X��t�2�L����OM�t�i-J4���+;�|����{����gIl\hZ���&9[!l$U\�@:7V�-������R��\����+�y>!�����:��)"0���A��9��7����a�$���
��J��<*JY0�4�����+uT�U��:p���[��O�ua�e���u@N��t�����
��dx;�Nh�a(��t^�p8�+�_�t��wt�dy�q���� '��s��c��X��|Gq�/�/q����f��'�?xup����Gx����j��cs0�Oy>�V:'����k��M	��:��h�q��`x���741�x(�����x|5~��x�������������#��P;�����1�������Za�������]3�����0����p�_3��/$.-�d���<�z4�b����d�.@���p�5�.�������~��%�������I�����n@��rGb�{�MN��>������\GiisF�0�
t����Y�a�;v��W��
���;y������8����W���B��&���e8[e�:@mv���Ui�,1Ne�&�js��8u���;\��unv����V���-������nJ��^-���o��^��+�Y$��b�TG1*���]����7n������Y��s���0E��� ��T7�t�4����������)��:�M~��6=H�h��Q3#������\R�r�hG�N�(pV��p��q =6z�z+�iIQ��8�Nn������k��p����d*��iM�_��h9k-{�����@q�����V��'�6�9=�3o��@>���)u�&�p�;%C��0��h��B�Z��@�,�=@��B�d�?U���8�-�z��vb>O���y^�1�dqyzb�	C�����.�y}P�����A����5iKH���a��e�a8�����V���B�< 3�����%��h�[i�K=��)�j�E2���r0�w��?��7������T��������a)�z�2
Dgx=������\�0���`�<���i"_d�(��v�>j�	��~���W��])�:��fp=Z�oc�j�zb��������<���zjO�
7
�'��T�|�e^M�w
��8�"#8��N�U�f���u1?�!`W`���r�dmJJ��A]�v�d&������������WO��������N2�O����(�z��� -z��`��AM6���>��[�]�
L�hVZ�
�C�����p
IwN��8����"�����|;����J����b���n�J��x*�x���vGdW�p�����������W�HS����f.��c��.>������df���������9t��n?�����DfS�i��_��X&���t��T &}Z�{K��+�-7Z� �����)vu�U����ue�E[t�B���w����%�����a{�|5�������Sz7f)8���?��<�qE�o�olU����D�t����Rm��/�E~'������<����RMB�~WD����T��U[�����u�����h��H.5�j4�>��e�BC������6'j���qZ�W��6�k�������#�)�sm!��C���������R=���^=V �Z]������������'?J*���g#&��'�������tjeL���
 ��
��e��a�:'���w�������;���y�)}�N�2
���Il�z�
Z_�x�����CY7n�x(z���u{r�9%1l�,D��*�O���w:�! �CWb}]����!Yi����+�n�����I�M�; _�����z�����P,W��C������?�FF�/�E��D�
��Y-���C��>�h\����}�����.���O'���������.�����o�>>�������������(�>����Q��2A�{������e�g��	B�*&��Cw��36B��N����{�G~�	����|�s9u*FY�;!����N����,�[9]Ws
kF��\���(�������,R�w�@��Yg�t����C$�p*n��H����-�C����j����f�`�!���%@�����!@�Cy7J�]Yv}�;)WgUe������1n
*9G�ww������t�����F�i����IYU�A����r8�����c��ZX����'s�SN���������b6��G��?�X�����9�$�K�1����
�W�v��hq�����������|�r�����
������Wuaq�?.Fu���8�������_>y�����n�;�g��'��B�6/nn���x����C����z��������S>���i��A���w;��_>�],f��x1^��+�����&>�/�Sq��M�b,~z>��=������������O�����j��Q���2���/�����������������;��M{�����t�����������-�����=h��$��O���E
����?L�@�E?�a?��(�o�3>�*-���y� zZ���>�B��9:;����c���7�����NO�'�^��%<|����H}=����pQ ^�}B����7	{�O�o����?��!#B@��]_>9T��=�|��O���^>!�TV�$W
y�����H=-B��<����p���bL<��>�\�����Q���.�A�h2���r�0���<_<��MU�
��I��-&����X����?*�z��@�����UW]O�f�������VQ������W9�|:����t�
D&V�D��O�h�+�N�c���n�;����L��	oX��Q�|��b:�������������@Q���.��_"��'��:<#��{/����?�Y�b����U�������u�^w�,� ����x��&��������s������C�`��+V�C����X�[G���S�����T���r��	O.]������x�������U�o�W?O~���S�����#�!� ���`���?��=��>H���O���u��"��`�������y�x�~���r�v��'���*���l�d��n"{�$R�����H!~���n��+K��i�U���T�l
?�0ou�y(Z�\���I����f�d�L�z?����a����/��\W'�-�����[�3����W��G*7���v�"��������i;Z��S�O%�:r����9��>Vmz�S/g������o�X�Gn|�x�������Cd�|=�z�U7���1H��?�IxqR��������������#G.].g��dr:��L�����(K�Z�G�7�L�)_�W��drg�w���j�f"7�3��� �dn�*����>wnr���a!*+/�\�
+Uf~��#�����Q!df1��,�(M��s����>�<�'�_�7��T���hJl}+Le6�a�f��|���8�31"�����6���^R��\3ZA��q��7U�0�>{�i�w��/��7l�d%����<n�e��bB��o�ZA�;�oD�UG�[����,����G=�V�jm�=����?��eF-�4a�b���@��Y���t}��*Y�V�5��Y��g�fg��0��
�����k��Q�w?���:9}p5�X�>{i�v]�n�W��������h���s�b=�gS�+�
u���j�#�[������G����v1_N�@g�B��!���2�2j�$���M?{1��U;�����~�b���R�p7��$����'�z�R�n�O���0�y8s9�N������rp�o8s}v%�}�j�<X�&i7�U��D�P7��E���p�b@��[L�}����
��`������o�!
i�������R�������������A�~���tA�'������F�E�K��;wH����z�,o���z6�
��}RN:��-4:�6s�4>{�AB]������D��~s����V�s����<��:�G4��W�5�2����k�2(Q�\����A�����Z�q�_��Mg��������|e�w�m���k���go�"�7�a�������]��(�Za���(��.��YSq���Uq~J[��{���������/�-�����_�S��
tZ�2��i#�p��WaNg��f���L*��y!z�J�7�
H���b�2���:6���^�h�R���\��}�dd~�������7�z����p�����>����0e�����d��}<��>��+����U���iBk�p�Oh��}����?�C��������������|����(c_����������������c�O�n�G�&������2����B���i��?�s����~I3�_$>���_��%n���&���K������3�/����(�/d����������e,�~)[����/g���})������_�uC�����4�#�?;��H��/a���1�6��-_�������\�����"�I�(k�/mS�W������-��X����~![�����_���/jS����������v�?~���n���*h���n�������?A�>���'������#�?��g]v��u���]?�]�~���QM��b}�����46�
~���� m+O/�����w���E�g�cnE����l��j��?�����?��+K���o��c��z/����n���}���{������y^?�j�H��
�%���G�f����Q��:c�����A��S�EbN�~����R�`a���"�GG7��������)5Z�����zw��L�o8�����N���
��m=CZ$�i������J��3|��~��}��p?��W����=�������|���c�?-s���(���Eb��������^��>9<�<9{����,w�������d������w��s�������j���.�{vy|�]
�j�?�]�s�7��WL��mQz���^L����c�-��h*Tsq�*��������E�P�F8�R�|�iz-j�^������R;�;��b�������<��.r��/����s����������gk�g�eT}���/����3�_�
��>S�����K1����.����������M@������h�jL��B�yY�(��b1�8�>`�}��d����S����<��.J='�n��L�e�6�iQi�x��T�����}����r1���`��|�Ty�n��+_�
>�4}����������:�U;�}_P���[���fV�s_ ��\�vq\��Zc�@rYD#`D��V��#�H!���MV�}�BVd������MW����;�s+�,����[���2��]�D�����{#z�9����[o�9��o���/���OO���`��`hWR2RP�����+oW��	�������.f��@��
�]�����/��\�j�����>fQ�0��������jw55b{��u@5��.�/���x�:�|���1�X��_���+���]>�B�4�{�4���y�
���`,1�������E��X��8\�#.����bx+B�������������!��64h���w��q���T��Uc�Q9c��J/��I-  ?5���\9���&4(�^�3N�"�q+����@|EhJ���sT|,F�=53�nP�<G��u.��\\-E.M�wW�|�����Dc@x"~�a}�j�+���~�kY}O�,���N���@m������De�^F.�u6iC��W�NMh#��W��h�*�6q1����3���n��H^�H�l��kq��r����f���������v D1'D"���o�M����_��O!�����,@zbNzD���S��I��_P��Q��3�D�<��ib"�;f*��e�A���������^O���jfk�%)���5u���5I@�O�^�VHP�{-�(��vAPU�UU5�%���[dY n�������WG�9;G���Nt n�d���q�V`�b'n;;�T��Z��U�Py�D�e�5��T��E��B��T~n�P/�t��M��6�=�K�(0���$�4J�|��O~��#�����c
�p��M���}��;Q���)��)7I�)��)7"���?�/���qyp��B����jD
�o�4%����F����Ay�/��
u*;�<�����a���O��T
�{�T�SP�SK}�$�������,��]�\h���r&�Q�����@��PC12����(����!o��:��|0�6_���Tx�H�z�\��JzM����r"���qJT�eN<����r����Bm�^t&��
��=e�#�����OI�|J)d�\�y�]2 A�c��;�U[���V�b�Z�����<|`n�-��N��w��@�2�0��r@�2�D����qS"�c}b$J�Y�@�@2n�P����A.�u���j-����U���S���������z���i������-p���S��*`��_X��wi{������������X���������~�=���������o���s����?�jZL9��v��b&�v�o8����+�������3�BU>�},�^���N�������O*~xy*.�:0�o�K�s'�g�^�����4%��~���r�m����M�����2A�+��|� (��F^��*�C��sH��5c��}X�I-�3��>�������0���^_���_�?��_������z������e���p�'�s��Rk��`��5���8Y�.a���r��B#���=��aM�w�?�2]# ���w�@�������P��*X'#6�dk�q�hd��j��
���0RT�C���T9. ���_����u�(���\Qv���[��NI%2�d{��wg�;}��tHr�t|z�������������go/��4�0�>��������L)�5R���6�\Q�R����a���9�`+!�J��7b�*�[	�V��Q��G����`��h_����b<�����h��N��U���� o+��*o���@a|����i;a��A��e��q�i��!����Z]�?��Uv�����&����MQ.�y��X�����l�e�� V����So�|�@���d�������V�2@y8r��K\��q����z���TM�M�����
@X}a5K2�X}b�'T��G�K<�I��������-n&S~����s�>��U��W=�/����b�#�
p�>�����2r����PZE,�*����i���@��HR�2��F����3��p����3)gt?�'�O?r�P�!�k
�G�j�L�;�%�gQ`I+;�E��O���~���JI�����9��N�d��Y�����9�S������$���fv
Q9��
���"�;���������Q3�(�z]E\DC�-���w��V��XQ�2@7��u=�@@<�����3��a�T�X.�a�'���]��~&�s�U�T��q�R���n,i�������@b��FG	1c�rT�:{"�`:��[���U��>��z@��	���`4�&�}���{����;�UPH'��S��	;��/�9�������#������)E������Am�Z^���6��vo0V?�W��t���C�%C;8����"��9����W��:=[�����~4��X;�OV�S�
���'�*v�cF����K��X�9b�R{��t<.~��e�e�m,����U?�PdW������t9u���Z;�4�S���C����W�TF�1����w�I[����� �3�eb@~��u�X���V�s���}{y|^mg(���~]{�y.E_�qj����y�N�V���5$��~����.5 Wap�����'Y�����>J�#f>P���:�J��T��K!1�d��sx�~����G��������~o��%��bn[������d�- +��w�]�/T���n
�}< @n�8V�g�������\��G�p:{��T��P������K��2��i\���������6���To;�j��>�u�{�u���D�=�jU7+��r��+�f���	���5H0�\���8v���;�_�����+�o��q/�+�z��ro@b����
���E��eh�`��Sc�	1��	�`���y����>-R�x?y�6Ts�o�L��~-�����?i�l�m����g+"�&
�;6
NS!R�vN����(�MC�Q����_�.��+Z�4r6.���v���{V����5��4,��	c�<l4a,;p���Qv�!T��0����dwO���8�Z��>�����`���+��0��'M���<��(/�����C�^H���:W��@�����W��g>
���I���P�x8�>8��
)&������u��3'0�����t+���S.<���
���r�9��
%��5l�����������������B8�\`)���#��gQs6�f)���,=�=�.a4a�p
@�W85���I:Yp:o��3���w5�����u��Q%��.�U<��L���p�3 �����u�_������RcC�by�|.��4~W��j���(K�����+�N����N�t���`��7N��
8V�qklX������+�&�vO��~�g�P�K�Du�S�����bq7�������$m~��To���� �A���OZ�=�c_���n�x�i>#��
XP�3�����k�.D�srCf�\f#����aR��B�Z7�7���rL��('`f�6���F������Jm��@+�����7��m�����r�3�N�E5\�.�����\W�5K��|�P�k�Z%������J�U��������y��V�^z>����
,<.��OF�������G�G�����A��S�oo)`l�x���B�)�f@����x>�K��6|m@G-,��������i)��ynPq���[3.W������s�����i��#Ft���������^���]�O������������ci�t�tO�H��#n%�]y����������vd���?_x"�������|�}[v�OcX�\{�|�RZ��1�B��j��{yM��#u���9�nt�jI��7PP���%��9�O���O ����)��H� qe~+g(�������ic6����>���7��K�e��BVT9��8
���@E<�_�y�n��s:
i������
�����w?(V�#Z]Q�}ttj<�����tt�Y�2X�~uPu�,~� ���.��x�>5���N���F�����D#��	<p��>��t�H�t ��S �@��
��dn���
�48A��Fish:�t`���=����`��6l:�c����|��+-��t}Td����c2e �]=I�����#c;��y����\8�~|c��@G���M����mu�c{��z�T������#����'�5�#�,S]���	���s�/���������������H:� iJBn�'t�w�{����Q�A9~�>�<���w����?��n���L=�u����YO:k�|)��������[�2�6&�X�&�X�v�b@r8'`����(��LF��(^���Gs�����3���rp��s�����p��UL���N�D�����P�������hS��0���k��B;�-�r�����9�(7~� P1�0�d@{�o1�y�5`C�3Q�D��!�-d}�UHe���S�bJB^>�od���9V7z�_��Y�����r�����@�!gsL���k��(t�{���a��1�G42�W}in��o��h���l�"�7�w�I1�]W�"�o�L�7�w�s&�=y���<=��<��������4� ����<>�!�uCd��w�z@c|�N�V��e�[��o��\��oh�Bn����A��mJ>(U�2�>O'7�������cY+�����������#�����;���C�e�e|}lfg,@�X��+�@�8���Z@����C@������v1��V��]���#��x��4X��X����'*���-�T����5���*�y���I�{��\���fDG��(��
�[��f?l�a��N���d�f	��0��_��0S7��_E��>]8�q�����[�>��do������������T���*o3����N8�8�:�h���!��m�����qP��\uC��6W]J��SC�k�I���]�*!�h�G����I�|N��;��������9�j�n�!���E�{2.�r���n3u/>{�|���N�R�s��2��
C@.����q�s����&�����}C�/���FH�0�='���zY����,N����>I�8����y)�����NT%�u�(���]��k�	U���]��,�������\�,�{b�s*j7Vhv�z������������q�!��C���N��9��������l.��� ���F`���1r;e��r8/fT�UBU�F�F9�o2p����[�}������ �_�k����;�L�Wi�{%�=�3��J���Kb�K�UK������H�vW7��Vc�}�X�9Q���mu��@��wh�b�BRw�.�T
��xA��g"&t����[N��E5f�W���!�mT{@i�J�R�h�B]��]K���V��������������]}��F�r��R��_�s�2w���
d��@na�"<��t��!��E�8�-����;>k�X�R��qvT����!F+/��j��`$�?y��T�t0�r�y��\^X;t����C�����Bh���:D��>;x����n�M�V������t��7��������]�v����~����9���\�H�E}��j�2@�7	�xL����JV��j����y�n�I>���+?@�8#�f2����yx�;�S2|��gyWy$q��Cp���~����B����[��w�B����N�;�8n�2@ER��g����z/��X,V�^���a�8�4�sIoEB2��J
����������
r��;����d������=33�J���X���+S���irB��;�D�5w-�U��� �6*��Q��|q �~ D�V.�R�0Mjn�f�0�a�&l�B3�=��+1�-F�V����V�b�SH\2	(
G���Jc���:�<B:��^���j�����A����(/�{���<�W�p��7���P�o��
`�:�7��3���x�f��@x�����s>i�\Lg�R��z}�w����E������R�U��>�N���k}>����Gd���nD�S���]^.w��r����X���S�1���1�j
���w�=`�C�����\�������,�h��������e"}@�!��o�_(g��x�#9������N#h��D���Y"7���
��e=�Z���G�AK{�$���)��t2V{�9 j�P��9�,^��1�^�TL�|.�kB�����#�*�DN�8,���f$�����a������U�Kr�n&=t6o�!�����	ya��Za]4!P@���N���?��<��:�|s��rt�����/�?��X��|he�=�]��N����wR��f1
�;gk����&��o��Y/|d��=c {�j����>Bx���=�����,���O��w�`����_�.k���������� �?���S�Z�	U���_#�h�d����n��n)����kQi��w��V� wu��U����Ww�]��6�?������c��NO����� v��\]���k@���Y������|�����:�{�o5f,,s���#�u�=hNB��n������,k4�L�"��G�v��f0���io���11e,i��.�?x����tz��������^zx+W�����c���-�]=��G&"��E=�I�z:/:�G�q�#��RMq�(���������w�������\���
�w�G��������x��Y�~KR7+�E~�"C�p�ju���Q��N�/�dE�-��P���8�q�2@��;��J���[t��g��t��u^;["n�~ ���5���/�������������yR ����vukj����t�X/��i#�����@Iof���(p��y�������,��x����wg����f�{�`;D��C�b�����bn����a����<�l;�������:��#@���`[��j>�1���W����k�??>8�S��.���=�:�����'�
�@��Ov�<4-�d��	E`A��&>uo��b[�m�t$�XFf���}	�- �������V�N�����^+��#���+)��#��/t��J7��E��H�!m�V3[��c1]�ce5���u���0�oD��@7���q��y�E���bc�,���q���1�(���a������� �#��x��Rjn��;u���#��G�s��_��v��� �#'�}u�n����Q�V�Qq��F��w��y�9r�\U���L��P�W+���#xG�m�z�xG���st;rE�=/kW~�;���k�|�U.�L]�v�yj�u��Q�����]��${����P���Q�8[� ������vS�l�����#�������X$z�;��I~_�����~�PJ[�z(����6�=��#������9J\+�<����*����f<�W�
(�����
0��5���L"2G�C��4^��q}P]9��rA���n����+<����n��������#>������#��/�����P?G��L��������y5@T���^����uf!��y� 1���[}`����UL�&8�8S3%�n��Jd�������,;?P��s����-���#�Gn|q��������T�-v����Ud���v��S�!e*:��#7�� ��
~���|�U_��b��8��h��U�XV��I����x����@�����u�r�������g�XA�:+���>(��:0�r�d���V�"���( �Q�u��
m��������XGF��umq��~�o=����#���f2r��j#_���@r�&4�j
����#�U�o����C�b�]:�FNb�n�E��
����1��g��Fi,��Jz�R���
W����sM���S��Vf�7j�8X�H�|�������@.J���:Gm	V�q;���1K���f���+��$f-��V&n�1����\/1`gc�ug��l�o�E%f;��������1�pc��/�:����[�qc0�����qch���e���+yls�}���'������^���L1���q�Q,1�Z�m|������H�D"�?�������XN�������e
����~�1a���P�)���WO�`*1�b�
�c��n��EAuF��L��%iw�OI��T���Gu��z��<�m��7�_z,@|7���\����=��?D�1�����I��6>�k�\�Uz�@klZ!4Z-Y�����,�+dr90���)���Dt�G���3��������HU�2/��-Os���z:��������I�m���A��9�S�����U�7����o`�3F����23F�k�q*(��Q(KN�-cY���^��=��eq5~ 85��-���. �g���zjrNFF�uj���SxAj��0����a2�_=������U~�ye>���im���K���j�1���8E^RD{zH@@��4
>���=�U^.�7������2F.����2���<��>3�����63vc3c�f�6k���JQL��HL��q�O�s�1�9c������1�?c�imRlG
���-���XY�&q`�v�d������v����4��*3��5�ug���Sg�1�g���!����5@���s���c:��
�:c�����`�(��(9gI>��r��Vt����z@vx T�P���O�����
,�[���5�\?����6�~�"��4�uC�]�!�P�[���Dn����\��4�������?����5S�]?����G�g~�� �2,n�)�fK�:J�h�����R�m��#��!�%Hm��O'y�<W�ETC����rY���;c�K�Ye��Di�� ��232a�o��)g��u�j����:4����j���,g4��������6�
���6������`����f�S���O����4��R}P�8��^68)>ph����e@}����.��"�	�7W4��uv\�����T��R{���><K��]GAI�t�~
�Ac748h�����_uH�7/�j��[1F��u��X��eA�$�1��*�����J���s�'s��~P�]�v���UD�������B�u�'�����m�(%:����lO�~����2�]O������R�bJ�GcW�v��N�h��n�t�
��z�%<}Ju�"$M�mhjW�4j�����w��0G���7�m�S�5�9��lK��i��hc���',�PZ��E>�k�3.  ]�'Z�X6�������l��A6 ^c+��u= Y6\�Z@�l0*��_(�R�k Mc�d�<O�]��*���d������a�h������-k�c4��6�9Z�2�.ZQ�#�>�$�Gc>�y��5��ku��g�9��������Z�����1c��E��r������v����y`S�.�_�~��R3�mt�F�,����|`�1�]j�H���f�J�~xy*.�?���W��#'����R���3�u*�!�5����	�-oi9N��{�Da�������P������|Qp(�)�R�l����	�6'����������JK��8y�v���������������������}|���&M�@���	A[��Z�X� �d\@����tU�D�K��L���J��zG��[wL���VWu �n��@�2�e�j� ���R$3�3,�j������3���e�U����m�������).����_�����uL	�N�������qW/�^�N����"��E'p�Ij[����o��W�g�.����<����%��&nN�	W��9�:�D���Z���e�$�4���2A�k��MQ.�y�tvX-���E#�u���u��*�So�x��M,lG�t��J��pd�~ Q���B��&���[�V�my:�~�=,�j�
4��U�
[t��p�si\{7-�nV�}���5�N�Y��T#����\�h�����O��#d���u�B�G��u>�]]T�[���z@BX���` ,�����[��,lg��nU��h�\U�3��I��!!`��]o8��M'���3ZP��EQ
Yj�9z2�A��Am];$��=�]oY�4Q�$j���|�O���]z���0�/vE%y��h���Cw��&6o�
0����
4���z��Q���ap�]��Y�{�]���mEyi����sX���ltq'u�|9�S��S_���h�*[���I�:n[�U�a�v�'��2���X��(�g�$5��B)5�7(����QJ�H	J7�4@i�����]=N�Lh�$rviNFqs�\�<j����e���4�'�4j_u{��?�@�&��
�V��zM[s�2��o1�%�P�C�2��@Tz�O�5aY�-���P���#S��r2:�����
	 X'kS���T���E����2�b��6q3BM����#��wtl�������'^�M8|V�/�7�S�%]c��O'7���X�����C����2u�:{i�5\���[�2�M����PjY
����;~��7�\�2@XK��MH�v�LY�3���G�Cp��l��lV)���v2�$q�dn��J<7L����L�@�&�d�=:�nK;��l��M��mP��A�GqA�L\�����#���,���4�XJ���u��J)4����a��jYN:����gg\��a����1,t���xPA�|y��^��m�^�W��L����
��'�,����^����+�������/4�k#�0"�g�%h8`J�����
hd}����T�p�
4��^oV�i��:G�|9��#O���"�f"��:���aGITv��M?AD[$?�Z�:��k�2@�l����������cY-|W�W�Jo4%��;�����}��J�i�n<�#�e#�ED�\���#[���~�gT����l��E��<���G������������|0����� �FlZ.��y��������P_�x�A��v�������������������)��VC��r6������YS����b���N�ptx
��}z�����y�3]���B�������>�����6��'��(�r[�-��I��nOT���[�9��z-�'=�I���F��T/��Ol��9&�x�l
�j
h�O�<vI>���*���=Z�phy������Z���*@�9�/$�����:���2��~����dS��q>��5�OOz�k������<U���t�������\����U2J����es�Y/[�~����:ov�R@��!�_�.��+7j<&>KwJ����������S����������[����]R�wXf��C/{��<NC2a]>O��r�~�����0��<�x�~�=�]<R�h�/c���R�O�t� �)gR�_H�o7�P��&#�>YU������W��{c���O1���unqA
X���O+�������3_S�w��]i���<���/��b1R?������/�#����
�WU��b��}�������w5�����}o����^C1���5�F�_���u�������|.�/�n�r>|���������<�Mn�Tw�����]����'�)�:X�������z�����I>%�����x��������b������y9]��������'���p*.0��_��O�g��'���'}5ly������W��s?�\mEed�_\MG���������_�oZ�gQ_{�OF/�� ����>�^N����9�c_�$���~�7�K?�Q���q�B���b?J�����O^���l�;���2'/��������o��/���������������C���/~	_�8�<RD�E��r>��r~L��o�P�����B�����| ���w�BH��g�_����Od�4Y<�|��O�0�B����Q��|����3�S��[,��H�y�����Q�1�_��]�e*���O���2�C�h����r�0���<_<���Ut��|R�fK�I�3)i�����J�����>��a8�^]yig:_��GlnV���Nq��C��_�@���ut�M��7�X=ex>?Y���?\��d��"��~��w\!��+��vq�'t��b2��j����t&:u��QC'^���_����u�T�Y�^>�U�-�~�e��� �1;��4X��v���W}�.���KF�|����0����]�;�Y����?�O�������h��aUG�������u����\0���L�z���L.w��T�pK��l�����^��a����E�g��������H�\7��`?���������������u]����p���8�.�����/�O��`��.���5�R�������l��M�����Bj���T4����K����%�{�4��Q�|��F��}<��?�<��@��I"���67{%��������r}
�3�5�}����:yo	'k����.D�9/DM�2��Tn����t� �	���>���i;��S�O%�:�![�.���:Zs
�Y)>~6����^�F���������������;�n�
�Z�z��l���?�}����f���I���Q�7\i~�l��,����J�=�3�Y������v�/�RH���Or��{��pD{������B
�4�}���E�yU��~���i��y�����M�w���7�w(�~w�[D��?����K\��0��������Y��]Q��;%��r����_2��-j�i������i��8-�=�����N:sx�bl�t�b����]���t%���?��
�\�IBm�'	����������|]�����u��;����|]�����u�g������?���h�?�j��-�l�>��\��kD��������]����k����/n��[�������|�(_�������f�+l_��?o��.K�_~�.�l�������ko�{���}Q����(M���?�G�b�gUv�.�|]��o[�Y�B����5���fk�5Y��z�����>_�}>u-�����
M����������-�����x����%��K>������smg����x���?�q6���Q����A�dP�P�r"]���:3���y�'
��3��1�,/v�=Vj{��=[���n����n��y�z��l��v�rC�.(�k��#`u������y�Iz������t�0�*���z��>#��k~O��#Q�;��Q���t9��c@���������{������'�O.O����j�V�<��:�65�	��<�II�`���VJ�������j�����
�y������y��{����D�����������|0������(���'�B5��k�M��f�o��-�e���r!��l�=�N&�d:��@g��0���gZD�~����J���V��x�g�4���z����k��=�v�����|����2��N��4�1��
o:y6y����@���|��l�k]���/�����6"��b��xLQ��C��~��������[�"~��z?x�{�����hR���Z��sp��������\��V9��e�[Pp���<���q'��O�]��\ti�)�w�O�j�sQ�
��r*��+��ch)��]7��w/������N�(�_n����-z<��s�xW�EGE��kTo:�U��EY)[��������X_�r�������T��y!�V�B�����
�h���Z4�z��}.
��p��hh!������uI�j]���������%�*�A��o"�)�����p�������]3��z����t:U]�sQ��;�.��:^opGf�UT���0�s�2�������`�?&X��9����6$��H`��H`p"����]HM�H�?��x9����r���^�x~����w��v �Q�Z��!uk^�&���wT�v���Fw�g�}��A��]&bp\�����'�^�z�L����1D`��o��_����H��@=�z��o�(��������V��\�;*W�
��������\�=w��e^,:uSWL�f�z:�����q������"�6SL����b�]�Z�nC��7��1|�2dG���mU�mUm���]�q�p��B����Vp���k&y�r�{�_����r5�wp�Y\���B����o�y>�������������f�������ae��^Oi���bTu���h�Q����;�1����<cd�u�i�)5P�c�A;Z���z������=c����������b)i��94]��2�/vN�������w�3����^�z\E+���E�j��d���
���l��j>���X/v�}��z������I�x>�:!��7Gd^?(���, ��j����v�eBZ�	K��.���P�g>��I�C�x���@#���+o��84�������k���>�t"��|�Ov�/�z�|&/��8*�ES��*��z��a��*�7(Q��U.��'�@������j�����q���T�v�3E�1�2{����g��^{���$>��Dv����*��jf���Y:�0��9��Q�U9}�z��)C 4��8K�K���j�B0�]c�P$B���.������"_�n�t>z=��;4��?y{x����v~���[��
�v)�M��T�����CUgT��U��M_<�������pG{r���]������O��c�<���2�:��N$�RfH������wf�G���d�(z$P��?h2t2���T$7��'�y�^��=��T����4����j���
�Y__���_��\�Q�E����z�sY������P�;;T���SC����t>�e1Yd����5V���4���L�k���������g�7��f�,Q�;KT�
���
��4��O���n��~P�3���FB)�j�������j6M+e�{�����~��
)��o���{�e12N���C��]��bW�����N���[&/���KX�d�\��(\C���^��::���[������{~>S�b�������?�i��b�����#����������@_�^��-@_���n��	�+��n��,��	���M@_����������~���ttQ���"�������
������G���j��u��p�r�g�<������'_�~��u����������?9�u����������?�L�:��u�g]>���c��1�I���?����=��ELM}���:��??9���Bo������h?��1�#��~����wI�O������I���>�m�>�bf}L���~��:����B��ME��x�B�=�)�_�P�uf����:�����9��9��F����kk#�f?��<�)5��z�R�+��O�������#�?�O�i��lk�����#�[
~��
���/W��v~,��N�����E�{����%I��F
i��;����bx�Xv���e������E�,[�M���|P.���b��i�a�����=���Ke�(�/�	�	#�\^��y!mu�����R�3���N����m���Gx#��m�������Qtg���x���e��X���l�����?��q6��Z�v�Ov��,��ewS[�m]x���|1���D���[������hd�����v_���
�[��j�W������,n&����>�:�n�����h/9k�-=s;4�n@&?/�nT������u�����C���?��.O�x0_��L}_��
�?~V�����C�9={;�k-D���[����||��{:�v����g��m�S�g_�\V������[.�rY��
��l�����j?s�-��-7����n[Z[�fk��5�n�Ng/����:�n%l������^�����q�4��:w���S��jK���{������Q�m]�;�DG�L�L���Y�q<�_�zw���r��3u�I}���a8�Ko'����������4��i�����������w�'o�������9<=�M"2N�5�f���B�W�����n��z����/��L�~��a2�o���6�9b�
��:�[�lK���h;�g�����7��^zo����q��&���z!������q���B�n[/��}4��O���X���(��9u�����������X�d�����sw����mmt��x�O|v�����y�Ww��������:=����������������d�}��wE�������w6�m_�#>o��K=��V+Z����z�Ye�{��o���m=d)���o�����Y�~�y>6���)lG����o5�����bZ�Y�������e`[#X�ku
&y�8�g����e;p�<M~E�c��"K��(�0v+������B~<��x��g���E�+�v����_s��#�@
:,c�ww��,������uJ�Ta�/F�hf��ym;���������l�}����*V�jQ�t��([�[��9�����x0���M�����e���>��I��R�x���Ny
c��kK�[��n��������Yw�Hf8��+}z�����������P����cf���i�/�e�!_�v���Z�=P�{Lo�����_��N����N!��(��Df.������T�S�����
���������8���r�=F3�7���D�ds���`D�������1"Q=�<Tk]�V��^XQ���2�#J�s����L��P�H%Z��������w?�_�{H����tad>����N��v���]���s�F���We�9�d�N#��T=�f~�����:+>w���uQ|�F|F��u��G��U^!F(zAM��	B��|]����W���5S|������������������B�������6�uR|^_��V��=�����>-��������V���3>@�|��0,����Q���qr�m��t�g�=������<9���%���,�'�$�/����U����J��u"��f�K:�����qiP�YnN�7��t� �����R.���o��:�C����
��p^����a��{��NWG#z��\��b�\(����u�t��x��� ��O�j�X+�2
�������
�w*su&�r��I����i�^9��$���/�`������@}D�&�������BWw@���W��K���I������
#�o���^�(����<�^�Y7|t��D�(��U�;�� ��^_���<b8�/J�X�y_�����y;u%�O��c:��~UL�#�����z�<����IoF���������.����PR�$_<�/'�P'�
\g�!�:��r�Z�/3/�u��n����MG���O29���O�`��q��
F�P3�	������o��W�FC���W;�cIv��J�D�^-on�{���
J��D����h�l�,o����s����4K,Z�:�#�J��^8���g�|���7QD��c<��
��UT����4�B��������^-z�T=?��h��yV�#=5$�(��@�~�q?H�f���CQ(�,r�7��A�C������#�<��*q|@96Q�P�����f{,OU��Y:�';�s�<�y�A����t:��>(���?|���z�����N�M���w��y�{r2�
H�0��b�*�R}L��m!z�;���I�g��;9�A
M-LD�h�z����G�]�sY��>e�3�e�9���>���:����~������5������������{�����/)��w��������@�������:�����cL`F������P�:]_@#���J��As�F����\R��we=������D�
G)6�h���+������d���&w�Mg��M���\�V��j�.��s@c#�z�6P�����r��3�}�@�N�KQz��C�����������FD����P_�fSr�)��������Xt�Fu���5 D�@�Y�����#U	���f���/��K:���r�I�#$������M��F�����b�Bq��N�����r�	�J?v�U�.2-o��[��7\$@�b����#]���GW|8@c���1�	j��_s���c��V������B����wE7���|������X�w��[|���a�uz��J�Y�z��>��/����	���1�9�T���x+�T3���LF$�{��lnU\c����-�J��H��s`*��������e��%��/���x14b��6;�Z%�&����i��������>������8
�������e�9�U�������|BU��H��Tw7����1��XP�>G��|��X������W����`� ��~-'�o��;�?�W�%^�g5��[�c�	�@-���|�x,� r>�N�-u���2o@czD�&sP,���}�?��F����1���r\l�����k�X]���+�����z+�"yd����^w�h�oCk-i)z0~��(�a��~_�6����[���,��8�vo������K�O�����n�U�r��v��S=s�9��R����+�p]}@���J��e�����4�z�%�U}�V�Rs;����t�����e@e���J(�t����������'o/�������`{(Q��U0�~�q���$��P7�`��e1�
�����w�?�qO�V�CZ����q�u�Q���X����
��	��Ru�Y�/���t���1"���s���3���v�Q���>�[��o^���^��1���X�:��P�~�un@({�-QE�<����h��	HK�m��#&�)�n$��\��w���s��~�=��|����6pc`��['Q�o���Q������Wt�B^�kE�Vk	��-G���[��>@m��8=��l��^�*���zW�vT_j���������;��.IG����kH��8I�~HR��A_z�~��������w��`��.=��V�����e#��e�r�]��[�`;y�������?�o��1�6�(���R�'��+1����H6l���d��F�4�J�]��s���z�BE����?���]��}^�`d�����]q�8����elIr���C0*[�>9������_G$��g��)�sZ`����Nr�B�;���XuT�Z�UK��N����������Ck����-W!
�(��^TAb-Y�*�����@�
S�%Q��������qf���t� r�;��m�|(c��9���y� ��=��9�A���������h�/���?�}��}���_O.�go������.��y@AV
�L�������������^�E�}v���s�v�_��o�
�p��~��V-jt�����VF3� �+���]���o�i��</�vi�����N�a�i�����D���P�A�8��.0g����F�@68�W�`z�8�h��p��r]f���q���yrTU����6W���I�)��*~yS|�'z ��sl�r0X�4��B���Ho��p>�f��35W
��z��S?��n5;��b<�C�*��eO������w������yn�Y�������6���H�UGJ��CJ�k���)���\�#_���5��&x�rP��|���?�����7���Sq�n@l���������2,���jUX�K����c�8� r���{I=L�+-�( �?x�������O|�`*l`�W:�������q�h�;C���W�gK�kW������#e��}���q&:%���}s�^>��[��\�/����K��������j�
�f�H���|�T{���^h��"�5���^e�R�i�?�R�{&W<����}����wW�x�9i���$ ]�}��Z@e8R�2@`��F�������9����p6�F� ��J����b<�x�|����-W~���y&�@7���7��=����#�?�t�I�������F���3�r�����t�CG�4����r�o��K2�7���_�� q\�U9GiC�����i� ql���r@{n�gh���=)=f���A���%�<r@�6�T��x4H]7����Cp���I��aI]� [��)�mz��[
RW��kkC�=W��p����~`S�M�/D(uDR��{~,
8�T��X�_�G�=?�,�2�������'Md~�j�lZ�����s�4`�RC�UpT��u*
2g�+��U~���5ZM�V���T�R����K�V���Bi��L���uQ��j���~'Pw9�U�<j����tx����7�S5�3S��~
����$e�
�g�=�T!�2�������Tk'��+`�
�~;:K����-�H���Je}sn��}������?���&�;t�=��p���.���\�.��z��Y�A�w�F
*�\����F.To*�G�����3����sEm��|>�+ov���D�D,����Co�����o������4���������
!�RCK�/c��5f����\�<��)Zu!��
���V�,�f9�Q�q�|�1r>>��������Z#:�4�j�Hs$�����yZ�S+�V����������4D�i����������x����O��������7��zx||t|��(�v-
Y?X=���p0�M�j$_�JeS��6o_]��V�e.��U�2v��-[���s��3�����M�M�����^w���:8}:�P��h1<4d��W1��
�[���������we3�5�0���j���	<,4t�AC@��
*����������B�_bt���!�j�LW��Z�P��C�0�,3c!@=C��_���.�B�a{o�c����
������a��������Z�j;^�8CW�����;0��!@7���G���9�����3���-��!�:C�K�g���mx�q- ���~���������[��A��,���r����@\KX��������6�pz���U����*���
9T���,��ut����&�]��ZnV(F����A��y������V�����W4^N�c���\��u�� �!�|V��Z\R��amX��{�.�w����:�-o�bQ�1���Y�V1Ug}���|P��:��J[9�I5���*��/�[�&��\Y��^��
th���*$���0t���#��QsP����j0_��m����!w�<%�%i�.�xM�MU�����8�����rN��e��D�k��U"J���`j��Z����0%�N�<no��!�����W���@�!�O�$[��VX�fw��4T%�#&�j3�jVVH��
! LC���b��{���^����
��;]�}][�F��b�J���7�v���
-p��~�U������E[8�������W+�}��D���D0i��q�A��*6�������O��q�Bx)%]�bZ�T��AN�����=y���@�82�3���M"�]

�~����[W�����a.�j�k�
c��eU�(�1��`��9�����n���%��
-��N!��?�����7dY�F�G���
mF���2�f�Z6t5Cmw:�Mh�6��pmh�k������+�o�w��h;���D�_�l������l��2��y��N�"�G�l��;�w��s���t7�j6gtZS�a�d1~�������U	��_��	h�xS���/JZG����L��`���Crm��Ut���6�A�f�
���3�?�����ll�����Z�`kO���a��R��O%�l�z�z�.z5�\W��)����.�4�z��z?O����:vm^��{w��%g�kj%�;���W���P�����[;����J����V��uA�j�m%�2���5����U�a�:E��tl_�f��9�xj"�>��lh3dm�OcRu"+��C7��n�U���d-��#& ;�\�����,���2]V=��(&C����,�s�`n��1�.@�v0������	�����_��
�}���Y>�7��90��ao�5��1]cH��������ATSi�Q*�wCw~�#�5��?p����o��k��C����[�6n_C��;�7�9/Vu�^�7�����#�!T����.�A����T�	j'Dw��}�����2�K������S2D�<�����g���B��`���E����t@��:f�W��9�����T@�#WK[F/������3rG��x�?O��7��o�^�t�vG���'#)�L'0`r�j��IP�|���]c��O�X��<l����Yv��%�5��:�q���#;���R���|���������Jtd6���p�\�yo1�k����Dp�,�Ln7�x��{&��q\�~ ��Z����E{�8�Z�2@�8F�*:��
��jm�=��\kG�4s�o�T���� ������<��:�R��s"�	�����s��w=����"�������������7&�M�wu�W�:~{�?:�<xup!��o�i ��~O ?����:��c=yu������
=���^��@.��Mse������\��e��A����`���;�Q���zz�v�Zz�M���<���\�@���e��Z��X����6P�L~3(&����"���i���������Sq@���UC��l��eFN1�
8��Z���R�.�5�z#��>��^<�Y���w�_��+��8��_�bD��khe���@�����F&-f���P��������#w�]o� �f� �����E�u~,m���Jl���bJn�����yx7����v@�G�������`~#�4�0�)_
���J��6�#7 >@|n<���z�����	�^[�Se������l����"h�=�!N�k��r�����s���#������'�d�*�[Ms��\�Y�"Zg����my�J������y@������s���b�������?�|���I���#D�7�\h4O��/F��"p�C3����*��F.�.��@?�>~�?�����$���9�}d�E��vz�N�YN��������hE|��/��pp�p��5��-8��� q����}�M�$>��x���L;Y�MU7�\��������u6����A�`}�w���rU�����N�������:K�%p����Vl �����0�>�����x
��
����l����3�8�fJ���.�!�l��)Y����������n��V�g<�����#��Y��������[�0V���=���}����w�n���~������we�r�l�N&��r\N�3u���y����%��D��"U����6HJ$���]5���f:.K7A�������s�Z����r|\����r��G�;FG��1�������tz��&�]��C��9{�3�zY������u�{xz�-��<0A��(� 6�8�7l�nF���n)��.�����N"�>����J�2���b�\H����.���� ���n<0q�zO�x�~���7�}p��Z��&�\oe��d�<\s%'���E]J6]��)��&1�<�):�y��-(�<0c��p���M�0��#���Jn2��B����j�#���z��W�n�;��iA��R�hB&]���$"36�=`#7;����
�\]�H����,�<�(s� ����hi@����u�"+��������|)����v�s;�;P�d������
��7w\Kw�<0a����n@�\�g� ^��f�Lxl<H����0�����S��)o@�Z�N��p���T'@�Aj��!�v��<KP�tuk����5�@���~��8��B���.�s������N�@�:��h3b�9.��MF#�Sr'eH�
>�W�q���1���OT��B�H�e"��%|�J����.[��Q�dV��o�u�U��bO>�\� ��O���_���,����a g����#<�p+�����x)N�L���e�Et�#! �C1M�����
Xg����!��C�Q�L5!�Cc�gGGB;�v�s`��;�E����F��l(�����
�l3�H�94����X��B@=��6	C�����	o�Z.t�Z��|j��f�l~;��1��(������@�R��5)Z�P_z������(��7�28���_B���x��
������E��(��A���&TW]����j�@LL�:�6��@6�Y��MQ��wF�`�z:�������Y(X-x��Q��������g��V�v����Zb�"���}�;�1
��0��)�r��e�*i�[�9m��:�T!`OC��*����B�������/S������?�U��B@��\0f� @�Q��/v�U�@��)�2]����\��V�B@����t��J4D0��O�3��N� *���u�~���1C����O�F@G8S-�A����c 8C���Z.4�Z�����>r��
�����E&��wt5�$G�"
���:���
�l0���v\C�_�lDbmP�e���:<j24��`[���^���U�F�2�XKuH�eZ�~^�;�l��u��f�M;���?���$�[�����l��<�7C.R0]j�>:�w�<W��gsH&]odF�����~-��x8(~��=�C5�b@�7���c~��EtC��p'�c��bR(�A�r%Y|���c�W�y�c����0�������A:}������pH.�;�Y��T�rMP��]t�P�!GqV�^�h�j������"w��3P0��ddf�||���EB(���zdM�.���<y5�TTJ�2��2�S�@�L���k%e�nK��~�Um����.vq��p���!`>C���"�\���}����rK;;���l�������N'eU���RiU������W�n����@�j#tg������axhh��i�)h7z�6���i���!�NCC�beZ�\��3-�����}gW4�!��������`�
��|��5�@T������z������B0�?s���Ue�Y��{{��#����&�U/N�9��p���I�vq�P�������A�����T���j�UCS�dG���������UM��`������O�=^454��/.��:hV54������0Y{u'�P�%R���*	(��@�����&�����-"j��
�#�i	?B/e������5��g�l2d���]{Ts\#�lh39pk��:I�g�����C�a�n�T.�`6�V��C������5W�����-�_2
K��(�=�5:JO�k|���5L�y�&�D
9�c-0��u�af-�NT�b���!�YCgU�������t�<
�4�CN#��F�A|�J��"W#\U�1���:��j�[��.�Yg�,������vd:w�^�d���Ix����2������Y17���4q��Z���G�qq�����V��@�v��=vF��8T-h��cg0���@��S2����E�UC�n�l�� ��]�8
�4����F8ZU�g
��4H�~p��\�Z�%��=P
��T��p��y�N@X��6�,%r���!����C#��Dgd��|B��-�R\~������0����C#�U�=���V�v^3/����h�&&�h�Y���F1�	����J�b�>p��&� �X�t�)cF�$�]R"��F\�R]7�qtg�n�3�
9����QF����s���4�G_��qF�I�	���6��U;�?�0p�~e@��6�hYmt=R$���>T-�K���_=`����[�#z�<�_�~\|��|�$�b�����6��,M^������(�gz��pm�C@#��F���A%�?m�0�o��8P��%���|�x���AO����u��@@��������=9���}w�;}�{vrv��N�O/N�Jj ����f1AA�A��!��ng�����+�X���=ljd.���`�E��g�y�#��Fv(iP��%m�����`��)��%��>?�u
6T'- k�����p-,k�������H��w.����@����U��?�y�������S@�Fl|RGW��F����5yJ�4�e��c6�
c�2�L���on���3���%�	}�������C���6�Z� h����:_�D�n�kM��y=h#��� ]#�t����u�X�u�y������#Y�\I}@��su���(b�b�)'@R#I�H0�A����Q�������Cwf|��F���I�LH�Z�~�F�1J�P��CU�b��2���M���pV��b�F�)��
Wm6Qa��F�L�Q�Q�� J#.z����b~u��N6�P[Ul�q�H�~����G��-�+����u�Zswp��i�
@�(�$���������
M`,d��w�8T-t�d��K��e��{����Wh��PS��p��-�y�*�{����CF��lY�N#�i��@wl��N3j*[4�����S�@�l�P��������q�K�q
������V�@� b����/�q�3��8pSc,���G+�B��6Y.�j��9Ou�pf�����
 �(�����=f�5R�C��QZ��s�z���_3�5�=#.��:^D3Jm���>x���-��L;y��1�<cCD�S�#�gl�y�������b�������b��E��q�F�l4w������r�;� f�F�T� te|�LX9��5�r���b�2>�C�bI��a��`�y5�����S]����jA����������|4��zv2v��8H�vN1�8���]8l����r��7\VFm#{��9HS-�vC�M��j!�}�������3/����F�?����r����3��4c."�Z��R��B����(�������=Y��<�+���\�B�wh�L��,�������bB���9�+\F5hmsWz��V�sud�e����$��1Us�����&�O���Q���+C��V�wQe���X����ag��Z0����7��Ro7�r��
���.�b@���-W���H���2��K��wzy�������	`���y}�z�1��:�3��@23�$3J����;���Y!��{�������=�fl	i�t����+��{1�c�����d�������Ul��U���`�������U}�<���^�R�p�����t	����J�d�0���D�.�z���O2ZW��z`j0�1�L�5���.\f���C5/
@�1��=�s"b�G���Ae����?Oog�����O�U�M�������A���UK@/�r����c�F����������_3o�I�=�����h0������o(d���00��@���L���my��;����v,�7W�_)"@1_�:�������4�QGKu�����.�P�5@�b;t�n�����'j1���G��F��v�^�)o�~>z}��[��;����8�q��p����������'p����p�)�9���9:��/���uU
��t=8��b��S�b�v�.c������
��b��k+�Y�T����;Y
�u��A�?��z.��V�j	F��;�1�c6��&_���9��c1 qd�hW���wc�s'�A`{�)����7\M�.����x���(���e������^@Si:v��?@�bK��a]��_1 �b.y�Z������a��!��(Oz�&�������'�����2Nd�q�}����q����Y1�������;�Ic��,X]��}t�'o_�^]}t~�;~���������Z�"b��
d��^1�%��������_�e���l~��L��1 M��24d��|2��&�l����r������{T�=�{�>y�W�{�M��Z�����H�%�h2���f-m����hb6P�qb�8a2Y����1{}����!}U�����n���~fU@\�1^@�-Y����!� �[��L���� 8���-9�������������g�8#��%��[b��g��A��Z����eL�N��1��c�;���cp���;��_����-
���G���(�,%���1����f�k�G$2��Hu�0���;f>�s�
��[
<I���Hz!���"c.t�:�(c���t]N���7W�#UE|c���l�61�d������F������vW}(���r�2���[/�^�������4u;!2����i�Y"��]��es��z��fjW.)����NC8���,�b���Z��w*���63/�� �)5�����j�p�e}5!:B���Y%���	@%���J��f��.,��\se<��	G`��}@i�v���pY�r�g��������D7B4�F ������m.�HC��8�u������3Ay��N����tS����&�x��"S%��L��������UG�p������N���]����A�����p��l|�(2.�{����������������XF��4��M���E�L��pt�:�I���)���hw7�S�hd�pn��P��h���Ig$]��$l������WK}�=u��g������E��i�W]�]?��~���|�u%W���� �	�|�C`>[��Y�k��+������.'8�L8�S�{&v��k&\�t���2a��2�u�e�*�L�J���L�
�����q��%�I&&��$�L�PIW�s�tG��;�O��K�#'���X�����v	��nT�=�`��^�:zH���I������4���t����:�M�I�u^�J��Y4�-;�2�eb����ri
83S4H��3a!��z=���lL�b���V�@8�R-���f�v���6�
3���0p��K{@!A
����S���J���\1�8@9�T�]?�No�>/E��#���^��2� ��M�F�=�3Y/��z1��)���h���9����]�P������Ww6��V���v�:�y���r����x�31%5�,��]�@��d	�qP�����(�^&���#�u�q�6�f���]���	Go��Y`�MG��@j&������<(A�"fW�,c��}�9-��>��-�4iCS5��g��4(�%�A��P��������^O%9:u.3����O�S�	 <S�q}V ��
�lyulj\>�UX��*�#c����1�@�&�ZSfE ��2�d4�b6j=
��	�	|�yW���D��_g���\S��{����)@A8T�2����#���i�l���g�D4��6�O<����9��?�>�f�.��0��t�Le|���he��#A������r�����R7�VN��h�dOn�.g��� ��������|L�x�ti	�����5��pH�Z���l������vWw��=u7Z������TNJh8N�>5S1������c'���e���U�e���E#@*&R����j!I����
P��elT]&2?�b���n11�qd���D����*�{T Q��l���!J#���J^
d�
`�����11$�f��b1@�	E:� ��.�� c�������J�t�A4�;�8��W>"I���������
��2#_\6&,{�S^�a6�����DS��?����-C���F���IT��v�)�W&
������3qWj�b�v"�x1����<������2����E-�z�����_��QJ�\&�����I�l�������&���vw���)�8]W�3�Mj@2�M�����E�����L
��a��8f������+��,S�Y���X���t���1KMjH1���e��;�E{[
����35���4kN���tQ�f�8���\�K��~���.u�M��"�K\�'�������������r=P�>[N]��������.�b|�D�K1�0J�VV�������N����b<���M�Y_�����Lk��;)�-Smi�W�O����Y63����2e��;z�e�eW���K=���A�6B����2���Q�F�_2Q�P/
��t��N��Z�#$�rZ��6m0������t^8�����;�S����2
-�Og�W
���
����%�zxr��i�\K�ejHT���.��&�)�4S��T�,S��)�*S��)�*SC��Ss����u m�'�.��^ |�r�  
F)��F6�2���ny66����i�G�����B��i����%Y���)�1S�S��V���(�I'�[Vq.CL6�C�/� �`��)3�pL��[��Y�78�U$g{��V�L����k�
~3��MM���r�����3�m��H��R�V���M��rL��P�3��G������5i����e�L�0m`���"4�M]^���\4��ES���\L6@
���6$�3������r����������f��C�`��L��O�'�����x��s�O�p~�r��C��62lh�M�����M9>Ts_!��Y�w
����b0'�^Y�&���a]����6��3�"mv�u@h��ZIm��6e,
�����
�w�\TN��F3
-��*Y�������.�&�o^��%1BR�u�\DO��1C���z�:�����?��v�O���h��)S����d��lXE������)�f���o��Y����J�,��;|gj�w���Lm�f����Fy�^��H�����O�U�r���K�����L���������r<�o�i����%O��>�B��v�����_�������� ��Si����tr5�>���>�N��/M��E�i���������)`JS��������>L.)�@Y79��(M9�T-��9C�Z�.�������E�-f������U����#,Xx@��_�$W-�d
*,Q�
���)@AS-�W{ BS� ��6�
^q��������Y)�i�A�j1@LX@�)ukB�{o���^�@X2��bg|��s�H��Y���7�������<�!���)�������D�����gsy���`O���L9�����]�����A��zx�&�S��t<����(he���L���) ?S���2�qc7o:��������W'?}x}�{{�����I���Y����_�0Jp���T�z�m���C�tr�����N��1�����svh�f����2�M��~�.���1+��E�e����]���)GX�����#�e����2�f�t2��>�����Fb`�?��F�����I>@5S������E�e��]]�)�0���A�
�eN�G����j3��M��PW�����)����� ��	�l4�y���lJ����U�E��p�.��+r7��
�?�\M;�+�
MS��J���v�E�:?�\t����y�/�����58p�DM�e����1
�&��*��^�~w|~r�������{r�-��4�����2���f��wv~7*7���]sR
��CP�b��B����'_�lo����~�i�.pWo�i������������4�G���)��L���6�+�gT;���vePz�D��<�-������&b��
������Q+�s���\u|,�������M��H��*��V�Q��s�'s�	�u����i	��m��9, �0J}n'i����e-���U�>H���rQ�r�j�l�1
}n�2���/�Gnc\���S��]��1V���N;C_�<�G��.F��4�{�;�����N^�m�L������
z���d�S��)���N>RR�w��<s����6_��N��q&������a@��R������!���2��2�(@�L�l����������	�����rM��9�������i�m&O��@
�R���Asm���z�:y�W���C�����������L����8+��&�v%F�A��%!U���t���j.h��j�Mb�6.�n��G&��0����6��N�Z3�K�o:s����6V��\0}nU�����+e]t�P+����^�����sRq3�L,J�6U Q,\]��$������k-A�?���t��]+t}�m]�hgr�qF�e�q4]���~Pq�r��/fe�"g\2���f
�6vk������29T��kd`����ST�U@����m���?S/�OD����M
����sy�$/��r�X
�$n����F%���#�(�((�o�<�bf��sKA��z@f~�>_Sg�$1}n7$��a��L��:+���������1����&[��xd����v����r^��3��������������������'m���P�����������oz�N�.~�����Wf�D��1���z�#�`'<y{�{�����#��o�:9��9:}���f��>�l�@6���i2)�d�z����$��k�f��>�|� R^�Y�g����BfR�>���uv��C3�L����fd�>�k}fd�>����@i}Mx�}6�&~-nD�����g&��s��`���2G#�Ra�C�y_���U?������{%Q�7G�9(����z����f�}�C�d�S��&5��p������;\3rL�w��������L
�����s"3L�35����������]�]�|��k�}�
(���@�����Y�wA��l�O%�yZ'cY�B�����A���]M7��.}�.yZ%a��>���A��^�G�a]���G�1]���G��Z���Qg�~��X��hoBkw����u 3fK�w%��h������e"d[�IO�����A8����wt��9����.}n���8.}�]��fW���0��������E������g������^�S_�nz�ex��"�z'�h}�����f��;=��������7���������w��v8�\����mb�#m�f&|��u���l�Yq:��:����3���;W��
��/����3&L��UpA�
^e�@�\,fC��^@^Lbe�y1����������w�b�L���gbT��"���t�Z�4'1L�uh��Q�	�";�W|�q�.�ed�<�#�I�B����1J
�Q
�2�G�B���6c�����A�1T}`����4?�~���1�9�5����f��������;����0}n��DGN�aFw;~�������������w�����;�8y�&bv�\0}n���:��g����l�\	G��p2�;5[+��
��t0�f�?���aXa
�*C�XC+��j0R0)��]6|#�2���8{�{#�g^�Zb[��D��?aK2���'nI@���VK�]U�s+��Y�Fs�
�\�h��|o��d�����Y/��"k
M+D���h<�����\��\��Os����~���f�>��G��J��W��G7�)����������B\���rm�c����������M7\1���H�W�?Yw����v��w�WU��#3�L�[5������`�E`C����W������\���Tv
!r3W����{.���C�c��{f�tm�/��G)�����]f��I�&��v(�RE.�9��-�j�P�b^�heg��MX��i�Ef�t
�x�Eb�?+�������
1��k�ke����l�b��0�������1K�k�2{]���T��osf�s-�f0��)4������h�;��4�����������Mctnp�����J}��d�����������[?�<l>��|�������������{�%�:��������.@�]�>���2��F%���7r��h0@S�(��U�w*/9���l\.�����4vM~��9��k���7�1(��^����\EWnD����8�R�K���]�
��Rb���t���*����~G��5U&��oPT�����0�=p��
,�\��+M��	��a���-mD����+W-�P��l2|?�4>,������<���:PK��/p4� �������Y�9*�n4<��Hk�;r�y#�x�����u1�+��=+P-p�X�@�8�X+(��Q����7\����r@�V�-��j>��-��b������Z5��<g=0��mH�.s�x��F��`�/�_���[O��U��
���g��t�������������elR���2�4��Ev
���>T�������Z���|��.��]����W'��,n���H������������&�c����m e�=�,���l�������o���t�o�O��	�i���^M�o�Go
 �.B�����]����CU-@>����jO[V7��@,m�+���i%���1����@!��������l|�@��-U{��/��
��,_��F�"���0�-��K�m�B���"(��).�
7r�M!�i
�W�6�}�����h1��86�
7���MD�M��jg��g������V��>.P�p��n���w
������q��r��"g�S������&Hu�#��r��E)_cOo�(����w�����=���3P�Y��q�V�)��.
v���T{}���2
r��^���<�;/_�!��e��vF�����w94�a��\�j(�(�zU����w7����\���������%��m�@aL��U2�d�kI����w�X+����q\��E��}�������y5C�%���z�������~���N}w��.�Z+��x}N���e
�U!�H��#���w�/��A1,?��'�_:�%������8w�aqC��������|��_:�L�D��@Tz��Y8���
�~[k[��q6��n��|>x�l�`��������Yu7���F�7���?.^>H��j�����}����$��)����x}�f�]�|vS����l?�.��/r��O��`0������iv3{�}���s�L?��������T�q�}7�ZmY�U�����=U�Mq;�����^��oJ����r��ATd�A}k�O��>��{�+���P�����w�����'������s?��W����8��f����g9�=��������c"A��.��v����������������W����C�����#j�A��}F}��3��z�M��d})���f�!/�������&����Y��$~'Bw�]�gp���B'>\��B8IR5G�)/���k8����)/���}�_���\��,��!��1n�_>��{1�dY�L2��u�<V�^KhI��3)J���m)R�-���gu=�����)qs��?���,���]�Y�~Y��sWU��7��;n *�z$��l~Zdb����K����p�Q��%��f��	�ox4f�J�{�e����i�/a��/��X���I��lv/�]�.��vG�v��/�����&
�����ze�����x*B���.���6�OvA��'��g��4*������h*���#j!�~bhs��B��L��jS�^u�-S��-|�����N&����w��m�G�����D�'�~�|�XU.��l�*������=>��^�u�<�������E��o~=�N���������ZH�u���?k��������dK^�M�<���B����T� t����v/��J��i2U����T�lM?��ny�y�Qp��4�$�����n�j�ny��l�������!��J�����3��^~�\84�u!�zUnP������N3��_xt�D��.��c�l<���Tb��B� �'������������{1N�&v~S[�=t�������|�* {���g[�*�Gf�R�.g��|��$����kfLk�-3��k:�0_�� H?��t��e�9�w��}kZ�$]��#[K4�n���w���
��k�|���
�����^���AJi��G����v��}z�Z�=*N��/.��^��5+8u?�*$�����
�$�rH��2���7��r����0�72N�5GE���Y��gh�tv����d4[���$_m2��!_������<���\i=S���7����y�s�9�C>
{����N� }�5-N?�RDt8DO�����y���{Q�i��/]{h�����;�~������R��5u8��G����%������Iyus7�������v)%����O�B"i�mR&�w}7���5��
�n�i����\���Y��{���A���v�o h��7�����\�����\��[!�R*��)�M���r%��?�@�?"?��������������[�C��-���?��������-��.��-���CK<&�)��K���a��C��u��O~|�e����'������Z�D0��(<����&<�
�'I�<�M�-��i�}Z|��B:���}J ���
�����9�Og��:g���rUp����,�Q]���l>-��fk#����{�*�G��������Q��o������>�6h�r�rV���W?$� ���m��-���a�$�Q�n���~�>*�f�W�Q�������:���������}�w��]{��~��l��S\�����A����'(��rx#g����������S\����&��s�����v�PcpX�B�vXU8c����
��t#Z���:��+N^���4���>���i9�[�������T�1���������tv?�����;����q`�9C�q<������������r�}e�1
����B��C:�������:y{q�����}l�j��,..^s�j�gF�L�����a�0����:���������Aj�e��]����d0^��\���^S��)���tk.7�R
��\RV���`���A~���b�z3��U��h�}�>f�=*�����!��\2V������o.��r���r����_����~��
tz.��r���^v�����������M�h��E;�7L]����{1��y/p+�����J)��{LW7����r���}= �������+��9ME�����9��(����������~������������k��^9�]���*w���J�nN4*J�zhe�����b>���:����R�m��G��#���Q�;���_�]��O���uGW��9��p�R.����R��G��h�|���Nscx>*2r���yAihc��.�=�,��<�9��F�L�4�!��P�����
+@r�%fC������Wf��M��{�2�4�V��9��(��J�b:����d���{Y��]��������
�����������I@l�l�J)@l|Fl�mw��h�����$��w���3T�1�Bn�8�j=wZ3���������$��|�R���R�6Y���'z��'?��u����"�~�(De�
�U�Zs
Oq'v������� ��SRJ���/���J�#Q��</����CZ�R�����l~��[��~�t'>2'��1j�w���m�g��I���2��}�������\zo���V��2���X%87'��23�:W�_����l<��f�\�8�eb_��F�"
�/]�L�MlH"h�x)l�L�dcKyV������RXc�����k[�+�z���RXa��������*�e�TJ
2��*��D��~��L�MNA$B�<�W*��|z+5�S��	#���E>�+����}�����Q�d��1<<l���5�n;I�����$����n�6'|t.��"��|��n1���F���9M�(���5I��N������}���u�7#N��9�Q�}zee�
�|�2{�*�.�>�D|�oc��L4�����wg{����:����B4��j�E��{\
��,�����C�l�����,���V�aB1g��n����6�eiO��aNW)�a���sO�RG��r�%�TG s�IQ�9QM���ysza3S0�����v0��1S}�1g�%�"��8�k�P�,���^g�3g������RE��
�EU�	�vl7=1����NO�:��B�@R�V.�
�`�Z����V]5]5��MV�gXewbD.�>�ZcH@�O�.�/�z�
�3�u����vB����=WQ@!F!�R�:$�<�uE1�Z�ub�D��Oc����L-���X��&#�Y�������Z���p�#��RwX��K$@Z+/ ��Zz)�����yh��'�)����HAOO�}�����D��'��+����Q4��h�B��h
�#e�e�NA�O
I��sQ9�,&����s1��{��&��S�^V.]]��r� G	v�����Z9)����s�z$������pl�_�{h�������.&�l0����������]�������c��,0�s+���,���WO�C�����}��~�y�[@q��$>�]����Qf�?Ne��*!����tiJ���n��,f�s��l�!�������H|���������f�*���H��.���H}�v�����
��������yNa=�w+�r�${5���}sy��.GO�]-tx����5L��y �.�T�����Zz!������kqzpJ��)�b�p�$]����Hy�:7���o�����N��N�1�;7��,�H���������v$��|�9�����0�����������2'�u�
R������;r������*t�t9d���*v�t=K�b����<�;�K�9�B
w:��.]����l�����&��������
@G84S-fv7x wY5��x��w�Y��o�����a�W������)s~�%�*U{�(pgeO|����0-���</�~�p7q�\���q���������/�_�9���^�y����Q����1y���m����~^�~
X!�B D�z��rW�6���O�10�.��VF�G�dw#!
$I�K�{�����sU3�r�hyW���-��.�C��P9�me�����t}[q��3�zPoC������W�L��m���-���]�jg>��D�]���m�7��S7AXC@9����8����k$J����q��CJ�b�fqD(]��.�>���(38��e���.�d��E���>�
�v|�������) (�����{.�S][<�������hT�  �������zzX��h����|���Y8�d�X���v0KKQuC�^��nh���uC�#�6����qg����.`=]���X`����YOA���P��R{w�}��r{�V;~�\��q����E]�u�"���n�vk����#�
�
��c]����?�|(�X��^��h��?��E�S�y�3u#/�]�����MV7�:�����k�r,*�>U��3�b�@������"w���KQ��tc�S�94����_��m���%\�
���^��F6����(N������57���Fk��f��p�%u#[����q�Ou9>U-���6y�;��������S�O���?Oe�T����������3��>�`v��U�0roU��k:��r���K6�Y��nlz@�.�j�( S]�L]nt�_���Yk�`�nl{d�L��1��(S][0��ut�w�5�m{v������.n��}s &��t�����M�E����&�}�5AR�+�]]�vU����l]=�c-{5�j�����sT�M�S"�0 	�y�g��Vu���-����#S+���c�������T���Z��]5���X�����j�{���u9&V-Bj�����,p�j��~����H��?��_���3
�
���M� &��b���b�;���E�4�n@1R������d���e�d���nj�^���;4���7WO_���u�n*��O�rA�y����X����a���`]�C+����Z�A{l@M=[���P���<��z�V~��Q��F�u�,�x�$����M��<��z�k�J����c��������=6�������`w:�=6������c�=6���k�`����PV��t�cR�`��J����������^3-kx<�8�����^����1T) ��g�}�s�r��_�x���@U�U������<m��#\-(��\O����p����j1��z�C~}t�h�K=�/U�]�D���O=�z����)��=SP�G�*�������t���u�j��N��M��z�"���j1@L@q����g�~.�vM=�~z\(Q� (&\��z�����MXa��Oz�����E�vM=��z��<������I=� ��B=m6��wM=��z��"G]���k����������I�(��b>���N����S-�������hO=<P.j�rK����e�Z�k���8LS������v��f�t>��W�y\~�{c���"����S-������� Pu*P/� ���,�$�A%����j1�#��C(�������Z��su��3�:>GwVuU����KG� �UVs�����)|z����4=;@�������K/�3����j��[3���Mt^��,�Z-tV4�F�\� �t������w6����������bg�E �^�A?nT�:��=z��z+q����)��O�/��
"[_�-g���5\{�*z��:�V�8XQ-�pU����V�����3�� =�P�h���LG�p�X�%6��n$�Lg6��D��������%��j��C���W�s&�]G�����[��=+qY�tA��l�\�l�g�&z�M����&;�+�[�m{�]��<+C|�4P���=��������NEr(L}Q����Fn{����(���>���1�����5%������=f�b$��������k�Qw��:�cV�����R\5���"w������=���_=���+^�Y�
���������|B F��b�$K�V��7�v����4 KV!C��1S.@Nz9���c5����3VS���S�n7 ��S�J��(ENG�q.=�eKU��������*>�}.:f+��
Q��5��XC�P��A��P����sew��|�"���w�s_�M,��VWW9������<�-��hX�(C��2�e��P��P�o����?}+�p����U������b^*�N���[�R�����@����[&�GX���K9�2wa���&�P����a�R�B������b�<X��,+C�	hC�.��xB�*4f�v����PI)����������[��T��\�����v *��O� ��D����K>�}6ze�~��������>}�$P�>G	��7>`}����O��;�o�a��C�n�~���g�3�j�f:VS���t|@"����GE�=+H�~
t�5���rbF��r���g�4���:���%c2��?f���r��R��	�{�N~8������?�����1feymb�^��6���de��1�o_������a��F0�>`a����o�/6�+|>�}������!+�Y����s��Z..Qy�J�\�����!�:��
�hE�d��B�.T��,�amy$������Du t�oKR��U���a7�V���YG�"����V���UbU�C�uU@�>�o��� �>�}����!|��h>�/�=~^ G�ej�v��k�+�������7��T�@���R������"^]�
7|`�v����=����N�.�{mZ����L��2�b�Dq��6�������7�I.�)�l���I}��������v�L�d���#/��"���\Z
�����~�[��\� p%su���&��wbB/m���_h�`��e�v���)c;�X@�����0l,���PK���Y��Ycc��]�������@3}�T�rYl�P����|tu_�*�5:�]X�f����L�.e�e��?�r��;�F��������O��.��Ge��I���tNO�W�}�{��M�/����P�CC�b��p���f>������y������%ig������N�������]W!�������{>�l��3��f��fko[��M���E���.c��S�62f�9�E���h��l@��	*���H��K�yQ+��������\������(���W�C�NN|2����4�n�XHQ�@�>�Xe1>��7��T��6><���C}��������{����P�j1@�b\�]d�7�P����*@9}���>�.}.0��j�J�
JY���:y}b�7�K��9�I0}������oL2���3��0g��7`N��9��ko����p����r�8��IYW9]�n�:i`��^]��_R4�H���_�4�pS��v�-��R�f1�T8���C
����@�vA1�U��>|8��.&f������	�p��b�pq1,�W��>|���.�{G��p>$j���j1@8�B\���q�������)���j�m�5�W���pX��.%|�y������nW��
�2��
\K�Y5����b���M��(
~p��ZP#��U�k6�f�QZ�8������S==6`a����e����������l
� �������]|�H�]15��{z�Z�e�7��\NuV�����M�����4���=����G@�Zmr�
����a���&g���.�e��&�%�����l������B;*�i��&'���X>t}���l,�vY��.�e`� �\]��
7���\�t� ;v�h[��n����?�����L����K�9X��X��^��8������UH��_���-.8���82��#e��Z�-������Uw;8d��"�/}�X�%���F�t(��3��)�n:/����h��������
�Z��d/��=�����^�gy��b
��\]f�a���	 ���������� �0;�.S^q&�b@��.`e�� ��P�X`�e��Np8�Z�M�H��"����B0���
��5^���0�:��T��`�mv4��ST��>	�;l3�f��l2�39�)��UC%��#�f`�U��+5���ilx�#��������bR��,��������@��
�|�pvr~qtv�;?��s��Yg��:	�n�3�N� $v��C�]�f���P����� ���?p�(Ql�9����6�[�@�l����2������j1@�8�R�{�2��~������ay��KVj�����*_��B[,Te`GU:���c��k���^px%{����s�]�$��Gf��_�x��I`+.�Zp�Hj��_Z��\6�d�+����B���f����)�8J��2�Bn*� 2�
������f�W�O���!p�$]���a&H�7�2�����;-�[]s.[s��[�l�PK�0���V����s��v�� ��V�5������!��p0�Z���Vg��A^�H���A��Z��q6�4_����E���6T�Wv��C���k�Us��I?\4�B��S-�,P!Gr����'�L�U?��:�e���0���L�l�d�2dCy�*V2�XI}�>�dxh{�*�\���.����C@U��v�<B�E�\�N��9CT�����y���0��Ul��k������>�8}���
r9��b@G6��|�u����m��U�n��1�E���Z�
������y��M���k9�k}cC@7�ch�3Xb�E���Q���u!������*�t��zk���t'���L����OL�
�������6�2��x��_�#%�����YN}�f���VY����.�y��m�ni����Z�w41�.��9s~�9[�@�8pS-H�Y�2qu!Z��B��#r����bi��W��|Y�\�X3���^���f3��M����M��lf����2�7���K5�_�9��N"����EmU�9S�Q]����=����k���b����r�;�@0C�\�����U���hLf���G��W;����@��]�����]�����t�vZd���! .C.|�Z�������m��������6��^��|V90u�B�s��.>ea��6�x�[+�gR���! :C��l��R���~��X��tU���-+&4��� 4d����'\��G�\$�����-/��>�;�~
i����h���j1@eB��J�1�����@i8T]�tg���j1�s���sH3d!�z����U����,@7��E���@d�]�:7���ZL�����x���w�����s����=��l�K�O�2�p���e�\�����.�����g6k���$�*�c�D�e�\$L�O`eh+;�.2������!F1|/�}!��N�l�B�24��\*�Q���n_�lVrq�0���f�=����t4��X��3o������C^�\Br���A>��XEeh�Q.+v�Ut�T��2�k���2��i$24�5gjTt+��Y>�9���������������������P�l����vg%2������l��/�ZS�s:���s <�r���01�Ly����$�l}��O6���5 f�T���J�$&�����O.�/�>_���C�o�v�f���M��l��(��6��&��^n~��
9T-��)����#c�.�CCk8tYch��l�r
+~\~�V(�<���]����B;;�Q��mcU���]��p��]����e�L�M-��NX��*�����j= NC.�ZP��f��
������r�(������!]S��
9 �c����*����kLx{�����9�c�f�i�E�T'[
�!���	�7_��gh�,=|f��tk�����J
�����k�����Z������o�N
��0]���P��S-��*�q���3b#r�Kr�*�ZkA�l�{BD�/o�y�dWW�������}�/��4�G�Z$�0��S��B#
���:$8�gd_�w� �L>[
QW�Z�l�7�Af;�Y~",��Z�Y["S�2U����{��Z�Y"�K�r��|��Y��7�_���-Q�e6��v[s��L��������xl}[��FO*oe8��4�"L�
�.:��F��&����_��J#;�4@i��L!ZB����Qs��� ��]F�0����)-���*���a�^��xE����R��FV�.#�,]5{�R0�/�������|B���|8��fb���8��^g�l�/2f� �id�:N�x��I1�p)J"�FV�4e�5c�.��������H��Z�+���MW�7nT�j�����)��h�j1@r�X�N��~O^�D}��]6�.�o�[1��,���}q����@#������58�M���u�r'���-���g�L����ru���x������SW�N#.��Z�+�������Xo��S[�o�_V�P���U�t@����G)����U�6�E#-��)@E#+T��F2�F�@��H���#K��������42��K�]��h72bT��[� ��xi�#/���qhj}5�i-��������4��������tb�G@W#[tum�����5��Q5��*������� O#S�x�T�jd�<>�(�B�Mp8���^x�����3o
�����eh"B�Px���hdG�F������Iy����z��y�6c��������wz�;������������N��c�Z���k�@�L9�;[+�
6�h��VW������|�y�\�����St�!���;�����J��s�9������	oz��U�aVd�B1
 �Qh��4�Z,�R����)�y���g�����E^8}����U9i��vx��*ri�M,�yG������L�A����h|��\�3P)���/(�(���u�����������x��@��F��N�f@@�Flb��R@����-��cJ�6�%�b#5�R�5�$E���B/G�����sp:0��>�H&��1���#S����"g����/I��R9��r�[�b�������g������kWt����rS���<���i�~Rm���s'_�����2J�����
1�F&���H�n��FVV;�V��
���h��e��z�����s��"+�%�� j{D�7�h
�Z_4Z���.g��m~H��#��WsH���-�\q�R �#;X}tY�����=:�/N;�``Z��	im���U�<������^��xF&���v�,g4�8?��';�x���_������rV��c�
����������v����
��,8rd�#��W��U�����Us�^i��U����@J�"�vZz5���U�KR���R=N�GF\_zZ� ��	)�K��f8�b�C�Mbo}m�y�]�8J���.��+��.���4*S����(��SSm���p~���J
��G ��k;����w�-�5���ml���!�@3Q<\��(9���#%GVP���i`���lrd���UP!�����:��r��j1@�8<Y�|�8���
�Pb�Q�e�iw��U�;�r��������G�bY����8I���h��?Q~���|9J�UG;�8��.#@��"�U��Y�p�cS�C����|��L$u�L�V!:6��Y�����.��c3<pl����9�����Y�	���M�l�b0/n'������l.�Tg1Y���Y��z�.�m��V��a[�����i[��>�"��L*��
���"��9>\Kk4�I��p���4W��>�m4kM|h3Ebl�������w���Q1�V-��H1�&W-w�e9Ik�n���*����Y��h
0�����m��Z�Y�b�LW3��/����6drl"��2�M��O���� N����7��R������z!�AL��rlG.��\���e��]�
��[����P1Q����%uZt���	-���z��@����8;d?�y������v`��������t!_��z�E�U/���iqC;���y���S�1`�c�n���\u���K�����3@"����:S���f�rxs�qG�.��y6�F�tc'
�r����c�V���u��`kQ����U;�;49t=n��\O�:#���b^������1��c.�mG���`:]$����6;[���C�}
S��l2����Ul\�(�n��}5������V����:���-k���`�������g���Vr��%k:SW������7L��������X����d��:)��,��t��|�z[������u��B���_�������V@j�V�6m���h0�v�[�1����9�Z��� �c�@�\en8h;lw�;�z�`���Xy-Uh����[����������.��Km�����q�v/�bl4x�h�F��56j<lw�A5 �yl��[��&���v!��!c������
g�il�zl�����b\���cC�c����c����)N�'���x�<���<��v����r�UM��E�j�#8��jy@M��V�4+�|����^<����;����;������Vg�n��/[J��s8�x��MVU�17����Hl��ve\qS�f�@�1^Y]Pt�A�t�+1 �c)m�T�1�FH^%�Y3�H�8Z;������?�3�����l�!��k&Er���6�Hf�!�����(�O���O^U���UbB�v!�c;� d��W��53(:�
�-'�9:�����P��e�	|�T��
��9����l���Wx��6|r�7�?�%�qTGu�F�3�J�h�%�ql�8�qB_e�+x)�������O�eU���������2�P%����$��q���$�"Wf�j�{?�'�c@.�V����O��e7�m[��t��u8{9l7����0���������&�;�������{Gg��O/��_���8��_�6}��c��Vq\�=�aG8rl#��J@&�&2Y(�[���:P>F����qb�m��I����mr�
s(�R���R�tg���Wb�I��m�u�@U�U�d���\_���C�j4��������"�U���y.��
`��4���u$]�;��~N5�N{���?����Z
V�����m�����f��'�����</^O�N'r1���+�C�
���	����!��s�N@�l�J���d[��?^6��N�{���� w*�4�6�s�{�N��ge�; xb�'OL 8]�;���L��&��N��n���<4{L����j;�G����(��FwVa�x@��J>�XE��j���dG���6XDh�����b�z��EkUx;v�Bs�O�=s��kN��V�YD��L�a'����
�v�S:��)�z�`*���k����	����T:��<�[6	 ��tv�\�;6����{6	��6������p�3�g��9�����6\��i�ke��v��H'"���>'����
����	mF���0'&�yu�dw����A/5EUnd��-X/�^���N3=os�G�T�V�[?	�����C����\]=���@'V��[6��bE{ursvY������ZD
�5��,\r�XV�q�����\�������G�	'������7A���e,UK���VH���9o�X��
W��s�?�&��|���z����Yh����U��U���2��y��k�@Il�&{��j�%�?������_�M���ETN�������P������zx(����+�m���j1@[�$j1�8lX:+Z��VV���3�%��M��+'tM�Y��]O�l�2�'�L�:�����V��a��B ,����� �	��*�51����T�17�����U-)MBK������V��N�6����T�_��Yd�6F�nA��B4$�iMlc �wgVa��p��Z�w�Va����j#��l�2��e�3�|��Q ��7������)�N��X�Oy�.�6�0X]�����:��3\M��S����4�*TKS�INK���
��	G���@�&��,9rF�������w|tq$����7'����O�'Z`���w��]&:���T��ML�����	`^���'������{jn�6S�����q}�2�O�e�
H
X-�@d;�������T\�}�0�+i]�2�6��Il������m�Cg�b���R�����Z�c���5�
�	4kbK�����
]M8tU-��V�y��QoA����fDiz��}���r!��Yz������W5���Il��I^�*���E[^�K���w:�	�l MKg0��Ul��3����F�t}}�����L����Ib����v�|�y���<���0h��Y�g9���S�o�J�bp(�Z�[
tU���X�'�cq�i�"��8����L��pk5�Le���mL�F�2�m���0'���6wn�L������S�e�����2�x����$[J2�>Y�DO��:e�n��LL�j��b���}��t��:Vi�������m6)���������(h���v���0ZPy����9��w~qvr�F����-��1��;&�������<������/����)@S;�1�c
b�+���W��]���R
��hiGv�='�:����2+�N<�zz
�������N�H�������L������5��sP��#�����F����V-��k�fI
����4t���8�.,��9d+�,/���
El`�R@N�9�c����8��)�S�jl����r��H�����L�r
����O��/@G�.�-�S����{��.���iwJ�������6�r����r�Jp�$]����Z8��vQcS�E�VQc����j�kW4���L9�R-tf[�r����Q��*�Z�+�-���5����\L���)�Z��
�,�(WVM��&�i9�����Z��W�z1�U)�rdk���s���)pU�r1W��Z
���]�c}���t�L���b�����z1���|.�l�G'o�|�)�Sk�Q�xg�fL9�Q-t]d��SX\���^��wo��^X�47�,���4,�\��������:t�w���6�����e��N��l����*�_����-~�t����AL�X�������,�/�Nido�O�����C
��X��.Fh]�t�q�<bXm��LKG}��V�5N���r�Z[�p�x�E��%�dZ��aj���3LM�����i 
S[�Pk���fmq�d�#
�b@�7��l��(���(��*�Sev���0�e
]�t�:�a���T�
a�`YW��2�lF
����jfQP�Qc �*
��-h��n���FS�:����TS�X�"0bZ.8"�b���j1@+8�Ps&\�rpa����%�V��$�S.n���/�r��i5���-;�����<��l�`=���y���K'��9���rq������{�LM� :���0�����/����R�u�.���L������a��� {�mt�U��y�,T_jG����Km�>�E
�����0�@|�!~�����K��\P��Rb���:t����`[�L��bY.*�pJ0��m�z�
)nz��Q:�p?��L����vU��ZF��-�-Qe���]@�8�Q)���-��z<�������5�'�\LJ� 3T���+L�@����������G��T�
�9�R��|qI?���c�2SM*���9��56VLg/���l�Z3�_N�����0ecKj2��4��|�`j	��L���mET���n:4� ���������l1��E�M�bj	�[�]�SJ�.�vq"S�-���2�-*V��/�w~�z���AJ�\5�&��b���ij��lD��1��M���g�t-qb�W���V��{h��s��R|�����
K9ZZ-��2�R@�[N��F9��mG'�]cG��7��K+��D�������N��
��O��;�'Q�Q!�s��I|�����
%�<��X���������}�����ugw�e��L ��k�Py�����6�c�����w�/?����?������o�r,�����
�W�F����F��/�9�r[���3��cQ�@���_��p(�U,����&��l.nt{�����g���������&������M6��)��q����C��6V+�c6���	=x���u�o��^���s3��^>�)����b��O�A&f���$+SQ@�:;���g7�gN��;�f�y6)�}������>���T�-��*���r:����)n����w��)_�V"���e�����P^$���%:��������>�z�{��0����������������������yA��9�t���);��f����g9�=�����������z��t��������O��g/����xU~��O�#j�A��}Fz@m���H>�&�o�g#���:$Q/��.F_>;�N
!</.�g�3�C@����"���r��<����p��1K�B��T�p�����?_;n0��/��+�I��>����b�b������|���,�������U�6��g������>�j(�������#�4�Y]	
6U�\
�'�G�X��#���z�-�����������*���V�7�X=Ux6?-2�H/�4�9G���+?��?TK�#��O��x�#J�#���w��u6#�KX�A�K�%VZ�����m���+�^P`������o�����C����U7���,�`��W�5�*5`��������]���I���w?��������6�J8�z�Z���Z[G����iY]�I����ej�����7��PW�����?���
�(N����F��K������j���v��q��;|��/���u�������U�C��6�v�W�Q����U���ZH�u���?kM��f�����NZ�M�����BJ���T{�����v���J��i�V���-U�l�H?�ry�y F���u�������;��[��w4��p2t/ i�{�R��;�l)�7����pd��B���2���
�s1?^��r	Bxr�<��.��c�l<���Tb���B��'RGr��E.>{1����^������S���hO�����}��;_m��.������.����:� ,�
H������?���7�v^f���zv����N�+S�,���
����

��7?��Y��gi����R}�4QmMh�x=[�C/����K���Z�[%��5�#xR5v:���^O�}��[�K��j�L����[��'*:E�Z:/'��!=-��L;]��j�l��y^�g}���v�@e�F����J Ca�BD�9*���I���\�l�uG0��q=����wM;�8I?���vU��l=K�O�X��c�n���>���+W���%�����������0t'�?u�z�n2
���n�i�a�&nv������4v#?[�+�7����~��������EO[�JC7rY>���a�a����'�U�t������w��o��v��!�A��'�8I���������a���)h���~��7�F�����j��.�k����&��^�
��q��7A����&3��w
u�����o�=�-��7�6���;W�������a�[��s\[�o��m��-����-����������������~��������
�����Z��0@�������'��
��d�M�����cS�����S�Z����������'�O�4��'����@����=��i�?O����������+>����z2������h�'�{=���O�0����?D�(�>�3�O��J��>L����������V�����_�*��������W�����~[��A�n`[�n���?�.�������&������������
��`3�1��f����L������{=]�A���D�������;'�/�V�=�	W��;�#���}�����F~kdO���*{��.�&��'C'=�4�6�(b|�_a�u���Ml��1ezU�\Xq;�V�����������-�����c�7$$�b�&v�)���y?���@r4�&�<6b'�h	Q3Y����&/}9`:�eEwJ��<J6~4���F,�e�9�Om�N����x��_=>��>t�76��-e��MOz��D�}F����:��}�L<��z"~!�B�XX����M��(��s����Z���.NT�����w��J�'������_/E�Uh�v�g��p��o��|����o�]��%F148��oQ������&��&��Is�E��������o���o/���7���+O��������o�b���(�������'������f�H��d7j3�d[i2�/�-�Nd���"S��CDFa�V��D&j�#\�����|Q{��Gi/4�������7�rt�	4��������i�������"m�k:kIMJ�>��T9���f���x{�/.�[:�5��_Cf<����as�3T��+�o�������ME�K���ZJyZ��P�������+��0���
��������Q�01�� FCRYv���Q�G���>��t��������,)x�,)x�,)X���g i��Y�Q�O�k+�l,����e���x��Fk/�k�Wk0��?�qYG`\��z������DrJEM�
�d���\\]��k���x������%7i=���!
nJ�4�>��LB�2y������r1��6���o��h�'���1Y�<���9T����nC��������2�g����'�v�i��7�h��F{�B���Fkt�_?lmf�7��2��������������t��:B��{��E�c�����)v��)v54�������ab�������^����P-�y����q������Z{���^����-��9�-�����������[�w��Z�������[�w���}��Q���]Q�N��Q�@�k��e�7�����y[Vf��l7���K�hsiXm�����B3k6�-(��7l9�cL�cLk��-.��e�g��g�������8��8�Z��-6���7n>�CM�CM[�w�n����V�&�������?�����������[�w��n�_�?���[�w��n�_��x7j��n#?x~(�����6V���m��[.o��m�r��^��l���{���L��L<��@$o>l��
��-��E~����\��\���r�\s�����w��8��=[����+�-�v��>�wv=]�Eq�9cJ�o��+/��u���wN��-��q�����?4%��F�;�#?�����6��g��W��m��-�������+���j���6����d�{��Jt���o��Nr��7��svJOA�O�
�����w�G
����ON7z�,i���Q��_��������T�_�}w��*����7������Io~�l�����lb��u@��S�W�i�4���EwU��j[�U?dEZ+mD�K��V�D��j^^�Sr�"
7h�	��-�y9��W������2���KY��x�f�����t*�*�u���l_;�/���ue�����qv�����+��N�J@�[b���MF�l��8����xq=������v��b"�����~N���<w�C�{A��^s���xz)�
���-9�?���Y����=J�I�G9}u����������wo�������V���I�;���[����^\��]�;�����l>P
��e:��M������_���z�a����r��]�_+����2�U1�?dy~P,fb/���tk�����DK����������@v�����7[\����%��.���[Noo��a~0����-s/ .#���n���7�@G��������4s? #�~%��}S��=��nZc����N���!�	��	���b4�F�"ant�����6�6��S��>��>�7s/���~����n����nxv�Q�
<�
�N6�[}����b��|�>�J)@��*h�����\/:e��������Ozo�~<=�f�8��%������}��g��|�|���NO�[��^�l�F�Q��N�����q<&?.wv���Wz�o�$��s�L���P��Ij�����v�y�+����Pn�Z��(~�Sn���MLW�
�������Y�'����3�eMk�	$�g$��&n�X5U�f
��N.3jQ��F�����&����?������o�]a��������'kT�������l��{�����l���;���\��V�V��r��\g��P��[�X�6��[�K�Zt.�9l�8��"*@�W{@ �u��/&��[�%�6QH�5'�������k)���������G{A@6�l�!a��J��w5��z���3�L:�����V�����P���g��|���?�����g��~Zhmp���@A�����v�u+�i���>bE�
������t\7�* �!����*�r��2�5���m�<��,��FU�
7�_��������.kb:��"���J�U=D@��u�����t�!�6b��(w�q+��jtMY3]31kz
��o���F8�������j5w��FFh�f�I$2�X"Mo�r(���F��*����k9������?
Sc�q�Y����I�.�+S��3����'6;|�G�hC�����Fu���5D7��w��mTgu��\$
�.w^������_�34�<*�<ro.7�Z�[��s��o>
�U���b��g�a��k�������;�<t�D����_���o4o.�5�b�=1�=J)@$bF$�}��D��@��)n2��?sy6t��x%��s�nF���rr�Ho��s��k-�rhIbI$@4�4p.���F��@�j��v�����������*����x��P{�N<Y,�lx��'aIYMT�t�|q��#��Q�e��yF�9�N�*�IV��4��p������Z�@�:R��������K�rD����rpF�s3��>�']���T�t2�w�E�\8��6Szw�&aY$��nT�8���WR�������E[�w.E��ht��d>�U�Z��H����������&1��;G�|�'��#����7c��$,���J��:+J[TM>���ji��u^M�N&^�n�b�����r��),[�x*�@���DXq[�k���}�a��x��F����\m?@^�H��X���pL������������sYC�����7��Y-��*������F}(�?������T-�U�^���������S5���dP>�?w���^�"Y����@�RC�������h�c���������?�B��� ���<Wk�`���B��@�RF����0�r��h���Or|V�d)ed��J1@PR����p��KxbJ������Y �O�$����(==5,���{h���s�S���OY�CM�w�9�Z����hT��bO6�J�ny�fU��}\|n4J��%�f/� �X��B�/iDZd��
�b >7�P*�]Z|��&����p�T����o)�%{9��Lo��7�����Lk������T-2��|���n�Z�z6����
*=4�����u��[�����;����Mc�~T���������s�j�����N��|�e����;}E#)�R`Eo�`~����b���9:�W��^�������<P�&]�����t�i����� \�������	��M6QM��!��)K�Rx�UU��\�����Q�����|*��R�+i�Ts��p�veN�;VK"��e�j�����t5�����y�������l� nM���@'�x�����;�5��"����m���p-�z�;�B�
�l�C�+�h��������+k�q�~x�������h���]��,�E)�<�+Y�WbZK�]L& �����BZ��Z��������� ������I!�9]�{x����t�����FjO\��q�tI�e���0��vo�3��&�����s�h���d�#��m������������.��wv��u,�^��U��VM����:����{�'W�v�d���}�N=���{������l�g��i�`��'��j��l��r�U7� �]��wg�:kTPF��y�c	
����tnuOzR�t��\p�w�~�����ho�l"����[P��N.�] ����� r����;OE�1��-�)+v�����U�=�"_����^#@�J.{��c���
��5k��.����e� q�#d������9�e���}������Y�F@�X�[P��r 6�/�( +��|��Oq��O�t�ri�r����r�����<�?��#Z?��]Q�������~��z7�����[\J�(���
u�ywt
._P�����z����)�>����]q�kS��i��<S��!�<��S.S(����L�����4�>��S��3��
}^�fJ�L������d��T,\K�OU�b��j�o�s[a��U������&���q+�i���M��9#�N���5�,O��.��c&mO8�����ou`QU����9�zv9�Y-��X>�m��8K�P�_��F�jb��F��G\m�6��M�Y5��U�.�+vy3�J���*�gP�hS�m���v��q���6U���k{zv�M���h3G��.G;{�������W�[z��m��������Gn�V��Z~�?�����]@X�����~��-%�M#q�^���sq1���tV@���� 3��{���Sy@���Kkv�kn�-��Cw/G���� k������W6���
�����R�3cC�]��q6&~`w<��A]�-}��-k���&��}�&��_��9'u��vcK�aLR���M��vK����l������)��1w��l�v�Y��m��e�ZZFW%�|�O��&���"��2�j:'Fxg<�s*�e�F����T���X�?�4��Uvu2�n�[fW�X����������h��Q�so	�GB��mI,���t����|0�������r����?�������s^�i�� �.�4���`)�����M6�����K�G,Gq��	S ��.�G;�'��eN�	h7��>��|t%�����Q��'��^,���z�)����!��$���tu�}	`�.�=7.�l�9cyIM���rL�Z��r&���������4(�5��{���a����!vN����1n+��]����l>�8�t��w�Nz���������������5������m~�S���\	'H�X�.&��������<WwB<�q{�k�A��z��Qw���m����b���Z�Sw��"�'#�����q{ x�'PmT���yw������c�
��	���^q���G�O_����9�89��8Q;�8pom��A�s�N���+�������CK��`��jw��������9��������=�{��L����Y=�9������=��y��_��9�_��/������"f�+�<��F�puJXG����A�j1@�������-^�|�\�z�r��� ��=#�.���!s-����4��������.�`7�@�g����Z>�a�����=P9�3���=#t�q��zz���x}��8�'`�=o����W
�' �=��W�
�m�p��<��G�����F��\t�����b$��=�oqi/��s� �/���+��qx?w���[�y��h$w*����]�M8�������J ��Y��RGCp��,��(���a�`l1��w����!�l�S�]:������s�G�h��������?p<����������[��z��|2}�����P���[z}���}D��1+��N�LM���w��G����d���|[%�7R����������R���^`+f�G��z���$�L@��� d���n����:����8BN0x�'����ZD�Y��3j1@�[��FcK��3pf���,��*g�������p)E�#�����xz]��:7�d�K�
�Z����/���h���08�������Nx���i`7����F]-�
w@=���S�|��c:#1c�[V���J�����b�k;���=.T8�y���C���#�g�a�
�����[9�����xx��@��}���EuH��,�7�������	�������N.����u"��K<��{\�o���H����z,2g��;����S��cf��G�X>^~N�V���[�l�j�����t������:Ri���M��1�����*��*��3�sWv���{h�o�9z������E���G^_��O����������`��X��s�i�<_���3��{\�o� k�N��]��$���K���[V����.1�4kQF�y���p���E���?���}�����z�/�*���W��e��5W����.�����f��C���C�]>�hZ�������g���=p.�=��J��x�Y=/�UN��e�t0�!�=�{\����ZS�`������[Yy2]������S��������~!T�������#z����*\����*�^�0��>�V��4
S�@#��������?�8��H�{=+��0�@����k��y��d�g"�����7Za�b\Do� Gl��ZR�UW[�h�P���.�����
��������]���m�%"������G��N������xG�X��'g������~x������!k:K�x�������N�.>�����.������]����]���������,�h�-D���C��.+#��-l������xz���'�|���u�:�[��nXz�c��b������:���C	��C5P�������b����
��Z$��v����
���r�M�;k�(�yi=�0���dW�1K���E=�D�����������e�r�;��~�@�?F������������}-l��*k���uwnGy����h�f-�9�_-�,�>����:�����"��k�y�
F�lyT��Vk��M�}�T���r�@����U1���$������I�LH�M9��ArZ(s8�&����������n�n�����!����WZ�����ez����h ��A�����������>��}��Wy�Z+���������A�^u�}@���Fsk�`���K�t�[C�Q�(H=���?�j����t�f]�������
��|�����Q������o^9���h��Wk�Q��M�m8���H���lj{e..�����o
��]��m�L9�	�}\���S���
�������P�o��\�tp�7�W�VK_	����	�M��Z#�p��>�_-�2:p�+��5B�g_���"�? o���,��_�oV��x�o8`4'�;�>8�������#���$c��dP��::������iZN��O_sU����W6����
���7:�.3�e�b"����s
����������_�w���8p2��R
�]�������>�N��*��?w5��|�8�Z-����3%�F����l����Ue�3U���JW��?>��}�����jY@�LQ�u$��o��������?���rn^�\���L��um�;����h	��k�i�l�N���sa���=@�}boz���1�j1��s0��6�u?����O~<}�L7��:%��w44�%�(�,����la�0j�����f\�,�i��_P����:��O����}�RW����ut��3�^6��O��,�:)Y�2���Z�x)�9.�e�w���Lh�>G����~m��M��}��O�QIP�9_0���_�c��c�/���IaB���*�f�<��W���.*�xx�:*����q)�r�G2�2��>j��������������k(z?�"F}@���p��k���:���AQJU���G.�4�V�SI'�k����
���-z?��X�E@s"�p
>��}k_���K-tr�^�a���������(�"���)m������l���z[ 
��T��>��i�F�f'BM>f��j���C�|�q^T�L�]���U�V��n(W�����fe0�K������_|EYv�EIF��=W��#�a��f>�����\���(��
���&�'�z:U���>�������OE���.�?^���E<��A%��j��^��>���\��U����7�N7j ����*�r���;�c���W
���Hu?����"������]��w{�w:�Y������W���k�9�]-Sl����������F[/�`���w�c���> �}�`���_�Y|}�9~����7g������>���sf�~��m��#C��ff����:|�����l���W	E�J�7=����	��}�P��2XP�K�p����z.|��\�|�o����o�zw?�z�8��}�����4e��&�{���Y%�=��YS�U��6�yi�����9.[-�OkZ�i��5Ng��N���pb@g?�^���!����O-�m�
�b��n�xj�c-��!�����$8���j9�n���e5�����#@��x�����mQ��-��W
�C��M����+��ux�����$���J#�=pq���j1fI
8 XU�`���.��J�_1�(7`�\U���n�*{p���u��7��3��>�n�nr��N��p��U��l`���v�6Sz@�)���m���YO��X3�<��t5�FW;�������#�pj���j1@�8�TSz@�)���������z��,�i=`:��i�Z ����[Cg���S{��k~./;��h��wV�@��uXg�k��^��pa��b@����lGH��`��!b3c���=�7�z���{���������`=��8c����35�lf�������2@O	���kuo���9�QA�dY^L���������6��I@OX�s���3�>S���X�Sq�pa�����U d�M������_f=���o�8�t�Q�l�5��=��},��{��T[�2��G>��c�.���:��N&9���~R9���r

X4�Q�@�r�E������0�E���`����U������G���4��l0�����	��X-�)����(
B�!��������j
������d�L)����;�Cy
�-�X5	��)~�0iD`�x=/���l�<�U'�������G
�9����2��$[�1T|$�i��i�
�WU�r�Zr���������X����\�S-	��VpzU���CN_�~�:��g���c�" Il���"�I�:�Yq3�%�T�7/��~&>i`BH�A@�A�����<��M��-
�"4�n��M�����J_�fj���:5��T�����P�O"4��@�v���0r�����k�c��~J�~�)j��y���[�M�Oix�@��d��jCN#p����|�<�_LK:Q���N�����>�_*��/�d1W�zb�S� C��~������l�_���S���;�q���x$L���w�����z1��Uc��)���c:w2+�m���	| ff�I1��&���M�����X5	��f�]�%X��Ux��H�\50��j#�i`DO%��\~My��;��2r%(��m}��>�w��S�{����]�>��R�H��X�m�q�j4��0$S����U�+��������^��8���O��}��J�
~!���U�9����+Sh��iz'<�9����R	����Y�
�G1K>{��B���.1����qD�H�P?��y��:��_U����h���4�P������k}h�9���;�5�B^7���Xa����h[b���^�7��Y�|�,��q�[�l���tu���
��%7H����<!�o��U��f�YkM
����� �q�"?"wU���X�'\�Q�7��1��r>H�F1_�nO��*����A?o�6H,w�V���
��'�G�7�~N��������YqF�������
�x.F3]�
q�d�����Eq-W�W�z:,�V��\E�
�70��F�/�q�*���&C��������1�8�4�s���������5
�����L��A����
h�������Y�����sK�@��/�3�����.c��v�C�%*�L��VU��������>6I�����h`�S��|I.�[Z�
j�9�"v��>��y<����=�9�����%�����:��+t�������o���w
T�@Om�N��b��v�_�|){�P���,��n������y��+�t���t�S-�,�����:4Mf��e�v���
Nr���b����m�O�B�8���s��<�4(�sh
��O�B@9���D��%����D|����!�Jk}��!GJ+�:�phua54t��u�Eo������%������r�2�T�Am�������@�)�vB�6����v��s���s8���r@����W���7����lM�yGB$�k��Z0���]�Pq��q�sx84F��.�:2��hr5�D�\O�w�mZ7���X����@z�C�g��dP���v�nJ�%?�������������m]���}1���s��rMW�'|�|4���=@]X���g*���Q�oD�!
]O���;}Et���/.�1�z�i���#s�!U��#�M�(�1L��s��,q
�����l����N���p�c��2����B�'�j�&��L�w��;��0�|5��:�Z.���]#��)��i��F����VF���[�PP[o�@����6�:�
�-��+�L
9<�����I��3cQ.~��?0.%��C��v�j>@�C��A+�����f�rT].(�oA��r<8���9#N�nr!p�b�z����'$�_<�|a���|8��f����N����$���!�*&Zc������c4a�c�F�\\\��,��n_
���D!��a�I:�f�����~kG��f��p��Z���@W������������h���n�8�jk��0whn��5��x�F����!�o���	lS��v�W��(X{AFkJbf����|��a�v�����/+�{iE���`�!3X-HG^k��^�,{]/m��,SY��������#���3�?`v������Y�z����TR:����H���H(�����:dj��Xt�a���
`������^D�����|���!�C#0��:O��'c�u�P�B����������z�c�a��E���l���i���3�L�=��>
��9�[�\,_��
\u�0��)���$�z�X�v�u��v<'�u���FV�O��d��Q�����n� (L�H���
��W3C��
X���!��u����!��g�h����>?����R@�_5�xu�j���)O���v�
2d�<�/.��|DAKO��_�}(@X����(����`�:�J�����{����<����D���<�g�	�D�9@�]]V�6��:4����P�0�:�J��ktv�
R@L��`��7!�C�V��a���Td=�T��2����g�g�������j1@�L���n��Sb����W#�5��H��������%��:�T�\k���*
��d�$�~v�����Y��?� �s����)�������=�rhE)��������]E k�]�8����A}C@*��]\���P�s]������nv#O����+P.��������������&��%N|7Q���_���^<�����I��@��8iQ#�j�_<�/�^����'/���n<fD���z�C�����Y�<����/@��]����}���r�X�*E��z���G9I��q�p���kV��r���b������2C�<�6oRN;�
|[���������Q���Dn�u'�]�p����X{���,n�L�:0��!��C.P�Z�������#���dt�2�Lcam��{L�1\D�v1�t#�{P�K���r��@-���U��V%���F������t�]-�,�����m���eg�z��DV�n�KH�OY&)��<� ��L�x_�T����$��_)���SwS��q�Z�Y".F�zQ�/[�2`���5?����3�N;��E���mT���Or�-�p��RV.�%d8r7<���Q�~DY)s���?�
e���a�4Hr�n�X�y-��c�����s�%��qy����
�Rk�����
�P5/����gLxud����������l�(2	���Y�q���{��-����e������.>tx����f�l��� �#�X��]�Q��H���Yt��F���"G��r����M����;��(m��-�-�#���d
2Lq#@�GFJ����H��Hz;�/�E����vu��8��#���mI]q���P��a-�����g^ P=�@�]KbtqNnm��������\3��YR��l�f�~?���������D~7��x�����q�|��������[������<2p���:�#T��T=7���m�l��p��q��.�Q�7��^�������P>2��k����o�a;�s6�y���N������\�����3�:��
#�O�$�A��]����5N��]�1�H��]��n9SX}d����o�
�����2��vq�}��z%PA����uw�#��G��n�>�`X�(0�# �#c���R�E�G�����t137�����y��;���#�q=d��DV���i��,Ad{�@k�@�B^�>�\?������0Bd:��^k���
Q�aD����Th������C��}P/u��!".V�ZPU�C��m����Q� �����{���TD��?��]��.8�EV���[�}�@��4�K�@-��,�K�!��Q�@-�Dc�yu�����*��p4,��O�N���Q��O���Q.��]��U88�����|iqD�x��F��Fd8���aNfD��j��3".��Z�S�xr
�Y����8N���`�n���lm����m�����b�[D�y��U��Z(��Q�n��~�,?T���'��`0���c����Q�(�Fe(*����=��t����sp:�)"�<�:��$"�c�����������e�����(1�5����q��t��/���2f�DNs�x &�a���hX�b��tpB!�;��l�;N��wxA-K����.�g�3�c�
1D�u�����N�F��C�iX����F��D���P��au��U��p�!"�8�Yxh=\��������tp��)���o��jGw�7g"C$s�������{��@D��P�1�Jlu��.��:b*���O1G �5�@��D|h+&^��]3Ic�s�/����-����q���4��;��r�4����V�������xj|�M#��� El������18?sQ��b���
K�Rm��cp""������1�W@M�p"���1�E�]�8���^(�E��`�j1@i\j��nb@�����,��u�������o�]{Pj�]R��72����jP�X@� ��,c\P���tq}��R�����E�����?�B�e�J�o�WX�j��a!+�_^��9��-��,������JH���d���%����\y���_�����Y���V<����������Z�..��ZP;$�Q����O�9�Z>� Su�)�{l?]�h�N����1��c�z�de\��6�9�-P}�[���Uf,i�(�-
�_U��n4�_��e�b���>����1���^�����Cf;h{����j�<�1R%�k��@������>l�y�8xU����j��[�j�x�.$yl��6�b����Q��1���c���vz8�����MW8�&��*�|#���t~��������sR�)���qQ�����t��z�Q�q,��-���m��1���LC���n\g�����TLK����}O���zN����(�e��U�*��3���1�R��
������z(�3k=�1K4�:�7�bH��h��l�N�M�vP�,�!�A��@�d9�]���3�;J�E�u��A'�`DM!�D\���\����h��7��Rn�+.�\�5�R$���xL���y����nU�t�_n���q�l~��8��!��5I�o���,��76���,4r�����j�&5d�u���#�s��8�8�\C��f��&< �GsO4��q�����m\q�k�M�����P8���}�c1�0���Um�n��4��Z����c ac;�n&�+����[�b@���Vm�k����u^O�����p��[���Q^��6����j`���1\D���OZ	�k�X���V���7��S}�oT�NyOQG�j�@i���t�4����|�����+���1��{{�uU���&�nu���W�j<�F-t\.��Z��V0�C3�z�#J�3�9�5��F���ql;���k�����~��� (���_Qxt���xx��*��fU�CI��I?fE��������,6����'v����V���G]��q6�&D-�/d�b��)|�:>�4N�p����u$�X�'���U��*��������Q���
����;w[]g ]�Q	Z���/��_�!0�qj=dX3�`M~���d�#�b�{�5��7��y�/�N~�')\Yt(l!�%��\j�nR@���F�J=�2c��$�l:t=��BqLK0����x���qjL��z)�Q�k�9�t�~��3��/�,[�U�?��#1��[.����V�}>5g���{�{��k�V"���2/�0��Znw\���s1�Kk����29\;�������k�E��d�[v�Uo�n�	 8{����%��L86S-�,
	�Z�	�)6���s~qtv!������r P�7���c";���'\LhvH��������V.����������p�^�O��p�c��y.��zKsgO����p��11�y��W�ho����c��l���=1�����CU�B�L�e�Zy�	�S��]Z���5n1o^�������\�zL���J?*�~�:��UP��T�ja"$5etcb�w���r�R<���Z]Q
�����N��1��n�[�r�B5�����-IN�]S6x�������<���L,
S�����k�*�����m\�5	^^���Uk��;�Q��q8��*N�Sn��D���o��1�6&�`cl����i�>���y������K�q�KX�4�p N���*�������<
�4��i�0
���D��\�+��F�P� �	�$X�o�<�S���b2��^U�����h3-0&�gL�
�Ol�&xL8�Q-H��Q{Kn6�<������w4�|����������PF��k�W����w{�����A�e����}E@�|�M� �	�$���1D���l�.!�N����"�J���y3�s_��b����;���.M�>Ll�����p�i�4�m$JL�5�����@YT������0�	�$:����%�)&���0@7
�^haX��6�S�u���+"ALPD��e� `�$�\�ab
�Z�	�7��xHK�M	���'�rh����]���U�U-���1����1	������4�2��������n�pX�����������AD���e�.��G���,�}d9��}HJhy�aU���jZ6(eb�R�7g�V@G&���B�
��sLX�QB���{�0�j���:�Q>���u�z1���w���xr�K���9@`"���xqN��W��{4���7������i�su]��Id{�!�c�1��/H���tt�������!�#��Ll��..l\��@,l��������U��������.U���	�OvW�t&jG]�de���j1@b���Z���0���_M���v�n$RE�%�����]���_���'rO�V"����P��=M����5���/�=��?�����1��PhU~m��hK]�a�X�����d
��Il�\%�LL��pD�Y����%��5t����$����.J�e�V�����N�7����U��L���j1@�#����c�|L���2�B~��]I�cb�����o�[&�!>=�+S���]Y�����?X{��X���ug[i@�8�S-HNb�Z�z�AF����v�z6�j4Ilo{�%;@�&Q�0���D��P�	G�r�
1��!�6�U�4a�v�*�Y�B���@�&)����z&�u�N:f
��������&&~����7�41b����8���5si�Q��������T�h
����)��/��������)n�'��S�����Z�N#����L���S����:���._
��.(g
���>(g���|�|����z��9�/����$=�����.M�}���Y��Mv����L�[�*X��R�H����Lkw�Y�@TS;D5�jj�s�,�����(�-�ZZ�����QM��0�)��j�!�SS�O�u����?:;Y�0;;�Q����n�bvC0���z�)@QSE�	�.�n���{�J�(WPV#��@7\�P�:P�4��M��$4����ty�U�D�Z?|P
8��� x��#��4�,p6��7\�����M9��j9���"�����4�M9
T-��g�|�>���hjM����������4�3R5�o���3��hU�*��+���L��f��ijG���<M}K6,`h��l��dT�������myV�����T���������1�3e�O�����(�I9�,c����*DMx�32�U��8��3��J\K
F���#�4��c�����?��b�_^��N)����jP&��������
e���,�ouu=�^��������|�t;:=MY�TstI�k;:�Mm�PW��#�C�ID;���,$i����@P���})����:�x-�o�����p�����'����t�L�b�*XE�\�Z�:��!M����Q �D��
��` ��@��M�b@���Q���
iB_��o��2G��:���4
-��Um�&w�;���h�{��^��I��uQ}����+C���������f.)� ��*G�p�m������-����f�+������7P.��Z�[�t����I����=��pm���.(���.�2Qr��qv����R~����55q��r��:�DW�19.4��E#`����������^��|Q������?�[��Mmy����!��9������+@��v�l
���6������R��� r��ZP1�������e��?��g�\-m�(!^���\�9e����!��@����v{~���}������G������~s������MMI��1"�*�M����W����	�#w�����q�b��p���0 ��R��IL���A�9>f�{'��JLe�X1��P�8)�pS���=I����F�O���6�����Z��+N(_R�[nIgC�.���Y�n{z�`f�h���u��$�l��C���M�B����M�B�F���B��r�5��4�m�F��p��WK��
��I�LH���)���aME��M�G4mj�T�����*�%�7n��,�K^��vm
 �����%l�	"�l�E'���i.Y�Z����
>�DM������p��@^��P�]&7�5������N���?vDE*�X5
�e6i�2�B�����OS>��:�5�T~E��ij�+"�u���9�����-�RSJu:�5���v���r��Z���|�RvJs��9ah}�MV��rp�Z���v[w����
2-G>�[{ �j�q�j1@���/�]3�W��C3�J��cF���(�(0��M���;>}n7��o�>}�
�#��;��Z5���������k77�}n�*�i����
:���t�KN�~Z�.d`z����������\L��\o:�l1j}��"j�.���X�#�m��BL%��������&:�<dy!7��f�QY�s��|iL�Y?�����h<�5'#,R���%,S������*�����U�����|��,���l�Ty�����L�0�_/r��L��Dg5:<qO� ����1���n?�\k��L�������*P�'����C��+����r
Fs����l ;�,��W����R�=���d
9��]�^q	������"�����nd���X�uHW3�.�H���/�U��v�SV'7��IV�|]��U�B�D�8�
�
��)�j����o��)����+
���j9f��>_��l�!n��.P��P�zP���6�c1���n����L�E5��iFY��u+��^:]3����c���������@�L���/��F6��F�%�4K��q�����m��'g���|{����+�$�]=�|��]���j�����u ���y�f�I��8w��������vD�
�mU-�~Q�(����+}n�$,�$��7�S������v��/����M�t9�����������]���oK���u��t|�~dI%nt�6�<k��~�&���e�L�3��62��5�v@�l��s�����?�&$(�&�eF�%g-�8��iYk�x��32��B1_L�fP�@m��*��V�~.�\z�U�]1o`�Z�����"�D��y�r�4/}n�X'.*!��#*�(����>�
�A����kf��>�mbM�������0/���Z+��y�bf��s�(e�6�,���������::��/��@��H�Z9@<���n!^����fY6�(+�m�:�~Wo2X�djW��1z�5��y1q����������	���N�
@�lyc�F��ez��[�������8��K%M�]���7��c[K@�Bf�K+��m,\��;�G4--r�{d�������������e�|Ss:�3�t���MD!k.��.��-����PK4c��9X��u�_������"��L�E��I�����#9l����s�& ^\t��M�����[�s�K���
(J0q�(��l�iFc#�����Z�d;i2\�Fu����U���8�9����l�m{��c�����f8�>����P7����
3L��9n��^���I�����2{�����'v�9���z��2��0}n'Pf|�>�����#�������v��1}�X��c���s[�A�/�42�1d��!64�X����?�N�r�X�Ok{��| M��
Rj�9��b
������V����z��vH3�L��F���f��>��Xf��>H����O�����	7/�?��m�\�QG+im�VM�W=f�7��4}n����3}n���;nf��>��C�u~���=f�L�M
�?����e7r��sdf��skc��h�����t�v��Hu���]�@s����d(�
�����Xf8mx0{5
\g�����X�& 1o�a��o%�J�U���~EI����L=#n���u�4w�s�����Z6������PS/3�M��.kt=]��U�+�~x�������{���TO[P5S����z��q���f]A��7}��.|�QP�+8W2U�C����E�L���k��P)���r�|�[i����������T�Z�,�����	��&���W3cN�?T�j�(,T=�����
V�l9}n�)��X���g%}�k�����O�m�u��!\���H��W=o�#��5�B�#+w\jt��Z/��^
]���a�����M���2��P�5d���6��)k]�l��5�e	|�Rg�.��]��*�,�.�����J�r��V�Y�\�H����*X�������T\���kx�s�.��(��.��_,��]��(���ho^�����5���k�y5�-������<T�t�i{@�o�����/&��-@����-�s��vi�4L����o\]�d�b",j|����
������;YOb>�h������K��Y@�P�[���	w_Y���.��]�i`�e�X��r�H�k��u��N}���������8}������������]}���t�c���p�.T�yin �]�����U
�F�����k�s����g���esH�6���6A������b(Y���Y��\{���k���:Y��o�vi�9�����w�Bd��U����]�:\c
4s�����kc���\��-��n$������M�b���<���U8���t�Mp��� h�a�?7�t���Y��s�+.8+�r!��r�n�����0����G�d�w���A0MA���a�;!6.��V�FXZ�x��>�]@8�^����C���t`� ]|�j�t�u��4<��!����j�X��)O]����s��.[4��Y�}�����w'c,s��FB�(S��l	�������cs�������oub��O_9W��m�*k��J�`S�W������C��5
���m?�-�W��wN����n�
��aNU�JG�O5s�>�$����r.[����6�9���������E��hb@X�}��}���q�/M��FG�7�U�>��|`��|!^�S�	���:��QQ��{�f���!�W�0����z�,�'@Zl#b;!
L2-�^����*!�s���bfk�v���a�#5���g������y}����pr74x��(\��&P�|(���k���wm�H�m���
���5w�BH���NlCz�Hu+�Q��2���==�9��@"��rq��2�8;;u8���U��������G����q�^���z������a�w�3�"�=8�TB�7]���v9��jV%�Q�w�^��W�6�s������v;j�w���
��)�=�N�D�+$�;��JN��FE��O��2��s@�ry�&�
��������{)J��E�^�U�:������SkE7j����.�����f�;��6�,�	�)����T�v�e�a���o7�N��!\\��	��]�@��{@��"e������
H�5O�i
]U/������B��)�'����k��5����t��I����"]/��	)��b\[z�L$R��.Gwk����l�����v��=y�2n�v#�|�mm`����6����&v�M��ZM)���Rf�p��ku����A��0(i�DI[�t,�r��V��&�Yo��av��r�p��G~iH�'d�W
�U��?���k��"�_�Y���nj���w.G�����>TB������Wb���HJhQ���f#�,��N��\1������S��Bs�"���I���fvS�����tugc|i�P����f�eJ��z�f�8�Y+���	J���d��]���������%�f�3���&#������f���\���.
:F�dV���]T��[���e�r���t���}��RR���:�f�Y�<68wi��p1������l���?T��f��RcV=����%�e�)�*7���+n)��m�K#'g��������F�D���Q!7�4������]y�R�4q��\RY�9�gWB.'�j#c�������x�>�� ��=#�������l��8z�h�����d�4�(yl����������M�1��S-Z�E~U����x�Z�e�
�]Z#��hN�%'�|L�V�[+G�����Z�k>�f?#��[v�}�|�Z0�KBF�/#�
3DK��6��=s.�����=�nr��|���s �gR�K��p����rWB1��Pk��2q��8
�e��-�IK����\
�y�"�|����aF����Vt ��\��Z�@��)�wW��d	/+�[;�0�����(du���������I�n��P�g�MY�o��5k�$[`�������=S�ou!�p����zA@4,�\�=[��zl���V������gN^%��ZD���y�i�\@����H�T=�������u$�.��4�����E}.8<�(���%1_���Z�RQ�r ��a��z�#$
�����c=`N=s�w9�zt*�G^�����b:��+p���V6=����X�\�C��Y���k�r��+��^`����i^������Z9@7LA����W��_��vQa��v5���(0�2��t1��E���x:��K]u��c!����R���t\/���b�F��.���E�
��j]�V9��Aiy�����Wy�A�����6�U��-P��m�iZ��;za�o���8��>����~_����bX�,~��O����$������Kw�aq���}����������3��cq��T������WU�(��bX�:_����t{�����g���������&�����bp]�����x�L�!Y\���1����N�����1�P=�VHX���gW/�������q1���&(��u�?����T�����������SG0w��O��=�����?:�}w�
�V[ViU����S�����~���kY�����M�u/�z0���Tz���W��`�*�����������a�����0�����p?c�;�;�� �~�~�G/�������n6���y�s�C�?���/^�;����'�Oo^;�?|����y����O������W����[&���e������g���8����g���M��R��mV��.:���������d,�I���~�=�������(>��������.~x!�M�y�Bta��_8���@���a��mW|�Z�\�� ������������~�	�<+��;+�y�����N�<�gRS�����Rc[�������
�]�^a�l:/����z�,���]�Y�~Y��sWU��7��;n *�z$��l~Zdb�=.14��(��z5�v���j	�U����7,���W�v��+�3G����F���4��Zc���6Yz���|v%:��|���k'9���I�,���`������S�z�j�+�P�"�_p�7��!�z�?�=���Q������1��T�q�I�BV���:����3
��Q�����L-w����1s�j�������]��o�?z��JO�}����������rO8/\��{����^�u�<����������o~=�N���������Zh�u���?k�s������dI�^�M�����B����T���h��������G�U��'�?[�O>%[��~d������DR~[�s�/Y��-��;�Mux�;�4��Xir����b&�~��o����.DW�*�
���?�+��('���#/��t���g�������l�b|�����x���_�Q���3
n��7�5�C7>i<p_������W�0�~����b:U��lZN�a�$q�aG\+i���u��=����g�}ze����{�g�r�C.�g��t���l�<n�}{=k���{H�V�R^*W��V������8���H�F�kbk�F����oJ;Z�Z��4���d�m��2��V5�n�&�����-�,e�?����^����S���H��o���74�KCK6l}58�M��ke�<����3Z.�@
�O/~������[�6����|D��j4���5��T8A��f�f����U�~�1,��5;��������{ZLO�6�7����u�O�j���
h��X�~{b-3�y���X�z���T:���Z
���~�jM	0Vm�\
>C;�#8�����9G��Kt�:]�{���W�(n��:��on ��$pmi����>����%�k$Xsk�u������0���"�sBr����/��/����&�C��cN�#��~��-�n�wKmi!����i�V�{@�'��'��-��9�-���������*{��l��-���,�}���K�v��<���2���l�u�2�'#�P�i]����?Og;��,�>�]�'���6��;i�����5������d�XM�I��Oq�����O�{Z�hO�~*����������B�=��)�JO��x����a�'C�?�9�
�v�l�����������0�?�a���������������>},Z����R6[�fK��[��P6�}~���4�|���8^�&���2�������������H����=�N�������o����t�����w.��lR�{2t��N#��DQU��
S��o^nb�x�)��������<���F�D)]����+��~W+���{������O,a�.'EY?�g���D�����GC����bi&������/GKg���x.�n��H��sQ�����=���
��?����~����������A��e��8�a���}m�N-�����x���D�B*�>���O��)�V�nQ���j�{����$�
?,+�\�/���'������?_�$+�z����Ko��&����/M_���_��s�(�eG�K���l�������S�JZsAr��7����_S������XY��h{���\������v�o���]����m����l�������������_W����"��
n�H����C�67]��.�m����������h�������[���\������v�o���]����m����l����������v�����������r�5����v�o���]�#s7?�/����^�^�sC7�����z���.�=���"^�Z��.�m�hA'����O��\���u�����t?�.��+���5���0�
qc����G�����Y���[t�.�}���W�y6(��{����}�k����X�a:n��"��op�m���o�����'�AkiM]N�����O6\N#����[��o�����4?�(;���5�>�E�n�~�f3]��z���F��l���GO�aEZ���}���|�a����EE��|/8/�=Pr��������nL����YM���c��T~�7������P[�O�c+r�?y�j�(c�w��H����z������Q
:��%u���K��[oK�H���������nQ{�g�����?������-��Y����v�g��������l���?������v�g���I��GWo���$#*l7~~c��������v���o�l�������O����v���pL�k�g��>��`�����>�������l���?������v�g��������l��7���K��d��7Eg��������l�~��?�e��������N�v'q���[���^�k{m������������>��K
#94Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#92)

On Mon, Nov 9, 2020 at 8:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 9, 2020 at 1:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 9, 2020 at 3:23 PM Peter Smith <smithpb2250@gmail.com> wrote:

I've looked at the patches and done some tests. Here is my comment and
question I realized during testing and reviewing.

+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+             xl_xact_parsed_prepare *parsed)
+{
+   XLogRecPtr  origin_lsn = parsed->origin_lsn;
+   TimestampTz commit_time = parsed->origin_timestamp;
static void
DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-           xl_xact_parsed_abort *parsed, TransactionId xid)
+           xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared)
{
int         i;
+   XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
+   TimestampTz commit_time = 0;
+   XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
-   for (i = 0; i < parsed->nsubxacts; i++)
+   if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
{
-       ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-                          buf->record->EndRecPtr);
+       origin_lsn = parsed->origin_lsn;
+       commit_time = parsed->origin_timestamp;
}

In the above two changes, parsed->origin_timestamp is used as
commit_time. But in DecodeCommit() we use parsed->xact_time instead.
Therefore it a transaction didn't have replorigin_session_origin the
timestamp of logical decoding out generated by test_decoding with
'include-timestamp' option is invalid. Is it intentional?

I think all three DecodePrepare/DecodeAbort/DecodeCommit should have
same handling for this with the exception that at DecodePrepare time
we can't rely on XACT_XINFO_HAS_ORIGIN but instead we need to check if
origin_timestamp is non-zero then we will overwrite commit_time with
it. Does that make sense to you?

Yeah, that makes sense to me.

'git show --check' of v18-0002 reports some warnings.

I have also noticed this. Actually, I have already started making some
changes to these patches apart from what you have reported so I'll
take care of these things as well.

Ok.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#95Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#93)
1 attachment(s)

FYI - I have cross-checked all the v18 patch code against the v18 code
coverage [1]/messages/by-id/CAHut+Pu4BpUr0GfCLqJjXc=DcaKSvjDarSN89-4W2nxBeae9hQ@mail.gmail.com resulting from running the tests.

The purpose of this study was to identify where there may be any gaps
in the testing of this patch - e.g is there some v18 code not
currently getting executed by the tests?

I found almost all of the normal (not error) code paths are getting executed.

For details please see attached the study results. (MS Excel file)

===

[1]: /messages/by-id/CAHut+Pu4BpUr0GfCLqJjXc=DcaKSvjDarSN89-4W2nxBeae9hQ@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v18-patch-test-coverage-20201110.xlsxapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet; name=v18-patch-test-coverage-20201110.xlsxDownload
#96Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#95)

I was doing some testing, and I found some issues. Two issues. The
first one, seems to be a behaviour that might be acceptable, the
second one not so much.
I was using test_decoding, not sure how this might behave with the
pg_output plugin.

Test 1:
A transaction that is immediately rollbacked after the prepare.

SET synchronous_commit = on;
SELECT 'init' FROM
pg_create_logical_replication_slot('regression_slot',
'test_decoding');
CREATE TABLE stream_test(data text);
-- consume DDL
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
NULL, 'include-xids', '0', 'skip-empty-xacts', '1');

BEGIN;
INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM
generate_series(1, 20) g(i);
PREPARE TRANSACTION 'test1';
ROLLBACK PREPARED 'test1';
SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1', 'stream-changes', '1');
==================

Here, what is seen is that while the transaction was not decoded at
all since it was rollbacked before it could get decoded, the ROLLBACK
PREPARED is actually decoded.
The result being that the standby could get a spurious ROLLBACK
PREPARED. The current code in worker.c does handle this silently. So,
this might not be an issue.

Test 2:
A transaction that is partially streamed , is then prepared.
'
BEGIN;
INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM
generate_series(1,800) g(i);
SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1', 'stream-changes', '1');
SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1', 'stream-changes', '1');
PREPARE TRANSACTION 'test1';
SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1', 'stream-changes', '1');
ROLLBACK PREPARED 'test1';
==========================

Here, what is seen is that the transaction is streamed twice, first
when it crosses the memory threshold and is streamed (usually only in
the 2nd pg_logical_slot_get_changes call)
and then the same transaction is streamed again after the prepare.
This cannot be right, as it would result in duplication of data on the
standby.

I will be debugging the second issue and try to arrive at a fix.

regards,
Ajin Cherian
Fujitsu Australia.

Show quoted text

On Tue, Nov 10, 2020 at 4:47 PM Peter Smith <smithpb2250@gmail.com> wrote:

FYI - I have cross-checked all the v18 patch code against the v18 code
coverage [1] resulting from running the tests.

The purpose of this study was to identify where there may be any gaps
in the testing of this patch - e.g is there some v18 code not
currently getting executed by the tests?

I found almost all of the normal (not error) code paths are getting executed.

For details please see attached the study results. (MS Excel file)

===

[1] /messages/by-id/CAHut+Pu4BpUr0GfCLqJjXc=DcaKSvjDarSN89-4W2nxBeae9hQ@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

#97Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#91)
6 attachment(s)

On Mon, Nov 9, 2020 at 1:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've looked at the patches and done some tests. Here is my comment and
question I realized during testing and reviewing.

+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+             xl_xact_parsed_prepare *parsed)
+{
+   XLogRecPtr  origin_lsn = parsed->origin_lsn;
+   TimestampTz commit_time = parsed->origin_timestamp;
static void
DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-           xl_xact_parsed_abort *parsed, TransactionId xid)
+           xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared)
{
int         i;
+   XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
+   TimestampTz commit_time = 0;
+   XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
-   for (i = 0; i < parsed->nsubxacts; i++)
+   if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
{
-       ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-                          buf->record->EndRecPtr);
+       origin_lsn = parsed->origin_lsn;
+       commit_time = parsed->origin_timestamp;
}

In the above two changes, parsed->origin_timestamp is used as
commit_time. But in DecodeCommit() we use parsed->xact_time instead.
Therefore it a transaction didn't have replorigin_session_origin the
timestamp of logical decoding out generated by test_decoding with
'include-timestamp' option is invalid. Is it intentional?

Changed as discussed.

---
+   if (is_commit)
+       txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+   else
+       txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+   if (rbtxn_commit_prepared(txn))
+       rb->commit_prepared(rb, txn, commit_lsn);
+   else if (rbtxn_rollback_prepared(txn))
+       rb->rollback_prepared(rb, txn, commit_lsn);

RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED are used only here
and it seems to me that it's not necessarily necessary.

These are used in v18-0005-Support-2PC-txn-pgoutput. So, I don't think
we can directly remove them.

---
+               /*
+                * If this is COMMIT_PREPARED and the output plugin supports
+                * two-phase commits then set the prepared flag to true.
+                */
+               prepared = ((info == XLOG_XACT_COMMIT_PREPARED) &&
ctx->twophase) ? true : false;

We can write instead:

prepared = ((info == XLOG_XACT_COMMIT_PREPARED) && ctx->twophase);

+               /*
+                * If this is ABORT_PREPARED and the output plugin supports
+                * two-phase commits then set the prepared flag to true.
+                */
+               prepared = ((info == XLOG_XACT_ABORT_PREPARED) &&
ctx->twophase) ? true : false;

The same is true here.

I have changed this code so that we can determine if the transaction
is already decoded at prepare time before calling
DecodeCommit/DecodeAbort, so these checks are gone now and I think
that makes the code look a bit cleaner.

Apart from this, I have changed v19-0001-Support-2PC-txn-base such
that it displays xid and gid consistently in all APIs. In
v19-0002-Support-2PC-txn-backend, apart from fixing the above
comments, I have rearranged the code in DecodeCommit/Abort/Prepare so
that it does only the required things (like in DecodeCommit was still
processing subtxns even when it has to just perform FinishPrepared,
also the stats were not updated properly which I have fixed.) and
added/edited the comments. Apart from 0001 and 0002, I have not
changed anything in the remaining patches.

--
With Regards,
Amit Kapila.

Attachments:

v19-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v19-0001-Support-2PC-txn-base.patchDownload
From ae6a00ba9c0fefcb71ef5459e50d8e7b4ad4d67c Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 5 Nov 2020 04:08:22 -0500
Subject: [PATCH v19 1/2] Support 2PC txn base.

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 189 +++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 146 ++++++++++++-
 src/backend/replication/logical/logical.c | 242 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++
 src/include/replication/reorderbuffer.h   |  35 ++++
 src/tools/pgindent/typedefs.list          |  11 +
 7 files changed, 667 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 8e33614f14..13f8a18cba 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -73,6 +78,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -88,6 +96,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -112,10 +132,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -127,6 +152,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +162,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +254,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+
+				errno = 0;
+				data->check_xid_aborted = (TransactionId) strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +294,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +354,92 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +598,26 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/*
+	 * if check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -645,6 +808,32 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 	OutputPluginWrite(ctx, true);
 }
 
+static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037fac..f5b617d37f 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>stream_prepare_cb</function>, <function>commit_prepared_cb</function>
+    and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +598,56 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +657,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +740,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +794,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +850,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1041,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d5cfbeaa4a..e9107cdf13 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -226,6 +240,21 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_message_cb != NULL) ||
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
+	/*
+	 * To support two-phase logical decoding, we require
+	 * prepare/commit-prepare/abort-prepare callbacks. The filter-prepare
+	 * callback is optional. We however enable two-phase logical decoding when
+	 * at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
 	/*
 	 * streaming callbacks
 	 *
@@ -237,6 +266,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -782,6 +812,129 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
@@ -859,6 +1012,52 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1056,6 +1255,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7ee02..7f4384b62c 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -84,6 +84,11 @@ typedef struct LogicalDecodingContext
 	 */
 	bool		streaming;
 
+	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796450..032e35a2e1 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+  * and sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -123,6 +156,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index dfdda938b2..66c89d16bd 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -244,6 +245,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char	   *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -405,6 +409,26 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -431,6 +455,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -497,6 +527,10 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -505,6 +539,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f2ba92be53..1086e51869 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1314,9 +1314,20 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
2.28.0.windows.1

v19-0002-Support-2PC-txn-backend.patchapplication/octet-stream; name=v19-0002-Support-2PC-txn-backend.patchDownload
From 42c45c69d4ceba6d1ab431a38492e9de3b6226d3 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 6 Nov 2020 10:59:08 +0530
Subject: [PATCH v19 2/2] Support 2PC txn backend.

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.
---
 src/backend/replication/logical/decode.c      | 213 ++++++++++--
 .../replication/logical/reorderbuffer.c       | 318 +++++++++++++++---
 src/include/replication/reorderbuffer.h       |  34 ++
 3 files changed, 494 insertions(+), 71 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee99b8..1b65d4a16e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,9 +67,14 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool already_decoded);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool already_decoded);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -244,6 +249,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +259,19 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction data then
+				 * DecodeCommit doesn't need to decode it again. This is
+				 * possible iff output plugin supports two-phase commits and
+				 * doesn't skip the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+								ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeCommit(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +290,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction during prepare
+				 * then DecodeAbort need to call rollback prepared.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+						ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeAbort(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +341,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -582,10 +629,14 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool already_decoded)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -609,8 +660,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * There can be several reasons we might not be interested in this
 	 * transaction:
 	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
 	 * 2) The transaction happened in another database.
 	 * 3) The output plugin is not interested in the origin.
 	 * 4) We are doing fast-forwarding
@@ -640,7 +691,79 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		return;
 	}
 
-	/* tell the reorderbuffer about the surviving subtransactions */
+	/*
+	 * Send the final commit record if the transaction data is already decoded,
+	 * otherwise, process the entire transaction.
+	 */
+	if (already_decoded)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* tell the reorderbuffer about the surviving subtransactions */
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+									 buf->origptr, buf->endptr);
+		}
+
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		commit_time = parsed->origin_timestamp;
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeCommit for
+	 * the reasons why we sometimes want to skip the transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache invalidations
+	 * if there are any for the reasons mentioned in DecodeCommit.
+	 */
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is set up correctly for the main transaction in case all changes
+	 * happened in subtransactions.
+	 */
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
 		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
@@ -648,33 +771,67 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool already_decoded)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool	skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
 	}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	skip_xact = SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id);
+
+	/*
+	 * Send the final rollback record if the transaction data is already
+	 * decoded and we don't need to skip it, otherwise, perform the cleanup of
+	 * the transaction.
+	 */
+	if (already_decoded && !skip_xact)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
+
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c1bd68011c..6092df39f0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -421,6 +422,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1514,12 +1521,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1538,7 +1547,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1572,9 +1581,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1768,9 +1801,24 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1898,8 +1946,10 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/*
+	 * Discard the changes that we just streamed.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2006,7 +2056,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2297,7 +2347,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2331,18 +2390,32 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the 4 scenarios: 1. Prepare of a
+		 * two-phase commit. 2. Prepare of a two-phase commit and a part of
+		 * streaming in-progress txn. 3. streaming of an in-progress txn. 3.
+		 * Commit of a transaction.
+		 *
+		 * Scenario 1 and 2, we handle the same way, pass in prepared as true
+		 * to ReorderBufferTruncateTXN and allow more elaborate truncation of
+		 * txn data as the entire transaction has been decoded, only commit is
+		 * pending. Scenario 3, we pass in prepared as false to
+		 * ReorderBufferTruncateTXN as the txn is not yet completely decoded.
+		 * Scenario 4, all txn has been decoded and we can fully cleanup the
+		 * TXN reorder buffer.
 		 */
-		if (streaming)
+		if (rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, true);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn, false);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2372,17 +2445,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2390,10 +2466,23 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/*
+			 * If streaming, reset the TXN so that it is allowed to stream
+			 * remaining data. Streaming can also be on a prepared txn, handle
+			 * it the same way.
+			 */
+			if (streaming)
+			{
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+					 txn->gid[0] != '\0' ? txn->gid : "", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2415,23 +2504,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2472,6 +2554,120 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							command_id, false);
 }
 
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	 * The transaction may or may not exist (during restarts for example).
+	 * Anyway, two-phase transactions do not contain any reorderbuffers. So
+	 * allow it to be created below.
+	 */
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -2515,7 +2711,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
@@ -2603,6 +2804,37 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
+/*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ * Note that this is only allowed to be called when a transaction prepare
+ * has just been read, not otherwise.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
 /*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 66c89d16bd..13c802b456 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -234,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -614,12 +635,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -637,6 +664,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+								 TimestampTz commit_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
2.28.0.windows.1

v19-0003-Support-2PC-test-cases-for-test_decoding.patchapplication/octet-stream; name=v19-0003-Support-2PC-test-cases-for-test_decoding.patchDownload
From 99ea1ff6c83a75d116d4ced5da9677806b6db916 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 9 Nov 2020 12:22:31 +1100
Subject: [PATCH v18] Support 2PC test cases for test_decoding.

Add sql and tap tests to test_decoding for 2PC.
---
 contrib/test_decoding/Makefile                     |   4 +-
 contrib/test_decoding/expected/two_phase.out       | 228 +++++++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 177 ++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 +++++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 ++++++
 contrib/test_decoding/t/001_twophase.pl            | 121 +++++++++++
 6 files changed, 711 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql
 create mode 100644 contrib/test_decoding/t/001_twophase.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,11 +4,13 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..e5e34b4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..957c198
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..1555582
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

v19-0004-Support-2PC-txn-spoolfile.patchapplication/octet-stream; name=v19-0004-Support-2PC-txn-spoolfile.patchDownload
From f7a8909795f313ca3cdc5d2b3b5b9a7ed588906d Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 9 Nov 2020 14:10:07 +1100
Subject: [PATCH v18] Support 2PC txn - spoolfile.

This patch only refactors to isolate the streaming spool-file processing to a separate function.
Later, two-phase commit logic will require this common processing to be called from multiple places.
---
 src/backend/replication/logical/worker.c | 58 +++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0468491..9fa816c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -244,6 +244,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -933,30 +935,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -964,7 +957,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -979,7 +972,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1048,6 +1041,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1055,16 +1077,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
-- 
1.8.3.1

v19-0005-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v19-0005-Support-2PC-txn-pgoutput.patchDownload
From 6d360e67a130738462f55e94f61fda8fe0a9d856 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 9 Nov 2020 16:53:00 +1100
Subject: [PATCH v18] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.
---
 src/backend/access/transam/twophase.c       |  33 +++-
 src/backend/replication/logical/proto.c     | 141 ++++++++++++++++-
 src/backend/replication/logical/worker.c    | 236 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c |  74 +++++++++
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  37 ++++-
 src/tools/pgindent/typedefs.list            |   1 +
 7 files changed, 518 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7940060..847f85d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
@@ -1133,9 +1160,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..cfb94d1 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,145 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK]
+	 * PREPARED uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9fa816c..f1e94ad 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -742,6 +742,234 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData *prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK
+	 * PREPARED for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare_txn (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1969,6 +2197,14 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..71ac431 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -143,6 +151,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +165,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +392,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +913,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index cca13da..7c6686c 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -54,10 +54,12 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_PREPARE = 'P',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +116,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +124,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +153,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData *prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +200,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1086e51..f9df33c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1340,6 +1340,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPrepareData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v19-0006-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v19-0006-Support-2PC-txn-subscriber-tests.patchDownload
From b4fd33329e2e5677ab58132012638432d3f64e61 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 9 Nov 2020 16:57:10 +1100
Subject: [PATCH v18] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl           | 345 ++++++++++++++
 src/test/subscription/t/021_twophase_streaming.pl | 521 ++++++++++++++++++++++
 2 files changed, 866 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_streaming.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..f489f47
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,345 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 21;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full SELECT generate_series(1,10)");
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full2 (x text)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_full2 VALUES ('a'), ('b'), ('b')");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full2 (x text)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full, tab_full2");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_streaming.pl b/src/test/subscription/t/021_twophase_streaming.pl
new file mode 100644
index 0000000..9a03b83
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_streaming.pl
@@ -0,0 +1,521 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 28;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#98Ajin Cherian
itsajin@gmail.com
In reply to: Ajin Cherian (#96)

Did some further tests on the problem I saw and I see that it does not
have anything to do with this patch. I picked code from top of head.
If I have enough changes in a transaction to initiate streaming, then
it will also stream the same changes after a commit.

BEGIN;
INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM
generate_series(1,800) g(i);
SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1', 'stream-changes', '1');
** see streamed output here **
END;
SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1', 'stream-changes', '1');
** see the same streamed output here **

I think this is because since the transaction has not been committed,
SnapBuildCommitTxn is not called which is what moves the
"builder->start_decoding_at", and as a result
later calls to pg_logical_slot_get_changes will start from the
previous lsn. I did do a quick test in pgoutput using pub/sub and I
dont see duplication of data there but I haven't
verified exactly what happens.

regards,
Ajin Cherian
Fujitsu Australia

#99Peter Smith
smithpb2250@gmail.com
In reply to: Ajin Cherian (#98)
1 attachment(s)

The subscriber tests are updated to include test cases for "cascading"
pub/sub scenarios.

i.e.. NODE_A publisher => subscriber NODE_B publisher => subscriber NODE_C

PSA only the modified v20-0006 patch (the other 5 patches remain unchanged)

Kind Regards,
Peter Smith.
Fujitsu Australia.

Attachments:

v20-0006-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v20-0006-Support-2PC-txn-subscriber-tests.patchDownload
From 9b041cd0f4707d42b718d93a405a2a5f7e827bb7 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 12 Nov 2020 17:53:23 +1100
Subject: [PATCH v20] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#100Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#97)

On Wed, Nov 11, 2020 at 12:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
I have rearranged the code in DecodeCommit/Abort/Prepare so

that it does only the required things (like in DecodeCommit was still
processing subtxns even when it has to just perform FinishPrepared,
also the stats were not updated properly which I have fixed.) and
added/edited the comments. Apart from 0001 and 0002, I have not
changed anything in the remaining patches.

One small comment on the patch:

- DecodeCommit(ctx, buf, &parsed, xid);
+ /*
+ * If we have already decoded this transaction data then
+ * DecodeCommit doesn't need to decode it again. This is
+ * possible iff output plugin supports two-phase commits and
+ * doesn't skip the transaction at prepare time.
+ */
+ if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+ {
+ already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+ }
+

Just a small nitpick but the way already_decoded is assigned here is a
bit misleading. It appears that the callbacks determine if the
transaction is already decoded when in
reality the callbacks only decide if the transaction should skip two
phase commits. I think it's better to either move it to the if
condition or if that is too long then have one more variable
skip_twophase.

if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase &&
!(ctx->callbacks.filter_prepare_cb &&
ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid)))
already_decoded = true;

OR
bool skip_twophase = false;
skip_twophase = !(ctx->callbacks.filter_prepare_cb &&
ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase && skip_twophase)
already_decoded = true;

regards,
Ajin Cherian
Fujitsu Australia

#101Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#100)

On Thu, Nov 12, 2020 at 2:28 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Wed, Nov 11, 2020 at 12:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
I have rearranged the code in DecodeCommit/Abort/Prepare so

that it does only the required things (like in DecodeCommit was still
processing subtxns even when it has to just perform FinishPrepared,
also the stats were not updated properly which I have fixed.) and
added/edited the comments. Apart from 0001 and 0002, I have not
changed anything in the remaining patches.

One small comment on the patch:

- DecodeCommit(ctx, buf, &parsed, xid);
+ /*
+ * If we have already decoded this transaction data then
+ * DecodeCommit doesn't need to decode it again. This is
+ * possible iff output plugin supports two-phase commits and
+ * doesn't skip the transaction at prepare time.
+ */
+ if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+ {
+ already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+ ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+ }
+

Just a small nitpick but the way already_decoded is assigned here is a
bit misleading. It appears that the callbacks determine if the
transaction is already decoded when in
reality the callbacks only decide if the transaction should skip two
phase commits. I think it's better to either move it to the if
condition or if that is too long then have one more variable
skip_twophase.

if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase &&
!(ctx->callbacks.filter_prepare_cb &&
ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid)))
already_decoded = true;

OR
bool skip_twophase = false;
skip_twophase = !(ctx->callbacks.filter_prepare_cb &&
ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase && skip_twophase)
already_decoded = true;

Hmm, introducing an additional boolean variable for this doesn't seem
like a good idea neither the other alternative suggested by you. How
about if we change the comment to make it clear. How about: "If output
plugin supports two-phase commits and doesn't skip the transaction at
prepare time then we don't need to decode the transaction data at
commit prepared time as it would have already been decoded at prepare
time."?

--
With Regards,
Amit Kapila.

#102Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#96)

On Tue, Nov 10, 2020 at 4:19 PM Ajin Cherian <itsajin@gmail.com> wrote:

I was doing some testing, and I found some issues. Two issues. The
first one, seems to be a behaviour that might be acceptable, the
second one not so much.
I was using test_decoding, not sure how this might behave with the
pg_output plugin.

Test 1:
A transaction that is immediately rollbacked after the prepare.

SET synchronous_commit = on;
SELECT 'init' FROM
pg_create_logical_replication_slot('regression_slot',
'test_decoding');
CREATE TABLE stream_test(data text);
-- consume DDL
SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
NULL, 'include-xids', '0', 'skip-empty-xacts', '1');

BEGIN;
INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM
generate_series(1, 20) g(i);
PREPARE TRANSACTION 'test1';
ROLLBACK PREPARED 'test1';
SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1', 'stream-changes', '1');
==================

Here, what is seen is that while the transaction was not decoded at
all since it was rollbacked before it could get decoded, the ROLLBACK
PREPARED is actually decoded.
The result being that the standby could get a spurious ROLLBACK
PREPARED. The current code in worker.c does handle this silently. So,
this might not be an issue.

Yeah, this seems okay because it is quite possible that such a
Rollback would have encountered after processing few records in which
case sending the Rollback is required. This can happen when rollback
has been issues concurrently when we are decoding prepare. If the
Output plugin wants, then can detect that transaction has not written
any data and ignore rollback and we already do something similar in
test_decoding. So, I think this should be fine.

--
With Regards,
Amit Kapila.

#103Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#101)

On Fri, Nov 13, 2020 at 9:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 12, 2020 at 2:28 PM Ajin Cherian <itsajin@gmail.com> wrote:

Hmm, introducing an additional boolean variable for this doesn't seem
like a good idea neither the other alternative suggested by you. How
about if we change the comment to make it clear. How about: "If output
plugin supports two-phase commits and doesn't skip the transaction at
prepare time then we don't need to decode the transaction data at
commit prepared time as it would have already been decoded at prepare
time."?

Yes, that works for me.

regards,
Ajin Cherian
Fujitsu Australia

#104Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#98)

On Wed, Nov 11, 2020 at 4:30 PM Ajin Cherian <itsajin@gmail.com> wrote:

Did some further tests on the problem I saw and I see that it does not
have anything to do with this patch. I picked code from top of head.
If I have enough changes in a transaction to initiate streaming, then
it will also stream the same changes after a commit.

BEGIN;
INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM
generate_series(1,800) g(i);
SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1', 'stream-changes', '1');
** see streamed output here **
END;
SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1', 'stream-changes', '1');
** see the same streamed output here **

I think this is because since the transaction has not been committed,
SnapBuildCommitTxn is not called which is what moves the
"builder->start_decoding_at", and as a result
later calls to pg_logical_slot_get_changes will start from the
previous lsn.

No, we always move start_decoding_at after streaming changes. It will
be moved because we have advanced the confirmed_flush location after
streaming all the changes (via LogicalConfirmReceivedLocation()) which
will be used to set 'start_decoding_at' when we create decoding
context (CreateDecodingContext) next time. However, we don't advance
'restart_lsn' due to which it start from the same point and accumulate
all changes for transaction each time. Now, after Commit we get an
extra record which is ahead of 'start_decoding_at' and we try to
decode it, it will get all the changes of the transaction. It might be
that we update the documentation for pg_logical_slot_get_changes() to
indicate the same but I don't think this is a problem.

I did do a quick test in pgoutput using pub/sub and I
dont see duplication of data there but I haven't
verified exactly what happens.

Yeah, because we always move ahead for WAL locations in that unless
the subscriber/publisher is restarted in which case it should start
from the required location. But still, we can try to see if there is
any bug.

--
With Regards,
Amit Kapila.

#105Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#104)
6 attachment(s)

Updated with a new test case
(contrib/test_decoding/t/002_twophase-streaming.pl) that tests
concurrent aborts during streaming prepare. Had to make a few changes
to the test_decoding stream_start callbacks to handle
"check-xid-aborted"
the same way it was handled in the non stream callbacks. Merged
Peter's v20-0006 as well.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v20-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v20-0001-Support-2PC-txn-base.patchDownload
From a160453cf0a79a5c979d1b90cf1149d1c63512b3 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 16 Nov 2020 01:25:14 -0500
Subject: [PATCH v20] Support 2PC txn base.

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 210 ++++++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 146 +++++++++++++++++-
 src/backend/replication/logical/logical.c | 242 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++++
 src/include/replication/reorderbuffer.h   |  35 +++++
 src/tools/pgindent/typedefs.list          |  11 ++
 7 files changed, 688 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 8e33614..10ab260 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -36,6 +40,7 @@ typedef struct
 	bool		skip_empty_xacts;
 	bool		xact_wrote_changes;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 static void pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
@@ -73,6 +78,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -88,6 +96,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -112,10 +132,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -127,6 +152,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -136,6 +162,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -227,6 +254,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+
+				errno = 0;
+				data->check_xid_aborted = (TransactionId) strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -238,6 +294,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -297,6 +354,92 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -455,6 +598,26 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	data->xact_wrote_changes = true;
 
+	/*
+	 * if check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -603,6 +766,27 @@ static void
 pg_output_stream_start(LogicalDecodingContext *ctx, TestDecodingData *data, ReorderBufferTXN *txn, bool last_write)
 {
 	OutputPluginPrepareWrite(ctx, last_write);
+
+	/*
+	 * if check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
 	else
@@ -646,6 +830,32 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	if (data->skip_empty_xacts && !data->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..f5b617d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>stream_prepare_cb</function>, <function>commit_prepared_cb</function>
+    and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +598,56 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +657,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +740,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +794,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +850,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1041,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d5cfbea..e9107cd 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -227,6 +241,21 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require
+	 * prepare/commit-prepare/abort-prepare callbacks. The filter-prepare
+	 * callback is optional. We however enable two-phase logical decoding when
+	 * at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +266,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +813,129 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1012,52 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1256,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..032e35a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+  * and sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index dfdda93..66c89d1 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -244,6 +245,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char	   *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -405,6 +409,26 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -431,6 +455,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -497,6 +527,10 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -505,6 +539,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b146b3e..84f3d3e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1316,9 +1316,20 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
1.8.3.1

v20-0004-Support-2PC-txn-spoolfile.patchapplication/octet-stream; name=v20-0004-Support-2PC-txn-spoolfile.patchDownload
From 2d927ce3cdd145b3817d5c1f988d903ad74c1272 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 16 Nov 2020 01:57:30 -0500
Subject: [PATCH v20] Support 2PC txn - spoolfile.

This patch only refactors to isolate the streaming spool-file processing to a separate function.
Later, two-phase commit logic will require this common processing to be called from multiple places.
---
 src/backend/replication/logical/worker.c | 58 +++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0468491..9fa816c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -244,6 +244,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -933,30 +935,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -964,7 +957,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -979,7 +972,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1048,6 +1041,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1055,16 +1077,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
-- 
1.8.3.1

v20-0005-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v20-0005-Support-2PC-txn-pgoutput.patchDownload
From 68bc0ad1ce275bfbd0bdd49903452e99f5ec85d7 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 16 Nov 2020 01:58:27 -0500
Subject: [PATCH v20] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.
---
 src/backend/access/transam/twophase.c       |  33 +++-
 src/backend/replication/logical/proto.c     | 141 ++++++++++++++++-
 src/backend/replication/logical/worker.c    | 236 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c |  74 +++++++++
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  37 ++++-
 src/tools/pgindent/typedefs.list            |   1 +
 7 files changed, 518 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7940060..847f85d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
@@ -1133,9 +1160,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..cfb94d1 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,145 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK]
+	 * PREPARED uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9fa816c..f1e94ad 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -742,6 +742,234 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData *prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK
+	 * PREPARED for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare_txn (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1969,6 +2197,14 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..71ac431 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -143,6 +151,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +165,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +392,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +913,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535d..0691fc5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -54,10 +54,12 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_PREPARE = 'P',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +116,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +124,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +153,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData *prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +200,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 84f3d3e..cc67646 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1342,6 +1342,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPrepareData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v20-0003-Support-2PC-test-cases-for-test_decoding.patchapplication/octet-stream; name=v20-0003-Support-2PC-test-cases-for-test_decoding.patchDownload
From b1b4162575e41160799f3c11f4ab0fecc127128b Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 16 Nov 2020 01:56:25 -0500
Subject: [PATCH v20] Support 2PC test cases for test_decoding.

Add sql and tap tests to test_decoding for 2PC.
---
 contrib/test_decoding/Makefile                     |   4 +-
 contrib/test_decoding/expected/two_phase.out       | 228 +++++++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 177 ++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 +++++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 ++++++
 contrib/test_decoding/t/001_twophase.pl            | 121 +++++++++++
 contrib/test_decoding/t/002_twophase-streaming.pl  | 102 +++++++++
 7 files changed, 813 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase-streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,11 +4,13 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..e5e34b4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..957c198
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..1555582
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase-streaming.pl b/contrib/test_decoding/t/002_twophase-streaming.pl
new file mode 100644
index 0000000..4d64bc4
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase-streaming.pl
@@ -0,0 +1,102 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

v20-0002-Support-2PC-txn-backend.patchapplication/octet-stream; name=v20-0002-Support-2PC-txn-backend.patchDownload
From 8fd3922e038439fbc413829005fdf66af3687b4e Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 16 Nov 2020 01:34:17 -0500
Subject: [PATCH v20] Support 2PC txn backend.

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.
---
 src/backend/replication/logical/decode.c        | 213 +++++++++++++---
 src/backend/replication/logical/reorderbuffer.c | 319 ++++++++++++++++++++----
 src/include/replication/reorderbuffer.h         |  34 +++
 3 files changed, 495 insertions(+), 71 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..1b65d4a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,9 +67,14 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool already_decoded);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool already_decoded);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -244,6 +249,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +259,19 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction data then
+				 * DecodeCommit doesn't need to decode it again. This is
+				 * possible iff output plugin supports two-phase commits and
+				 * doesn't skip the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+								ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeCommit(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +290,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction during prepare
+				 * then DecodeAbort need to call rollback prepared.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+						ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeAbort(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +341,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -582,10 +629,14 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool already_decoded)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -609,8 +660,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * There can be several reasons we might not be interested in this
 	 * transaction:
 	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
 	 * 2) The transaction happened in another database.
 	 * 3) The output plugin is not interested in the origin.
 	 * 4) We are doing fast-forwarding
@@ -640,7 +691,79 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		return;
 	}
 
-	/* tell the reorderbuffer about the surviving subtransactions */
+	/*
+	 * Send the final commit record if the transaction data is already decoded,
+	 * otherwise, process the entire transaction.
+	 */
+	if (already_decoded)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* tell the reorderbuffer about the surviving subtransactions */
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+									 buf->origptr, buf->endptr);
+		}
+
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		commit_time = parsed->origin_timestamp;
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeCommit for
+	 * the reasons why we sometimes want to skip the transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache invalidations
+	 * if there are any for the reasons mentioned in DecodeCommit.
+	 */
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is set up correctly for the main transaction in case all changes
+	 * happened in subtransactions.
+	 */
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
 		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
@@ -648,33 +771,67 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool already_decoded)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool	skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
 	}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	skip_xact = SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id);
+
+	/*
+	 * Send the final rollback record if the transaction data is already
+	 * decoded and we don't need to skip it, otherwise, perform the cleanup of
+	 * the transaction.
+	 */
+	if (already_decoded && !skip_xact)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
+
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c1bd680..ce0175f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -421,6 +422,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1514,12 +1521,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1538,7 +1547,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1572,9 +1581,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1768,9 +1801,24 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1898,8 +1946,10 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/*
+	 * Discard the changes that we just streamed.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2006,7 +2056,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2297,7 +2347,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2331,18 +2390,32 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the 4 scenarios: 1. Prepare of a
+		 * two-phase commit. 2. Prepare of a two-phase commit and a part of
+		 * streaming in-progress txn. 3. streaming of an in-progress txn. 3.
+		 * Commit of a transaction.
+		 *
+		 * Scenario 1 and 2, we handle the same way, pass in prepared as true
+		 * to ReorderBufferTruncateTXN and allow more elaborate truncation of
+		 * txn data as the entire transaction has been decoded, only commit is
+		 * pending. Scenario 3, we pass in prepared as false to
+		 * ReorderBufferTruncateTXN as the txn is not yet completely decoded.
+		 * Scenario 4, all txn has been decoded and we can fully cleanup the
+		 * TXN reorder buffer.
 		 */
-		if (streaming)
+		if (rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, true);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn, false);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2372,17 +2445,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2390,10 +2466,24 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/*
+			 * If streaming, reset the TXN so that it is allowed to stream
+			 * remaining data. Streaming can also be on a prepared txn, handle
+			 * it the same way.
+			 */
+			if (streaming)
+			{
+				elog(LOG, "stopping decoding of %u",txn->xid);
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2415,23 +2505,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2473,6 +2556,120 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	 * The transaction may or may not exist (during restarts for example).
+	 * Anyway, two-phase transactions do not contain any reorderbuffers. So
+	 * allow it to be created below.
+	 */
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2515,7 +2712,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
@@ -2604,6 +2806,37 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 }
 
 /*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ * Note that this is only allowed to be called when a transaction prepare
+ * has just been read, not otherwise.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
+/*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 66c89d1..13c802b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -234,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -614,12 +635,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -637,6 +664,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+								 TimestampTz commit_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v20-0006-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v20-0006-Support-2PC-txn-subscriber-tests.patchDownload
From e84784a996fef27d34bad750475cc86eeda9a9c6 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 16 Nov 2020 02:06:39 -0500
Subject: [PATCH v20] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#106Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Ajin Cherian (#105)

On Mon, Nov 16, 2020 at 4:25 PM Ajin Cherian <itsajin@gmail.com> wrote:

Updated with a new test case
(contrib/test_decoding/t/002_twophase-streaming.pl) that tests
concurrent aborts during streaming prepare. Had to make a few changes
to the test_decoding stream_start callbacks to handle
"check-xid-aborted"
the same way it was handled in the non stream callbacks. Merged
Peter's v20-0006 as well.

Thank you for updating the patch.

I have a question about the timestamp of PREPARE on a subscriber node,
although this may have already been discussed.

With the current patch, the timestamps of PREPARE are different
between the publisher and the subscriber but the timestamp of their
commits are the same. For example,

-- There is 1 prepared transaction on a publisher node.
=# select * from pg_prepared_xact;

transaction | gid | prepared | owner | database
-------------+-----+-------------------------------+----------+----------
510 | h1 | 2020-11-16 16:57:13.438633+09 | masahiko | postgres
(1 row)

-- This prepared transaction is replicated to a subscriber.
=# select * from pg_prepared_xact;

transaction | gid | prepared | owner | database
-------------+-----+-------------------------------+----------+----------
514 | h1 | 2020-11-16 16:57:13.440593+09 | masahiko | postgres
(1 row)

These timestamps are different. Let's commit the prepared transaction
'h1' on the publisher and check the commit timestamps on both nodes.

-- On the publisher node.
=# select pg_xact_commit_timestamp('510'::xid);

pg_xact_commit_timestamp
-------------------------------
2020-11-16 16:57:13.474275+09
(1 row)

-- Commit prepared is also replicated to the subscriber node.
=# select pg_xact_commit_timestamp('514'::xid);

pg_xact_commit_timestamp
-------------------------------
2020-11-16 16:57:13.474275+09
(1 row)

The commit timestamps are the same. At PREPARE we use the local
timestamp when PREPARE is executed as 'prepared' time while at COMMIT
PREPARED we use the origin's commit timestamp as the commit timestamp
if the commit WAL has.

This behaviour made me think a possibility that if the clock of the
publisher is behind then on subscriber node the timestamp of COMMIT
PREPARED (i.g., the return value from pg_xact_commit_timestamp())
could be smaller than the timestamp of PREPARE (i.g., 'prepared_at' in
pg_prepared_xacts). I think it would not be a critical issue but I
think it might be worth discussing the behaviour.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#107Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#106)

On Mon, Nov 16, 2020 at 3:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 16, 2020 at 4:25 PM Ajin Cherian <itsajin@gmail.com> wrote:

Updated with a new test case
(contrib/test_decoding/t/002_twophase-streaming.pl) that tests
concurrent aborts during streaming prepare. Had to make a few changes
to the test_decoding stream_start callbacks to handle
"check-xid-aborted"
the same way it was handled in the non stream callbacks. Merged
Peter's v20-0006 as well.

Thank you for updating the patch.

I have a question about the timestamp of PREPARE on a subscriber node,
although this may have already been discussed.

With the current patch, the timestamps of PREPARE are different
between the publisher and the subscriber but the timestamp of their
commits are the same. For example,

-- There is 1 prepared transaction on a publisher node.
=# select * from pg_prepared_xact;

transaction | gid | prepared | owner | database
-------------+-----+-------------------------------+----------+----------
510 | h1 | 2020-11-16 16:57:13.438633+09 | masahiko | postgres
(1 row)

-- This prepared transaction is replicated to a subscriber.
=# select * from pg_prepared_xact;

transaction | gid | prepared | owner | database
-------------+-----+-------------------------------+----------+----------
514 | h1 | 2020-11-16 16:57:13.440593+09 | masahiko | postgres
(1 row)

These timestamps are different. Let's commit the prepared transaction
'h1' on the publisher and check the commit timestamps on both nodes.

-- On the publisher node.
=# select pg_xact_commit_timestamp('510'::xid);

pg_xact_commit_timestamp
-------------------------------
2020-11-16 16:57:13.474275+09
(1 row)

-- Commit prepared is also replicated to the subscriber node.
=# select pg_xact_commit_timestamp('514'::xid);

pg_xact_commit_timestamp
-------------------------------
2020-11-16 16:57:13.474275+09
(1 row)

The commit timestamps are the same. At PREPARE we use the local
timestamp when PREPARE is executed as 'prepared' time while at COMMIT
PREPARED we use the origin's commit timestamp as the commit timestamp
if the commit WAL has.

Doesn't this happen only if you set replication origins? Because
otherwise both PrepareTransaction() and
RecordTransactionCommitPrepared() used the current timestamp.

--
With Regards,
Amit Kapila.

#108Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#107)

On Tue, Nov 17, 2020 at 10:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Doesn't this happen only if you set replication origins? Because
otherwise both PrepareTransaction() and
RecordTransactionCommitPrepared() used the current timestamp.

I was also checking this, even if you set replicating origins, the
preparedTransaction will reflect the local prepare time in
pg_prepared_xacts. pg_prepared_xacts fetches this information
from GlobalTransaction data which does not store the origin_timestamp;
it only stores the prepared_at which is the local timestamp.
The WAL record does have the origin_timestamp but that is not updated
in the GlobalTransaction data structure

typedef struct xl_xact_prepare
{
uint32 magic; /* format identifier */
uint32 total_len; /* actual file length */
TransactionId xid; /* original transaction XID */
Oid database; /* OID of database it was in */
TimestampTz prepared_at; /* time of preparation */ <=== this is
local time and updated in GlobalTransaction
Oid owner; /* user running the transaction */
int32 nsubxacts; /* number of following subxact XIDs */
int32 ncommitrels; /* number of delete-on-commit rels */
int32 nabortrels; /* number of delete-on-abort rels */
int32 ninvalmsgs; /* number of cache invalidation messages */
bool initfileinval; /* does relcache init file need invalidation? */
uint16 gidlen; /* length of the GID - GID follows the header */
XLogRecPtr origin_lsn; /* lsn of this record at origin node */
TimestampTz origin_timestamp; /* time of prepare at origin node
*/ <=== this is the time at origin which is not updated in
GlobalTransaction
} xl_xact_prepare;

regards,
Ajin Cherian
Fujitsu Australia

#109Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#108)

On Tue, Nov 17, 2020 at 5:02 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Nov 17, 2020 at 10:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Doesn't this happen only if you set replication origins? Because
otherwise both PrepareTransaction() and
RecordTransactionCommitPrepared() used the current timestamp.

I was also checking this, even if you set replicating origins, the
preparedTransaction will reflect the local prepare time in
pg_prepared_xacts. pg_prepared_xacts fetches this information
from GlobalTransaction data which does not store the origin_timestamp;
it only stores the prepared_at which is the local timestamp.

Sure, but my question was does this difference in behavior happens
without replication origins in any way? The reason is that if it
occurs only with replication origins, I don't think we need to bother
about the same because that feature is not properly implemented and
not used as-is. See the discussion [1]/messages/by-id/064fab0c-915e-aede-c02e-bd4ec1f59732@2ndquadrant.com [2]/messages/by-id/188d15be-8699-c045-486a-f0439c9c2b7d@2ndquadrant.com. OTOH, if this behavior can
happen without replication origins then we might want to consider
changing it.

[1]: /messages/by-id/064fab0c-915e-aede-c02e-bd4ec1f59732@2ndquadrant.com
[2]: /messages/by-id/188d15be-8699-c045-486a-f0439c9c2b7d@2ndquadrant.com

--
With Regards,
Amit Kapila.

#110Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#109)

On Tue, Nov 17, 2020 at 9:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 17, 2020 at 5:02 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Nov 17, 2020 at 10:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Doesn't this happen only if you set replication origins? Because
otherwise both PrepareTransaction() and
RecordTransactionCommitPrepared() used the current timestamp.

I was also checking this, even if you set replicating origins, the
preparedTransaction will reflect the local prepare time in
pg_prepared_xacts. pg_prepared_xacts fetches this information
from GlobalTransaction data which does not store the origin_timestamp;
it only stores the prepared_at which is the local timestamp.

Sure, but my question was does this difference in behavior happens
without replication origins in any way? The reason is that if it
occurs only with replication origins, I don't think we need to bother
about the same because that feature is not properly implemented and
not used as-is. See the discussion [1] [2]. OTOH, if this behavior can
happen without replication origins then we might want to consider
changing it.

Logical replication workers always have replication origins, right? Is
that what you meant 'with replication origins'?

IIUC logical replication workers always set the origin's commit
timestamp as the commit timestamp of the replicated transaction. OTOH,
the timestamp of PREPARE, ‘prepare’ of pg_prepared_xacts, always uses
the local timestamp even if the caller of PrepareTransaction() sets
replorigin_session_origin_timestamp. In terms of user-visible
timestamps of transaction operations, I think users might expect these
timestamps are matched between the origin and its subscribers. But the
pg_xact_commit_timestamp() is a function of the commit timestamp
feature whereas ‘prepare’ is a pure timestamp when the transaction is
prepared. So I’m not sure these timestamps really need to be matched,
though.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#111Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#110)

On Wed, Nov 18, 2020 at 7:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Nov 17, 2020 at 9:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 17, 2020 at 5:02 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Nov 17, 2020 at 10:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Doesn't this happen only if you set replication origins? Because
otherwise both PrepareTransaction() and
RecordTransactionCommitPrepared() used the current timestamp.

I was also checking this, even if you set replicating origins, the
preparedTransaction will reflect the local prepare time in
pg_prepared_xacts. pg_prepared_xacts fetches this information
from GlobalTransaction data which does not store the origin_timestamp;
it only stores the prepared_at which is the local timestamp.

Sure, but my question was does this difference in behavior happens
without replication origins in any way? The reason is that if it
occurs only with replication origins, I don't think we need to bother
about the same because that feature is not properly implemented and
not used as-is. See the discussion [1] [2]. OTOH, if this behavior can
happen without replication origins then we might want to consider
changing it.

Logical replication workers always have replication origins, right? Is
that what you meant 'with replication origins'?

I was thinking with respect to the publisher-side but you are right
that logical apply workers always have replication origins so the
effect will be visible but I think the same should be true on
publisher without this patch as well. Say, the user has set up
replication origin via pg_replication_origin_xact_setup and provided a
value of timestamp then also the same behavior will be there.

IIUC logical replication workers always set the origin's commit
timestamp as the commit timestamp of the replicated transaction. OTOH,
the timestamp of PREPARE, ‘prepare’ of pg_prepared_xacts, always uses
the local timestamp even if the caller of PrepareTransaction() sets
replorigin_session_origin_timestamp. In terms of user-visible
timestamps of transaction operations, I think users might expect these
timestamps are matched between the origin and its subscribers. But the
pg_xact_commit_timestamp() is a function of the commit timestamp
feature whereas ‘prepare’ is a pure timestamp when the transaction is
prepared. So I’m not sure these timestamps really need to be matched,
though.

Yeah, I am not sure if it is a good idea for users to rely on this
especially if the same behavior is visible on the publisher as well.
We might want to think separately if there is a value in making
prepare-time to also rely on replorigin_session_origin_timestamp and
if so, that can be done as a separate patch. What do you think?

--
With Regards,
Amit Kapila.

#112Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#105)
6 attachment(s)

On Mon, Nov 16, 2020 at 12:55 PM Ajin Cherian <itsajin@gmail.com> wrote:

Updated with a new test case
(contrib/test_decoding/t/002_twophase-streaming.pl) that tests
concurrent aborts during streaming prepare. Had to make a few changes
to the test_decoding stream_start callbacks to handle
"check-xid-aborted"
the same way it was handled in the non stream callbacks.

Why did you make a change in stream_start API? I think it should be
*_change and *_truncate APIs because the concurrent abort can happen
while decoding any intermediate change. If you agree then you can
probably take that code into a separate function and call it from the
respective APIs.

In 0003,
contrib/test_decoding/t/002_twophase-streaming.pl | 102 +++++++++

The naming of the file seems to be inconsistent with other files. It
should be 002_twophase_streaming.pl

Other than this, please find attached rebased patch set. It needs
rebase after latest commit 9653f24ad8307f393de51e0a64d9b10a49efa6e3.

--
With Regards,
Amit Kapila.

Attachments:

v21-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v21-0001-Support-2PC-txn-base.patchDownload
From 36fe22e5fd42a2fecc1ef25026203bfac40e2cf9 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 17 Nov 2020 18:10:25 +0530
Subject: [PATCH v21 1/6] Support 2PC txn base.

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 212 ++++++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 146 +++++++++++++++++-
 src/backend/replication/logical/logical.c | 242 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++++
 src/include/replication/reorderbuffer.h   |  35 +++++
 src/tools/pgindent/typedefs.list          |  11 ++
 7 files changed, 690 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e12278b..23fe7a2 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -35,6 +39,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -87,6 +92,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -102,6 +110,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -126,10 +146,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -141,6 +166,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -150,6 +176,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -241,6 +268,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+
+				errno = 0;
+				data->check_xid_aborted = (TransactionId) strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -252,6 +308,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -320,6 +377,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -480,6 +624,26 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/*
+	 * if check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -642,6 +806,27 @@ static void
 pg_output_stream_start(LogicalDecodingContext *ctx, TestDecodingData *data, ReorderBufferTXN *txn, bool last_write)
 {
 	OutputPluginPrepareWrite(ctx, last_write);
+
+	/*
+	 * if check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+			!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
 	else
@@ -702,6 +887,33 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..f5b617d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>stream_prepare_cb</function>, <function>commit_prepared_cb</function>
+    and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +598,56 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +657,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +740,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +794,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +850,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1041,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f1f4df7..44c6324 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -227,6 +241,21 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require
+	 * prepare/commit-prepare/abort-prepare callbacks. The filter-prepare
+	 * callback is optional. We however enable two-phase logical decoding when
+	 * at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +266,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +813,129 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1012,52 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1256,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..032e35a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+  * and sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bd9dd7e..9b8eced 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -244,6 +245,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char	   *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -410,6 +414,26 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -436,6 +460,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -502,6 +532,10 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -510,6 +544,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fde701b..f4d4703 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1316,9 +1316,20 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
1.8.3.1

v21-0002-Support-2PC-txn-backend.patchapplication/octet-stream; name=v21-0002-Support-2PC-txn-backend.patchDownload
From d66311703d637137ae25fdd5f5bbeb111025362c Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 17 Nov 2020 18:11:39 +0530
Subject: [PATCH v21 2/6] Support 2PC txn backend.

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.
---
 src/backend/replication/logical/decode.c        | 213 +++++++++++++---
 src/backend/replication/logical/reorderbuffer.c | 319 ++++++++++++++++++++----
 src/include/replication/reorderbuffer.h         |  34 +++
 3 files changed, 495 insertions(+), 71 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..1b65d4a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,9 +67,14 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool already_decoded);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool already_decoded);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -244,6 +249,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +259,19 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction data then
+				 * DecodeCommit doesn't need to decode it again. This is
+				 * possible iff output plugin supports two-phase commits and
+				 * doesn't skip the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+								ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeCommit(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +290,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction during prepare
+				 * then DecodeAbort need to call rollback prepared.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+						ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeAbort(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +341,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -582,10 +629,14 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool already_decoded)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -609,8 +660,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * There can be several reasons we might not be interested in this
 	 * transaction:
 	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
 	 * 2) The transaction happened in another database.
 	 * 3) The output plugin is not interested in the origin.
 	 * 4) We are doing fast-forwarding
@@ -640,7 +691,79 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		return;
 	}
 
-	/* tell the reorderbuffer about the surviving subtransactions */
+	/*
+	 * Send the final commit record if the transaction data is already decoded,
+	 * otherwise, process the entire transaction.
+	 */
+	if (already_decoded)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* tell the reorderbuffer about the surviving subtransactions */
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+									 buf->origptr, buf->endptr);
+		}
+
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		commit_time = parsed->origin_timestamp;
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeCommit for
+	 * the reasons why we sometimes want to skip the transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache invalidations
+	 * if there are any for the reasons mentioned in DecodeCommit.
+	 */
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is set up correctly for the main transaction in case all changes
+	 * happened in subtransactions.
+	 */
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
 		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
@@ -648,33 +771,67 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool already_decoded)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool	skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
 	}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	skip_xact = SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id);
+
+	/*
+	 * Send the final rollback record if the transaction data is already
+	 * decoded and we don't need to skip it, otherwise, perform the cleanup of
+	 * the transaction.
+	 */
+	if (already_decoded && !skip_xact)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
+
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 301baff..916ea02 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -422,6 +423,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1515,12 +1522,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1539,7 +1548,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1573,9 +1582,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1769,9 +1802,24 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1899,8 +1947,10 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/*
+	 * Discard the changes that we just streamed.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2007,7 +2057,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2298,7 +2348,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2332,18 +2391,32 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the 4 scenarios: 1. Prepare of a
+		 * two-phase commit. 2. Prepare of a two-phase commit and a part of
+		 * streaming in-progress txn. 3. streaming of an in-progress txn. 3.
+		 * Commit of a transaction.
+		 *
+		 * Scenario 1 and 2, we handle the same way, pass in prepared as true
+		 * to ReorderBufferTruncateTXN and allow more elaborate truncation of
+		 * txn data as the entire transaction has been decoded, only commit is
+		 * pending. Scenario 3, we pass in prepared as false to
+		 * ReorderBufferTruncateTXN as the txn is not yet completely decoded.
+		 * Scenario 4, all txn has been decoded and we can fully cleanup the
+		 * TXN reorder buffer.
 		 */
-		if (streaming)
+		if (rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, true);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn, false);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2373,17 +2446,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2391,10 +2467,24 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/*
+			 * If streaming, reset the TXN so that it is allowed to stream
+			 * remaining data. Streaming can also be on a prepared txn, handle
+			 * it the same way.
+			 */
+			if (streaming)
+			{
+				elog(LOG, "stopping decoding of %u",txn->xid);
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2416,23 +2506,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2474,6 +2557,120 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	 * The transaction may or may not exist (during restarts for example).
+	 * Anyway, two-phase transactions do not contain any reorderbuffers. So
+	 * allow it to be created below.
+	 */
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2516,7 +2713,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
@@ -2605,6 +2807,37 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 }
 
 /*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ * Note that this is only allowed to be called when a transaction prepare
+ * has just been read, not otherwise.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
+/*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9b8eced..a6b43b6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -234,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -619,12 +640,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -642,6 +669,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+								 TimestampTz commit_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v21-0003-Support-2PC-test-cases-for-test_decoding.patchapplication/octet-stream; name=v21-0003-Support-2PC-test-cases-for-test_decoding.patchDownload
From c2380e148b9c306bbe14eaeabafd5613c16bd10c Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 17 Nov 2020 18:13:42 +0530
Subject: [PATCH v21 3/6] Support 2PC test cases for test_decoding.

Add sql and tap tests to test_decoding for 2PC.
---
 contrib/test_decoding/Makefile                     |   4 +-
 contrib/test_decoding/expected/two_phase.out       | 228 +++++++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 177 ++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 +++++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 ++++++
 contrib/test_decoding/t/001_twophase.pl            | 121 +++++++++++
 contrib/test_decoding/t/002_twophase-streaming.pl  | 102 +++++++++
 7 files changed, 813 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase-streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,11 +4,13 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..e5e34b4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..957c198
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..1555582
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase-streaming.pl b/contrib/test_decoding/t/002_twophase-streaming.pl
new file mode 100644
index 0000000..4d64bc4
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase-streaming.pl
@@ -0,0 +1,102 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

v21-0004-Support-2PC-txn-spoolfile.patchapplication/octet-stream; name=v21-0004-Support-2PC-txn-spoolfile.patchDownload
From 1eae10e6d1acd279adc22f7bdeb83a3657eae8fc Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 17 Nov 2020 18:15:36 +0530
Subject: [PATCH v21 4/6] Support 2PC txn - spoolfile.

This patch only refactors to isolate the streaming spool-file processing to a separate function.
Later, two-phase commit logic will require this common processing to be called from multiple places.
---
 src/backend/replication/logical/worker.c | 58 +++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0468491..9fa816c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -244,6 +244,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -933,30 +935,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -964,7 +957,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -979,7 +972,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1048,6 +1041,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1055,16 +1077,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
-- 
1.8.3.1

v21-0005-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v21-0005-Support-2PC-txn-pgoutput.patchDownload
From 47709db8af4896722bcfba4800bcf9c301bd1a57 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 17 Nov 2020 18:16:29 +0530
Subject: [PATCH v21 5/6] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.
---
 src/backend/access/transam/twophase.c       |  33 +++-
 src/backend/replication/logical/proto.c     | 141 ++++++++++++++++-
 src/backend/replication/logical/worker.c    | 236 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c |  74 +++++++++
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  37 ++++-
 src/tools/pgindent/typedefs.list            |   1 +
 7 files changed, 518 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9b..00b4497 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
@@ -1133,9 +1160,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..cfb94d1 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,145 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK]
+	 * PREPARED uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9fa816c..f1e94ad 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -742,6 +742,234 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData *prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK
+	 * PREPARED for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare_txn (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1969,6 +2197,14 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..71ac431 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -143,6 +151,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +165,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +392,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +913,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535d..0691fc5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -54,10 +54,12 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_PREPARE = 'P',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +116,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +124,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +153,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData *prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +200,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f4d4703..4546572 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1342,6 +1342,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPrepareData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v21-0006-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v21-0006-Support-2PC-txn-subscriber-tests.patchDownload
From da9bfae90749deebbe1bd2ad5f16a3743ca66775 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 17 Nov 2020 18:17:47 +0530
Subject: [PATCH v21 6/6] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#113Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#109)

Hi.

Using a tablesync debugging technique as described in another mail
thread [1]/messages/by-id/CAHut+Psprtsa4o89wtNnKLxxwXeDKAX9nNsdghT1Pv63siz+AA@mail.gmail.com[2]/messages/by-id/CAHut+Pt4PyKQCwqzQ=EFF=bpKKJD7XKt_S23F6L20ayQNxg77A@mail.gmail.com I have caused the tablesync worker to handle (e.g.
apply_dispatch) a 2PC PREPARE

This exposes a problem with the current 2PC logic because if/when the
PREPARE is processed by the tablesync worker then the txn will end up
being COMMITTED, even though the 2PC PREPARE has not yet been COMMIT
PREPARED by the publisher.

For example, below is some logging (using my patch [2]/messages/by-id/CAHut+Pt4PyKQCwqzQ=EFF=bpKKJD7XKt_S23F6L20ayQNxg77A@mail.gmail.com) which shows
this occurring:

---

[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE
SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub
application_name=tap_sub' PUBLICATION tap_pub;"
2020-11-18 17:00:37.394 AEDT [15885] LOG: logical decoding found
consistent point at 0/16EF840
2020-11-18 17:00:37.394 AEDT [15885] DETAIL: There are no running transactions.
2020-11-18 17:00:37.394 AEDT [15885] STATEMENT:
CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput NOEXPORT_SNAPSHOT
NOTICE: created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
2020-11-18 17:00:37.407 AEDT [15886] LOG: logical replication apply
worker for subscription "tap_sub" has started
2020-11-18 17:00:37.407 AEDT [15886] LOG: !!>> The apply worker
process has PID = 15886
2020-11-18 17:00:37.415 AEDT [15887] LOG: starting logical decoding
for slot "tap_sub"
2020-11-18 17:00:37.415 AEDT [15887] DETAIL: Streaming transactions
committing after 0/16EF878, reading WAL from 0/16EF840.
2020-11-18 17:00:37.415 AEDT [15887] STATEMENT: START_REPLICATION
SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names
'"tap_pub"')
2020-11-18 17:00:37.415 AEDT [15887] LOG: logical decoding found
consistent point at 0/16EF840
2020-11-18 17:00:37.415 AEDT [15887] DETAIL: There are no running transactions.
2020-11-18 17:00:37.415 AEDT [15887] STATEMENT: START_REPLICATION
SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names
'"tap_pub"')
2020-11-18 17:00:37.415 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:37.415 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:37.421 AEDT [15889] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2020-11-18 17:00:37.421 AEDT [15889] LOG: !!>> The tablesync worker
process has PID = 15889
2020-11-18 17:00:37.421 AEDT [15889] LOG: !!>>

Sleeping 30 secs. For debugging, attach to process 15889 now!

[postgres@CentOS7-x64 ~]$ 2020-11-18 17:00:38.431 AEDT [15886] LOG:
!!>> apply worker: LogicalRepApplyLoop
2020-11-18 17:00:38.431 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:39.433 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:39.433 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:40.437 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:40.437 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:41.439 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:41.439 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:42.441 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:42.441 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:43.442 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:43.442 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
-- etc.
2020-11-18 17:01:03.520 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:03.520 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:04.521 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:04.521 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:05.523 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:05.523 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:06.532 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:06.532 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:07.426 AEDT [15889] LOG: !!>> tablesync worker:
About to call LogicalRepSyncTableStart to do initial syncing
2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:08.539 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:08.539 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:09.541 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:09.541 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
-- etc.
2020-11-18 17:01:23.583 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:23.583 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:24.584 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:24.584 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:25.586 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:25.586 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:26.586 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:26.586 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:27.454 AEDT [17456] LOG: logical decoding found
consistent point at 0/16EF878
2020-11-18 17:01:27.454 AEDT [17456] DETAIL: There are no running transactions.
2020-11-18 17:01:27.454 AEDT [17456] STATEMENT:
CREATE_REPLICATION_SLOT "tap_sub_24582_sync_16385" TEMPORARY LOGICAL
pgoutput USE_SNAPSHOT
2020-11-18 17:01:27.456 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:27.457 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:27.465 AEDT [15889] LOG: !!>> tablesync worker: wait
for CATCHUP state notification
2020-11-18 17:01:27.465 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:27.465 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables

#### Here, while the tablesync worker is paused in the debugger I
execute the PREPARE txn on publisher

psql -d test_pub -c "BEGIN;INSERT INTO test_tab VALUES(1,
'foo');PREPARE TRANSACTION 'test_prepared_tab';"
PREPARE TRANSACTION

2020-11-18 17:01:54.732 AEDT [15887] LOG: !!>>
pgoutput_begin_txn
2020-11-18 17:01:54.732 AEDT [15887] CONTEXT: slot "tap_sub", output
plugin "pgoutput", in the begin callback, associated LSN 0/16EF8B0
2020-11-18 17:01:54.732 AEDT [15887] STATEMENT: START_REPLICATION
SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names
'"tap_pub"')

#### And then in the debugger I let the tablesync worker continue...

2020-11-18 17:02:02.788 AEDT [15889] LOG: !!>> tablesync worker:
received CATCHUP state notification
2020-11-18 17:02:07.729 AEDT [15889] LOG: !!>> tablesync worker:
Returned from LogicalRepSyncTableStart
2020-11-18 17:02:16.284 AEDT [17456] LOG: starting logical decoding
for slot "tap_sub_24582_sync_16385"
2020-11-18 17:02:16.284 AEDT [17456] DETAIL: Streaming transactions
committing after 0/16EF8B0, reading WAL from 0/16EF878.
2020-11-18 17:02:16.284 AEDT [17456] STATEMENT: START_REPLICATION
SLOT "tap_sub_24582_sync_16385" LOGICAL 0/16EF8B0 (proto_version '2',
publication_names '"tap_pub"')
2020-11-18 17:02:16.284 AEDT [17456] LOG: logical decoding found
consistent point at 0/16EF878
2020-11-18 17:02:16.284 AEDT [17456] DETAIL: There are no running transactions.
2020-11-18 17:02:16.284 AEDT [17456] STATEMENT: START_REPLICATION
SLOT "tap_sub_24582_sync_16385" LOGICAL 0/16EF8B0 (proto_version '2',
publication_names '"tap_pub"')
2020-11-18 17:02:16.284 AEDT [17456] LOG: !!>>
pgoutput_begin_txn
2020-11-18 17:02:16.284 AEDT [17456] CONTEXT: slot
"tap_sub_24582_sync_16385", output plugin "pgoutput", in the begin
callback, associated LSN 0/16EF8B0
2020-11-18 17:02:16.284 AEDT [17456] STATEMENT: START_REPLICATION
SLOT "tap_sub_24582_sync_16385" LOGICAL 0/16EF8B0 (proto_version '2',
publication_names '"tap_pub"')
2020-11-18 17:02:40.346 AEDT [15889] LOG: !!>> tablesync worker:
LogicalRepApplyLoop

#### The tablesync worker processes the replication messages....

2020-11-18 17:02:47.992 AEDT [15889] LOG: !!>> tablesync worker:
apply_dispatch for message kind 'B'
2020-11-18 17:02:54.858 AEDT [15889] LOG: !!>> tablesync worker:
apply_dispatch for message kind 'R'
2020-11-18 17:02:56.082 AEDT [15889] LOG: !!>> tablesync worker:
apply_dispatch for message kind 'I'
2020-11-18 17:02:56.082 AEDT [15889] LOG: !!>> tablesync worker:
should_apply_changes_for_rel: true
2020-11-18 17:02:57.354 AEDT [15889] LOG: !!>> tablesync worker:
apply_dispatch for message kind 'P'
2020-11-18 17:02:57.354 AEDT [15889] LOG: !!>> tablesync worker:
called process_syncing_tables
2020-11-18 17:02:59.011 AEDT [15889] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has finished

#### SInce the tablesync was "ahead", the apply worker now needs to
skip those same messages
#### Notice should_apply_changes_for_rel() is false
#### Then apply worker just waits for next messages....

2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker:
apply_dispatch for message kind 'B'
2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker:
apply_dispatch for message kind 'R'
2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker:
apply_dispatch for message kind 'I'
2020-11-18 17:02:59.065 AEDT [15886] LOG: !!>> apply worker:
should_apply_changes_for_rel: false
2020-11-18 17:02:59.065 AEDT [15886] LOG: !!>> apply worker:
apply_dispatch for message kind 'P'
2020-11-18 17:02:59.067 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:02:59.067 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:03:00.071 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:03:00.071 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:03:01.073 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:03:01.073 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:03:02.075 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:03:02.075 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:03:03.080 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:03:03.080 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:03:04.081 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:03:04.082 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:03:05.103 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:03:05.103 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
etc ...

#### At this point there is a problem because the tablesync worker has
COMMITTED that PREPARED INSERT.
#### See the subscriber node has ONE record but the publisher node has NONE!

[postgres@CentOS7-x64 ~]$ psql -d test_pub -c "SELECT count(*) FROM test_tab;"
count
-------
0
(1 row)

[postgres@CentOS7-x64 ~]$
[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "SELECT
count(*) FROM test_tab;"
count
-------
1
(1 row)

[postgres@CentOS7-x64 ~]$

-----
[1]: /messages/by-id/CAHut+Psprtsa4o89wtNnKLxxwXeDKAX9nNsdghT1Pv63siz+AA@mail.gmail.com
[2]: /messages/by-id/CAHut+Pt4PyKQCwqzQ=EFF=bpKKJD7XKt_S23F6L20ayQNxg77A@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

#114Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#113)

On Wed, Nov 18, 2020 at 1:18 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi.

Using a tablesync debugging technique as described in another mail
thread [1][2] I have caused the tablesync worker to handle (e.g.
apply_dispatch) a 2PC PREPARE

This exposes a problem with the current 2PC logic because if/when the
PREPARE is processed by the tablesync worker then the txn will end up
being COMMITTED, even though the 2PC PREPARE has not yet been COMMIT
PREPARED by the publisher.

IIUC, this is the problem with the patch being discussed here, right?
Because before this we won't decode at Prepare time.

--
With Regards,
Amit Kapila.

#115Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#114)

On Wed, Nov 18, 2020 at 7:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 18, 2020 at 1:18 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi.

Using a tablesync debugging technique as described in another mail
thread [1][2] I have caused the tablesync worker to handle (e.g.
apply_dispatch) a 2PC PREPARE

This exposes a problem with the current 2PC logic because if/when the
PREPARE is processed by the tablesync worker then the txn will end up
being COMMITTED, even though the 2PC PREPARE has not yet been COMMIT
PREPARED by the publisher.

IIUC, this is the problem with the patch being discussed here, right?
Because before this we won't decode at Prepare time.

Correct. This is new.

Kind Regards,
Peter Smith.
Fujitsu Australia.

#116Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#112)
6 attachment(s)

Why did you make a change in stream_start API? I think it should be
*_change and *_truncate APIs because the concurrent abort can happen
while decoding any intermediate change. If you agree then you can
probably take that code into a separate function and call it from the
respective APIs.

Patch 0001:
Updated this from stream_start to stream_change. I haven't updated
*_truncate as the test case written for this does not include a
truncate.
Also created a new function for this: test_concurrent_aborts().

In 0003,
contrib/test_decoding/t/002_twophase-streaming.pl | 102 +++++++++

The naming of the file seems to be inconsistent with other files. It
should be 002_twophase_streaming.pl

Patch 0003:
Changed accordingly.

Patch 0002:
I've updated a comment that got muddled up while applying pg-indent in
reorderbuffer.c

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v22-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v22-0001-Support-2PC-txn-base.patchDownload
From f6a030eda6f6a34d6123c0befa2185d2a009fb20 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 18 Nov 2020 22:28:47 -0500
Subject: [PATCH v22] Support 2PC txn base.

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 202 +++++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 146 +++++++++++++++++-
 src/backend/replication/logical/logical.c | 242 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++++
 src/include/replication/reorderbuffer.h   |  35 +++++
 src/tools/pgindent/typedefs.list          |  11 ++
 7 files changed, 680 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e12278b..1203a5e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -35,6 +39,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -87,6 +92,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -102,6 +110,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -126,10 +146,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -141,6 +166,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -150,6 +176,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -241,6 +268,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+
+				errno = 0;
+				data->check_xid_aborted = (TransactionId) strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -252,6 +308,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -320,6 +377,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -331,6 +475,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -480,6 +648,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -642,6 +813,7 @@ static void
 pg_output_stream_start(LogicalDecodingContext *ctx, TestDecodingData *data, ReorderBufferTXN *txn, bool last_write)
 {
 	OutputPluginPrepareWrite(ctx, last_write);
+
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
 	else
@@ -702,6 +874,33 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
@@ -751,6 +950,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..f5b617d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>stream_prepare_cb</function>, <function>commit_prepared_cb</function>
+    and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +598,56 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +657,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +740,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +794,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +850,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1041,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 4324e32..009db5f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -227,6 +241,21 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require
+	 * prepare/commit-prepare/abort-prepare callbacks. The filter-prepare
+	 * callback is optional. We however enable two-phase logical decoding when
+	 * at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +266,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +813,129 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1012,52 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1256,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..032e35a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+  * and sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bd9dd7e..9b8eced 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -244,6 +245,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char	   *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -410,6 +414,26 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -436,6 +460,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -502,6 +532,10 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -510,6 +544,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fde701b..f4d4703 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1316,9 +1316,20 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
1.8.3.1

v22-0002-Support-2PC-txn-backend.patchapplication/octet-stream; name=v22-0002-Support-2PC-txn-backend.patchDownload
From fc9eef93a625519f6fb3e7c8596f6f32ef4e4cac Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 18 Nov 2020 22:33:10 -0500
Subject: [PATCH v22] Support 2PC txn backend.

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.
---
 src/backend/replication/logical/decode.c        | 213 +++++++++++++---
 src/backend/replication/logical/reorderbuffer.c | 321 ++++++++++++++++++++----
 src/include/replication/reorderbuffer.h         |  34 +++
 3 files changed, 497 insertions(+), 71 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..1b65d4a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,9 +67,14 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool already_decoded);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool already_decoded);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -244,6 +249,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +259,19 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction data then
+				 * DecodeCommit doesn't need to decode it again. This is
+				 * possible iff output plugin supports two-phase commits and
+				 * doesn't skip the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+								ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeCommit(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +290,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction during prepare
+				 * then DecodeAbort need to call rollback prepared.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+						ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeAbort(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +341,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -582,10 +629,14 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool already_decoded)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -609,8 +660,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * There can be several reasons we might not be interested in this
 	 * transaction:
 	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
 	 * 2) The transaction happened in another database.
 	 * 3) The output plugin is not interested in the origin.
 	 * 4) We are doing fast-forwarding
@@ -640,7 +691,79 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		return;
 	}
 
-	/* tell the reorderbuffer about the surviving subtransactions */
+	/*
+	 * Send the final commit record if the transaction data is already decoded,
+	 * otherwise, process the entire transaction.
+	 */
+	if (already_decoded)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* tell the reorderbuffer about the surviving subtransactions */
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+									 buf->origptr, buf->endptr);
+		}
+
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		commit_time = parsed->origin_timestamp;
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeCommit for
+	 * the reasons why we sometimes want to skip the transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache invalidations
+	 * if there are any for the reasons mentioned in DecodeCommit.
+	 */
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is set up correctly for the main transaction in case all changes
+	 * happened in subtransactions.
+	 */
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
 		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
@@ -648,33 +771,67 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool already_decoded)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool	skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
 	}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	skip_xact = SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id);
+
+	/*
+	 * Send the final rollback record if the transaction data is already
+	 * decoded and we don't need to skip it, otherwise, perform the cleanup of
+	 * the transaction.
+	 */
+	if (already_decoded && !skip_xact)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
+
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 301baff..78d210f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -422,6 +423,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1515,12 +1522,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1539,7 +1548,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1573,9 +1582,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1769,9 +1802,24 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1899,8 +1947,10 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/*
+	 * Discard the changes that we just streamed.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2007,7 +2057,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2298,7 +2348,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2332,18 +2391,34 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the 4 scenarios:
+		 * 1. Prepare of a two-phase commit.
+		 * 2. Prepare of a two-phase commit and part of streaming an
+		 *    in-progress txn.
+		 * 3. Streaming of an in-progress txn.
+		 * 4. Commit of a transaction.
+		 *
+		 * Scenario 1 and 2, we handle the same way, pass in prepared as true
+		 * to ReorderBufferTruncateTXN and allow more elaborate truncation of
+		 * txn data as the entire transaction has been decoded, only commit is
+		 * pending. Scenario 3, we pass in prepared as false to
+		 * ReorderBufferTruncateTXN as the txn is not yet completely decoded.
+		 * Scenario 4, all txn has been decoded and we can fully cleanup the
+		 * TXN reorder buffer.
 		 */
-		if (streaming)
+		if (rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, true);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn, false);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2373,17 +2448,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2391,10 +2469,24 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/*
+			 * If streaming, reset the TXN so that it is allowed to stream
+			 * remaining data. Streaming can also be on a prepared txn, handle
+			 * it the same way.
+			 */
+			if (streaming)
+			{
+				elog(LOG, "stopping decoding of %u",txn->xid);
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2416,23 +2508,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2474,6 +2559,120 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	 * The transaction may or may not exist (during restarts for example).
+	 * Anyway, two-phase transactions do not contain any reorderbuffers. So
+	 * allow it to be created below.
+	 */
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2516,7 +2715,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
@@ -2605,6 +2809,37 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 }
 
 /*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ * Note that this is only allowed to be called when a transaction prepare
+ * has just been read, not otherwise.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
+/*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9b8eced..a6b43b6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -234,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -619,12 +640,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -642,6 +669,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+								 TimestampTz commit_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v22-0004-Support-2PC-txn-spoolfile.patchapplication/octet-stream; name=v22-0004-Support-2PC-txn-spoolfile.patchDownload
From 567cc6075b4a6ef096dbc4443fa01d8e1b90b8e2 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 18 Nov 2020 23:09:41 -0500
Subject: [PATCH v22] Support 2PC txn - spoolfile.

This patch only refactors to isolate the streaming spool-file processing to a separate function.
Later, two-phase commit logic will require this common processing to be called from multiple places.
---
 src/backend/replication/logical/worker.c | 58 +++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0468491..9fa816c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -244,6 +244,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -933,30 +935,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -964,7 +957,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -979,7 +972,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1048,6 +1041,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1055,16 +1077,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
-- 
1.8.3.1

v22-0005-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v22-0005-Support-2PC-txn-pgoutput.patchDownload
From bc08bb1cd996b66f9b87a0c8d559cacc4ec50774 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 18 Nov 2020 23:11:29 -0500
Subject: [PATCH v22] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.
---
 src/backend/access/transam/twophase.c       |  33 +++-
 src/backend/replication/logical/proto.c     | 141 ++++++++++++++++-
 src/backend/replication/logical/worker.c    | 236 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c |  74 +++++++++
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  37 ++++-
 src/tools/pgindent/typedefs.list            |   1 +
 7 files changed, 518 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9b..00b4497 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
@@ -1133,9 +1160,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..cfb94d1 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,145 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK]
+	 * PREPARED uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9fa816c..f1e94ad 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -742,6 +742,234 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData *prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK
+	 * PREPARED for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare_txn (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1969,6 +2197,14 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..71ac431 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -143,6 +151,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +165,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +392,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +913,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535d..0691fc5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -54,10 +54,12 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_PREPARE = 'P',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +116,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +124,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +153,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData *prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +200,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f4d4703..4546572 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1342,6 +1342,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPrepareData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v22-0003-Support-2PC-test-cases-for-test_decoding.patchapplication/octet-stream; name=v22-0003-Support-2PC-test-cases-for-test_decoding.patchDownload
From 3414027db5f52b21de9e31535495b58462358080 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 18 Nov 2020 23:01:22 -0500
Subject: [PATCH v22] Support 2PC test cases for test_decoding.

Add sql and tap tests to test_decoding for 2PC.
---
 contrib/test_decoding/Makefile                     |   4 +-
 contrib/test_decoding/expected/two_phase.out       | 228 +++++++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 177 ++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 +++++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 ++++++
 contrib/test_decoding/t/001_twophase.pl            | 121 +++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl  | 102 +++++++++
 7 files changed, 813 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,11 +4,13 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..e5e34b4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..957c198
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..1555582
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..4d64bc4
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,102 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

v22-0006-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v22-0006-Support-2PC-txn-subscriber-tests.patchDownload
From da5b088aae99c030575e26a89fc2ad9c8ec0159b Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 18 Nov 2020 23:13:19 -0500
Subject: [PATCH v22] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#117Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#116)

On Thu, Nov 19, 2020 at 11:27 AM Ajin Cherian <itsajin@gmail.com> wrote:

Why did you make a change in stream_start API? I think it should be
*_change and *_truncate APIs because the concurrent abort can happen
while decoding any intermediate change. If you agree then you can
probably take that code into a separate function and call it from the
respective APIs.

Patch 0001:
Updated this from stream_start to stream_change. I haven't updated
*_truncate as the test case written for this does not include a
truncate.

I think the same check should be there in truncate as well to make the
APIs consistent and also one can use it for writing another test that
has a truncate operation.

--
With Regards,
Amit Kapila.

#118Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#117)
6 attachment(s)

On Thu, Nov 19, 2020 at 5:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the same check should be there in truncate as well to make the
APIs consistent and also one can use it for writing another test that
has a truncate operation.

Updated the checks in both truncate callbacks (stream and non-stream).
Also added a test case for testing concurrent aborts while decoding
streaming TRUNCATE.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v23-0004-Support-2PC-txn-spoolfile.patchapplication/octet-stream; name=v23-0004-Support-2PC-txn-spoolfile.patchDownload
From b0e449d38dff2e75ee200def9b3e8667187590d5 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 18 Nov 2020 23:09:41 -0500
Subject: [PATCH v23] Support 2PC txn - spoolfile.

This patch only refactors to isolate the streaming spool-file processing to a separate function.
Later, two-phase commit logic will require this common processing to be called from multiple places.
---
 src/backend/replication/logical/worker.c | 58 +++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0468491..9fa816c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -244,6 +244,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -933,30 +935,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -964,7 +957,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -979,7 +972,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1048,6 +1041,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1055,16 +1077,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
-- 
1.8.3.1

v23-0003-Support-2PC-test-cases-for-test_decoding.patchapplication/octet-stream; name=v23-0003-Support-2PC-test-cases-for-test_decoding.patchDownload
From 2f78b96ca370cd94eca907d612b00d610e29ed5b Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 18 Nov 2020 23:01:22 -0500
Subject: [PATCH v23] Support 2PC test cases for test_decoding.

Add sql and tap tests to test_decoding for 2PC.
---
 contrib/test_decoding/Makefile                     |   4 +-
 contrib/test_decoding/expected/two_phase.out       | 228 +++++++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 177 ++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 +++++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 ++++++
 contrib/test_decoding/t/001_twophase.pl            | 121 +++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl  | 133 ++++++++++++
 7 files changed, 844 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,11 +4,13 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..e5e34b4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..957c198
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..1555582
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..8c0410e
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

v23-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v23-0001-Support-2PC-txn-base.patchDownload
From 8316de34fcda1550e2edb919177ca4808796cf90 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 18 Nov 2020 22:28:47 -0500
Subject: [PATCH v23] Support 2PC txn base.

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 208 +++++++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 146 +++++++++++++++++-
 src/backend/replication/logical/logical.c | 242 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++++
 src/include/replication/reorderbuffer.h   |  35 +++++
 src/tools/pgindent/typedefs.list          |  11 ++
 7 files changed, 686 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e12278b..ef9abdc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -35,6 +39,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -87,6 +92,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -102,6 +110,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -126,10 +146,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -141,6 +166,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -150,6 +176,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -241,6 +268,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+
+				errno = 0;
+				data->check_xid_aborted = (TransactionId) strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -252,6 +308,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -320,6 +377,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -331,6 +475,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -480,6 +648,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -566,6 +737,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -642,6 +816,7 @@ static void
 pg_output_stream_start(LogicalDecodingContext *ctx, TestDecodingData *data, ReorderBufferTXN *txn, bool last_write)
 {
 	OutputPluginPrepareWrite(ctx, last_write);
+
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
 	else
@@ -702,6 +877,33 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
@@ -751,6 +953,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -804,6 +1009,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..f5b617d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>stream_prepare_cb</function>, <function>commit_prepared_cb</function>
+    and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +598,56 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +657,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +740,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +794,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +850,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1041,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 4324e32..009db5f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -227,6 +241,21 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require
+	 * prepare/commit-prepare/abort-prepare callbacks. The filter-prepare
+	 * callback is optional. We however enable two-phase logical decoding when
+	 * at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +266,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +813,129 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1012,52 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1256,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..032e35a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+  * and sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bd9dd7e..9b8eced 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -244,6 +245,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char	   *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -410,6 +414,26 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -436,6 +460,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -502,6 +532,10 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -510,6 +544,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fde701b..f4d4703 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1316,9 +1316,20 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
1.8.3.1

v23-0002-Support-2PC-txn-backend.patchapplication/octet-stream; name=v23-0002-Support-2PC-txn-backend.patchDownload
From 646d7ebc84584b0b510cb1e040e31dbbaeae6f1b Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 18 Nov 2020 22:33:10 -0500
Subject: [PATCH v23] Support 2PC txn backend.

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.
---
 src/backend/replication/logical/decode.c        | 213 +++++++++++++---
 src/backend/replication/logical/reorderbuffer.c | 321 ++++++++++++++++++++----
 src/include/replication/reorderbuffer.h         |  34 +++
 3 files changed, 497 insertions(+), 71 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..1b65d4a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,9 +67,14 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool already_decoded);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool already_decoded);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -244,6 +249,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +259,19 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction data then
+				 * DecodeCommit doesn't need to decode it again. This is
+				 * possible iff output plugin supports two-phase commits and
+				 * doesn't skip the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+								ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeCommit(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +290,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction during prepare
+				 * then DecodeAbort need to call rollback prepared.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+						ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeAbort(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +341,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -582,10 +629,14 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool already_decoded)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -609,8 +660,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * There can be several reasons we might not be interested in this
 	 * transaction:
 	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
 	 * 2) The transaction happened in another database.
 	 * 3) The output plugin is not interested in the origin.
 	 * 4) We are doing fast-forwarding
@@ -640,7 +691,79 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		return;
 	}
 
-	/* tell the reorderbuffer about the surviving subtransactions */
+	/*
+	 * Send the final commit record if the transaction data is already decoded,
+	 * otherwise, process the entire transaction.
+	 */
+	if (already_decoded)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* tell the reorderbuffer about the surviving subtransactions */
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+									 buf->origptr, buf->endptr);
+		}
+
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		commit_time = parsed->origin_timestamp;
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeCommit for
+	 * the reasons why we sometimes want to skip the transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache invalidations
+	 * if there are any for the reasons mentioned in DecodeCommit.
+	 */
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is set up correctly for the main transaction in case all changes
+	 * happened in subtransactions.
+	 */
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
 		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
@@ -648,33 +771,67 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool already_decoded)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool	skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
 	}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	skip_xact = SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id);
+
+	/*
+	 * Send the final rollback record if the transaction data is already
+	 * decoded and we don't need to skip it, otherwise, perform the cleanup of
+	 * the transaction.
+	 */
+	if (already_decoded && !skip_xact)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
+
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 301baff..78d210f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -422,6 +423,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1515,12 +1522,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1539,7 +1548,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1573,9 +1582,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1769,9 +1802,24 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1899,8 +1947,10 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/*
+	 * Discard the changes that we just streamed.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2007,7 +2057,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2298,7 +2348,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2332,18 +2391,34 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the 4 scenarios:
+		 * 1. Prepare of a two-phase commit.
+		 * 2. Prepare of a two-phase commit and part of streaming an
+		 *    in-progress txn.
+		 * 3. Streaming of an in-progress txn.
+		 * 4. Commit of a transaction.
+		 *
+		 * Scenario 1 and 2, we handle the same way, pass in prepared as true
+		 * to ReorderBufferTruncateTXN and allow more elaborate truncation of
+		 * txn data as the entire transaction has been decoded, only commit is
+		 * pending. Scenario 3, we pass in prepared as false to
+		 * ReorderBufferTruncateTXN as the txn is not yet completely decoded.
+		 * Scenario 4, all txn has been decoded and we can fully cleanup the
+		 * TXN reorder buffer.
 		 */
-		if (streaming)
+		if (rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, true);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn, false);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2373,17 +2448,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2391,10 +2469,24 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/*
+			 * If streaming, reset the TXN so that it is allowed to stream
+			 * remaining data. Streaming can also be on a prepared txn, handle
+			 * it the same way.
+			 */
+			if (streaming)
+			{
+				elog(LOG, "stopping decoding of %u",txn->xid);
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2416,23 +2508,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2474,6 +2559,120 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	 * The transaction may or may not exist (during restarts for example).
+	 * Anyway, two-phase transactions do not contain any reorderbuffers. So
+	 * allow it to be created below.
+	 */
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2516,7 +2715,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
@@ -2605,6 +2809,37 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 }
 
 /*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ * Note that this is only allowed to be called when a transaction prepare
+ * has just been read, not otherwise.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
+/*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9b8eced..a6b43b6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -234,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -619,12 +640,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -642,6 +669,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+								 TimestampTz commit_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v23-0006-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v23-0006-Support-2PC-txn-subscriber-tests.patchDownload
From fa9bb8ecf4871c11e73e9cf0fe36162b6fe7cd82 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 18 Nov 2020 23:13:19 -0500
Subject: [PATCH v23] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v23-0005-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v23-0005-Support-2PC-txn-pgoutput.patchDownload
From 67e6e30730ee56b4b3ea50a9e0c9fdd816678b4d Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 18 Nov 2020 23:11:29 -0500
Subject: [PATCH v23] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.
---
 src/backend/access/transam/twophase.c       |  33 +++-
 src/backend/replication/logical/proto.c     | 141 ++++++++++++++++-
 src/backend/replication/logical/worker.c    | 236 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c |  74 +++++++++
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  37 ++++-
 src/tools/pgindent/typedefs.list            |   1 +
 7 files changed, 518 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9b..00b4497 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
@@ -1133,9 +1160,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..cfb94d1 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,145 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK]
+	 * PREPARED uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9fa816c..f1e94ad 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -742,6 +742,234 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData *prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK
+	 * PREPARED for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare_txn (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1969,6 +2197,14 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..71ac431 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -143,6 +151,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +165,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +392,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +913,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535d..0691fc5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -54,10 +54,12 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_PREPARE = 'P',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +116,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +124,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +153,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData *prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +200,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f4d4703..4546572 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1342,6 +1342,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPrepareData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

#119Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#118)

On Thu, Nov 19, 2020 at 2:52 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, Nov 19, 2020 at 5:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the same check should be there in truncate as well to make the
APIs consistent and also one can use it for writing another test that
has a truncate operation.

Updated the checks in both truncate callbacks (stream and non-stream).
Also added a test case for testing concurrent aborts while decoding
streaming TRUNCATE.

While reviewing/editing the code in 0002-Support-2PC-txn-backend, I
came across the following code which seems dubious to me.

1.
+ /*
+ * If streaming, reset the TXN so that it is allowed to stream
+ * remaining data. Streaming can also be on a prepared txn, handle
+ * it the same way.
+ */
+ if (streaming)
+ {
+ elog(LOG, "stopping decoding of %u",txn->xid);
+ ReorderBufferResetTXN(rb, txn, snapshot_now,
+   command_id, prev_lsn,
+   specinsert);
+ }
+ else
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid != NULL ? txn->gid : "", txn->xid);
+ ReorderBufferTruncateTXN(rb, txn, true);
+ }

Why do we need to handle the prepared txn case differently here? I
think for both cases we can call ReorderBufferResetTXN as it frees the
memory we should free in exceptions. Sure, there is some code (like
stream_stop and saving the snapshot for next run) in
ReorderBufferResetTXN which needs to be only called when we are
streaming the txn but otherwise, it seems it can be used here. We can
easily identify if the transaction is streamed to differentiate that
code path. Can you think of any other reason for not doing so?

2.
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyway, two-phase transactions do not contain any reorderbuffers. So
+ * allow it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);

Why should we allow to create a new transaction here or in other words
in which cases txn won't be present? I guess this should be the case
with the earlier version of the patch where at prepare time we were
cleaning the ReorderBufferTxn.

--
With Regards,
Amit Kapila.

#120Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#111)

On Wed, Nov 18, 2020 at 12:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 18, 2020 at 7:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Nov 17, 2020 at 9:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 17, 2020 at 5:02 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Nov 17, 2020 at 10:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Doesn't this happen only if you set replication origins? Because
otherwise both PrepareTransaction() and
RecordTransactionCommitPrepared() used the current timestamp.

I was also checking this, even if you set replicating origins, the
preparedTransaction will reflect the local prepare time in
pg_prepared_xacts. pg_prepared_xacts fetches this information
from GlobalTransaction data which does not store the origin_timestamp;
it only stores the prepared_at which is the local timestamp.

Sure, but my question was does this difference in behavior happens
without replication origins in any way? The reason is that if it
occurs only with replication origins, I don't think we need to bother
about the same because that feature is not properly implemented and
not used as-is. See the discussion [1] [2]. OTOH, if this behavior can
happen without replication origins then we might want to consider
changing it.

Logical replication workers always have replication origins, right? Is
that what you meant 'with replication origins'?

I was thinking with respect to the publisher-side but you are right
that logical apply workers always have replication origins so the
effect will be visible but I think the same should be true on
publisher without this patch as well. Say, the user has set up
replication origin via pg_replication_origin_xact_setup and provided a
value of timestamp then also the same behavior will be there.

Right.

IIUC logical replication workers always set the origin's commit
timestamp as the commit timestamp of the replicated transaction. OTOH,
the timestamp of PREPARE, ‘prepare’ of pg_prepared_xacts, always uses
the local timestamp even if the caller of PrepareTransaction() sets
replorigin_session_origin_timestamp. In terms of user-visible
timestamps of transaction operations, I think users might expect these
timestamps are matched between the origin and its subscribers. But the
pg_xact_commit_timestamp() is a function of the commit timestamp
feature whereas ‘prepare’ is a pure timestamp when the transaction is
prepared. So I’m not sure these timestamps really need to be matched,
though.

Yeah, I am not sure if it is a good idea for users to rely on this
especially if the same behavior is visible on the publisher as well.
We might want to think separately if there is a value in making
prepare-time to also rely on replorigin_session_origin_timestamp and
if so, that can be done as a separate patch. What do you think?

I agree that we can think about it separately. If it's necessary we
can make a patch later.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#121Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#119)

On Fri, Nov 20, 2020 at 12:23 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 19, 2020 at 2:52 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, Nov 19, 2020 at 5:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the same check should be there in truncate as well to make the
APIs consistent and also one can use it for writing another test that
has a truncate operation.

Updated the checks in both truncate callbacks (stream and non-stream).
Also added a test case for testing concurrent aborts while decoding
streaming TRUNCATE.

While reviewing/editing the code in 0002-Support-2PC-txn-backend, I
came across the following code which seems dubious to me.

1.
+ /*
+ * If streaming, reset the TXN so that it is allowed to stream
+ * remaining data. Streaming can also be on a prepared txn, handle
+ * it the same way.
+ */
+ if (streaming)
+ {
+ elog(LOG, "stopping decoding of %u",txn->xid);
+ ReorderBufferResetTXN(rb, txn, snapshot_now,
+   command_id, prev_lsn,
+   specinsert);
+ }
+ else
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid != NULL ? txn->gid : "", txn->xid);
+ ReorderBufferTruncateTXN(rb, txn, true);
+ }

Why do we need to handle the prepared txn case differently here? I
think for both cases we can call ReorderBufferResetTXN as it frees the
memory we should free in exceptions. Sure, there is some code (like
stream_stop and saving the snapshot for next run) in
ReorderBufferResetTXN which needs to be only called when we are
streaming the txn but otherwise, it seems it can be used here. We can
easily identify if the transaction is streamed to differentiate that
code path. Can you think of any other reason for not doing so?

Yes, I agree with this that ReorderBufferResetTXN needs to be called
to free up memory after an exception.
Will change ReorderBufferResetTXN so that it now has an extra
parameter that indicates streaming; so that the stream_stop and saving
of the snapshot is only done if streaming.

2.
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyway, two-phase transactions do not contain any reorderbuffers. So
+ * allow it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);

Why should we allow to create a new transaction here or in other words
in which cases txn won't be present? I guess this should be the case
with the earlier version of the patch where at prepare time we were
cleaning the ReorderBufferTxn.

Just confirmed this, yes, you are right. Even after a restart, the
transaction does get created again prior to this, We need not be
creating
it here. I will change this as well.

regards,
Ajin Cherian
Fujitsu Australia

#122Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#120)

On Fri, Nov 20, 2020 at 7:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Nov 18, 2020 at 12:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

IIUC logical replication workers always set the origin's commit
timestamp as the commit timestamp of the replicated transaction. OTOH,
the timestamp of PREPARE, ‘prepare’ of pg_prepared_xacts, always uses
the local timestamp even if the caller of PrepareTransaction() sets
replorigin_session_origin_timestamp. In terms of user-visible
timestamps of transaction operations, I think users might expect these
timestamps are matched between the origin and its subscribers. But the
pg_xact_commit_timestamp() is a function of the commit timestamp
feature whereas ‘prepare’ is a pure timestamp when the transaction is
prepared. So I’m not sure these timestamps really need to be matched,
though.

Yeah, I am not sure if it is a good idea for users to rely on this
especially if the same behavior is visible on the publisher as well.
We might want to think separately if there is a value in making
prepare-time to also rely on replorigin_session_origin_timestamp and
if so, that can be done as a separate patch. What do you think?

I agree that we can think about it separately. If it's necessary we
can make a patch later.

Thanks for the confirmation. Your review and suggestions are quite helpful.

--
With Regards,
Amit Kapila.

#123Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#121)

On Fri, Nov 20, 2020 at 9:12 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Fri, Nov 20, 2020 at 12:23 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 19, 2020 at 2:52 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, Nov 19, 2020 at 5:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the same check should be there in truncate as well to make the
APIs consistent and also one can use it for writing another test that
has a truncate operation.

Updated the checks in both truncate callbacks (stream and non-stream).
Also added a test case for testing concurrent aborts while decoding
streaming TRUNCATE.

While reviewing/editing the code in 0002-Support-2PC-txn-backend, I
came across the following code which seems dubious to me.

1.
+ /*
+ * If streaming, reset the TXN so that it is allowed to stream
+ * remaining data. Streaming can also be on a prepared txn, handle
+ * it the same way.
+ */
+ if (streaming)
+ {
+ elog(LOG, "stopping decoding of %u",txn->xid);
+ ReorderBufferResetTXN(rb, txn, snapshot_now,
+   command_id, prev_lsn,
+   specinsert);
+ }
+ else
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid != NULL ? txn->gid : "", txn->xid);
+ ReorderBufferTruncateTXN(rb, txn, true);
+ }

Why do we need to handle the prepared txn case differently here? I
think for both cases we can call ReorderBufferResetTXN as it frees the
memory we should free in exceptions. Sure, there is some code (like
stream_stop and saving the snapshot for next run) in
ReorderBufferResetTXN which needs to be only called when we are
streaming the txn but otherwise, it seems it can be used here. We can
easily identify if the transaction is streamed to differentiate that
code path. Can you think of any other reason for not doing so?

Yes, I agree with this that ReorderBufferResetTXN needs to be called
to free up memory after an exception.
Will change ReorderBufferResetTXN so that it now has an extra
parameter that indicates streaming; so that the stream_stop and saving
of the snapshot is only done if streaming.

I've already made the changes for this in the patch, you can verify
the same when I'll share the new version. We don't need to pass an
extra parameter rbtx_prepared()/rbtxn_is_streamed should serve the
need.

2.
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid, bool is_commit)
+{
+ ReorderBufferTXN *txn;
+
+ /*
+ * The transaction may or may not exist (during restarts for example).
+ * Anyway, two-phase transactions do not contain any reorderbuffers. So
+ * allow it to be created below.
+ */
+ txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+ true);

Why should we allow to create a new transaction here or in other words
in which cases txn won't be present? I guess this should be the case
with the earlier version of the patch where at prepare time we were
cleaning the ReorderBufferTxn.

Just confirmed this, yes, you are right. Even after a restart, the
transaction does get created again prior to this, We need not be
creating
it here. I will change this as well.

I'll take care of it along with other changes.

Thanks for the confirmation.

--
With Regards,
Amit Kapila.

#124Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#123)
7 attachment(s)

On Fri, Nov 20, 2020 at 2:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I'll take care of it along with other changes.

Thanks for the confirmation.

Ok, meanwhile I've just split the patches to move out the
check_xid_aborted test cases as well as the support in the code for
this into a separate patch. New 0007 patch for this.

regards,
Ajin

Attachments:

v24-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v24-0001-Support-2PC-txn-base.patchDownload
From 2f3f8d440fe947c30fb35ec0347dbbb2cb204328 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 20 Nov 2020 05:00:40 -0500
Subject: [PATCH v24] Support 2PC txn base.

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 172 +++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 146 +++++++++++++++++-
 src/backend/replication/logical/logical.c | 242 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++++
 src/include/replication/reorderbuffer.h   |  35 +++++
 src/tools/pgindent/typedefs.list          |  11 ++
 7 files changed, 650 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e12278b..c42de64 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -35,6 +39,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -87,6 +92,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -102,6 +110,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -126,10 +146,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -141,6 +166,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -150,6 +176,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -241,6 +268,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+
+				errno = 0;
+				data->check_xid_aborted = (TransactionId) strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -252,6 +308,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -320,6 +377,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -642,6 +786,7 @@ static void
 pg_output_stream_start(LogicalDecodingContext *ctx, TestDecodingData *data, ReorderBufferTXN *txn, bool last_write)
 {
 	OutputPluginPrepareWrite(ctx, last_write);
+
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
 	else
@@ -702,6 +847,33 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..f5b617d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>stream_prepare_cb</function>, <function>commit_prepared_cb</function>
+    and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +598,56 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +657,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +740,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +794,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +850,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1041,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 4324e32..009db5f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -227,6 +241,21 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require
+	 * prepare/commit-prepare/abort-prepare callbacks. The filter-prepare
+	 * callback is optional. We however enable two-phase logical decoding when
+	 * at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +266,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +813,129 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1012,52 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1256,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..032e35a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+  * and sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bd9dd7e..9b8eced 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -244,6 +245,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char	   *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -410,6 +414,26 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -436,6 +460,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -502,6 +532,10 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -510,6 +544,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fde701b..f4d4703 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1316,9 +1316,20 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
1.8.3.1

v24-0004-Support-2PC-txn-spoolfile.patchapplication/octet-stream; name=v24-0004-Support-2PC-txn-spoolfile.patchDownload
From 6f26954973ec9d8d24f34e14e3fb66d9fd3aa762 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 20 Nov 2020 05:43:38 -0500
Subject: [PATCH v24] Support 2PC txn - spoolfile.

This patch only refactors to isolate the streaming spool-file processing to a separate function.
Later, two-phase commit logic will require this common processing to be called from multiple places.
---
 src/backend/replication/logical/worker.c | 58 +++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0468491..9fa816c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -244,6 +244,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -933,30 +935,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -964,7 +957,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -979,7 +972,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1048,6 +1041,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1055,16 +1077,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
-- 
1.8.3.1

v24-0005-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v24-0005-Support-2PC-txn-pgoutput.patchDownload
From f60a01f69b970663203c8fc37b1917cbd4ba28c1 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 20 Nov 2020 05:52:45 -0500
Subject: [PATCH v24] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.
---
 src/backend/access/transam/twophase.c       |  33 +++-
 src/backend/replication/logical/proto.c     | 141 ++++++++++++++++-
 src/backend/replication/logical/worker.c    | 236 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c |  74 +++++++++
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  37 ++++-
 src/tools/pgindent/typedefs.list            |   1 +
 7 files changed, 518 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9b..00b4497 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
@@ -1133,9 +1160,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..cfb94d1 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,145 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK]
+	 * PREPARED uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9fa816c..f1e94ad 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -742,6 +742,234 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData *prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK
+	 * PREPARED for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare_txn (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1969,6 +2197,14 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..71ac431 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -143,6 +151,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +165,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +392,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +913,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535d..0691fc5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -54,10 +54,12 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_PREPARE = 'P',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +116,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +124,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +153,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData *prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +200,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f4d4703..4546572 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1342,6 +1342,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPrepareData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v24-0002-Support-2PC-txn-backend.patchapplication/octet-stream; name=v24-0002-Support-2PC-txn-backend.patchDownload
From 89bc3d49eec22cf731904e93a017f98b3cbe2f4d Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 20 Nov 2020 05:01:35 -0500
Subject: [PATCH v24] Support 2PC txn backend.

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.
---
 src/backend/replication/logical/decode.c        | 213 +++++++++++++---
 src/backend/replication/logical/reorderbuffer.c | 321 ++++++++++++++++++++----
 src/include/replication/reorderbuffer.h         |  34 +++
 3 files changed, 497 insertions(+), 71 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..1b65d4a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,9 +67,14 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool already_decoded);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool already_decoded);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -244,6 +249,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +259,19 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction data then
+				 * DecodeCommit doesn't need to decode it again. This is
+				 * possible iff output plugin supports two-phase commits and
+				 * doesn't skip the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+								ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeCommit(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +290,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction during prepare
+				 * then DecodeAbort need to call rollback prepared.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+						ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeAbort(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +341,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -582,10 +629,14 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool already_decoded)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -609,8 +660,8 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * There can be several reasons we might not be interested in this
 	 * transaction:
 	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
+	 *    LSN. This can happen because we previously decoded it and now just
+	 *    are restarting or if we haven't assembled a consistent snapshot yet.
 	 * 2) The transaction happened in another database.
 	 * 3) The output plugin is not interested in the origin.
 	 * 4) We are doing fast-forwarding
@@ -640,7 +691,79 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		return;
 	}
 
-	/* tell the reorderbuffer about the surviving subtransactions */
+	/*
+	 * Send the final commit record if the transaction data is already decoded,
+	 * otherwise, process the entire transaction.
+	 */
+	if (already_decoded)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* tell the reorderbuffer about the surviving subtransactions */
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+									 buf->origptr, buf->endptr);
+		}
+
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		commit_time = parsed->origin_timestamp;
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeCommit for
+	 * the reasons why we sometimes want to skip the transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache invalidations
+	 * if there are any for the reasons mentioned in DecodeCommit.
+	 */
+
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/*
+	 * Tell the reorderbuffer about the surviving subtransactions. We need to
+	 * do this because the main transaction itself has not committed since we
+	 * are in the prepare phase right now. So we need to be sure the snapshot
+	 * is set up correctly for the main transaction in case all changes
+	 * happened in subtransactions.
+	 */
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
 		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
@@ -648,33 +771,67 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool already_decoded)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool	skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
 	}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	skip_xact = SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id);
+
+	/*
+	 * Send the final rollback record if the transaction data is already
+	 * decoded and we don't need to skip it, otherwise, perform the cleanup of
+	 * the transaction.
+	 */
+	if (already_decoded && !skip_xact)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
+
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 301baff..78d210f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -422,6 +423,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1515,12 +1522,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after streaming or
+ * after a PREPARE.
+ * The flag txn_prepared indicates if this is called after a PREPARE.
+ * If streaming, keep the remaining info - transactions, tuplecids, invalidations and
+ * snapshots. If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1539,7 +1548,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1573,9 +1582,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1769,9 +1802,24 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
@@ -1899,8 +1947,10 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/*
+	 * Discard the changes that we just streamed.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, false);
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -2007,7 +2057,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			prev_lsn = change->lsn;
 
 			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2298,7 +2348,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2332,18 +2391,34 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the 4 scenarios:
+		 * 1. Prepare of a two-phase commit.
+		 * 2. Prepare of a two-phase commit and part of streaming an
+		 *    in-progress txn.
+		 * 3. Streaming of an in-progress txn.
+		 * 4. Commit of a transaction.
+		 *
+		 * Scenario 1 and 2, we handle the same way, pass in prepared as true
+		 * to ReorderBufferTruncateTXN and allow more elaborate truncation of
+		 * txn data as the entire transaction has been decoded, only commit is
+		 * pending. Scenario 3, we pass in prepared as false to
+		 * ReorderBufferTruncateTXN as the txn is not yet completely decoded.
+		 * Scenario 4, all txn has been decoded and we can fully cleanup the
+		 * TXN reorder buffer.
 		 */
-		if (streaming)
+		if (rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
+			ReorderBufferTruncateTXN(rb, txn, true);
 
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
+		else if (streaming)
+		{
+			ReorderBufferTruncateTXN(rb, txn, false);
+			/* Reset the CheckXidAlive */
+			CheckXidAlive = InvalidTransactionId;
+		}
 		else
 			ReorderBufferCleanupTXN(rb, txn);
 	}
@@ -2373,17 +2448,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2391,10 +2469,24 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			errdata = NULL;
 			curtxn->concurrent_abort = true;
 
-			/* Reset the TXN so that it is allowed to stream remaining data. */
-			ReorderBufferResetTXN(rb, txn, snapshot_now,
-								  command_id, prev_lsn,
-								  specinsert);
+			/*
+			 * If streaming, reset the TXN so that it is allowed to stream
+			 * remaining data. Streaming can also be on a prepared txn, handle
+			 * it the same way.
+			 */
+			if (streaming)
+			{
+				elog(LOG, "stopping decoding of %u",txn->xid);
+				ReorderBufferResetTXN(rb, txn, snapshot_now,
+									  command_id, prev_lsn,
+									  specinsert);
+			}
+			else
+			{
+				elog(LOG, "stopping decoding of %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+				ReorderBufferTruncateTXN(rb, txn, true);
+			}
 		}
 		else
 		{
@@ -2416,23 +2508,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * This interface is called once a toplevel commit is read for both streamed
  * as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
-					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-					TimestampTz commit_time,
-					RepOriginId origin_id, XLogRecPtr origin_lsn)
+static void
+ReorderBufferCommitInternal(ReorderBufferTXN *txn,
+							ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2474,6 +2559,120 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferCommitInternal()
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction. It calls ReorderBufferCommitInternal()
+ * since all prepared transactions need to be decoded at PREPARE time.
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferCommitInternal(txn, rb, xid, commit_lsn, end_lsn,
+								commit_time, origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	/*
+	 * The transaction may or may not exist (during restarts for example).
+	 * Anyway, two-phase transactions do not contain any reorderbuffers. So
+	 * allow it to be created below.
+	 */
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn,
+								true);
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+	else
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+
+	if (rbtxn_commit_prepared(txn))
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else if (rbtxn_rollback_prepared(txn))
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2516,7 +2715,12 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	/* cosmetic... */
 	txn->final_lsn = lsn;
 
-	/* remove potential on-disk data, and deallocate */
+	/*
+	 * remove potential on-disk data, and deallocate.
+	 *
+	 * We remove it even for prepared transactions (GID is enough to
+	 * commit/abort those later).
+	 */
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
@@ -2605,6 +2809,37 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 }
 
 /*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ * Note that this is only allowed to be called when a transaction prepare
+ * has just been read, not otherwise.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
+/*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9b8eced..a6b43b6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -234,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -619,12 +640,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -642,6 +669,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+								 TimestampTz commit_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v24-0003-Support-2PC-test-cases-for-test_decoding.patchapplication/octet-stream; name=v24-0003-Support-2PC-test-cases-for-test_decoding.patchDownload
From 8541fe2e6b55d33498af35f4781ef80d04d7e13b Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 20 Nov 2020 05:32:43 -0500
Subject: [PATCH v24] Support 2PC test cases for test_decoding.

Add sql tests to test_decoding for 2PC.
---
 contrib/test_decoding/Makefile                     |   2 +-
 contrib/test_decoding/expected/two_phase.out       | 228 +++++++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 177 ++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 +++++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 ++++++
 5 files changed, 588 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..2c4acdc 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,7 +4,7 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..e5e34b4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..957c198
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
-- 
1.8.3.1

v24-0007-2pc-test-cases-for-testing-concurrent-aborts.patchapplication/octet-stream; name=v24-0007-2pc-test-cases-for-testing-concurrent-aborts.patchDownload
From 5e650a9522444a785d5e3d5b2b2998c93501c5b6 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 20 Nov 2020 06:15:57 -0500
Subject: [PATCH v24] 2pc test cases for testing concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2pc.
---
 contrib/test_decoding/Makefile                    |   2 +
 contrib/test_decoding/t/001_twophase.pl           | 121 ++++++++++++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++++++
 contrib/test_decoding/test_decoding.c             |  36 ++++++
 4 files changed, 292 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2c4acdc..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..1555582
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..8c0410e
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stopping decoding of $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index c42de64..ef9abdc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -475,6 +475,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -624,6 +648,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -710,6 +737,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -923,6 +953,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -976,6 +1009,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
-- 
1.8.3.1

v24-0006-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v24-0006-Support-2PC-txn-subscriber-tests.patchDownload
From 8e092d49f29aa6ce85bc1a9a3360436f2a55e9c0 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 20 Nov 2020 05:59:07 -0500
Subject: [PATCH v24] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#125Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#124)
7 attachment(s)

On Fri, Nov 20, 2020 at 4:54 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Fri, Nov 20, 2020 at 2:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I'll take care of it along with other changes.

Thanks for the confirmation.

Ok, meanwhile I've just split the patches to move out the
check_xid_aborted test cases as well as the support in the code for
this into a separate patch. New 0007 patch for this.

This makes sense to me but it should have been 0004 in the series. I
have changed the order in the attached. I have updated
0002-Support-2PC-txn-backend and
0007-2pc-test-cases-for-testing-concurrent-aborts. The changes are:
1. As mentioned previously, used ReorderBufferResetTxn to deal with
concurrent aborts both in case of streamed and prepared txns.
2. There was no clear explanation as to why we are not skipping
DecodePrepare in the presence of concurrent aborts. I have added the
explanation of the same atop DecodePrepare() and at various other
palces.
3. Added/Edited comments at various places in the code and made some
other changes like simplified the code at a few places.
4. Changed the function name ReorderBufferCommitInternal to
ReorderBufferReplay as that seems more appropriate.
5. In ReorderBufferReplay()(which was previously
ReorderBufferCommitInternal), the patch was doing cleanup of TXN even
for prepared transactions which is not consistent with what we do at
other places in the patch, so changed the same.
6. In 2pc-test-cases-for-testing-concurrent-aborts, changed one of the
log message based on the changes in patch Support-2PC-txn-backend.

I am planning to continue review of these patches but I thought it is
better to check about the above changes before proceeding further. Let
me know what you think?

--
With Regards,
Amit Kapila.

Attachments:

v25-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v25-0001-Support-2PC-txn-base.patchDownload
From 43e882d4f592cb8f7133e3aaa095b985e7930f10 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 20 Nov 2020 05:00:40 -0500
Subject: [PATCH v25 1/7] Support 2PC txn base.

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 172 +++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 146 +++++++++++++++++-
 src/backend/replication/logical/logical.c | 242 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++++
 src/include/replication/reorderbuffer.h   |  35 +++++
 src/tools/pgindent/typedefs.list          |  11 ++
 7 files changed, 650 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e12278b..c42de64 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -35,6 +39,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -87,6 +92,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -102,6 +110,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -126,10 +146,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -141,6 +166,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -150,6 +176,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -241,6 +268,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+
+				errno = 0;
+				data->check_xid_aborted = (TransactionId) strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -252,6 +308,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -320,6 +377,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -642,6 +786,7 @@ static void
 pg_output_stream_start(LogicalDecodingContext *ctx, TestDecodingData *data, ReorderBufferTXN *txn, bool last_write)
 {
 	OutputPluginPrepareWrite(ctx, last_write);
+
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
 	else
@@ -702,6 +847,33 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..f5b617d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>stream_prepare_cb</function>, <function>commit_prepared_cb</function>
+    and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +598,56 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +657,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +740,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +794,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +850,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1041,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 4324e32..009db5f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -227,6 +241,21 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require
+	 * prepare/commit-prepare/abort-prepare callbacks. The filter-prepare
+	 * callback is optional. We however enable two-phase logical decoding when
+	 * at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +266,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +813,129 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1012,52 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1256,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..032e35a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+  * and sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bd9dd7e..9b8eced 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -244,6 +245,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char	   *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -410,6 +414,26 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -436,6 +460,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -502,6 +532,10 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -510,6 +544,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fde701b..f4d4703 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1316,9 +1316,20 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
1.8.3.1

v25-0002-Support-2PC-txn-backend.patchapplication/octet-stream; name=v25-0002-Support-2PC-txn-backend.patchDownload
From 282c0dbefb6992653f6b3a3304da140c4f9e53cc Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 17 Nov 2020 18:11:39 +0530
Subject: [PATCH v25 2/7] Support 2PC txn backend.

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.
---
 src/backend/replication/logical/decode.c        | 215 ++++++++++++--
 src/backend/replication/logical/reorderbuffer.c | 374 ++++++++++++++++++++----
 src/include/replication/reorderbuffer.h         |  34 +++
 3 files changed, 532 insertions(+), 91 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..dfba406 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,9 +67,14 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool already_decoded);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool already_decoded);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -244,6 +249,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +259,19 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction data then
+				 * DecodeCommit doesn't need to decode it again. This is
+				 * possible iff output plugin supports two-phase commits and
+				 * doesn't skip the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+								ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeCommit(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +290,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction during prepare
+				 * then DecodeAbort need to call rollback prepared.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+						ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeAbort(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +341,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -582,10 +629,14 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool already_decoded)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -640,7 +691,85 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		return;
 	}
 
-	/* tell the reorderbuffer about the surviving subtransactions */
+	/*
+	 * Send the final commit record if the transaction data is already decoded,
+	 * otherwise, process the entire transaction.
+	 */
+	if (already_decoded)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* tell the reorderbuffer about the surviving subtransactions */
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+									 buf->origptr, buf->endptr);
+		}
+
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ *
+ * Note that we don't skip prepare even if we have detected concurrent abort.
+ * The reason is that it is quite possible that we had already sent some
+ * changes before we detect abort in which case we need to abort those changes
+ * in the subscriber. To abort such changes, we do send the prepare and then
+ * the rollback prepared which is what happened on the publisher-side as well.
+ * Now, we can invent a new abort API wherein in such cases we send abort and
+ * skip sending prepared and rollback prepared but then it is not that
+ * straightforward because we might have streamed this transaction by that time
+ * in which case it is handled when the rollback is encountered. It is not
+ * impossible to optimize the concurrent abort case but it can introduce design
+ * complexity w.r.t handling different cases so leaving it for now as it
+ * doesn't seem worth it.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		commit_time = parsed->origin_timestamp;
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeCommit for
+	 * the reasons why we sometimes want to skip the transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache invalidations
+	 * if there are any for the reasons mentioned in DecodeCommit.
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/* Tell the reorderbuffer about the surviving subtransactions. */
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
 		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
@@ -648,33 +777,67 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool already_decoded)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool	skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	skip_xact = SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id);
+
+	/*
+	 * Send the final rollback record if the transaction data is already
+	 * decoded and we don't need to skip it, otherwise, perform the cleanup of
+	 * the transaction.
+	 */
+	if (already_decoded && !skip_xact)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
 	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 301baff..83d15bb 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -422,6 +423,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1515,12 +1522,18 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after
+ * streaming or decoding them at PREPARE. Keep the remaining info -
+ * transactions, tuplecids, invalidations and snapshots.
+ *
+ * We additionaly remove tuplecids after decoding the transaction at prepare
+ * time as we only need to perform invalidation at rollback or commit prepared.
+ *
+ * 'txn_prepared' indicates that we have decoded the transaction at prepare time.
+ * If streaming,  If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1539,7 +1552,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1573,9 +1586,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1757,9 +1794,10 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * If the transaction was (partially) streamed, we need to commit it in a
- * 'streamed' way.  That is, we first stream the remaining part of the
- * transaction, and then invoke stream_commit message.
+ * If the transaction was (partially) streamed, we need to prepare or commit
+ * it in a 'streamed' way.  That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_prepare or stream_commit message as is
+ * the case.
  */
 static void
 ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1769,29 +1807,49 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		/*
+		 * Note, we send stream prepare even if a concurrent abort is detected.
+		 * See DecodePrepare for more information.
+		 */
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids.
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
  * Set xid to detect concurrent aborts.
  *
- * While streaming an in-progress transaction there is a possibility that the
- * (sub)transaction might get aborted concurrently.  In such case if the
- * (sub)transaction has catalog update then we might decode the tuple using
- * wrong catalog version.  For example, suppose there is one catalog tuple with
- * (xmin: 500, xmax: 0).  Now, the transaction 501 updates the catalog tuple
- * and after that we will have two tuples (xmin: 500, xmax: 501) and
- * (xmin: 501, xmax: 0).  Now, if 501 is aborted and some other transaction
- * say 502 updates the same catalog tuple then the first tuple will be changed
- * to (xmin: 500, xmax: 502).  So, the problem is that when we try to decode
- * the tuple inserted/updated in 501 after the catalog update, we will see the
- * catalog tuple with (xmin: 500, xmax: 502) as visible because it will
- * consider that the tuple is deleted by xid 502 which is not visible to our
- * snapshot.  And when we will try to decode with that catalog tuple, it can
- * lead to a wrong result or a crash.  So, it is necessary to detect
- * concurrent aborts to allow streaming of in-progress transactions.
+ * While streaming an in-progress transaction or decoding a prepared
+ * transaction there is a possibility that the (sub)transaction might get
+ * aborted concurrently.  In such case if the (sub)transaction has catalog
+ * update then we might decode the tuple using wrong catalog version.  For
+ * example, suppose there is one catalog tuple with (xmin: 500, xmax: 0).  Now,
+ * the transaction 501 updates the catalog tuple and after that we will have
+ * two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0).  Now, if 501 is
+ * aborted and some other transaction say 502 updates the same catalog tuple
+ * then the first tuple will be changed to (xmin: 500, xmax: 502).  So, the
+ * problem is that when we try to decode the tuple inserted/updated in 501
+ * after the catalog update, we will see the catalog tuple with (xmin: 500,
+ * xmax: 502) as visible because it will consider that the tuple is deleted by
+ * xid 502 which is not visible to our snapshot.  And when we will try to
+ * decode with that catalog tuple, it can lead to a wrong result or a crash.
+ * So, it is necessary to detect concurrent aborts to allow streaming of
+ * in-progress transactions or decoding of prepared  transactions.
  *
  * For detecting the concurrent abort we set CheckXidAlive to the current
  * (sub)transaction's xid for which this change belongs to.  And, during
@@ -1800,7 +1858,10 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * and discard the already streamed changes on such an error.  We might have
  * already streamed some of the changes for the aborted (sub)transaction, but
  * that is fine because when we decode the abort we will stream abort message
- * to truncate the changes in the subscriber.
+ * to truncate the changes in the subscriber. Similarly, for prepared
+ * transactions, we stop decoding if concurrent abort is detected and then
+ * rollback the changes when rollback prepared is encountered. See
+ * DecodePreare.
  */
 static inline void
 SetupCheckXidLive(TransactionId xid)
@@ -1899,8 +1960,10 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/*
+	 * Discard the changes that we just streamed.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1912,15 +1975,19 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		specinsert = NULL;
 	}
 
-	/* Stop the stream. */
-	rb->stream_stop(rb, txn, last_lsn);
-
-	/* Remember the command ID and snapshot for the streaming run */
-	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	/*
+	 * For the streaming case, stop the stream and remember the command ID and
+	 * snapshot for the streaming run.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_stop(rb, txn, last_lsn);
+		ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	}
 }
 
 /*
- * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ * Helper function for ReorderBufferReplay and ReorderBufferStreamTXN.
  *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
@@ -2006,8 +2073,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			prev_lsn = change->lsn;
 
-			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			/*
+			 * Set the current xid to detect concurrent aborts. This is
+			 * required for the cases when we decode the changes before the
+			 * COMMIT record is processed.
+			 */
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2298,7 +2369,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2332,15 +2412,22 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the four reasons:
+		 * 1. Decoding an in-progress txn.
+		 * 2. Decoding a prepared txn.
+		 * 3. Decoding of a prepared txn that was (partially) streamed.
+		 * 4. Decoding a committed txn.
+		 *
+		 * For 1, we allow truncation of txn data by removing the changes already
+		 * streamed but still keeping other things like invalidations, snapshot,
+		 * and tuplecids. For 2 and 3, we indicate ReorderBufferTruncateTXN to
+		 * do more elaborate truncation of txn data as the entire transaction has
+		 * been decoded except for commit. For 4, as the entire txn has been
+		 * decoded, we can fully clean up the TXN reorder buffer.
 		 */
-		if (streaming)
+		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
-
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2373,17 +2460,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2392,6 +2482,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
@@ -2413,26 +2508,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * ReorderBufferCommitChild(), even if previously assigned to the toplevel
  * transaction with ReorderBufferAssignChild.
  *
- * This interface is called once a toplevel commit is read for both streamed
- * as well as non-streamed transactions.
+ * This interface is called once a prepare or toplevel commit is read for both
+ * streamed as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferReplay(ReorderBufferTXN *txn,
+					ReorderBuffer *rb, TransactionId xid,
 					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 					TimestampTz commit_time,
 					RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2462,7 +2550,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	if (txn->base_snapshot == NULL)
 	{
 		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+
+		/*
+		 * Removing this txn before a commit might result in the computation
+		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
+		 */
+		if (!rbtxn_prepared(txn))
+			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
 
@@ -2474,6 +2568,123 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, so skip preparing it */
+	if (txn == NULL)
+		return true;
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferReplay(txn, rb, xid, commit_lsn, end_lsn, commit_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferReplay(txn, rb, xid, commit_lsn, end_lsn, commit_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+	{
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+		rb->commit_prepared(rb, txn, commit_lsn);
+	}
+	else
+	{
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+		rb->rollback_prepared(rb, txn, commit_lsn);
+	}
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2605,6 +2816,39 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 }
 
 /*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ *
+ * Note that this is a special-purpose function for prepared transactions where
+ * we don't want to clean up the TXN even when we decide to skip it. See
+ * DecodePrepare.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
+/*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9b8eced..a6b43b6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -234,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -619,12 +640,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -642,6 +669,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+								 TimestampTz commit_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v25-0003-Support-2PC-test-cases-for-test_decoding.patchapplication/octet-stream; name=v25-0003-Support-2PC-test-cases-for-test_decoding.patchDownload
From a54245bed747ddefcc586d324dca8259db4ed32c Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 20 Nov 2020 05:32:43 -0500
Subject: [PATCH v25 3/7] Support 2PC test cases for test_decoding.

Add sql tests to test_decoding for 2PC.
---
 contrib/test_decoding/Makefile                     |   2 +-
 contrib/test_decoding/expected/two_phase.out       | 228 +++++++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 177 ++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 +++++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 ++++++
 5 files changed, 588 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..2c4acdc 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,7 +4,7 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..e5e34b4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..957c198
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
-- 
1.8.3.1

v25-0004-2pc-test-cases-for-testing-concurrent-aborts.patchapplication/octet-stream; name=v25-0004-2pc-test-cases-for-testing-concurrent-aborts.patchDownload
From e8bc74ce88bc06722f47d4846b9c2293da48b0b1 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 20 Nov 2020 06:15:57 -0500
Subject: [PATCH v25 4/7] 2pc test cases for testing concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2pc.
---
 contrib/test_decoding/Makefile                    |   2 +
 contrib/test_decoding/t/001_twophase.pl           | 121 ++++++++++++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++++++
 contrib/test_decoding/test_decoding.c             |  36 ++++++
 4 files changed, 292 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2c4acdc..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..3b3e7b8
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of prepared txn test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..15001c6
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index c42de64..ef9abdc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -475,6 +475,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -624,6 +648,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -710,6 +737,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -923,6 +953,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -976,6 +1009,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
-- 
1.8.3.1

v25-0005-Support-2PC-txn-spoolfile.patchapplication/octet-stream; name=v25-0005-Support-2PC-txn-spoolfile.patchDownload
From 588e501929c078cd7d6efb72cfefe01242d63035 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 20 Nov 2020 05:43:38 -0500
Subject: [PATCH v25 5/7] Support 2PC txn - spoolfile.

This patch only refactors to isolate the streaming spool-file processing to a separate function.
Later, two-phase commit logic will require this common processing to be called from multiple places.
---
 src/backend/replication/logical/worker.c | 58 +++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0468491..9fa816c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -244,6 +244,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -933,30 +935,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -964,7 +957,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -979,7 +972,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1048,6 +1041,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1055,16 +1077,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
-- 
1.8.3.1

v25-0006-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v25-0006-Support-2PC-txn-pgoutput.patchDownload
From 8a280ff8d1b1ac684ea79ed5a61ae7599ea04916 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 20 Nov 2020 05:52:45 -0500
Subject: [PATCH v25 6/7] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.
---
 src/backend/access/transam/twophase.c       |  33 +++-
 src/backend/replication/logical/proto.c     | 141 ++++++++++++++++-
 src/backend/replication/logical/worker.c    | 236 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c |  74 +++++++++
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  37 ++++-
 src/tools/pgindent/typedefs.list            |   1 +
 7 files changed, 518 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9b..00b4497 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
@@ -1133,9 +1160,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..cfb94d1 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,145 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK]
+	 * PREPARED uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9fa816c..f1e94ad 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -742,6 +742,234 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData *prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK
+	 * PREPARED for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare_txn (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1969,6 +2197,14 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..71ac431 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -143,6 +151,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +165,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +392,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +913,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535d..0691fc5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -54,10 +54,12 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_PREPARE = 'P',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +116,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +124,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +153,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData *prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +200,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f4d4703..4546572 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1342,6 +1342,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPrepareData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v25-0007-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v25-0007-Support-2PC-txn-subscriber-tests.patchDownload
From 201431095c0986f6f419d8514d7e6142048c3cf7 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 20 Nov 2020 05:59:07 -0500
Subject: [PATCH v25 7/7] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#126Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#125)

On Sun, Nov 22, 2020 at 12:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I am planning to continue review of these patches but I thought it is
better to check about the above changes before proceeding further. Let
me know what you think?

I've had a look at the changes and done a few tests, and have no
comments. However, I did see that the test 002_twophase_streaming.pl
failed once. I've run it at least 30 times after that but haven't seen
it fail again.
Unfortunately my ulimit was not set up to create dumps and so I dont
have a dump when the test case failed. I will continue testing and
reviewing the changes.

regards,
Ajin Cherian
Fujitsu Australia

#127Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#126)

On Mon, Nov 23, 2020 at 3:41 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Sun, Nov 22, 2020 at 12:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I am planning to continue review of these patches but I thought it is
better to check about the above changes before proceeding further. Let
me know what you think?

I've had a look at the changes and done a few tests, and have no
comments.

Okay, thanks. Additionally, I have analyzed whether we need to call
SnapbuildCommittedTxn in DecodePrepare as was raised earlier for this
patch [1]/messages/by-id/87zhxrwgvh.fsf@ars-thinkpad. As mentioned in that thread SnapbuildCommittedTxn primarily
does three things (a) Decide whether we are interested in tracking the
current txn effects and if we are, mark it as committed. (b) Build and
distribute snapshot to all RBTXNs, if it is important. (c) Set base
snap of our xact if it did DDL, to execute invalidations during
replay.

For the first two, as the xact is still not visible to others so we
don't need to make it behave like a committed txn. To make the (DDL)
changes visible to the current txn, the message
REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID copies the snapshot which
fills the subxip array. This will be sufficient to make the changes
visible to the current txn. For the third, I have checked the code
that whenever we have any change message the base snapshot gets set
via SnapBuildProcessChange. It is possible that I have missed
something but I don't want to call SnapbuildCommittedTxn in
DecodePrepare unless we have a clear reason for the same so leaving it
for now. Can you or someone see any reason for the same?

However, I did see that the test 002_twophase_streaming.pl
failed once. I've run it at least 30 times after that but haven't seen
it fail again.

This test is based on waiting to see some message in the log. It is
possible it failed due to timeout which can only happen rarely. You
can check some failure logs in test_decoding folder (probably in
tmp_check folder). Even if we get some server or test log, it can help
us to diagnose the problem.

[1]: /messages/by-id/87zhxrwgvh.fsf@ars-thinkpad

--
With Regards,
Amit Kapila.

#128Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#127)
7 attachment(s)

FYI - I have regenerated a new v26 set of patches.

PSA

v26-0001 - no change
v26-0002 - no change
v26-0003 - only filename changed (for consistency)
v26-0004 - only filename changed (for consistency)
v26-0005 - no change
v26-0006 - minor code change to have more consistently located calls
to process_syncing_tables
v26-0007 - no change

---
Kind Regards
Peter Smith.
Fujitsu Australia.

Attachments:

v26-0001-Support-2PC-txn-base.patchapplication/octet-stream; name=v26-0001-Support-2PC-txn-base.patchDownload
From a2b3f0cd4aa5e79fa68f922251522bcb0efc7450 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 14:21:54 +1100
Subject: [PATCH v26] Support 2PC txn base.

Until now two-phase transaction commands were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the two-phase semantics
were communicated to the subscriber.

This patch provides infrastructure for logical decoding plugins to be informed of
two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

Include logical decoding plugin API infrastructure changes.

Includes contrib/test_decoding changes.

Includes documentation changes.
---
 contrib/test_decoding/test_decoding.c     | 172 +++++++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 146 +++++++++++++++++-
 src/backend/replication/logical/logical.c | 242 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  46 ++++++
 src/include/replication/reorderbuffer.h   |  35 +++++
 src/tools/pgindent/typedefs.list          |  11 ++
 7 files changed, 650 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e12278b..c42de64 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,12 +11,16 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
+#include "access/transam.h"
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
 
+#include "storage/procarray.h"
+
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -35,6 +39,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -87,6 +92,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -102,6 +110,18 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx,
 									  ReorderBufferTXN *txn,
 									  int nrelations, Relation relations[],
 									  ReorderBufferChange *change);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 
 void
 _PG_init(void)
@@ -126,10 +146,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
 	cb->stream_truncate_cb = pg_decode_stream_truncate;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 }
 
 
@@ -141,6 +166,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -150,6 +176,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -241,6 +268,35 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+
+				errno = 0;
+				data->check_xid_aborted = (TransactionId) strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -252,6 +308,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -320,6 +377,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -642,6 +786,7 @@ static void
 pg_output_stream_start(LogicalDecodingContext *ctx, TestDecodingData *data, ReorderBufferTXN *txn, bool last_write)
 {
 	OutputPluginPrepareWrite(ctx, last_write);
+
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "opening a streamed block for transaction TXN %u", txn->xid);
 	else
@@ -702,6 +847,33 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..f5b617d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -387,11 +387,16 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeTruncateCB truncate_cb;
     LogicalDecodeCommitCB commit_cb;
     LogicalDecodeMessageCB message_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits, which are
+    decoded on <command>PREPARE TRANSACTION</command>. The <function>prepare_cb</function>,
+    <function>stream_prepare_cb</function>, <function>commit_prepared_cb</function>
+    and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,13 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too.
     </para>
 
     <note>
@@ -578,6 +598,56 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callbacks for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+<programlisting>
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                        ReorderBufferTXN *txn,
+                                        XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called whenever
+      a transaction commit prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr commit_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called whenever
+      a transaction rollback prepared has been decoded. The <parameter>gid</parameter> field,
+      which is part of the <parameter>txn</parameter> parameter can be used in this
+      callback.
+<programlisting>
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                 ReorderBufferTXN *txn,
+                                                 XLogRecPtr rollback_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-change">
      <title>Change Callback</title>
 
@@ -587,7 +657,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -664,6 +740,39 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
      </para>
      </sect3>
 
+     <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents
+      as for the other callbacks. The <parameter>txn</parameter> parameter
+      contains meta information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some cases.
+      The <parameter>gid</parameter> is the identifier that later identifies this
+      transaction for <command>COMMIT PREPARED</command> or <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-message">
      <title>Generic Message Callback</title>
 
@@ -685,7 +794,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -735,6 +850,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                            ReorderBufferTXN *txn,
+                                            XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1041,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 4324e32..009db5f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -207,6 +217,10 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->apply_change = change_cb_wrapper;
 	ctx->reorder->apply_truncate = truncate_cb_wrapper;
 	ctx->reorder->commit = commit_cb_wrapper;
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
 	ctx->reorder->message = message_cb_wrapper;
 
 	/*
@@ -227,6 +241,21 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require
+	 * prepare/commit-prepare/abort-prepare callbacks. The filter-prepare
+	 * callback is optional. We however enable two-phase logical decoding when
+	 * at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +266,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +813,129 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Output plugin did not register rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1012,52 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1256,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming commits requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..032e35a 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -77,6 +77,39 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+ /*
+  * Called before decoding of PREPARE record to decide whether this
+  * transaction should be decoded with separate calls to prepare and
+  * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+  * and sent as usual transaction.
+  */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /*
  * Called for the generic logical decoding messages.
  */
@@ -124,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -171,12 +212,17 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeTruncateCB truncate_cb;
 	LogicalDecodeCommitCB commit_cb;
 	LogicalDecodeMessageCB message_cb;
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bd9dd7e..9b8eced 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -244,6 +245,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char	   *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -410,6 +414,26 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb,
 									   ReorderBufferTXN *txn,
 									   XLogRecPtr commit_lsn);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* message callback signature */
 typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -436,6 +460,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -502,6 +532,10 @@ struct ReorderBuffer
 	ReorderBufferApplyChangeCB apply_change;
 	ReorderBufferApplyTruncateCB apply_truncate;
 	ReorderBufferCommitCB commit;
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
 	ReorderBufferMessageCB message;
 
 	/*
@@ -510,6 +544,7 @@ struct ReorderBuffer
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fde701b..f4d4703 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1316,9 +1316,20 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
1.8.3.1

v26-0002-Support-2PC-txn-backend.patchapplication/octet-stream; name=v26-0002-Support-2PC-txn-backend.patchDownload
From d570feb9c3248631048ea0df2db07cdfe067f225 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 14:24:07 +1100
Subject: [PATCH v26] Support 2PC txn backend.

Until now two-phase transactions were decoded at COMMIT, just like
regular transaction. During replay, two-phase transactions were
translated into regular transactions on the subscriber, and the GID
was not forwarded to it.

This patch allows PREPARE-time decoding two-phase transactions (if
the output plugin supports this capability), in which case the
transactions are replayed at PREPARE and then committed later when
COMMIT PREPARED arrives.

Includes backend changes to support decoding of PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED.
---
 src/backend/replication/logical/decode.c        | 215 ++++++++++++--
 src/backend/replication/logical/reorderbuffer.c | 374 ++++++++++++++++++++----
 src/include/replication/reorderbuffer.h         |  34 +++
 3 files changed, 532 insertions(+), 91 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..dfba406 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,9 +67,14 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool already_decoded);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool already_decoded);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
@@ -244,6 +249,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +259,19 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction data then
+				 * DecodeCommit doesn't need to decode it again. This is
+				 * possible iff output plugin supports two-phase commits and
+				 * doesn't skip the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+								ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeCommit(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +280,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +290,17 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction during prepare
+				 * then DecodeAbort need to call rollback prepared.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+				{
+					already_decoded = !(ctx->callbacks.filter_prepare_cb &&
+						ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+				}
+
+				DecodeAbort(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +341,35 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ctx->callbacks.filter_prepare_cb &&
+					ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -582,10 +629,14 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool already_decoded)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -640,7 +691,85 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		return;
 	}
 
-	/* tell the reorderbuffer about the surviving subtransactions */
+	/*
+	 * Send the final commit record if the transaction data is already decoded,
+	 * otherwise, process the entire transaction.
+	 */
+	if (already_decoded)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* tell the reorderbuffer about the surviving subtransactions */
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+									 buf->origptr, buf->endptr);
+		}
+
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ *
+ * Note that we don't skip prepare even if we have detected concurrent abort.
+ * The reason is that it is quite possible that we had already sent some
+ * changes before we detect abort in which case we need to abort those changes
+ * in the subscriber. To abort such changes, we do send the prepare and then
+ * the rollback prepared which is what happened on the publisher-side as well.
+ * Now, we can invent a new abort API wherein in such cases we send abort and
+ * skip sending prepared and rollback prepared but then it is not that
+ * straightforward because we might have streamed this transaction by that time
+ * in which case it is handled when the rollback is encountered. It is not
+ * impossible to optimize the concurrent abort case but it can introduce design
+ * complexity w.r.t handling different cases so leaving it for now as it
+ * doesn't seem worth it.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		commit_time = parsed->origin_timestamp;
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeCommit for
+	 * the reasons why we sometimes want to skip the transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache invalidations
+	 * if there are any for the reasons mentioned in DecodeCommit.
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	{
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/* Tell the reorderbuffer about the surviving subtransactions. */
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
 		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
@@ -648,33 +777,67 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 commit_time, origin_id, origin_lsn, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool already_decoded)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz commit_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool	skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		commit_time = parsed->origin_timestamp;
+	}
+
+	skip_xact = SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
+		ctx->fast_forward || FilterByOrigin(ctx, origin_id);
+
+	/*
+	 * Send the final rollback record if the transaction data is already
+	 * decoded and we don't need to skip it, otherwise, perform the cleanup of
+	 * the transaction.
+	 */
+	if (already_decoded && !skip_xact)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
 	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 301baff..83d15bb 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -422,6 +423,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1515,12 +1522,18 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after
+ * streaming or decoding them at PREPARE. Keep the remaining info -
+ * transactions, tuplecids, invalidations and snapshots.
+ *
+ * We additionaly remove tuplecids after decoding the transaction at prepare
+ * time as we only need to perform invalidation at rollback or commit prepared.
+ *
+ * 'txn_prepared' indicates that we have decoded the transaction at prepare time.
+ * If streaming,  If after a PREPARE, keep only the invalidations and snapshots.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1539,7 +1552,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1573,9 +1586,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1757,9 +1794,10 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * If the transaction was (partially) streamed, we need to commit it in a
- * 'streamed' way.  That is, we first stream the remaining part of the
- * transaction, and then invoke stream_commit message.
+ * If the transaction was (partially) streamed, we need to prepare or commit
+ * it in a 'streamed' way.  That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_prepare or stream_commit message as is
+ * the case.
  */
 static void
 ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1769,29 +1807,49 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		/*
+		 * Note, we send stream prepare even if a concurrent abort is detected.
+		 * See DecodePrepare for more information.
+		 */
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids.
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
  * Set xid to detect concurrent aborts.
  *
- * While streaming an in-progress transaction there is a possibility that the
- * (sub)transaction might get aborted concurrently.  In such case if the
- * (sub)transaction has catalog update then we might decode the tuple using
- * wrong catalog version.  For example, suppose there is one catalog tuple with
- * (xmin: 500, xmax: 0).  Now, the transaction 501 updates the catalog tuple
- * and after that we will have two tuples (xmin: 500, xmax: 501) and
- * (xmin: 501, xmax: 0).  Now, if 501 is aborted and some other transaction
- * say 502 updates the same catalog tuple then the first tuple will be changed
- * to (xmin: 500, xmax: 502).  So, the problem is that when we try to decode
- * the tuple inserted/updated in 501 after the catalog update, we will see the
- * catalog tuple with (xmin: 500, xmax: 502) as visible because it will
- * consider that the tuple is deleted by xid 502 which is not visible to our
- * snapshot.  And when we will try to decode with that catalog tuple, it can
- * lead to a wrong result or a crash.  So, it is necessary to detect
- * concurrent aborts to allow streaming of in-progress transactions.
+ * While streaming an in-progress transaction or decoding a prepared
+ * transaction there is a possibility that the (sub)transaction might get
+ * aborted concurrently.  In such case if the (sub)transaction has catalog
+ * update then we might decode the tuple using wrong catalog version.  For
+ * example, suppose there is one catalog tuple with (xmin: 500, xmax: 0).  Now,
+ * the transaction 501 updates the catalog tuple and after that we will have
+ * two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0).  Now, if 501 is
+ * aborted and some other transaction say 502 updates the same catalog tuple
+ * then the first tuple will be changed to (xmin: 500, xmax: 502).  So, the
+ * problem is that when we try to decode the tuple inserted/updated in 501
+ * after the catalog update, we will see the catalog tuple with (xmin: 500,
+ * xmax: 502) as visible because it will consider that the tuple is deleted by
+ * xid 502 which is not visible to our snapshot.  And when we will try to
+ * decode with that catalog tuple, it can lead to a wrong result or a crash.
+ * So, it is necessary to detect concurrent aborts to allow streaming of
+ * in-progress transactions or decoding of prepared  transactions.
  *
  * For detecting the concurrent abort we set CheckXidAlive to the current
  * (sub)transaction's xid for which this change belongs to.  And, during
@@ -1800,7 +1858,10 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * and discard the already streamed changes on such an error.  We might have
  * already streamed some of the changes for the aborted (sub)transaction, but
  * that is fine because when we decode the abort we will stream abort message
- * to truncate the changes in the subscriber.
+ * to truncate the changes in the subscriber. Similarly, for prepared
+ * transactions, we stop decoding if concurrent abort is detected and then
+ * rollback the changes when rollback prepared is encountered. See
+ * DecodePreare.
  */
 static inline void
 SetupCheckXidLive(TransactionId xid)
@@ -1899,8 +1960,10 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  XLogRecPtr last_lsn,
 					  ReorderBufferChange *specinsert)
 {
-	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	/*
+	 * Discard the changes that we just streamed.
+	 */
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1912,15 +1975,19 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		specinsert = NULL;
 	}
 
-	/* Stop the stream. */
-	rb->stream_stop(rb, txn, last_lsn);
-
-	/* Remember the command ID and snapshot for the streaming run */
-	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	/*
+	 * For the streaming case, stop the stream and remember the command ID and
+	 * snapshot for the streaming run.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_stop(rb, txn, last_lsn);
+		ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	}
 }
 
 /*
- * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ * Helper function for ReorderBufferReplay and ReorderBufferStreamTXN.
  *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
@@ -2006,8 +2073,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			prev_lsn = change->lsn;
 
-			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			/*
+			 * Set the current xid to detect concurrent aborts. This is
+			 * required for the cases when we decode the changes before the
+			 * COMMIT record is processed.
+			 */
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2298,7 +2369,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2332,15 +2412,22 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the four reasons:
+		 * 1. Decoding an in-progress txn.
+		 * 2. Decoding a prepared txn.
+		 * 3. Decoding of a prepared txn that was (partially) streamed.
+		 * 4. Decoding a committed txn.
+		 *
+		 * For 1, we allow truncation of txn data by removing the changes already
+		 * streamed but still keeping other things like invalidations, snapshot,
+		 * and tuplecids. For 2 and 3, we indicate ReorderBufferTruncateTXN to
+		 * do more elaborate truncation of txn data as the entire transaction has
+		 * been decoded except for commit. For 4, as the entire txn has been
+		 * decoded, we can fully clean up the TXN reorder buffer.
 		 */
-		if (streaming)
+		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
-
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2373,17 +2460,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2392,6 +2482,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
@@ -2413,26 +2508,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * ReorderBufferCommitChild(), even if previously assigned to the toplevel
  * transaction with ReorderBufferAssignChild.
  *
- * This interface is called once a toplevel commit is read for both streamed
- * as well as non-streamed transactions.
+ * This interface is called once a prepare or toplevel commit is read for both
+ * streamed as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferReplay(ReorderBufferTXN *txn,
+					ReorderBuffer *rb, TransactionId xid,
 					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 					TimestampTz commit_time,
 					RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2462,7 +2550,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	if (txn->base_snapshot == NULL)
 	{
 		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+
+		/*
+		 * Removing this txn before a commit might result in the computation
+		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
+		 */
+		if (!rbtxn_prepared(txn))
+			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
 
@@ -2474,6 +2568,123 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, so skip preparing it */
+	if (txn == NULL)
+		return true;
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferReplay(txn, rb, xid, commit_lsn, end_lsn, commit_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					 TimestampTz commit_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferReplay(txn, rb, xid, commit_lsn, end_lsn, commit_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time,
+							RepOriginId origin_id, XLogRecPtr origin_lsn,
+							char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+	{
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
+		rb->commit_prepared(rb, txn, commit_lsn);
+	}
+	else
+	{
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
+		rb->rollback_prepared(rb, txn, commit_lsn);
+	}
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2605,6 +2816,39 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 }
 
 /*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ *
+ * Note that this is a special-purpose function for prepared transactions where
+ * we don't want to clean up the TXN even when we decide to skip it. See
+ * DecodePrepare.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
+/*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9b8eced..a6b43b6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -234,6 +237,24 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -619,12 +640,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -642,6 +669,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+								 TimestampTz commit_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v26-0004-Support-2PC-txn-tests-for-concurrent-aborts.patchapplication/octet-stream; name=v26-0004-Support-2PC-txn-tests-for-concurrent-aborts.patchDownload
From 05b52a1d4d4e7f354981d86a2e3b95ef5594f856 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 14:34:18 +1100
Subject: [PATCH v26] Support 2PC txn tests for concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2PC.
---
 contrib/test_decoding/Makefile                    |   2 +
 contrib/test_decoding/t/001_twophase.pl           | 121 ++++++++++++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++++++
 contrib/test_decoding/test_decoding.c             |  36 ++++++
 4 files changed, 292 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2c4acdc..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..3b3e7b8
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of prepared txn test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..15001c6
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index c42de64..ef9abdc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -475,6 +475,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -624,6 +648,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -710,6 +737,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -923,6 +953,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -976,6 +1009,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
-- 
1.8.3.1

v26-0003-Support-2PC-txn-tests-for-test_decoding.patchapplication/octet-stream; name=v26-0003-Support-2PC-txn-tests-for-test_decoding.patchDownload
From 04047f9266445ab2d83be9d509b3a49c1be1d234 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 14:30:34 +1100
Subject: [PATCH v26] Support 2PC txn tests for test_decoding.

Add sql tests to test_decoding for 2PC.
---
 contrib/test_decoding/Makefile                     |   2 +-
 contrib/test_decoding/expected/two_phase.out       | 228 +++++++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 177 ++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 +++++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 ++++++
 5 files changed, 588 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..2c4acdc 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,7 +4,7 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..e5e34b4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..957c198
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
-- 
1.8.3.1

v26-0005-Support-2PC-txn-spoolfile.patchapplication/octet-stream; name=v26-0005-Support-2PC-txn-spoolfile.patchDownload
From debf625bc1a1897f46055be03eb49d604231a5a2 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 14:38:15 +1100
Subject: [PATCH v26] Support 2PC txn - spoolfile.

This patch only refactors to isolate the streaming spool-file processing to a separate function.
Later, two-phase commit logic will require this common processing to be called from multiple places.
---
 src/backend/replication/logical/worker.c | 58 +++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0468491..9fa816c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -244,6 +244,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -933,30 +935,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -964,7 +957,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -979,7 +972,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1048,6 +1041,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1055,16 +1077,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
-- 
1.8.3.1

v26-0006-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v26-0006-Support-2PC-txn-pgoutput.patchDownload
From 2a24fc69dd402103b70158b2dbef2aa70a73232f Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 14:50:27 +1100
Subject: [PATCH v26] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.
---
 src/backend/access/transam/twophase.c       |  33 +++-
 src/backend/replication/logical/proto.c     | 141 +++++++++++++++-
 src/backend/replication/logical/worker.c    | 242 +++++++++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c |  74 +++++++++
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  37 ++++-
 src/tools/pgindent/typedefs.list            |   1 +
 7 files changed, 521 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9b..00b4497 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
@@ -1133,9 +1160,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..cfb94d1 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,145 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK]
+	 * PREPARED uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9fa816c..d8c1342 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -742,6 +742,234 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData *prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK
+	 * PREPARED for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare_txn (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1086,12 +1314,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	in_remote_transaction = false;
 
-	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(commit_data.end_lsn);
-
 	/* unlink the files with serialized changes and subxact info */
 	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
 
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
 
@@ -1969,6 +2197,14 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..71ac431 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -143,6 +151,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +165,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +392,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +913,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535d..0691fc5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -54,10 +54,12 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_PREPARE = 'P',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +116,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +124,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +153,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData *prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +200,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f4d4703..4546572 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1342,6 +1342,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPrepareData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v26-0007-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v26-0007-Support-2PC-txn-subscriber-tests.patchDownload
From 3840e4d97fef72f180e278ddf7f04bfde23f32c4 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 15:02:24 +1100
Subject: [PATCH v26] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#129Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#127)

On Mon, Nov 23, 2020 at 10:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

For the first two, as the xact is still not visible to others so we
don't need to make it behave like a committed txn. To make the (DDL)
changes visible to the current txn, the message
REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID copies the snapshot which
fills the subxip array. This will be sufficient to make the changes
visible to the current txn. For the third, I have checked the code
that whenever we have any change message the base snapshot gets set
via SnapBuildProcessChange. It is possible that I have missed
something but I don't want to call SnapbuildCommittedTxn in
DecodePrepare unless we have a clear reason for the same so leaving it
for now. Can you or someone see any reason for the same?

I reviewed and tested this and like you said, SnapBuildProcessChange
sets the base snapshot for every change.
I did various tests using DDL updates and haven't seen any issues so
far. I agree with your analysis.

regards,
Ajin

#130Peter Smith
smithpb2250@gmail.com
In reply to: Ajin Cherian (#129)

Hi Amit.

IIUC the tablesync worker runs in a single transaction.

Last week I discovered and described [1]/messages/by-id/CAHut+PuEMk4SO8oGzxc_ftzPkGA8uC-y5qi-KRqHSy_P0i30DA@mail.gmail.com a problem where/if (by
unlucky timing) the tablesync worker gets to handle the 2PC PREPARE
TRANSACTION then that whole single tx is getting committed, regardless
that a COMMIT PREPARED was not even been executed yet. i.e. It means
if the publisher subsequently does a ROLLBACK PREPARED then the table
records on Pub/Sub nodes will no longer be matching.

AFAIK this is a new problem for the current WIP patch because prior to
this the PREPARE had no decoding.

Please let me know if this issue description is still not clear.

Did you have any thoughts how we might address this issue?

---

[1]: /messages/by-id/CAHut+PuEMk4SO8oGzxc_ftzPkGA8uC-y5qi-KRqHSy_P0i30DA@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

#131Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#130)

On Wed, Nov 25, 2020 at 12:54 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Amit.

IIUC the tablesync worker runs in a single transaction.

Last week I discovered and described [1] a problem where/if (by
unlucky timing) the tablesync worker gets to handle the 2PC PREPARE
TRANSACTION then that whole single tx is getting committed, regardless
that a COMMIT PREPARED was not even been executed yet. i.e. It means
if the publisher subsequently does a ROLLBACK PREPARED then the table
records on Pub/Sub nodes will no longer be matching.

AFAIK this is a new problem for the current WIP patch because prior to
this the PREPARE had no decoding.

Please let me know if this issue description is still not clear.

Did you have any thoughts how we might address this issue?

I think we need to disable two_phase_commit for table sync workers. We
anyway wanted to expose a parater via subscription for that and we can
use that to do it. Also, there were some other comments [1]/messages/by-id/87zhxrwgvh.fsf@ars-thinkpad related to
tablesync worker w.r.t prepared transactions which would possibly be
addressed by doing it. Kindly check those comments [1]/messages/by-id/87zhxrwgvh.fsf@ars-thinkpad and let me know
if anything additional is required.

[1]: /messages/by-id/87zhxrwgvh.fsf@ars-thinkpad

--
With Regards,
Amit Kapila.

#132Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#129)
7 attachment(s)

On Tue, Nov 24, 2020 at 3:29 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Nov 23, 2020 at 10:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

For the first two, as the xact is still not visible to others so we
don't need to make it behave like a committed txn. To make the (DDL)
changes visible to the current txn, the message
REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID copies the snapshot which
fills the subxip array. This will be sufficient to make the changes
visible to the current txn. For the third, I have checked the code
that whenever we have any change message the base snapshot gets set
via SnapBuildProcessChange. It is possible that I have missed
something but I don't want to call SnapbuildCommittedTxn in
DecodePrepare unless we have a clear reason for the same so leaving it
for now. Can you or someone see any reason for the same?

I reviewed and tested this and like you said, SnapBuildProcessChange
sets the base snapshot for every change.
I did various tests using DDL updates and haven't seen any issues so
far. I agree with your analysis.

Thanks, attached is a further revised version of the patch series.

Changes in v27-0001-Extend-the-output-plugin-API-to-allow-decoding-p
a. Removed the includes which are not required by this patch.
b. Moved the 'check_xid_aborted' parameter to 0004.
c. Added Assert(!ctx->fast_forward); in callback wrappers, because we
won't load the output plugin when fast_forward is set so there is no
chance that we call output plugin APIs. This is why we have this
Assert in all the existing APIs.
d. Adjusted the order of various callback APIs to make the code look consistent.
e. Added/Edited comments and doc updates at various places. Changed
error messages to make them consistent with other similar messages.
f. Some other cosmetic changes like the removal of spurious new lines
and fixed white-space issues.
g. Updated commit message.

Changes in v27-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer
a. Move the check to whether a particular txn can be skipped into a
separate function as the same code for it was repeated at three
different places.
b. ReorderBufferPrepare has a parameter name as commit_lsn whereas it
should be preapre_lsn. Similar changes has been made at various places
in the patch.
c. filter_prepare_cb callback existence is checked in both decode.c
and in filter_prepare_cb_wrapper. Fixed by removing it from decode.c.
d. Fixed miscellaneous comments and some cosmetic changes.
e. Moved the special elog in ReorderBufferProcessTxn to test
concurrent aborts in 0004 patch.
f. Moved the changes related to flags RBTXN_COMMIT_PREPARED and
RBTXN_ROLLBACK_PREPARED to patch 0006 as those are used only in that
patch.
g. Updated commit message.

One problem with this patch is: What if we have assembled a consistent
snapshot after prepare and before commit prepared. In that case, it
will currently just send commit prepared record which would be a bad
idea. It should decode the entire transaction for such cases at commit
prepared time. This same problem is noticed by Arseny Sher, see
problem-2 in email [1]/messages/by-id/877el38j56.fsf@ars-thinkpad.

One idea to fix this could be to check if the snapshot is consistent
to decide whether to skip the prepare and if we skip due to that
reason, then during commit we need to decode the entire transaction.
We can do that by setting a flag in txn->txn_flags such that during
prepare we can set a flag when we skip the prepare because the
snapshot is still not consistent and then used it during commit to see
if we need to decode the entire transaction. But here we need to think
about what would happen after restart? Basically, if it is possible
that after restart the snapshot is consistent for the same transaction
at prepare time and it got skipped due to start_decoding_at (which
moved ahead after restart) then such a solution won't work. Any
thoughts on this?

v27-0004-Support-2PC-txn-tests-for-concurrent-aborts
a. Moved the changes related to testing of concurrent aborts in this
patch from other patches.

v27-0006-Support-2PC-txn-pgoutput
a. Moved the changes related to flags RBTXN_COMMIT_PREPARED and
RBTXN_ROLLBACK_PREPARED from other patch.
b. Included headers required by this patch, previously it seems to be
dependent on other patches for this.

The other patches remain unchanged.

Let me know what you think about these changes?

[1]: /messages/by-id/877el38j56.fsf@ars-thinkpad

--
With Regards,
Amit Kapila.

Attachments:

v27-0001-Extend-the-output-plugin-API-to-allow-decoding-p.patchapplication/octet-stream; name=v27-0001-Extend-the-output-plugin-API-to-allow-decoding-p.patchDownload
From a2c654d7defefc16461bbed6243e93ddde960307 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 20 Nov 2020 05:00:40 -0500
Subject: [PATCH v27 1/7] Extend the output plugin API to allow decoding
 prepared xacts.

This adds four methods to the output plugin API, adding support for
streaming changes of two-phase transactions at prepare time.

* prepare
* commit_prepared
* rollback_prepared
* stream_prepare

Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction is not yet committed and maybe
aborted later.

Until now two-phase transactions were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the
two-phase commands were communicated to the subscriber.

This patch provides the infrastructure for logical decoding plugins to be
informed of two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

This also extends the 'test_decoding' plugin, implementing these new
methods.

This commit simply adds these new APIs and the upcoming patch to "allow
the decoding at prepare time in ReorderBuffer" will use these APIs.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c     | 146 ++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 149 ++++++++++++-
 src/backend/replication/logical/logical.c | 257 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  50 +++++
 src/include/replication/reorderbuffer.h   |  38 ++++
 src/tools/pgindent/typedefs.list          |  11 +
 7 files changed, 649 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e12278beb5..429a07c004 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -76,6 +76,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 static void pg_decode_stream_start(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn);
 static void pg_output_stream_start(LogicalDecodingContext *ctx,
@@ -87,6 +99,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -123,9 +138,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
@@ -141,6 +161,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -241,6 +262,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -252,6 +283,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -320,6 +352,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -701,6 +820,33 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 	OutputPluginWrite(ctx, true);
 }
 
+static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037fac..73673a0312 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,9 +389,14 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits,
+    which allows actions to be decoded on the <command>PREPARE TRANSACTION</command>.
+    The <function>prepare_cb</function>, <function>stream_prepare_cb</function>,
+    <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,15 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too. We will skip all the changes of such a transaction once
+     the abort is detected and abort the transaction when we read WAL for
+     <command>ROLLBACK PREPARED</command>.
     </para>
 
     <note>
@@ -587,7 +609,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -685,7 +713,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -698,6 +732,90 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents as for the
+      other callbacks. The <parameter>txn</parameter> parameter contains meta
+      information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some
+      cases. The <parameter>gid</parameter> is the identifier that later
+      identifies this transaction for <command>COMMIT PREPARED</command> or
+      <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callback for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr prepare_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called
+      whenever a transaction commit prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                      ReorderBufferTXN *txn,
+                                                      XLogRecPtr commit_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called
+      whenever a transaction rollback prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                        ReorderBufferTXN *txn,
+                                                        XLogRecPtr rollback_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-start">
      <title>Stream Start Callback</title>
      <para>
@@ -735,6 +853,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1044,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 4324e32656..b0784ffe1e 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -226,6 +236,32 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_message_cb != NULL) ||
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
+	/*
+	 * To support two-phase logical decoding, we require
+	 * prepare/commit-prepare/abort-prepare callbacks. The filter_prepare
+	 * callback is optional. We however enable two-phase logical decoding when
+	 * at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
+	 * Callback to support decoding at prepare time.
+	 *
+	 * filter_prepare is optional, so we do not fail with ERROR when missing,
+	 * but the wrappers simply do nothing.
+	 */
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
+
 	/*
 	 * streaming callbacks
 	 *
@@ -237,6 +273,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -782,6 +819,135 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of prepare record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
@@ -859,6 +1025,54 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case, all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1056,6 +1270,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming at prepare time requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7ee02..7f4384b62c 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -84,6 +84,11 @@ typedef struct LogicalDecodingContext
 	 */
 	bool		streaming;
 
+	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796450..14e6105905 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,40 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
+
 /*
  * Called when starting to stream a block of changes from in-progress
  * transaction (may be called repeatedly, if it's streamed in multiple
@@ -123,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
@@ -173,10 +215,18 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+
+	/* streaming of changes at prepare time */
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
+
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bd9dd7ec67..efd19573ac 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -244,6 +244,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char	   *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -418,6 +421,26 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* start streaming transaction callback signature */
 typedef void (*ReorderBufferStreamStartCB) (
 											ReorderBuffer *rb,
@@ -436,6 +459,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -504,12 +533,21 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction at prepare time.
+	 */
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
+
 	/*
 	 * Callbacks to be called when streaming a transaction.
 	 */
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fde701bfd4..f4d4703735 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1316,9 +1316,20 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
2.28.0.windows.1

v27-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchapplication/octet-stream; name=v27-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchDownload
From 5a14691f2c1c06625315bbfd4e942e44f472920d Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 24 Nov 2020 18:41:29 +0530
Subject: [PATCH v27 2/7] Allow decoding at prepare time in ReorderBuffer.

This patch allows PREPARE-time decoding two-phase transactions (if the
output plugin supports this capability), in which case the transactions
are replayed at PREPARE and then committed later when COMMIT PREPARED
arrives.

Now that we decode the changes before the commit, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We detect such failures with a special sqlerrcode
ERRCODE_TRANSACTION_ROLLBACK introduced by commit 7259736a6e and stop
decoding the remaining changes. Then we rollback the changes when rollback
prepared is encountered.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, Arseny Sher, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/decode.c      | 247 ++++++++++--
 .../replication/logical/reorderbuffer.c       | 358 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  20 +
 3 files changed, 523 insertions(+), 102 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee99b8..1e9522f07a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,13 +67,22 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool already_decoded);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool already_decoded);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 
+static bool DecodeTXNNeedSkip(LogicalDecodingContext *ctx,
+							  XLogRecordBuffer *buf, Oid dbId,
+							  RepOriginId origin_id);
+
 /*
  * Take every XLogReadRecord()ed record and perform the actions required to
  * decode it using the output plugin already setup in the logical decoding
@@ -244,6 +253,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +263,16 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction data then
+				 * DecodeCommit doesn't need to decode it again. This is
+				 * possible iff output plugin supports two-phase commits and
+				 * doesn't skip the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+					already_decoded = !(ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+
+				DecodeCommit(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +281,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +291,14 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction during prepare
+				 * then DecodeAbort need to call rollback prepared.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+					already_decoded = !(ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+
+				DecodeAbort(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +339,34 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -582,10 +626,14 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool already_decoded)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -606,15 +654,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * the reorderbuffer to forget the content of the (sub-)transactions
 	 * if not.
 	 *
-	 * There can be several reasons we might not be interested in this
-	 * transaction:
-	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
-	 * 2) The transaction happened in another database.
-	 * 3) The output plugin is not interested in the origin.
-	 * 4) We are doing fast-forwarding
-	 *
 	 * We can't just use ReorderBufferAbort() here, because we need to execute
 	 * the transaction's invalidations.  This currently won't be needed if
 	 * we're just skipping over the transaction because currently we only do
@@ -627,9 +666,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * relevant syscaches.
 	 * ---
 	 */
-	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
-		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
-		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
 	{
 		for (i = 0; i < parsed->nsubxacts; i++)
 		{
@@ -640,7 +677,83 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		return;
 	}
 
-	/* tell the reorderbuffer about the surviving subtransactions */
+	/*
+	 * Send the final commit record if the transaction data is already decoded,
+	 * otherwise, process the entire transaction.
+	 */
+	if (already_decoded)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* tell the reorderbuffer about the surviving subtransactions */
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+									 buf->origptr, buf->endptr);
+		}
+
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ *
+ * Note that we don't skip prepare even if we have detected concurrent abort.
+ * The reason is that it is quite possible that we had already sent some
+ * changes before we detect abort in which case we need to abort those changes
+ * in the subscriber. To abort such changes, we do send the prepare and then
+ * the rollback prepared which is what happened on the publisher-side as well.
+ * Now, we can invent a new abort API wherein in such cases we send abort and
+ * skip sending prepared and rollback prepared but then it is not that
+ * straightforward because we might have streamed this transaction by that time
+ * in which case it is handled when the rollback is encountered. It is not
+ * impossible to optimize the concurrent abort case but it can introduce design
+ * complexity w.r.t handling different cases so leaving it for now as it
+ * doesn't seem worth it.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz prepare_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		prepare_time = parsed->origin_timestamp;
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeTXNNeedSkip
+	 * for the reasons why we sometimes want to skip the transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache invalidations
+	 * if there are any for the reasons mentioned in DecodeCommit.
+	 */
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
+	{
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/* Tell the reorderbuffer about the surviving subtransactions. */
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
 		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
@@ -648,33 +761,70 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 prepare_time, origin_id, origin_lsn,
+						 parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool already_decoded)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz abort_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool	skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		origin_lsn = parsed->origin_lsn;
+		abort_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeTXNNeedSkip
+	 * for the reasons why we sometimes want to skip the transaction.
+	 */
+	skip_xact = DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id);
+
+	/*
+	 * Send the final rollback record if the transaction data is already
+	 * decoded and we don't need to skip it, otherwise, perform the cleanup of
+	 * the transaction.
+	 */
+	if (already_decoded && !skip_xact)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									abort_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
 	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
@@ -1080,3 +1230,24 @@ DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tuple)
 	header->t_infomask2 = xlhdr.t_infomask2;
 	header->t_hoff = xlhdr.t_hoff;
 }
+
+/*
+ * Check whether we are interested in this specific transaction.
+ *
+ * There can be several reasons we might not be interested in this
+ * transaction:
+ * 1) We might not be interested in decoding transactions up to this
+ *	  LSN. This can happen because we previously decoded it and now just
+ *	  are restarting or if we haven't assembled a consistent snapshot yet.
+ * 2) The transaction happened in another database.
+ * 3) The output plugin is not interested in the origin.
+ * 4) We are doing fast-forwarding
+ */
+static bool
+DecodeTXNNeedSkip(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+				  Oid txn_dbid, RepOriginId origin_id)
+{
+	return (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+			(txn_dbid != InvalidOid && txn_dbid != ctx->slot->data.database) ||
+			ctx->fast_forward || FilterByOrigin(ctx, origin_id));
+}
\ No newline at end of file
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 301baff244..d889fd1e81 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -422,6 +423,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1515,12 +1522,18 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after
+ * streaming or decoding them at PREPARE. Keep the remaining info -
+ * transactions, tuplecids, invalidations and snapshots.
+ *
+ * We additionaly remove tuplecids after decoding the transaction at prepare
+ * time as we only need to perform invalidation at rollback or commit prepared.
+ *
+ * 'txn_prepared' indicates that we have decoded the transaction at prepare
+ * time.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1539,7 +1552,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1573,9 +1586,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1757,9 +1794,10 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * If the transaction was (partially) streamed, we need to commit it in a
- * 'streamed' way.  That is, we first stream the remaining part of the
- * transaction, and then invoke stream_commit message.
+ * If the transaction was (partially) streamed, we need to prepare or commit
+ * it in a 'streamed' way.  That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_prepare or stream_commit message as per
+ * the case.
  */
 static void
 ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1769,29 +1807,49 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		/*
+		 * Note, we send stream prepare even if a concurrent abort is detected.
+		 * See DecodePrepare for more information.
+		 */
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids.
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
  * Set xid to detect concurrent aborts.
  *
- * While streaming an in-progress transaction there is a possibility that the
- * (sub)transaction might get aborted concurrently.  In such case if the
- * (sub)transaction has catalog update then we might decode the tuple using
- * wrong catalog version.  For example, suppose there is one catalog tuple with
- * (xmin: 500, xmax: 0).  Now, the transaction 501 updates the catalog tuple
- * and after that we will have two tuples (xmin: 500, xmax: 501) and
- * (xmin: 501, xmax: 0).  Now, if 501 is aborted and some other transaction
- * say 502 updates the same catalog tuple then the first tuple will be changed
- * to (xmin: 500, xmax: 502).  So, the problem is that when we try to decode
- * the tuple inserted/updated in 501 after the catalog update, we will see the
- * catalog tuple with (xmin: 500, xmax: 502) as visible because it will
- * consider that the tuple is deleted by xid 502 which is not visible to our
- * snapshot.  And when we will try to decode with that catalog tuple, it can
- * lead to a wrong result or a crash.  So, it is necessary to detect
- * concurrent aborts to allow streaming of in-progress transactions.
+ * While streaming an in-progress transaction or decoding a prepared
+ * transaction there is a possibility that the (sub)transaction might get
+ * aborted concurrently.  In such case if the (sub)transaction has catalog
+ * update then we might decode the tuple using wrong catalog version.  For
+ * example, suppose there is one catalog tuple with (xmin: 500, xmax: 0).  Now,
+ * the transaction 501 updates the catalog tuple and after that we will have
+ * two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0).  Now, if 501 is
+ * aborted and some other transaction say 502 updates the same catalog tuple
+ * then the first tuple will be changed to (xmin: 500, xmax: 502).  So, the
+ * problem is that when we try to decode the tuple inserted/updated in 501
+ * after the catalog update, we will see the catalog tuple with (xmin: 500,
+ * xmax: 502) as visible because it will consider that the tuple is deleted by
+ * xid 502 which is not visible to our snapshot.  And when we will try to
+ * decode with that catalog tuple, it can lead to a wrong result or a crash.
+ * So, it is necessary to detect concurrent aborts to allow streaming of
+ * in-progress transactions or decoding of prepared  transactions.
  *
  * For detecting the concurrent abort we set CheckXidAlive to the current
  * (sub)transaction's xid for which this change belongs to.  And, during
@@ -1800,7 +1858,10 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * and discard the already streamed changes on such an error.  We might have
  * already streamed some of the changes for the aborted (sub)transaction, but
  * that is fine because when we decode the abort we will stream abort message
- * to truncate the changes in the subscriber.
+ * to truncate the changes in the subscriber. Similarly, for prepared
+ * transactions, we stop decoding if concurrent abort is detected and then
+ * rollback the changes when rollback prepared is encountered. See
+ * DecodePreare.
  */
 static inline void
 SetupCheckXidLive(TransactionId xid)
@@ -1900,7 +1961,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1912,15 +1973,19 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		specinsert = NULL;
 	}
 
-	/* Stop the stream. */
-	rb->stream_stop(rb, txn, last_lsn);
-
-	/* Remember the command ID and snapshot for the streaming run */
-	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	/*
+	 * For the streaming case, stop the stream and remember the command ID and
+	 * snapshot for the streaming run.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_stop(rb, txn, last_lsn);
+		ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	}
 }
 
 /*
- * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ * Helper function for ReorderBufferReplay and ReorderBufferStreamTXN.
  *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
@@ -2006,8 +2071,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			prev_lsn = change->lsn;
 
-			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			/*
+			 * Set the current xid to detect concurrent aborts. This is
+			 * required for the cases when we decode the changes before the
+			 * COMMIT record is processed.
+			 */
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2298,7 +2367,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2332,15 +2410,22 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the four reasons:
+		 * 1. Decoding an in-progress txn.
+		 * 2. Decoding a prepared txn.
+		 * 3. Decoding of a prepared txn that was (partially) streamed.
+		 * 4. Decoding a committed txn.
+		 *
+		 * For 1, we allow truncation of txn data by removing the changes already
+		 * streamed but still keeping other things like invalidations, snapshot,
+		 * and tuplecids. For 2 and 3, we indicate ReorderBufferTruncateTXN to
+		 * do more elaborate truncation of txn data as the entire transaction has
+		 * been decoded except for commit. For 4, as the entire txn has been
+		 * decoded, we can fully clean up the TXN reorder buffer.
 		 */
-		if (streaming)
+		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
-
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2373,17 +2458,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2413,26 +2501,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * ReorderBufferCommitChild(), even if previously assigned to the toplevel
  * transaction with ReorderBufferAssignChild.
  *
- * This interface is called once a toplevel commit is read for both streamed
- * as well as non-streamed transactions.
+ * This interface is called once a prepare or toplevel commit is read for both
+ * streamed as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferReplay(ReorderBufferTXN *txn,
+					ReorderBuffer *rb, TransactionId xid,
 					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 					TimestampTz commit_time,
 					RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2462,7 +2543,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	if (txn->base_snapshot == NULL)
 	{
 		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+
+		/*
+		 * Removing this txn before a commit might result in the computation
+		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
+		 */
+		if (!rbtxn_prepared(txn))
+			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
 
@@ -2473,6 +2560,116 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							command_id, false);
 }
 
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, so skip preparing it */
+	if (txn == NULL)
+		return true;
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferReplay(txn, rb, xid, commit_lsn, end_lsn, commit_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+					 TimestampTz prepare_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferReplay(txn, rb, xid, prepare_lsn, end_lsn, prepare_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time, RepOriginId origin_id,
+							XLogRecPtr origin_lsn, char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -2604,6 +2801,39 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
+/*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ *
+ * Note that this is a special-purpose function for prepared transactions where
+ * we don't want to clean up the TXN even when we decide to skip it. See
+ * DecodePrepare.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
 /*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index efd19573ac..a56338d495 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -174,6 +174,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +234,12 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -622,12 +629,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -645,6 +658,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+								 TimestampTz prepare_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
2.28.0.windows.1

v27-0003-Support-2PC-txn-tests-for-test_decoding.patchapplication/octet-stream; name=v27-0003-Support-2PC-txn-tests-for-test_decoding.patchDownload
From 02cddf234779762e7562961966704aa04ac2cd60 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 14:30:34 +1100
Subject: [PATCH v27 3/7] Support 2PC txn tests for test_decoding.

Add sql tests to test_decoding for 2PC.
---
 contrib/test_decoding/Makefile                |   2 +-
 contrib/test_decoding/expected/two_phase.out  | 228 ++++++++++++++++++
 .../expected/two_phase_stream.out             | 177 ++++++++++++++
 contrib/test_decoding/sql/two_phase.sql       | 119 +++++++++
 .../test_decoding/sql/two_phase_stream.sql    |  63 +++++
 5 files changed, 588 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f013..2c4acdc171 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,7 +4,7 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000000..e5e34b485f
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000000..957c198ae4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000000..4ed5266b9a
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000000..01510e49de
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
-- 
2.28.0.windows.1

v27-0004-Support-2PC-txn-tests-for-concurrent-aborts.patchapplication/octet-stream; name=v27-0004-Support-2PC-txn-tests-for-concurrent-aborts.patchDownload
From 76b3a56182983abb3cb0f8ab3e86fa7132010488 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 14:34:18 +1100
Subject: [PATCH v27 4/7] Support 2PC txn tests for concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2PC.
---
 contrib/test_decoding/Makefile                |   2 +
 contrib/test_decoding/t/001_twophase.pl       | 121 ++++++++++++++++
 .../test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++
 contrib/test_decoding/test_decoding.c         |  58 ++++++++
 .../replication/logical/reorderbuffer.c       |   5 +
 5 files changed, 319 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2c4acdc171..49523feddf 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000000..3b3e7b8b4a
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of prepared txn test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000000..15001c640e
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 429a07c004..541dc112e0 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,11 +11,13 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
+#include "storage/procarray.h"
 
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -35,6 +37,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -171,6 +174,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -272,6 +276,24 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -450,6 +472,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -599,6 +645,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -685,6 +734,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -897,6 +949,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -950,6 +1005,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d889fd1e81..2753db9890 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2480,6 +2480,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
-- 
2.28.0.windows.1

v27-0005-Support-2PC-txn-spoolfile.patchapplication/octet-stream; name=v27-0005-Support-2PC-txn-spoolfile.patchDownload
From 2ace8f7cdfeaf292e1ba1b25379e92b21caeb5cf Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 14:38:15 +1100
Subject: [PATCH v27 5/7] Support 2PC txn - spoolfile.

This patch only refactors to isolate the streaming spool-file processing to a separate function.
Later, two-phase commit logic will require this common processing to be called from multiple places.
---
 src/backend/replication/logical/worker.c | 58 ++++++++++++++++--------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 04684912de..9fa816c976 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -244,6 +244,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -933,30 +935,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -964,7 +957,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -979,7 +972,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1048,6 +1041,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1055,16 +1077,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
-- 
2.28.0.windows.1

v27-0006-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v27-0006-Support-2PC-txn-pgoutput.patchDownload
From 4bcf71e267ce9de5a111994b771989dbf13baacc Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 14:50:27 +1100
Subject: [PATCH v27 6/7] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.
---
 src/backend/access/transam/twophase.c         |  33 ++-
 src/backend/replication/logical/proto.c       | 141 +++++++++-
 .../replication/logical/reorderbuffer.c       |   6 +
 src/backend/replication/logical/worker.c      | 243 +++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c   |  74 ++++++
 src/include/access/twophase.h                 |   1 +
 src/include/replication/logicalproto.h        |  38 ++-
 src/include/replication/reorderbuffer.h       |  14 +
 src/tools/pgindent/typedefs.list              |   1 +
 9 files changed, 543 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9bad9..00b4497c2d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -547,6 +547,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 	ProcArrayAdd(&ProcGlobal->allProcs[gxact->pgprocno]);
 }
 
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
 /*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
@@ -1133,9 +1160,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb31182d7..cfb94d1d56 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -105,6 +105,145 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 	commit_data->committime = pq_getmsgint64(in);
 }
 
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK]
+	 * PREPARED uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
 /*
  * Write ORIGIN to the output stream.
  */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2753db9890..5673959d14 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2665,9 +2665,15 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	strcpy(txn->gid, gid);
 
 	if (is_commit)
+	{
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
 		rb->commit_prepared(rb, txn, commit_lsn);
+	}
 	else
+	{
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
 		rb->rollback_prepared(rb, txn, commit_lsn);
+	}
 
 	/* cleanup: make sure there's no cache pollution */
 	ReorderBufferExecuteInvalidations(txn->ninvalidations,
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9fa816c976..7690f133c0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -741,6 +742,234 @@ apply_handle_commit(StringInfo s)
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
 
+/*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData *prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK
+	 * PREPARED for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare_txn (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
 /*
  * Handle ORIGIN message.
  *
@@ -1086,12 +1315,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	in_remote_transaction = false;
 
-	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(commit_data.end_lsn);
-
 	/* unlink the files with serialized changes and subxact info */
 	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
 
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
 
@@ -1969,6 +2198,14 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997aed83..71ac43122c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -143,6 +151,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +165,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -377,6 +391,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
 /*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
@@ -856,6 +912,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 	cleanup_rel_sync_cache(txn->xid, true);
 }
 
+/*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3445..b2628ea4e2 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535df80..c04d872e13 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,12 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_PREPARE = 'P',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +117,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +125,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +154,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData *prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +201,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a56338d495..0a00429890 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
 #define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -240,6 +242,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f4d4703735..4546572445 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1342,6 +1342,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPrepareData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
2.28.0.windows.1

v27-0007-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v27-0007-Support-2PC-txn-subscriber-tests.patchDownload
From 1d52a068992e785512f0bd3b571bb40ad5615e1a Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 15:02:24 +1100
Subject: [PATCH v27 7/7] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl       | 338 ++++++++++++
 .../subscription/t/021_twophase_stream.pl     | 517 ++++++++++++++++++
 .../subscription/t/022_twophase_cascade.pl    | 282 ++++++++++
 .../t/023_twophase_cascade_stream.pl          | 319 +++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000000..9c1d681738
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000000..9ec1e31bd5
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000000..0f955300eb
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000000..3c6470d184
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
2.28.0.windows.1

#133Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#132)
8 attachment(s)

On Wed, Nov 25, 2020 at 11:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

The other patches remain unchanged.

Let me know what you think about these changes?

Thanks, I will look at the patch and let you know my thoughts on it.
Before that, sharing a new patchset with an additional patch that
includes documentation changes for
two-phase commit support in Logical decoding. I have also updated the
example section of Logical Decoding with examples that use two-phase
commits.
I have just added the documentation patch as the 8th one and renamed
the other patches, not changed anything in them,

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v28-0001-Extend-the-output-plugin-API-to-allow-decoding-p.patchapplication/octet-stream; name=v28-0001-Extend-the-output-plugin-API-to-allow-decoding-p.patchDownload
From a2c654d7defefc16461bbed6243e93ddde960307 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 20 Nov 2020 05:00:40 -0500
Subject: [PATCH v27 1/7] Extend the output plugin API to allow decoding
 prepared xacts.

This adds four methods to the output plugin API, adding support for
streaming changes of two-phase transactions at prepare time.

* prepare
* commit_prepared
* rollback_prepared
* stream_prepare

Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction is not yet committed and maybe
aborted later.

Until now two-phase transactions were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the
two-phase commands were communicated to the subscriber.

This patch provides the infrastructure for logical decoding plugins to be
informed of two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

This also extends the 'test_decoding' plugin, implementing these new
methods.

This commit simply adds these new APIs and the upcoming patch to "allow
the decoding at prepare time in ReorderBuffer" will use these APIs.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c     | 146 ++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 149 ++++++++++++-
 src/backend/replication/logical/logical.c | 257 ++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  50 +++++
 src/include/replication/reorderbuffer.h   |  38 ++++
 src/tools/pgindent/typedefs.list          |  11 +
 7 files changed, 649 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e12278beb5..429a07c004 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -76,6 +76,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 static void pg_decode_stream_start(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn);
 static void pg_output_stream_start(LogicalDecodingContext *ctx,
@@ -87,6 +99,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -123,9 +138,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
@@ -141,6 +161,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -241,6 +262,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -252,6 +283,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -320,6 +352,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -701,6 +820,33 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 	OutputPluginWrite(ctx, true);
 }
 
+static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037fac..73673a0312 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,9 +389,14 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits,
+    which allows actions to be decoded on the <command>PREPARE TRANSACTION</command>.
+    The <function>prepare_cb</function>, <function>stream_prepare_cb</function>,
+    <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,15 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too. We will skip all the changes of such a transaction once
+     the abort is detected and abort the transaction when we read WAL for
+     <command>ROLLBACK PREPARED</command>.
     </para>
 
     <note>
@@ -587,7 +609,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -685,7 +713,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -698,6 +732,90 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents as for the
+      other callbacks. The <parameter>txn</parameter> parameter contains meta
+      information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some
+      cases. The <parameter>gid</parameter> is the identifier that later
+      identifies this transaction for <command>COMMIT PREPARED</command> or
+      <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callback for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr prepare_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called
+      whenever a transaction commit prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                      ReorderBufferTXN *txn,
+                                                      XLogRecPtr commit_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called
+      whenever a transaction rollback prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                        ReorderBufferTXN *txn,
+                                                        XLogRecPtr rollback_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-start">
      <title>Stream Start Callback</title>
      <para>
@@ -735,6 +853,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1044,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 4324e32656..b0784ffe1e 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -226,6 +236,32 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_message_cb != NULL) ||
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
+	/*
+	 * To support two-phase logical decoding, we require
+	 * prepare/commit-prepare/abort-prepare callbacks. The filter_prepare
+	 * callback is optional. We however enable two-phase logical decoding when
+	 * at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
+	 * Callback to support decoding at prepare time.
+	 *
+	 * filter_prepare is optional, so we do not fail with ERROR when missing,
+	 * but the wrappers simply do nothing.
+	 */
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
+
 	/*
 	 * streaming callbacks
 	 *
@@ -237,6 +273,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -782,6 +819,135 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of prepare record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
@@ -859,6 +1025,54 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case, all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1056,6 +1270,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming at prepare time requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7ee02..7f4384b62c 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -84,6 +84,11 @@ typedef struct LogicalDecodingContext
 	 */
 	bool		streaming;
 
+	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
 	/*
 	 * State for writing output.
 	 */
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796450..14e6105905 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,40 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
+
 /*
  * Called when starting to stream a block of changes from in-progress
  * transaction (may be called repeatedly, if it's streamed in multiple
@@ -123,6 +157,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
@@ -173,10 +215,18 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+
+	/* streaming of changes at prepare time */
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
+
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bd9dd7ec67..efd19573ac 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -244,6 +244,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char	   *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -418,6 +421,26 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* start streaming transaction callback signature */
 typedef void (*ReorderBufferStreamStartCB) (
 											ReorderBuffer *rb,
@@ -436,6 +459,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -504,12 +533,21 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction at prepare time.
+	 */
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
+
 	/*
 	 * Callbacks to be called when streaming a transaction.
 	 */
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fde701bfd4..f4d4703735 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1316,9 +1316,20 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
2.28.0.windows.1

v28-0004-Support-2PC-txn-tests-for-concurrent-aborts.patchapplication/octet-stream; name=v28-0004-Support-2PC-txn-tests-for-concurrent-aborts.patchDownload
From 76b3a56182983abb3cb0f8ab3e86fa7132010488 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 14:34:18 +1100
Subject: [PATCH v27 4/7] Support 2PC txn tests for concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2PC.
---
 contrib/test_decoding/Makefile                |   2 +
 contrib/test_decoding/t/001_twophase.pl       | 121 ++++++++++++++++
 .../test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++
 contrib/test_decoding/test_decoding.c         |  58 ++++++++
 .../replication/logical/reorderbuffer.c       |   5 +
 5 files changed, 319 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2c4acdc171..49523feddf 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000000..3b3e7b8b4a
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of prepared txn test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000000..15001c640e
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 429a07c004..541dc112e0 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,11 +11,13 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
+#include "storage/procarray.h"
 
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -35,6 +37,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -171,6 +174,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -272,6 +276,24 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -450,6 +472,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -599,6 +645,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -685,6 +734,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -897,6 +949,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -950,6 +1005,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d889fd1e81..2753db9890 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2480,6 +2480,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
-- 
2.28.0.windows.1

v28-0005-Support-2PC-txn-spoolfile.patchapplication/octet-stream; name=v28-0005-Support-2PC-txn-spoolfile.patchDownload
From 2ace8f7cdfeaf292e1ba1b25379e92b21caeb5cf Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 14:38:15 +1100
Subject: [PATCH v27 5/7] Support 2PC txn - spoolfile.

This patch only refactors to isolate the streaming spool-file processing to a separate function.
Later, two-phase commit logic will require this common processing to be called from multiple places.
---
 src/backend/replication/logical/worker.c | 58 ++++++++++++++++--------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 04684912de..9fa816c976 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -244,6 +244,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -933,30 +935,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -964,7 +957,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -979,7 +972,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1048,6 +1041,35 @@ apply_handle_stream_commit(StringInfo s)
 
 	BufFileClose(fd);
 
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	/*
 	 * Update origin state so we can restart streaming from correct position
 	 * in case of crash.
@@ -1055,16 +1077,12 @@ apply_handle_stream_commit(StringInfo s)
 	replorigin_session_origin_lsn = commit_data.end_lsn;
 	replorigin_session_origin_timestamp = commit_data.committime;
 
-	pfree(buffer);
-	pfree(s2.data);
-
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
 	store_flush_position(commit_data.end_lsn);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
-		 nchanges, path);
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
 
 	in_remote_transaction = false;
 
-- 
2.28.0.windows.1

v28-0003-Support-2PC-txn-tests-for-test_decoding.patchapplication/octet-stream; name=v28-0003-Support-2PC-txn-tests-for-test_decoding.patchDownload
From 02cddf234779762e7562961966704aa04ac2cd60 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 14:30:34 +1100
Subject: [PATCH v27 3/7] Support 2PC txn tests for test_decoding.

Add sql tests to test_decoding for 2PC.
---
 contrib/test_decoding/Makefile                |   2 +-
 contrib/test_decoding/expected/two_phase.out  | 228 ++++++++++++++++++
 .../expected/two_phase_stream.out             | 177 ++++++++++++++
 contrib/test_decoding/sql/two_phase.sql       | 119 +++++++++
 .../test_decoding/sql/two_phase_stream.sql    |  63 +++++
 5 files changed, 588 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f013..2c4acdc171 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,7 +4,7 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000000..e5e34b485f
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000000..957c198ae4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000000..4ed5266b9a
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000000..01510e49de
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
-- 
2.28.0.windows.1

v28-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchapplication/octet-stream; name=v28-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchDownload
From 5a14691f2c1c06625315bbfd4e942e44f472920d Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 24 Nov 2020 18:41:29 +0530
Subject: [PATCH v27 2/7] Allow decoding at prepare time in ReorderBuffer.

This patch allows PREPARE-time decoding two-phase transactions (if the
output plugin supports this capability), in which case the transactions
are replayed at PREPARE and then committed later when COMMIT PREPARED
arrives.

Now that we decode the changes before the commit, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We detect such failures with a special sqlerrcode
ERRCODE_TRANSACTION_ROLLBACK introduced by commit 7259736a6e and stop
decoding the remaining changes. Then we rollback the changes when rollback
prepared is encountered.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, Arseny Sher, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/decode.c      | 247 ++++++++++--
 .../replication/logical/reorderbuffer.c       | 358 ++++++++++++++----
 src/include/replication/reorderbuffer.h       |  20 +
 3 files changed, 523 insertions(+), 102 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee99b8..1e9522f07a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,13 +67,22 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool already_decoded);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool already_decoded);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 
+static bool DecodeTXNNeedSkip(LogicalDecodingContext *ctx,
+							  XLogRecordBuffer *buf, Oid dbId,
+							  RepOriginId origin_id);
+
 /*
  * Take every XLogReadRecord()ed record and perform the actions required to
  * decode it using the output plugin already setup in the logical decoding
@@ -244,6 +253,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +263,16 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction data then
+				 * DecodeCommit doesn't need to decode it again. This is
+				 * possible iff output plugin supports two-phase commits and
+				 * doesn't skip the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+					already_decoded = !(ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+
+				DecodeCommit(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +281,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +291,14 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction during prepare
+				 * then DecodeAbort need to call rollback prepared.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+					already_decoded = !(ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+
+				DecodeAbort(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +339,34 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -582,10 +626,14 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool already_decoded)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -606,15 +654,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * the reorderbuffer to forget the content of the (sub-)transactions
 	 * if not.
 	 *
-	 * There can be several reasons we might not be interested in this
-	 * transaction:
-	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
-	 * 2) The transaction happened in another database.
-	 * 3) The output plugin is not interested in the origin.
-	 * 4) We are doing fast-forwarding
-	 *
 	 * We can't just use ReorderBufferAbort() here, because we need to execute
 	 * the transaction's invalidations.  This currently won't be needed if
 	 * we're just skipping over the transaction because currently we only do
@@ -627,9 +666,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * relevant syscaches.
 	 * ---
 	 */
-	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
-		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
-		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
 	{
 		for (i = 0; i < parsed->nsubxacts; i++)
 		{
@@ -640,7 +677,83 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		return;
 	}
 
-	/* tell the reorderbuffer about the surviving subtransactions */
+	/*
+	 * Send the final commit record if the transaction data is already decoded,
+	 * otherwise, process the entire transaction.
+	 */
+	if (already_decoded)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* tell the reorderbuffer about the surviving subtransactions */
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+									 buf->origptr, buf->endptr);
+		}
+
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ *
+ * Note that we don't skip prepare even if we have detected concurrent abort.
+ * The reason is that it is quite possible that we had already sent some
+ * changes before we detect abort in which case we need to abort those changes
+ * in the subscriber. To abort such changes, we do send the prepare and then
+ * the rollback prepared which is what happened on the publisher-side as well.
+ * Now, we can invent a new abort API wherein in such cases we send abort and
+ * skip sending prepared and rollback prepared but then it is not that
+ * straightforward because we might have streamed this transaction by that time
+ * in which case it is handled when the rollback is encountered. It is not
+ * impossible to optimize the concurrent abort case but it can introduce design
+ * complexity w.r.t handling different cases so leaving it for now as it
+ * doesn't seem worth it.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz prepare_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		prepare_time = parsed->origin_timestamp;
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeTXNNeedSkip
+	 * for the reasons why we sometimes want to skip the transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache invalidations
+	 * if there are any for the reasons mentioned in DecodeCommit.
+	 */
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
+	{
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/* Tell the reorderbuffer about the surviving subtransactions. */
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
 		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
@@ -648,33 +761,70 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 prepare_time, origin_id, origin_lsn,
+						 parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool already_decoded)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz abort_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool	skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		origin_lsn = parsed->origin_lsn;
+		abort_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeTXNNeedSkip
+	 * for the reasons why we sometimes want to skip the transaction.
+	 */
+	skip_xact = DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id);
+
+	/*
+	 * Send the final rollback record if the transaction data is already
+	 * decoded and we don't need to skip it, otherwise, perform the cleanup of
+	 * the transaction.
+	 */
+	if (already_decoded && !skip_xact)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									abort_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
 	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
@@ -1080,3 +1230,24 @@ DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tuple)
 	header->t_infomask2 = xlhdr.t_infomask2;
 	header->t_hoff = xlhdr.t_hoff;
 }
+
+/*
+ * Check whether we are interested in this specific transaction.
+ *
+ * There can be several reasons we might not be interested in this
+ * transaction:
+ * 1) We might not be interested in decoding transactions up to this
+ *	  LSN. This can happen because we previously decoded it and now just
+ *	  are restarting or if we haven't assembled a consistent snapshot yet.
+ * 2) The transaction happened in another database.
+ * 3) The output plugin is not interested in the origin.
+ * 4) We are doing fast-forwarding
+ */
+static bool
+DecodeTXNNeedSkip(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+				  Oid txn_dbid, RepOriginId origin_id)
+{
+	return (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+			(txn_dbid != InvalidOid && txn_dbid != ctx->slot->data.database) ||
+			ctx->fast_forward || FilterByOrigin(ctx, origin_id));
+}
\ No newline at end of file
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 301baff244..d889fd1e81 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -422,6 +423,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1515,12 +1522,18 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after
+ * streaming or decoding them at PREPARE. Keep the remaining info -
+ * transactions, tuplecids, invalidations and snapshots.
+ *
+ * We additionaly remove tuplecids after decoding the transaction at prepare
+ * time as we only need to perform invalidation at rollback or commit prepared.
+ *
+ * 'txn_prepared' indicates that we have decoded the transaction at prepare
+ * time.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1539,7 +1552,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1573,9 +1586,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1757,9 +1794,10 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * If the transaction was (partially) streamed, we need to commit it in a
- * 'streamed' way.  That is, we first stream the remaining part of the
- * transaction, and then invoke stream_commit message.
+ * If the transaction was (partially) streamed, we need to prepare or commit
+ * it in a 'streamed' way.  That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_prepare or stream_commit message as per
+ * the case.
  */
 static void
 ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1769,29 +1807,49 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		/*
+		 * Note, we send stream prepare even if a concurrent abort is detected.
+		 * See DecodePrepare for more information.
+		 */
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids.
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
  * Set xid to detect concurrent aborts.
  *
- * While streaming an in-progress transaction there is a possibility that the
- * (sub)transaction might get aborted concurrently.  In such case if the
- * (sub)transaction has catalog update then we might decode the tuple using
- * wrong catalog version.  For example, suppose there is one catalog tuple with
- * (xmin: 500, xmax: 0).  Now, the transaction 501 updates the catalog tuple
- * and after that we will have two tuples (xmin: 500, xmax: 501) and
- * (xmin: 501, xmax: 0).  Now, if 501 is aborted and some other transaction
- * say 502 updates the same catalog tuple then the first tuple will be changed
- * to (xmin: 500, xmax: 502).  So, the problem is that when we try to decode
- * the tuple inserted/updated in 501 after the catalog update, we will see the
- * catalog tuple with (xmin: 500, xmax: 502) as visible because it will
- * consider that the tuple is deleted by xid 502 which is not visible to our
- * snapshot.  And when we will try to decode with that catalog tuple, it can
- * lead to a wrong result or a crash.  So, it is necessary to detect
- * concurrent aborts to allow streaming of in-progress transactions.
+ * While streaming an in-progress transaction or decoding a prepared
+ * transaction there is a possibility that the (sub)transaction might get
+ * aborted concurrently.  In such case if the (sub)transaction has catalog
+ * update then we might decode the tuple using wrong catalog version.  For
+ * example, suppose there is one catalog tuple with (xmin: 500, xmax: 0).  Now,
+ * the transaction 501 updates the catalog tuple and after that we will have
+ * two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0).  Now, if 501 is
+ * aborted and some other transaction say 502 updates the same catalog tuple
+ * then the first tuple will be changed to (xmin: 500, xmax: 502).  So, the
+ * problem is that when we try to decode the tuple inserted/updated in 501
+ * after the catalog update, we will see the catalog tuple with (xmin: 500,
+ * xmax: 502) as visible because it will consider that the tuple is deleted by
+ * xid 502 which is not visible to our snapshot.  And when we will try to
+ * decode with that catalog tuple, it can lead to a wrong result or a crash.
+ * So, it is necessary to detect concurrent aborts to allow streaming of
+ * in-progress transactions or decoding of prepared  transactions.
  *
  * For detecting the concurrent abort we set CheckXidAlive to the current
  * (sub)transaction's xid for which this change belongs to.  And, during
@@ -1800,7 +1858,10 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * and discard the already streamed changes on such an error.  We might have
  * already streamed some of the changes for the aborted (sub)transaction, but
  * that is fine because when we decode the abort we will stream abort message
- * to truncate the changes in the subscriber.
+ * to truncate the changes in the subscriber. Similarly, for prepared
+ * transactions, we stop decoding if concurrent abort is detected and then
+ * rollback the changes when rollback prepared is encountered. See
+ * DecodePreare.
  */
 static inline void
 SetupCheckXidLive(TransactionId xid)
@@ -1900,7 +1961,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1912,15 +1973,19 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		specinsert = NULL;
 	}
 
-	/* Stop the stream. */
-	rb->stream_stop(rb, txn, last_lsn);
-
-	/* Remember the command ID and snapshot for the streaming run */
-	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	/*
+	 * For the streaming case, stop the stream and remember the command ID and
+	 * snapshot for the streaming run.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_stop(rb, txn, last_lsn);
+		ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	}
 }
 
 /*
- * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ * Helper function for ReorderBufferReplay and ReorderBufferStreamTXN.
  *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
@@ -2006,8 +2071,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			prev_lsn = change->lsn;
 
-			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			/*
+			 * Set the current xid to detect concurrent aborts. This is
+			 * required for the cases when we decode the changes before the
+			 * COMMIT record is processed.
+			 */
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2298,7 +2367,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2332,15 +2410,22 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the four reasons:
+		 * 1. Decoding an in-progress txn.
+		 * 2. Decoding a prepared txn.
+		 * 3. Decoding of a prepared txn that was (partially) streamed.
+		 * 4. Decoding a committed txn.
+		 *
+		 * For 1, we allow truncation of txn data by removing the changes already
+		 * streamed but still keeping other things like invalidations, snapshot,
+		 * and tuplecids. For 2 and 3, we indicate ReorderBufferTruncateTXN to
+		 * do more elaborate truncation of txn data as the entire transaction has
+		 * been decoded except for commit. For 4, as the entire txn has been
+		 * decoded, we can fully clean up the TXN reorder buffer.
 		 */
-		if (streaming)
+		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
-
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2373,17 +2458,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2413,26 +2501,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * ReorderBufferCommitChild(), even if previously assigned to the toplevel
  * transaction with ReorderBufferAssignChild.
  *
- * This interface is called once a toplevel commit is read for both streamed
- * as well as non-streamed transactions.
+ * This interface is called once a prepare or toplevel commit is read for both
+ * streamed as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferReplay(ReorderBufferTXN *txn,
+					ReorderBuffer *rb, TransactionId xid,
 					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 					TimestampTz commit_time,
 					RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2462,7 +2543,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	if (txn->base_snapshot == NULL)
 	{
 		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+
+		/*
+		 * Removing this txn before a commit might result in the computation
+		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
+		 */
+		if (!rbtxn_prepared(txn))
+			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
 
@@ -2473,6 +2560,116 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							command_id, false);
 }
 
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, so skip preparing it */
+	if (txn == NULL)
+		return true;
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferReplay(txn, rb, xid, commit_lsn, end_lsn, commit_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+					 TimestampTz prepare_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferReplay(txn, rb, xid, prepare_lsn, end_lsn, prepare_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time, RepOriginId origin_id,
+							XLogRecPtr origin_lsn, char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -2604,6 +2801,39 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
+/*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ *
+ * Note that this is a special-purpose function for prepared transactions where
+ * we don't want to clean up the TXN even when we decide to skip it. See
+ * DecodePrepare.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
 /*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index efd19573ac..a56338d495 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -174,6 +174,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +234,12 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -622,12 +629,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -645,6 +658,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+								 TimestampTz prepare_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
2.28.0.windows.1

v28-0006-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v28-0006-Support-2PC-txn-pgoutput.patchDownload
From 4bcf71e267ce9de5a111994b771989dbf13baacc Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 14:50:27 +1100
Subject: [PATCH v27 6/7] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.
---
 src/backend/access/transam/twophase.c         |  33 ++-
 src/backend/replication/logical/proto.c       | 141 +++++++++-
 .../replication/logical/reorderbuffer.c       |   6 +
 src/backend/replication/logical/worker.c      | 243 +++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c   |  74 ++++++
 src/include/access/twophase.h                 |   1 +
 src/include/replication/logicalproto.h        |  38 ++-
 src/include/replication/reorderbuffer.h       |  14 +
 src/tools/pgindent/typedefs.list              |   1 +
 9 files changed, 543 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9bad9..00b4497c2d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -547,6 +547,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 	ProcArrayAdd(&ProcGlobal->allProcs[gxact->pgprocno]);
 }
 
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
 /*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
@@ -1133,9 +1160,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb31182d7..cfb94d1d56 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -105,6 +105,145 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 	commit_data->committime = pq_getmsgint64(in);
 }
 
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK]
+	 * PREPARED uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
 /*
  * Write ORIGIN to the output stream.
  */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2753db9890..5673959d14 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2665,9 +2665,15 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	strcpy(txn->gid, gid);
 
 	if (is_commit)
+	{
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
 		rb->commit_prepared(rb, txn, commit_lsn);
+	}
 	else
+	{
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
 		rb->rollback_prepared(rb, txn, commit_lsn);
+	}
 
 	/* cleanup: make sure there's no cache pollution */
 	ReorderBufferExecuteInvalidations(txn->ninvalidations,
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9fa816c976..7690f133c0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -741,6 +742,234 @@ apply_handle_commit(StringInfo s)
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
 
+/*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData *prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK
+	 * PREPARED for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare_txn (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
 /*
  * Handle ORIGIN message.
  *
@@ -1086,12 +1315,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	in_remote_transaction = false;
 
-	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(commit_data.end_lsn);
-
 	/* unlink the files with serialized changes and subxact info */
 	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
 
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(commit_data.end_lsn);
+
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
 
@@ -1969,6 +2198,14 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997aed83..71ac43122c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -143,6 +151,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +165,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -377,6 +391,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
 /*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
@@ -856,6 +912,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 	cleanup_rel_sync_cache(txn->xid, true);
 }
 
+/*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3445..b2628ea4e2 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535df80..c04d872e13 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,12 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_PREPARE = 'P',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +117,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +125,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +154,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData *prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +201,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a56338d495..0a00429890 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
 #define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -240,6 +242,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f4d4703735..4546572445 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1342,6 +1342,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPrepareData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
2.28.0.windows.1

v28-0007-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v28-0007-Support-2PC-txn-subscriber-tests.patchDownload
From 1d52a068992e785512f0bd3b571bb40ad5615e1a Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 24 Nov 2020 15:02:24 +1100
Subject: [PATCH v27 7/7] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl       | 338 ++++++++++++
 .../subscription/t/021_twophase_stream.pl     | 517 ++++++++++++++++++
 .../subscription/t/022_twophase_cascade.pl    | 282 ++++++++++
 .../t/023_twophase_cascade_stream.pl          | 319 +++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000000..9c1d681738
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000000..9ec1e31bd5
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000000..0f955300eb
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000000..3c6470d184
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
2.28.0.windows.1

v28-0008-Support-2PC-documentation.patchapplication/octet-stream; name=v28-0008-Support-2PC-documentation.patchDownload
From 1619b66506f0a559b24b425773ed0f5dfc3b9023 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 25 Nov 2020 21:42:18 -0500
Subject: [PATCH v28] Support-2PC-documentation.

Add documentation about two-phase commit support in Logical Decoding.
---
 doc/src/sgml/logicaldecoding.sgml | 97 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 96 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 73673a0..fcc43fd 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -165,7 +165,57 @@ COMMIT 693
 <keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
 $ pg_recvlogical -d postgres --slot=test --drop-slot
 </programlisting>
-  </sect1>
+
+  <para>
+  The following example shows how logical decoding can be used to handle transactions
+  that use a two-phase commit. Before you use two-phase commit commands, you must set
+  <varname>max_prepared_transactions</varname> to at least 1. You must also set the 
+  option 'two-phase-commit' to 1 while calling <function>pg_logical_slot_get_changes</function>.
+  </para>
+<programlisting>
+postgres=# BEGIN;
+postgres=*# INSERT INTO data(data) VALUES('5');
+postgres=*# PREPARE TRANSACTION 'test_prepared1';
+
+postgres=# SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/1689DC0 | 529 | BEGIN 529
+ 0/1689DC0 | 529 | table public.data: INSERT: id[integer]:3 data[text]:'5'
+ 0/1689FC0 | 529 | PREPARE TRANSACTION 'test_prepared1', txid 529
+(3 rows)
+
+postgres=# COMMIT PREPARED 'test_prepared1';
+COMMIT PREPARED
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                    data                    
+-----------+-----+--------------------------------------------
+ 0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529
+(1 row)
+
+postgres=#-- you can also rollback a prepared transaction
+postgres=# BEGIN;
+BEGIN
+postgres=*# INSERT INTO data(data) VALUES('6');INSERT 0 1
+postgres=*# PREPARE TRANSACTION 'test_prepared2';PREPARE TRANSACTION
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/168A180 | 530 | BEGIN 530
+ 0/168A1E8 | 530 | table public.data: INSERT: id[integer]:4 data[text]:'6'
+ 0/168A430 | 530 | PREPARE TRANSACTION 'test_prepared2', txid 530
+(3 rows)
+
+postgres=# ROLLBACK PREPARED 'test_prepared1';ERROR:  prepared transaction with identifier "test_prepared1" does not exist
+postgres=# ROLLBACK PREPARED 'test_prepared2';
+ROLLBACK PREPARED
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                     data                     
+-----------+-----+----------------------------------------------
+ 0/168A4B8 | 530 | ROLLBACK PREPARED 'test_prepared2', txid 530
+(1 row)
+</programlisting>
+</sect1>
 
   <sect1 id="logicaldecoding-explanation">
    <title>Logical Decoding Concepts</title>
@@ -1103,4 +1153,49 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
    </para>
 
   </sect1>
+
+  <sect1 id="logicaldecoding-two-phase-commits">
+   <title>Two-phase commit support for Logical Decoding</title>
+
+   <para>
+   With the basic output plugin callbacks (eg., <function>begin_cb</function>,
+   <function>change_cb</function>, <function>commit_cb</function> and
+   <function>message_cb</function>) two-phase commit commands like
+   <command>PREPARE TRANSACTION</command>, <command>COMMIT PREPARED</command>
+   and <command>ROLLBACK PREPARED</command> are not decoded correctly.
+   While the <command>PREPARE TRANSACTION</command> ignored, 
+   <command>COMMIT PREPARED</command> is decoded as a <command>COMMIT</command> and 
+   <command>ROLLBACK PREPARED</command> is decoded as a <command>ROLLBACK</command>.
+   </para>
+
+   <para>
+   An output plugin may provide additional callbacks to support two-phase commit commands.
+   There are multiple two-phase commit callbacks that are required,
+   (<function>prepare_cb</function>, <function>commit_prepared_cb</function>, 
+   <function>rollback_prepared_cb</function> and <function>stream_prepare_cb</function>)
+   and an optional callback (<function>filter_prepare_cb</function>).
+   </para>
+
+   <para>
+   If the output plugin callbacks for decoding two-phase commit commands are provided,
+   then on <command>PREPARE TRANSACTION</command>, the changes of that transaction are
+   decoded, passed to the output plugin and the <function>prepare_cb</function>
+   callback is invoked. This differs from the basic decoding setup where changes are
+   only passed to the output plugin when a transaction is committed.
+   </para>
+
+   <para>
+   When a prepared transaction is rollbacked using the <command>ROLLBACK PREPARED</command>,
+   then the <function>rollback_prepared_cb</function> is invoked and when the
+   prepared transaction is committed using <command>COMMIT PREPARED</command>,
+   then the <function>commit_prepared_cb</function> callback is invoked.
+   </para>
+
+   <para>
+   Optionally the output plugin can specify a name pattern in the 
+   <function>filter_prepare_cb</function> and transactions with gid containing
+   that name pattern will not be decoded as a two-phase commit transaction. 
+   </para>
+
+  </sect1>
  </chapter>
-- 
1.8.3.1

#134Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#132)

On Wed, Nov 25, 2020 at 11:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

One problem with this patch is: What if we have assembled a consistent
snapshot after prepare and before commit prepared. In that case, it
will currently just send commit prepared record which would be a bad
idea. It should decode the entire transaction for such cases at commit
prepared time. This same problem is noticed by Arseny Sher, see
problem-2 in email [1].

I'm not sure I understand how you could assemble a consistent snapshot
after prepare but before commit prepared?
Doesn't a consistent snapshot require that all in-progress
transactions commit? I've tried start a new subscription after
a prepare on the publisher and I see that the create subscription just
hangs till the transaction on the publisher is either committed or
rolled back.
Even if I try to create a replication slot using
pg_create_logical_replication_slot when a transaction has been
prepared but not yet committed
, it just hangs till the transaction is committed/aborted.

regards,
Ajin Cherian
Fujitsu Australia

#135Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#134)

On Thu, Nov 26, 2020 at 4:24 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Wed, Nov 25, 2020 at 11:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

One problem with this patch is: What if we have assembled a consistent
snapshot after prepare and before commit prepared. In that case, it
will currently just send commit prepared record which would be a bad
idea. It should decode the entire transaction for such cases at commit
prepared time. This same problem is noticed by Arseny Sher, see
problem-2 in email [1].

I'm not sure I understand how you could assemble a consistent snapshot
after prepare but before commit prepared?
Doesn't a consistent snapshot require that all in-progress
transactions commit?

By above, I don't mean that the transaction is not committed. I am
talking about the timing of WAL. It is possible that between WAL of
Prepare and Commit Prepared, we reach a consistent state.

I've tried start a new subscription after
a prepare on the publisher and I see that the create subscription just
hangs till the transaction on the publisher is either committed or
rolled back.

I think what you need to do to reproduce this is to follow the
snapshot machinery in SnapBuildFindSnapshot. Basically, first, start a
transaction (say transaction-id is 500) and do some operations but
don't commit. Here, if you create a slot (via subscription or
otherwise), it will wait for 500 to complete and make the state as
SNAPBUILD_BUILDING_SNAPSHOT. Here, you can commit 500 and then having
debugger in that state, start another transaction (say 501), do some
operations but don't commit. Next time when you reach this function,
it will change the state to SNAPBUILD_FULL_SNAPSHOT and wait for 501,
now you can start another transaction (say 502) which you can prepare
but don't commit. Again start one more transaction 503, do some ops,
commit both 501 and 503. At this stage somehow we need to ensure that
XLOG_RUNNING_XACTS record. Then commit prepared 502. Now, I think you
should notice that the consistent point is reached after 502's prepare
and before its commit. Now, this is just a theoretical scenario, you
need something on these lines and probably a way to force
XLOG_RUNNING_XACTS WAL (probably via debugger or some other way) at
the right times to reproduce it.

Thanks for trying to build a test case for this, it is really helpful.

--
With Regards,
Amit Kapila.

#136Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#135)

On Thu, Nov 26, 2020 at 10:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think what you need to do to reproduce this is to follow the
snapshot machinery in SnapBuildFindSnapshot. Basically, first, start a
transaction (say transaction-id is 500) and do some operations but
don't commit. Here, if you create a slot (via subscription or
otherwise), it will wait for 500 to complete and make the state as
SNAPBUILD_BUILDING_SNAPSHOT. Here, you can commit 500 and then having
debugger in that state, start another transaction (say 501), do some
operations but don't commit. Next time when you reach this function,
it will change the state to SNAPBUILD_FULL_SNAPSHOT and wait for 501,
now you can start another transaction (say 502) which you can prepare
but don't commit. Again start one more transaction 503, do some ops,
commit both 501 and 503. At this stage somehow we need to ensure that
XLOG_RUNNING_XACTS record. Then commit prepared 502. Now, I think you
should notice that the consistent point is reached after 502's prepare
and before its commit. Now, this is just a theoretical scenario, you
need something on these lines and probably a way to force
XLOG_RUNNING_XACTS WAL (probably via debugger or some other way) at
the right times to reproduce it.

Thanks for trying to build a test case for this, it is really helpful.

I tried the above steps, I was able to get the builder state to
SNAPBUILD_BUILDING_SNAPSHOT but was not able to get into the
SNAPBUILD_FULL_SNAPSHOT state.
Instead the code moves straight from SNAPBUILD_BUILDING_SNAPSHOT to
SNAPBUILD_CONSISTENT state.

In the function SnapBuildFindSnapshot, either the following check fails:

1327: TransactionIdPrecedesOrEquals(SnapBuildNextPhaseAt(builder),
running->oldestRunningXid))

because the SnapBuildNextPhaseAt (which is same as running->nextXid)
is higher than oldestRunningXid, or when the both are the same, then
it falls through into the below condition higher in the code

1247: if (running->oldestRunningXid == running->nextXid)

and then the builder moves straight into the SNAPBUILD_CONSISTENT
state. At no point will the nextXid be less than oldestRunningXid. In
my sessions, I commit multiple txns, hoping to bump
up oldestRunningXid, I do checkpoints, have made sure the
XLOG_RUNNING_XACTS are being inserted.,
but while iterating into SnapBuildFindSnapshot with a ,new
XLOG_RUNNING_XACTS:record, the oldestRunningXid is being incremented
at one xid at a time, which will eventually make it catch up
running->nextXid and reach a
SNAPBUILD_CONSISTENT state without entering the SNAPBUILD_FULL_SNAPSHOT state.

regards,
Ajin Cherian
Fujitsu Australia

#137Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#136)

On Fri, Nov 27, 2020 at 6:35 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, Nov 26, 2020 at 10:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think what you need to do to reproduce this is to follow the
snapshot machinery in SnapBuildFindSnapshot. Basically, first, start a
transaction (say transaction-id is 500) and do some operations but
don't commit. Here, if you create a slot (via subscription or
otherwise), it will wait for 500 to complete and make the state as
SNAPBUILD_BUILDING_SNAPSHOT. Here, you can commit 500 and then having
debugger in that state, start another transaction (say 501), do some
operations but don't commit. Next time when you reach this function,
it will change the state to SNAPBUILD_FULL_SNAPSHOT and wait for 501,
now you can start another transaction (say 502) which you can prepare
but don't commit. Again start one more transaction 503, do some ops,
commit both 501 and 503. At this stage somehow we need to ensure that
XLOG_RUNNING_XACTS record. Then commit prepared 502. Now, I think you
should notice that the consistent point is reached after 502's prepare
and before its commit. Now, this is just a theoretical scenario, you
need something on these lines and probably a way to force
XLOG_RUNNING_XACTS WAL (probably via debugger or some other way) at
the right times to reproduce it.

Thanks for trying to build a test case for this, it is really helpful.

I tried the above steps, I was able to get the builder state to
SNAPBUILD_BUILDING_SNAPSHOT but was not able to get into the
SNAPBUILD_FULL_SNAPSHOT state.
Instead the code moves straight from SNAPBUILD_BUILDING_SNAPSHOT to
SNAPBUILD_CONSISTENT state.

I see the code coverage report and it appears that part of the code
(get the snapshot machinery in SNAPBUILD_FULL_SNAPSHOT state) is
covered by existing tests [1]https://coverage.postgresql.org/src/backend/replication/logical/snapbuild.c.gcov.html. So, another idea you can try is to put
a break (say while (1)) in that part of code and run regression tests
(most probably the test_decoding or subscription tests should be
sufficient to hit). Then once you found which existing test covers
that, you can try to generate prepared transaction behavior as
mentioned above.

[1]: https://coverage.postgresql.org/src/backend/replication/logical/snapbuild.c.gcov.html

--
With Regards,
Amit Kapila.

#138Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#137)

On Sun, Nov 29, 2020 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Then once you found which existing test covers
that, you can try to generate prepared transaction behavior as
mentioned above.

I was able to find out the test case that exercises that code, it is
the ondisk_startup spec in test_decoding. Using that, I was able to
create the problem with the following setup:
Using 4 sessions (this could be optimized to 3, but just sharing what
I've tested):

s1(session 1):
begin;
postgres=# begin;
BEGIN
postgres=*# SELECT pg_current_xact_id();
pg_current_xact_id
--------------------
546
(1 row)
--------------------the above commands leave a transaction running
s2:
CREATE TABLE do_write(id serial primary key);
SELECT 'init' FROM
pg_create_logical_replication_slot('isolation_slot', 'test_decoding');

---------------------this will hang because of 546 txn is pending

s3:
postgres=# begin;
BEGIN
postgres=*# SELECT pg_current_xact_id();
pg_current_xact_id
--------------------
547
(1 row)
-------------------------------- leave another txn pending---

s1:
postgres=*# ALTER TABLE do_write ADD COLUMN addedbys2 int;
ALTER TABLE
postgres=*# commit;
------------------------------commit the first txn; this will cause
state to move to SNAPBUILD_FULL_SNAPSHOT state
2020-11-30 03:31:07.354 EST [16312] LOG: logical decoding found
initial consistent point at 0/1730A18
2020-11-30 03:31:07.354 EST [16312] DETAIL: Waiting for transactions
(approximately 1) older than 553 to end.

s4:
postgres=# begin;
BEGIN
postgres=*# INSERT INTO do_write DEFAULT VALUES;
INSERT 0 1
postgres=*# prepare transaction 'test1';
PREPARE TRANSACTION
-------------- leave this transaction prepared

s3:
postgres=*# commit;
COMMIT
----------------- this will cause s2 call to return and a consistent
point has been reached.
2020-11-30 03:31:34.200 EST [16312] LOG: logical decoding found
consistent point at 0/1730D58

s4:
commit prepared 'test1';

s2:
postgres=# SELECT * FROM pg_logical_slot_get_changes('isolation_slot',
NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1');
lsn | xid | data
-----------+-----+-------------------------
0/1730FC8 | 553 | COMMIT PREPARED 'test1'
(1 row)

In pg_logical_slot_get_changes() we see only the Commit Prepared but
no insert and no prepare command. I debugged this and I see that in
DecodePrepare, the
prepare is skipped because the prepare lsn is prior to the
start_decoding_at point and is skipped in SnapBuildXactNeedsSkip. So,
the reason for skipping
the PREPARE is similar to the reason why it would have been skipped on
a restart after a previous decode run.

One possible fix would be similar to what you suggested, in
DecodePrepare , add the check DecodingContextReady(ctx), which if
false would indicate that the
PREPARE was prior to a consistent snapshot and if so, set a flag value
in txn accordingly (say RBTXN_PREPARE_NOT_DECODED?), and if this flag
is detected
while handling the COMMIT PREPARED, then handle it like you would
handle a COMMIT. This would ensure that all the changes of the
transaction are sent out
and at the same time, the subscriber side does not need to try and
handle a prepared transaction that does not exist on its side.

Let me know what you think of this?

regards,
Ajin Cherian
Fujitsu Australia

#139Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#138)

On Mon, Nov 30, 2020 at 2:36 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Sun, Nov 29, 2020 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Then once you found which existing test covers
that, you can try to generate prepared transaction behavior as
mentioned above.

I was able to find out the test case that exercises that code, it is
the ondisk_startup spec in test_decoding. Using that, I was able to
create the problem with the following setup:
Using 4 sessions (this could be optimized to 3, but just sharing what
I've tested):

s1(session 1):
begin;
postgres=# begin;
BEGIN
postgres=*# SELECT pg_current_xact_id();
pg_current_xact_id
--------------------
546
(1 row)
--------------------the above commands leave a transaction running
s2:
CREATE TABLE do_write(id serial primary key);
SELECT 'init' FROM
pg_create_logical_replication_slot('isolation_slot', 'test_decoding');

---------------------this will hang because of 546 txn is pending

s3:
postgres=# begin;
BEGIN
postgres=*# SELECT pg_current_xact_id();
pg_current_xact_id
--------------------
547
(1 row)
-------------------------------- leave another txn pending---

s1:
postgres=*# ALTER TABLE do_write ADD COLUMN addedbys2 int;
ALTER TABLE
postgres=*# commit;
------------------------------commit the first txn; this will cause
state to move to SNAPBUILD_FULL_SNAPSHOT state
2020-11-30 03:31:07.354 EST [16312] LOG: logical decoding found
initial consistent point at 0/1730A18
2020-11-30 03:31:07.354 EST [16312] DETAIL: Waiting for transactions
(approximately 1) older than 553 to end.

s4:
postgres=# begin;
BEGIN
postgres=*# INSERT INTO do_write DEFAULT VALUES;
INSERT 0 1
postgres=*# prepare transaction 'test1';
PREPARE TRANSACTION
-------------- leave this transaction prepared

s3:
postgres=*# commit;
COMMIT
----------------- this will cause s2 call to return and a consistent
point has been reached.
2020-11-30 03:31:34.200 EST [16312] LOG: logical decoding found
consistent point at 0/1730D58

s4:
commit prepared 'test1';

s2:
postgres=# SELECT * FROM pg_logical_slot_get_changes('isolation_slot',
NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0',
'skip-empty-xacts', '1');
lsn | xid | data
-----------+-----+-------------------------
0/1730FC8 | 553 | COMMIT PREPARED 'test1'
(1 row)

In pg_logical_slot_get_changes() we see only the Commit Prepared but
no insert and no prepare command. I debugged this and I see that in
DecodePrepare, the
prepare is skipped because the prepare lsn is prior to the
start_decoding_at point and is skipped in SnapBuildXactNeedsSkip.

So what caused it to skip due to start_decoding_at? Because the commit
where the snapshot became consistent is after Prepare. Does it happen
due to the below code in SnapBuildFindSnapshot() where we bump
start_decoding_at.

{
...
if (running->oldestRunningXid == running->nextXid)
{
if (builder->start_decoding_at == InvalidXLogRecPtr ||
builder->start_decoding_at <= lsn)
/* can decode everything after this */
builder->start_decoding_at = lsn + 1;

So,
the reason for skipping
the PREPARE is similar to the reason why it would have been skipped on
a restart after a previous decode run.

One possible fix would be similar to what you suggested, in
DecodePrepare , add the check DecodingContextReady(ctx), which if
false would indicate that the
PREPARE was prior to a consistent snapshot and if so, set a flag value
in txn accordingly

Sure, but you can see in your example above it got skipped due to
start_decoding_at not due to DecodingContextReady. So, the problem as
mentioned by me previously was how we distinguish those cases because
it can skip due to start_decoding_at during restart as well when we
would have already sent the prepare to the subscriber.

One idea could be that the subscriber skips the transaction if it sees
the transaction is already prepared. We already skip changes in apply
worker (subscriber) if they are performed via tablesync worker, see
should_apply_changes_for_rel. This will be a different thing but I am
trying to indicate that something similar is already done in
subscriber. I am not sure if we can detect this in publisher, if so,
that would be also worth considering and might be better.

Thoughts?

--
With Regards,
Amit Kapila.

#140Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#139)

On Tue, Dec 1, 2020 at 12:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

So what caused it to skip due to start_decoding_at? Because the commit
where the snapshot became consistent is after Prepare. Does it happen
due to the below code in SnapBuildFindSnapshot() where we bump
start_decoding_at.

{
...
if (running->oldestRunningXid == running->nextXid)
{
if (builder->start_decoding_at == InvalidXLogRecPtr ||
builder->start_decoding_at <= lsn)
/* can decode everything after this */
builder->start_decoding_at = lsn + 1;

I think the reason is that in the function
DecodingContextFindStartpoint(), the code
loops till it finds the consistent snapshot. Then once consistent
snapshot is found, it sets
slot->data.confirmed_flush = ctx->reader->EndRecPtr; This will be used
as the start_decoding_at when the slot is
restarted for decoding.

Sure, but you can see in your example above it got skipped due to
start_decoding_at not due to DecodingContextReady. So, the problem as
mentioned by me previously was how we distinguish those cases because
it can skip due to start_decoding_at during restart as well when we
would have already sent the prepare to the subscriber.

The distinguishing factor is that at restart, the Prepare does satisfy
DecodingContextReady (because the snapshot is consistent then).
In both cases, the prepare is prior to start_decoding_at, but when the
prepare is before a consistent point,
it does not satisfy DecodingContextReady. Which is why I suggested
using the check DecodingContextReady to mark the prepare as 'Not
decoded".

regards,
Ajin Cherian
Fujitsu Australia

#141Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#140)

On Tue, Dec 1, 2020 at 7:55 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Dec 1, 2020 at 12:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Sure, but you can see in your example above it got skipped due to
start_decoding_at not due to DecodingContextReady. So, the problem as
mentioned by me previously was how we distinguish those cases because
it can skip due to start_decoding_at during restart as well when we
would have already sent the prepare to the subscriber.

The distinguishing factor is that at restart, the Prepare does satisfy
DecodingContextReady (because the snapshot is consistent then).
In both cases, the prepare is prior to start_decoding_at, but when the
prepare is before a consistent point,
it does not satisfy DecodingContextReady.

I think it won't be true when we reuse some already serialized
snapshot from some other slot. It is possible that we wouldn't have
encountered such a serialized snapshot while creating a slot but later
during replication, we might use it because by that time some other
slot has serialized the one at that point.

--
With Regards,
Amit Kapila.

#142Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#139)

On Mon, Nov 30, 2020 at 7:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 30, 2020 at 2:36 PM Ajin Cherian <itsajin@gmail.com> wrote:

Sure, but you can see in your example above it got skipped due to
start_decoding_at not due to DecodingContextReady. So, the problem as
mentioned by me previously was how we distinguish those cases because
it can skip due to start_decoding_at during restart as well when we
would have already sent the prepare to the subscriber.

One idea could be that the subscriber skips the transaction if it sees
the transaction is already prepared.

To skip it, we need to send GID in begin message and then on
subscriber-side, check if the prepared xact already exists, if so then
set a flag. The flag needs to be set in begin/start_stream and reset
in stop_stream/commit/abort. Using the flag, we can skip the entire
contents of the prepared xact. In ReorderFuffer-side also, we need to
get and set GID in txn even when we skip it because we need to send
the same at commit time. In this solution, we won't be able to send it
during normal start_stream because by that time we won't know GID and
I think that won't be required. Note that this is only required when
we skipped sending prepare, otherwise, we just need to send
Commit-Prepared at commit time.

Another way to solve this problem via publisher-side is to maintain in
some file at slot level whether we have sent prepare for a particular
txn? Basically, after sending prepare, we need to update the slot
information on disk to indicate that the particular GID is sent (we
can probably store GID and LSN of Prepare). Then next time whenever we
have to skip prepare due to whatever reason, we can check the
existence of persistent information on disk for that GID, if it exists
then we need to send just Commit Prepared, otherwise, the entire
transaction. We can remove this information during or after
CheckPointSnapBuild, basically, we can remove the information of all
GID's that are after cutoff LSN computed via
ReplicationSlotsComputeLogicalRestartLSN. Now, we can even think of
removing this information after Commit Prepared but not sure if that
is correct because we can't lose this information unless
start_decoding_at (or restart_lsn) is moved past the commit lsn

Now, to persist this information, there could be multiple
possibilities (a) maintain the flexible array for GID's at the end of
ReplicationSlotPersistentData, (b) have a separate state file per-slot
for prepared xacts, (c) have a separate state file for each prepared
xact per-slot.

With (a) during upgrade from the previous version there could be a
problem because the previous data won't match new data but I am not
sure if we maintain slots info intact after upgrade. I think (c) would
be simplest but OTOH, having many such files (in case there are more
prepared xacts) per-slot might not be a good idea.

One more thing that needs to be thought about is when we are sending
the entire xact at commit time whether we will send prepare
separately? Because, if we don't send it separately, then later
allowing the PREPARE on the master to wait for prepare via subscribers
won't be possible?

Thoughts?

--
With Regards,
Amit Kapila.

#143Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#142)

On Tue, Dec 1, 2020 at 6:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

One idea could be that the subscriber skips the transaction if it sees
the transaction is already prepared.

To skip it, we need to send GID in begin message and then on
subscriber-side, check if the prepared xact already exists, if so then
set a flag. The flag needs to be set in begin/start_stream and reset
in stop_stream/commit/abort. Using the flag, we can skip the entire
contents of the prepared xact. In ReorderFuffer-side also, we need to
get and set GID in txn even when we skip it because we need to send
the same at commit time. In this solution, we won't be able to send it
during normal start_stream because by that time we won't know GID and
I think that won't be required. Note that this is only required when
we skipped sending prepare, otherwise, we just need to send
Commit-Prepared at commit time.

After going through both the solutions, I think the above one is a better idea.
I also think, rather than change the protocol for the regular begin,
we could have
a special begin_prepare for prepared txns specifically. This way we won't affect
non-prepared transactions. We will need to add in a begin_prepare callback
as well, which has the gid as one of the parameters. Other than this,
in ReorderBufferFinishPrepared, if the txn hasn't already been
prepared (because it was skipped in DecodePrepare), then we set
prepared flag and call
ReorderBufferReplay before calling commit-prepared callback.

At the subscriber side, on receipt of the special begin-prepare, we
first check if the gid is of an already
prepared txn, if yes, then we set a flag such that the rest of the
transaction are not applied but skipped, If it's not
a gid that has already been prepared, then continue to apply changes
as you would otherwise. So, this is the
approach I'd pick. The drawback is probably that we send extra
prepares after a restart, which might be quite common
while using test_decoding but not so common when using the pgoutput
and real world scenarios of pub/sub.

The second approach is a bit more involved requiring file creation and
manipulation as well as the overhead of having to
write to a file on every prepare which might be a performance bottleneck.

Let me know what you think.

regards,
Ajin Cherian
Fujitsu Australia

#144Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#143)

On Wed, Dec 2, 2020 at 12:47 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Dec 1, 2020 at 6:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

One idea could be that the subscriber skips the transaction if it sees
the transaction is already prepared.

To skip it, we need to send GID in begin message and then on
subscriber-side, check if the prepared xact already exists, if so then
set a flag. The flag needs to be set in begin/start_stream and reset
in stop_stream/commit/abort. Using the flag, we can skip the entire
contents of the prepared xact. In ReorderFuffer-side also, we need to
get and set GID in txn even when we skip it because we need to send
the same at commit time. In this solution, we won't be able to send it
during normal start_stream because by that time we won't know GID and
I think that won't be required. Note that this is only required when
we skipped sending prepare, otherwise, we just need to send
Commit-Prepared at commit time.

After going through both the solutions, I think the above one is a better idea.
I also think, rather than change the protocol for the regular begin,
we could have
a special begin_prepare for prepared txns specifically. This way we won't affect
non-prepared transactions. We will need to add in a begin_prepare callback
as well, which has the gid as one of the parameters. Other than this,
in ReorderBufferFinishPrepared, if the txn hasn't already been
prepared (because it was skipped in DecodePrepare), then we set
prepared flag and call
ReorderBufferReplay before calling commit-prepared callback.

At the subscriber side, on receipt of the special begin-prepare, we
first check if the gid is of an already
prepared txn, if yes, then we set a flag such that the rest of the
transaction are not applied but skipped, If it's not
a gid that has already been prepared, then continue to apply changes
as you would otherwise.

The above sketch sounds good to me and additionally you might want to
add Asserts in streaming APIs on the subscriber-side to ensure that we
should never reach the already prepared case there. We should never
need to stream the changes when we are skipping prepare either because
the snapshot was not consistent by that time or we have already sent
those changes before restart.

So, this is the
approach I'd pick. The drawback is probably that we send extra
prepares after a restart, which might be quite common
while using test_decoding but not so common when using the pgoutput
and real world scenarios of pub/sub.

The restarts would be rare. It depends on how one uses test_decoding
module, this is primarily for testing and if you write a test in way
that it tries to perform wal decoding again and again for the same WAL
(aka simulating restarts) then probably you would see it again but
otherwise, one shouldn't see it.

--
With Regards,
Amit Kapila.

#145Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#144)
9 attachment(s)

I have rebased the v28 patch set (made necessary due to recent commit [1]https://github.com/postgres/postgres/commit/0926e96c493443644ba8e96b5d96d013a9ffaf64)
[1]: https://github.com/postgres/postgres/commit/0926e96c493443644ba8e96b5d96d013a9ffaf64

And at the same time I have added patch 0009 to this set - This is for
the new SUBSCRIPTION option "two_phase" (0009 is still WIP but
stable).

PSA new patch set with version bumped to v29.

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v29-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchapplication/octet-stream; name=v29-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchDownload
From 9aaf7bdd15e7ea3e9c7b4176f2a1f7e048c947d2 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 2 Dec 2020 17:25:59 +1100
Subject: [PATCH v29] Allow decoding at prepare time in ReorderBuffer.

This patch allows PREPARE-time decoding two-phase transactions (if the
output plugin supports this capability), in which case the transactions
are replayed at PREPARE and then committed later when COMMIT PREPARED
arrives.

Now that we decode the changes before the commit, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We detect such failures with a special sqlerrcode
ERRCODE_TRANSACTION_ROLLBACK introduced by commit 7259736a6e and stop
decoding the remaining changes. Then we rollback the changes when rollback
prepared is encountered.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, Arseny Sher, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/decode.c        | 247 +++++++++++++---
 src/backend/replication/logical/reorderbuffer.c | 358 +++++++++++++++++++-----
 src/include/replication/reorderbuffer.h         |  20 ++
 3 files changed, 523 insertions(+), 102 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..1e9522f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,13 +67,22 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool already_decoded);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool already_decoded);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 
+static bool DecodeTXNNeedSkip(LogicalDecodingContext *ctx,
+							  XLogRecordBuffer *buf, Oid dbId,
+							  RepOriginId origin_id);
+
 /*
  * Take every XLogReadRecord()ed record and perform the actions required to
  * decode it using the output plugin already setup in the logical decoding
@@ -244,6 +253,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +263,16 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction data then
+				 * DecodeCommit doesn't need to decode it again. This is
+				 * possible iff output plugin supports two-phase commits and
+				 * doesn't skip the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+					already_decoded = !(ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+
+				DecodeCommit(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +281,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +291,14 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction during prepare
+				 * then DecodeAbort need to call rollback prepared.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+					already_decoded = !(ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+
+				DecodeAbort(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +339,34 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -582,10 +626,14 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool already_decoded)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -606,15 +654,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * the reorderbuffer to forget the content of the (sub-)transactions
 	 * if not.
 	 *
-	 * There can be several reasons we might not be interested in this
-	 * transaction:
-	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
-	 * 2) The transaction happened in another database.
-	 * 3) The output plugin is not interested in the origin.
-	 * 4) We are doing fast-forwarding
-	 *
 	 * We can't just use ReorderBufferAbort() here, because we need to execute
 	 * the transaction's invalidations.  This currently won't be needed if
 	 * we're just skipping over the transaction because currently we only do
@@ -627,9 +666,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * relevant syscaches.
 	 * ---
 	 */
-	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
-		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
-		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
 	{
 		for (i = 0; i < parsed->nsubxacts; i++)
 		{
@@ -640,7 +677,83 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		return;
 	}
 
-	/* tell the reorderbuffer about the surviving subtransactions */
+	/*
+	 * Send the final commit record if the transaction data is already decoded,
+	 * otherwise, process the entire transaction.
+	 */
+	if (already_decoded)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* tell the reorderbuffer about the surviving subtransactions */
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+									 buf->origptr, buf->endptr);
+		}
+
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ *
+ * Note that we don't skip prepare even if we have detected concurrent abort.
+ * The reason is that it is quite possible that we had already sent some
+ * changes before we detect abort in which case we need to abort those changes
+ * in the subscriber. To abort such changes, we do send the prepare and then
+ * the rollback prepared which is what happened on the publisher-side as well.
+ * Now, we can invent a new abort API wherein in such cases we send abort and
+ * skip sending prepared and rollback prepared but then it is not that
+ * straightforward because we might have streamed this transaction by that time
+ * in which case it is handled when the rollback is encountered. It is not
+ * impossible to optimize the concurrent abort case but it can introduce design
+ * complexity w.r.t handling different cases so leaving it for now as it
+ * doesn't seem worth it.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz prepare_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		prepare_time = parsed->origin_timestamp;
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeTXNNeedSkip
+	 * for the reasons why we sometimes want to skip the transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache invalidations
+	 * if there are any for the reasons mentioned in DecodeCommit.
+	 */
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
+	{
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/* Tell the reorderbuffer about the surviving subtransactions. */
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
 		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
@@ -648,33 +761,70 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 prepare_time, origin_id, origin_lsn,
+						 parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool already_decoded)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz abort_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool	skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		origin_lsn = parsed->origin_lsn;
+		abort_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeTXNNeedSkip
+	 * for the reasons why we sometimes want to skip the transaction.
+	 */
+	skip_xact = DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id);
+
+	/*
+	 * Send the final rollback record if the transaction data is already
+	 * decoded and we don't need to skip it, otherwise, perform the cleanup of
+	 * the transaction.
+	 */
+	if (already_decoded && !skip_xact)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									abort_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
 	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
@@ -1080,3 +1230,24 @@ DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tuple)
 	header->t_infomask2 = xlhdr.t_infomask2;
 	header->t_hoff = xlhdr.t_hoff;
 }
+
+/*
+ * Check whether we are interested in this specific transaction.
+ *
+ * There can be several reasons we might not be interested in this
+ * transaction:
+ * 1) We might not be interested in decoding transactions up to this
+ *	  LSN. This can happen because we previously decoded it and now just
+ *	  are restarting or if we haven't assembled a consistent snapshot yet.
+ * 2) The transaction happened in another database.
+ * 3) The output plugin is not interested in the origin.
+ * 4) We are doing fast-forwarding
+ */
+static bool
+DecodeTXNNeedSkip(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+				  Oid txn_dbid, RepOriginId origin_id)
+{
+	return (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+			(txn_dbid != InvalidOid && txn_dbid != ctx->slot->data.database) ||
+			ctx->fast_forward || FilterByOrigin(ctx, origin_id));
+}
\ No newline at end of file
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 301baff..d889fd1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -422,6 +423,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1515,12 +1522,18 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after
+ * streaming or decoding them at PREPARE. Keep the remaining info -
+ * transactions, tuplecids, invalidations and snapshots.
+ *
+ * We additionaly remove tuplecids after decoding the transaction at prepare
+ * time as we only need to perform invalidation at rollback or commit prepared.
+ *
+ * 'txn_prepared' indicates that we have decoded the transaction at prepare
+ * time.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1539,7 +1552,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1573,9 +1586,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1757,9 +1794,10 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * If the transaction was (partially) streamed, we need to commit it in a
- * 'streamed' way.  That is, we first stream the remaining part of the
- * transaction, and then invoke stream_commit message.
+ * If the transaction was (partially) streamed, we need to prepare or commit
+ * it in a 'streamed' way.  That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_prepare or stream_commit message as per
+ * the case.
  */
 static void
 ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1769,29 +1807,49 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		/*
+		 * Note, we send stream prepare even if a concurrent abort is detected.
+		 * See DecodePrepare for more information.
+		 */
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids.
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
  * Set xid to detect concurrent aborts.
  *
- * While streaming an in-progress transaction there is a possibility that the
- * (sub)transaction might get aborted concurrently.  In such case if the
- * (sub)transaction has catalog update then we might decode the tuple using
- * wrong catalog version.  For example, suppose there is one catalog tuple with
- * (xmin: 500, xmax: 0).  Now, the transaction 501 updates the catalog tuple
- * and after that we will have two tuples (xmin: 500, xmax: 501) and
- * (xmin: 501, xmax: 0).  Now, if 501 is aborted and some other transaction
- * say 502 updates the same catalog tuple then the first tuple will be changed
- * to (xmin: 500, xmax: 502).  So, the problem is that when we try to decode
- * the tuple inserted/updated in 501 after the catalog update, we will see the
- * catalog tuple with (xmin: 500, xmax: 502) as visible because it will
- * consider that the tuple is deleted by xid 502 which is not visible to our
- * snapshot.  And when we will try to decode with that catalog tuple, it can
- * lead to a wrong result or a crash.  So, it is necessary to detect
- * concurrent aborts to allow streaming of in-progress transactions.
+ * While streaming an in-progress transaction or decoding a prepared
+ * transaction there is a possibility that the (sub)transaction might get
+ * aborted concurrently.  In such case if the (sub)transaction has catalog
+ * update then we might decode the tuple using wrong catalog version.  For
+ * example, suppose there is one catalog tuple with (xmin: 500, xmax: 0).  Now,
+ * the transaction 501 updates the catalog tuple and after that we will have
+ * two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0).  Now, if 501 is
+ * aborted and some other transaction say 502 updates the same catalog tuple
+ * then the first tuple will be changed to (xmin: 500, xmax: 502).  So, the
+ * problem is that when we try to decode the tuple inserted/updated in 501
+ * after the catalog update, we will see the catalog tuple with (xmin: 500,
+ * xmax: 502) as visible because it will consider that the tuple is deleted by
+ * xid 502 which is not visible to our snapshot.  And when we will try to
+ * decode with that catalog tuple, it can lead to a wrong result or a crash.
+ * So, it is necessary to detect concurrent aborts to allow streaming of
+ * in-progress transactions or decoding of prepared  transactions.
  *
  * For detecting the concurrent abort we set CheckXidAlive to the current
  * (sub)transaction's xid for which this change belongs to.  And, during
@@ -1800,7 +1858,10 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * and discard the already streamed changes on such an error.  We might have
  * already streamed some of the changes for the aborted (sub)transaction, but
  * that is fine because when we decode the abort we will stream abort message
- * to truncate the changes in the subscriber.
+ * to truncate the changes in the subscriber. Similarly, for prepared
+ * transactions, we stop decoding if concurrent abort is detected and then
+ * rollback the changes when rollback prepared is encountered. See
+ * DecodePreare.
  */
 static inline void
 SetupCheckXidLive(TransactionId xid)
@@ -1900,7 +1961,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1912,15 +1973,19 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		specinsert = NULL;
 	}
 
-	/* Stop the stream. */
-	rb->stream_stop(rb, txn, last_lsn);
-
-	/* Remember the command ID and snapshot for the streaming run */
-	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	/*
+	 * For the streaming case, stop the stream and remember the command ID and
+	 * snapshot for the streaming run.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_stop(rb, txn, last_lsn);
+		ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	}
 }
 
 /*
- * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ * Helper function for ReorderBufferReplay and ReorderBufferStreamTXN.
  *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
@@ -2006,8 +2071,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			prev_lsn = change->lsn;
 
-			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			/*
+			 * Set the current xid to detect concurrent aborts. This is
+			 * required for the cases when we decode the changes before the
+			 * COMMIT record is processed.
+			 */
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2298,7 +2367,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2332,15 +2410,22 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the four reasons:
+		 * 1. Decoding an in-progress txn.
+		 * 2. Decoding a prepared txn.
+		 * 3. Decoding of a prepared txn that was (partially) streamed.
+		 * 4. Decoding a committed txn.
+		 *
+		 * For 1, we allow truncation of txn data by removing the changes already
+		 * streamed but still keeping other things like invalidations, snapshot,
+		 * and tuplecids. For 2 and 3, we indicate ReorderBufferTruncateTXN to
+		 * do more elaborate truncation of txn data as the entire transaction has
+		 * been decoded except for commit. For 4, as the entire txn has been
+		 * decoded, we can fully clean up the TXN reorder buffer.
 		 */
-		if (streaming)
+		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
-
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2373,17 +2458,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2413,26 +2501,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * ReorderBufferCommitChild(), even if previously assigned to the toplevel
  * transaction with ReorderBufferAssignChild.
  *
- * This interface is called once a toplevel commit is read for both streamed
- * as well as non-streamed transactions.
+ * This interface is called once a prepare or toplevel commit is read for both
+ * streamed as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferReplay(ReorderBufferTXN *txn,
+					ReorderBuffer *rb, TransactionId xid,
 					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 					TimestampTz commit_time,
 					RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2462,7 +2543,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	if (txn->base_snapshot == NULL)
 	{
 		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+
+		/*
+		 * Removing this txn before a commit might result in the computation
+		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
+		 */
+		if (!rbtxn_prepared(txn))
+			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
 
@@ -2474,6 +2561,116 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, so skip preparing it */
+	if (txn == NULL)
+		return true;
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferReplay(txn, rb, xid, commit_lsn, end_lsn, commit_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+					 TimestampTz prepare_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferReplay(txn, rb, xid, prepare_lsn, end_lsn, prepare_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time, RepOriginId origin_id,
+							XLogRecPtr origin_lsn, char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	/* this txn is obviously prepared */
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	if (is_commit)
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2605,6 +2802,39 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 }
 
 /*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ *
+ * Note that this is a special-purpose function for prepared transactions where
+ * we don't want to clean up the TXN even when we decide to skip it. See
+ * DecodePrepare.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
+/*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index efd1957..a56338d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -174,6 +174,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +234,12 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -622,12 +629,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -645,6 +658,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+								 TimestampTz prepare_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v29-0006-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v29-0006-Support-2PC-txn-pgoutput.patchDownload
From 548f21a0b1a601922cdda44e4e86f4883ffda2c4 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 2 Dec 2020 19:27:41 +1100
Subject: [PATCH v29] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.
---
 src/backend/access/transam/twophase.c           |  33 +++-
 src/backend/replication/logical/proto.c         | 141 +++++++++++++-
 src/backend/replication/logical/reorderbuffer.c |   6 +
 src/backend/replication/logical/worker.c        | 237 ++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c     |  74 ++++++++
 src/include/access/twophase.h                   |   1 +
 src/include/replication/logicalproto.h          |  38 +++-
 src/include/replication/reorderbuffer.h         |  14 ++
 src/tools/pgindent/typedefs.list                |   1 +
 9 files changed, 540 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9b..00b4497 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
@@ -1133,9 +1160,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..cfb94d1 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,145 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK]
+	 * PREPARED uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2753db9..5673959 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2665,9 +2665,15 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	strcpy(txn->gid, gid);
 
 	if (is_commit)
+	{
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
 		rb->commit_prepared(rb, txn, commit_lsn);
+	}
 	else
+	{
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
 		rb->rollback_prepared(rb, txn, commit_lsn);
+	}
 
 	/* cleanup: make sure there's no cache pollution */
 	ReorderBufferExecuteInvalidations(txn->ninvalidations,
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a4ec883..2416b85 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -722,6 +723,234 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData *prepare_data)
+{
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK
+	 * PREPARED for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare_txn (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1979,6 +2208,14 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..71ac431 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -143,6 +151,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +165,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +392,48 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +913,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535d..c04d872 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,12 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_PREPARE = 'P',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +117,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +125,28 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +154,10 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData *prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +201,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a56338d..0a00429 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
 #define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -240,6 +242,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7828d8e..5f97c01 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,6 +1341,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPrepareData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v29-0004-Support-2PC-txn-tests-for-concurrent-aborts.patchapplication/octet-stream; name=v29-0004-Support-2PC-txn-tests-for-concurrent-aborts.patchDownload
From 07de7c697dca23ed06c8c79e70da1a56bddf2f15 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 2 Dec 2020 17:52:15 +1100
Subject: [PATCH v29] Support 2PC txn tests for concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2PC.
---
 contrib/test_decoding/Makefile                    |   2 +
 contrib/test_decoding/t/001_twophase.pl           | 121 ++++++++++++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++++++
 contrib/test_decoding/test_decoding.c             |  58 ++++++++++
 src/backend/replication/logical/reorderbuffer.c   |   5 +
 5 files changed, 319 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2c4acdc..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..3b3e7b8
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of prepared txn test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..15001c6
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 429a07c..541dc11 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,11 +11,13 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
+#include "storage/procarray.h"
 
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -35,6 +37,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -171,6 +174,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -272,6 +276,24 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -450,6 +472,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -599,6 +645,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -685,6 +734,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -897,6 +949,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -950,6 +1005,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d889fd1..2753db9 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2480,6 +2480,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
-- 
1.8.3.1

v29-0005-Support-2PC-txn-spoolfile.patchapplication/octet-stream; name=v29-0005-Support-2PC-txn-spoolfile.patchDownload
From d254d5526512d699c724c1a5af57145be8028ac3 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 2 Dec 2020 18:55:47 +1100
Subject: [PATCH v29] Support 2PC txn - spoolfile.

This patch only refactors to isolate the streaming spool-file processing to a separate function.
Later, two-phase commit logic will require this common processing to be called from multiple places.
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 8c7fad8..a4ec883 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -924,30 +926,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -955,7 +948,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -970,7 +963,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1045,6 +1038,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v29-0003-Support-2PC-txn-tests-for-test_decoding.patchapplication/octet-stream; name=v29-0003-Support-2PC-txn-tests-for-test_decoding.patchDownload
From c955f35d84ad1830b43489c59fe04c3d6f63734d Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 2 Dec 2020 17:36:45 +1100
Subject: [PATCH v29] Support 2PC txn tests for test_decoding.

Add sql tests to test_decoding for 2PC.
---
 contrib/test_decoding/Makefile                     |   2 +-
 contrib/test_decoding/expected/two_phase.out       | 228 +++++++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 177 ++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 +++++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 ++++++
 5 files changed, 588 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..2c4acdc 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,7 +4,7 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..e5e34b4
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,228 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#1'
+(1 row)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+               data                
+-----------------------------------
+ COMMIT PREPARED 'test_prepared#3'
+(1 row)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                 data                 
+--------------------------------------
+ COMMIT PREPARED 'test_prepared_lock'
+(1 row)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                   data                    
+-------------------------------------------
+ COMMIT PREPARED 'test_prepared_savepoint'
+(1 row)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..957c198
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,177 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+          data           
+-------------------------
+ COMMIT PREPARED 'test1'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
-- 
1.8.3.1

v29-0007-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v29-0007-Support-2PC-txn-subscriber-tests.patchDownload
From 84549eadf44ab47fd7a04734db79704146a0b3cf Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 2 Dec 2020 20:35:08 +1100
Subject: [PATCH v29] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v29-0008-Support-2PC-documentation.patchapplication/octet-stream; name=v29-0008-Support-2PC-documentation.patchDownload
From 8433e52ba2952f776ed85919b5956168d094470c Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 2 Dec 2020 20:41:20 +1100
Subject: [PATCH v29] Support-2PC-documentation.

Add documentation about two-phase commit support in Logical Decoding.
---
 doc/src/sgml/logicaldecoding.sgml | 97 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 96 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 73673a0..fcc43fd 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -165,7 +165,57 @@ COMMIT 693
 <keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
 $ pg_recvlogical -d postgres --slot=test --drop-slot
 </programlisting>
-  </sect1>
+
+  <para>
+  The following example shows how logical decoding can be used to handle transactions
+  that use a two-phase commit. Before you use two-phase commit commands, you must set
+  <varname>max_prepared_transactions</varname> to at least 1. You must also set the 
+  option 'two-phase-commit' to 1 while calling <function>pg_logical_slot_get_changes</function>.
+  </para>
+<programlisting>
+postgres=# BEGIN;
+postgres=*# INSERT INTO data(data) VALUES('5');
+postgres=*# PREPARE TRANSACTION 'test_prepared1';
+
+postgres=# SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/1689DC0 | 529 | BEGIN 529
+ 0/1689DC0 | 529 | table public.data: INSERT: id[integer]:3 data[text]:'5'
+ 0/1689FC0 | 529 | PREPARE TRANSACTION 'test_prepared1', txid 529
+(3 rows)
+
+postgres=# COMMIT PREPARED 'test_prepared1';
+COMMIT PREPARED
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                    data                    
+-----------+-----+--------------------------------------------
+ 0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529
+(1 row)
+
+postgres=#-- you can also rollback a prepared transaction
+postgres=# BEGIN;
+BEGIN
+postgres=*# INSERT INTO data(data) VALUES('6');INSERT 0 1
+postgres=*# PREPARE TRANSACTION 'test_prepared2';PREPARE TRANSACTION
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/168A180 | 530 | BEGIN 530
+ 0/168A1E8 | 530 | table public.data: INSERT: id[integer]:4 data[text]:'6'
+ 0/168A430 | 530 | PREPARE TRANSACTION 'test_prepared2', txid 530
+(3 rows)
+
+postgres=# ROLLBACK PREPARED 'test_prepared1';ERROR:  prepared transaction with identifier "test_prepared1" does not exist
+postgres=# ROLLBACK PREPARED 'test_prepared2';
+ROLLBACK PREPARED
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                     data                     
+-----------+-----+----------------------------------------------
+ 0/168A4B8 | 530 | ROLLBACK PREPARED 'test_prepared2', txid 530
+(1 row)
+</programlisting>
+</sect1>
 
   <sect1 id="logicaldecoding-explanation">
    <title>Logical Decoding Concepts</title>
@@ -1103,4 +1153,49 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
    </para>
 
   </sect1>
+
+  <sect1 id="logicaldecoding-two-phase-commits">
+   <title>Two-phase commit support for Logical Decoding</title>
+
+   <para>
+   With the basic output plugin callbacks (eg., <function>begin_cb</function>,
+   <function>change_cb</function>, <function>commit_cb</function> and
+   <function>message_cb</function>) two-phase commit commands like
+   <command>PREPARE TRANSACTION</command>, <command>COMMIT PREPARED</command>
+   and <command>ROLLBACK PREPARED</command> are not decoded correctly.
+   While the <command>PREPARE TRANSACTION</command> ignored, 
+   <command>COMMIT PREPARED</command> is decoded as a <command>COMMIT</command> and 
+   <command>ROLLBACK PREPARED</command> is decoded as a <command>ROLLBACK</command>.
+   </para>
+
+   <para>
+   An output plugin may provide additional callbacks to support two-phase commit commands.
+   There are multiple two-phase commit callbacks that are required,
+   (<function>prepare_cb</function>, <function>commit_prepared_cb</function>, 
+   <function>rollback_prepared_cb</function> and <function>stream_prepare_cb</function>)
+   and an optional callback (<function>filter_prepare_cb</function>).
+   </para>
+
+   <para>
+   If the output plugin callbacks for decoding two-phase commit commands are provided,
+   then on <command>PREPARE TRANSACTION</command>, the changes of that transaction are
+   decoded, passed to the output plugin and the <function>prepare_cb</function>
+   callback is invoked. This differs from the basic decoding setup where changes are
+   only passed to the output plugin when a transaction is committed.
+   </para>
+
+   <para>
+   When a prepared transaction is rollbacked using the <command>ROLLBACK PREPARED</command>,
+   then the <function>rollback_prepared_cb</function> is invoked and when the
+   prepared transaction is committed using <command>COMMIT PREPARED</command>,
+   then the <function>commit_prepared_cb</function> callback is invoked.
+   </para>
+
+   <para>
+   Optionally the output plugin can specify a name pattern in the 
+   <function>filter_prepare_cb</function> and transactions with gid containing
+   that name pattern will not be decoded as a two-phase commit transaction. 
+   </para>
+
+  </sect1>
  </chapter>
-- 
1.8.3.1

v29-0009-Support-2PC-txn-WIP-Subscription-option.patchapplication/octet-stream; name=v29-0009-Support-2PC-txn-WIP-Subscription-option.patchDownload
From c42763b9112d8060cec0a03405eadb0d73eafbed Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 2 Dec 2020 22:00:34 +1100
Subject: [PATCH v29] Support 2PC txn - WIP Subscription option.

This is a WIP patch for new SUBSCRIPTION option "two_phase".
The catalog and syntax changes are done, including syntax tests, but
otherwise the feature is not yet implemented.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 13 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/catversion.h                   |  2 +-
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 15 files changed, 156 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index db5e59f..dbe2a43 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -166,8 +166,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..0d233f4e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,19 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. When two-phase commit is not
+          enabled then PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED are not
+          decoded on the publisher.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index ca78d39..886839e 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -67,6 +67,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b140c21..5f4e191 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1149,7 +1149,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 1696454..b0745d5 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -64,7 +64,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -105,6 +106,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -210,6 +216,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -355,6 +370,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -379,7 +396,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -447,6 +465,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -720,6 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -730,7 +751,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -769,6 +791,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -787,7 +816,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -832,7 +862,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -875,7 +906,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 24f8b3e..1f404cd 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -429,6 +429,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index dc1d41d..97f0dd5 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4221,6 +4221,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4264,9 +4265,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4287,6 +4293,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4312,6 +4319,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4380,6 +4389,9 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+	
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 317bb83..22e4e6c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -629,6 +629,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 14150d0..47306a2 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -5997,7 +5997,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6023,13 +6023,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index a16cc38..bc1d8fd 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202011251
+#define CATALOG_VERSION_NO	202011271
 
 #endif
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3fa02af..e07eed0 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -53,6 +53,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -90,6 +92,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index c04d872..4b0afac 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1b05b39..f96c891 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 2fa9bce..23d876e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,42 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 14fa0b2..2a0b366 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -147,6 +147,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
-- 
1.8.3.1

v29-0001-Extend-the-output-plugin-API-to-allow-decoding-p.patchapplication/octet-stream; name=v29-0001-Extend-the-output-plugin-API-to-allow-decoding-p.patchDownload
From d9c0ae99f60f4290f9e8dd2f8b0283d40e5d0898 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 2 Dec 2020 17:14:23 +1100
Subject: [PATCH v29] Extend the output plugin API to allow decoding prepared
 xacts.

This adds four methods to the output plugin API, adding support for
streaming changes of two-phase transactions at prepare time.

* prepare
* commit_prepared
* rollback_prepared
* stream_prepare

Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction is not yet committed and maybe
aborted later.

Until now two-phase transactions were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the
two-phase commands were communicated to the subscriber.

This patch provides the infrastructure for logical decoding plugins to be
informed of two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

This also extends the 'test_decoding' plugin, implementing these new
methods.

This commit simply adds these new APIs and the upcoming patch to "allow
the decoding at prepare time in ReorderBuffer" will use these APIs.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c     | 146 +++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 149 ++++++++++++++++-
 src/backend/replication/logical/logical.c | 257 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  50 ++++++
 src/include/replication/reorderbuffer.h   |  38 +++++
 src/tools/pgindent/typedefs.list          |  11 ++
 7 files changed, 649 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e12278b..429a07c 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -76,6 +76,18 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 static void pg_decode_stream_start(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn);
 static void pg_output_stream_start(LogicalDecodingContext *ctx,
@@ -87,6 +99,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -123,9 +138,14 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
@@ -141,6 +161,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -241,6 +262,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -252,6 +283,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -320,6 +352,93 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -702,6 +821,33 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..73673a0 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,9 +389,14 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +418,19 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits,
+    which allows actions to be decoded on the <command>PREPARE TRANSACTION</command>.
+    The <function>prepare_cb</function>, <function>stream_prepare_cb</function>,
+    <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +491,15 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too. We will skip all the changes of such a transaction once
+     the abort is detected and abort the transaction when we read WAL for
+     <command>ROLLBACK PREPARED</command>.
     </para>
 
     <note>
@@ -587,7 +609,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -685,7 +713,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -698,6 +732,90 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents as for the
+      other callbacks. The <parameter>txn</parameter> parameter contains meta
+      information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some
+      cases. The <parameter>gid</parameter> is the identifier that later
+      identifies this transaction for <command>COMMIT PREPARED</command> or
+      <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callback for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr prepare_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called
+      whenever a transaction commit prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                      ReorderBufferTXN *txn,
+                                                      XLogRecPtr commit_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called
+      whenever a transaction rollback prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                        ReorderBufferTXN *txn,
+                                                        XLogRecPtr rollback_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-start">
      <title>Stream Start Callback</title>
      <para>
@@ -735,6 +853,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1044,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f1f4df7..1a6e844 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,14 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +82,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -227,6 +237,32 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require
+	 * prepare/commit-prepare/abort-prepare callbacks. The filter_prepare
+	 * callback is optional. We however enable two-phase logical decoding when
+	 * at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
+	 * Callback to support decoding at prepare time.
+	 *
+	 * filter_prepare is optional, so we do not fail with ERROR when missing,
+	 * but the wrappers simply do nothing.
+	 */
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +273,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +820,135 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of prepare record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1025,54 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case, all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1271,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming at prepare time requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..14e6105 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,40 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
+
+/*
  * Called when starting to stream a block of changes from in-progress
  * transaction (may be called repeatedly, if it's streamed in multiple
  * chunks).
@@ -124,6 +158,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -173,10 +215,18 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+
+	/* streaming of changes at prepare time */
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
+
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bd9dd7e..efd1957 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -244,6 +244,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char	   *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -418,6 +421,26 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* start streaming transaction callback signature */
 typedef void (*ReorderBufferStreamStartCB) (
 											ReorderBuffer *rb,
@@ -436,6 +459,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -505,11 +534,20 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction at prepare time.
+	 */
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
+
+	/*
 	 * Callbacks to be called when streaming a transaction.
 	 */
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 04464c2..7828d8e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1315,9 +1315,20 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
1.8.3.1

#146Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#145)

On Wed, Dec 2, 2020 at 8:24 PM Peter Smith <smithpb2250@gmail.com> wrote:

I have rebased the v28 patch set (made necessary due to recent commit [1])
[1] https://github.com/postgres/postgres/commit/0926e96c493443644ba8e96b5d96d013a9ffaf64

And at the same time I have added patch 0009 to this set - This is for
the new SUBSCRIPTION option "two_phase" (0009 is still WIP but
stable).

PSA new patch set with version bumped to v29.

Thank you for updating the patch!

While looking at the patch set I found that the tests in
src/test/subscription don't work with this patch. I got the following
error:

2020-12-03 15:18:12.666 JST [44771] tap_sub ERROR: unrecognized
pgoutput option: two_phase
2020-12-03 15:18:12.666 JST [44771] tap_sub CONTEXT: slot "tap_sub",
output plugin "pgoutput", in the startup callback
2020-12-03 15:18:12.666 JST [44771] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
two_phase 'on', publication_names '"tap_pub","tap_pub_ins_only"')

In v29-0009 patch "two_phase" option is added on the subscription side
(i.g., libpqwalreceiver) but it seems not on the publisher side
(pgoutput).

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#147Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#146)

On Thu, Dec 3, 2020 at 5:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

While looking at the patch set I found that the tests in
src/test/subscription don't work with this patch. I got the following
error:

2020-12-03 15:18:12.666 JST [44771] tap_sub ERROR: unrecognized
pgoutput option: two_phase
2020-12-03 15:18:12.666 JST [44771] tap_sub CONTEXT: slot "tap_sub",
output plugin "pgoutput", in the startup callback
2020-12-03 15:18:12.666 JST [44771] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2',
two_phase 'on', publication_names '"tap_pub","tap_pub_ins_only"')

In v29-0009 patch "two_phase" option is added on the subscription side
(i.g., libpqwalreceiver) but it seems not on the publisher side
(pgoutput).

The v29-0009 patch is still a WIP for a new SUBSCRIPTION "two_phase"
option so it is not yet fully implemented. I did run following prior
to upload but somehow did not see those failures yesterday:
cd src/test/subscription
make check

Anyway, as 0009 is the last of the set please just git apply
--reverse that one if it is causing a problem.

Sorry for any inconvenience. I will add the missing functionality to
0009 as soon as I can.

Kind Regards,
Peter Smith.
Fujitsu Australia

#148Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#147)
1 attachment(s)

On Thu, Dec 3, 2020 at 6:21 PM Peter Smith <smithpb2250@gmail.com> wrote:

Sorry for any inconvenience. I will add the missing functionality to
0009 as soon as I can.

PSA a **replacement** patch for the previous v29-0009.

This should correct the recently reported trouble [1]= /messages/by-id/CAD21AoBnZ6dYffVjOCdSvSohR_1ZNedqmb=6P9w_H6W0bK1s6g@mail.gmail.com
[1]: = /messages/by-id/CAD21AoBnZ6dYffVjOCdSvSohR_1ZNedqmb=6P9w_H6W0bK1s6g@mail.gmail.com

I observed after this patch:
make check is all OK.
cd src/test/subscription, then make check is all OK.

~

Note that the tablesync worker's (temporary) slot always uses
two_phase *off*, regardless of the user setting.

e.g.

CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub
application_name=tap_sub' PUBLICATION tap_pub WITH (streaming = on,
two_phase = on);

will show in the logs that only the apply worker slot enabled the two_phase.

STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0
(proto_version '2', streaming 'on', two_phase 'on', publication_names
'"tap_pub"')
STATEMENT: START_REPLICATION SLOT "tap_sub_16395_sync_16385" LOGICAL
0/16076D8 (proto_version '2', streaming 'on', publication_names
'"tap_pub"')

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v29-0009-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v29-0009-Support-2PC-txn-Subscription-option.patchDownload
From 5e6681da48ecf045b2227502de5581bf65cc6f8a Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 4 Dec 2020 11:15:04 +1100
Subject: [PATCH v29] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.

Note: The tablesync worker slot always has two_phase disabled, regardless of the option.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 13 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/catversion.h                   |  2 +-
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 21 files changed, 201 insertions(+), 52 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index db5e59f..dbe2a43 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -166,8 +166,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..0d233f4e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,19 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. When two-phase commit is not
+          enabled then PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED are not
+          decoded on the publisher.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index ca78d39..886839e 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -67,6 +67,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b140c21..5f4e191 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1149,7 +1149,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 1696454..b0745d5 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -64,7 +64,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -105,6 +106,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -210,6 +216,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -355,6 +370,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -379,7 +396,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -447,6 +465,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -720,6 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -730,7 +751,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -769,6 +791,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -787,7 +816,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -832,7 +862,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -875,7 +906,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 24f8b3e..1f404cd 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -429,6 +429,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 2416b85..7a6c594 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2702,6 +2702,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		(!am_tablesync_worker() && newsub->twophase != MySubscription->twophase) ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3348,6 +3349,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase && !am_tablesync_worker();
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 71ac431..7651a62 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -172,13 +172,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -246,6 +248,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -259,6 +271,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -283,7 +296,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -324,6 +338,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index dc1d41d..97f0dd5 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4221,6 +4221,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4264,9 +4265,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4287,6 +4293,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4312,6 +4319,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4380,6 +4389,9 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+	
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 317bb83..22e4e6c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -629,6 +629,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 14150d0..47306a2 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -5997,7 +5997,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6023,13 +6023,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index a16cc38..bc1d8fd 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202011251
+#define CATALOG_VERSION_NO	202011271
 
 #endif
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3fa02af..e07eed0 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -53,6 +53,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -90,6 +92,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index c04d872..4b0afac 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1b05b39..f96c891 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 2fa9bce..23d876e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,42 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 14fa0b2..2a0b366 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -147,6 +147,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

#149Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#142)
9 attachment(s)

On Tue, Dec 1, 2020 at 6:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

To skip it, we need to send GID in begin message and then on
subscriber-side, check if the prepared xact already exists, if so then
set a flag. The flag needs to be set in begin/start_stream and reset
in stop_stream/commit/abort. Using the flag, we can skip the entire
contents of the prepared xact. In ReorderFuffer-side also, we need to
get and set GID in txn even when we skip it because we need to send
the same at commit time. In this solution, we won't be able to send it
during normal start_stream because by that time we won't know GID and
I think that won't be required. Note that this is only required when
we skipped sending prepare, otherwise, we just need to send
Commit-Prepared at commit time.

I have implemented these changes and tested the fix using the test
setup I had shared above and it seems to be working fine.
I have also tested restarts that simulate duplicate prepares being
sent by the publisher and verified that it is handled correctly by the
subscriber.
Do have a look at the changes and let me know if you have any comments.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v30-0003-Support-2PC-txn-tests-for-test_decoding.patchapplication/octet-stream; name=v30-0003-Support-2PC-txn-tests-for-test_decoding.patchDownload
From acbbd9991b837f40a32a01d22cbe92988b1a72fb Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 01:42:59 -0500
Subject: [PATCH v30] Support 2PC txn tests for test_decoding.

Add sql tests to test_decoding for 2PC.
---
 contrib/test_decoding/Makefile                     |   2 +-
 contrib/test_decoding/expected/two_phase.out       | 242 +++++++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 199 +++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 ++++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 ++++++
 5 files changed, 624 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..2c4acdc 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,7 +4,7 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..9d29e6e
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,242 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+ COMMIT PREPARED 'test_prepared#1'
+(5 rows)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+ COMMIT PREPARED 'test_prepared#3'
+(4 rows)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ COMMIT PREPARED 'test_prepared_lock'
+(5 rows)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+ COMMIT PREPARED 'test_prepared_savepoint'
+(4 rows)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..a21fbc7
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,199 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ PREPARE TRANSACTION 'test1'
+ COMMIT PREPARED 'test1'
+(23 rows)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
-- 
1.8.3.1

v30-0005-Support-2PC-txn-spoolfile.patchapplication/octet-stream; name=v30-0005-Support-2PC-txn-spoolfile.patchDownload
From 3e15403643a033b4ae2dedfc8f0c94de08515dbd Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 02:19:27 -0500
Subject: [PATCH v30] Support 2PC txn - spoolfile.

This patch only refactors to isolate the streaming spool-file processing to a separate function.
Later, two-phase commit logic will require this common processing to be called from multiple places.
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 8c7fad8..a4ec883 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -924,30 +926,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -955,7 +948,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -970,7 +963,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1045,6 +1038,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v30-0004-Support-2PC-txn-tests-for-concurrent-aborts.patchapplication/octet-stream; name=v30-0004-Support-2PC-txn-tests-for-concurrent-aborts.patchDownload
From 75e306eace493ab5b398d573fc07760e1b57886e Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 02:13:14 -0500
Subject: [PATCH v30] Support 2PC txn tests for concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2PC.
---
 contrib/test_decoding/Makefile                    |   2 +
 contrib/test_decoding/t/001_twophase.pl           | 121 ++++++++++++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++++++
 contrib/test_decoding/test_decoding.c             |  58 ++++++++++
 src/backend/replication/logical/reorderbuffer.c   |   5 +
 5 files changed, 319 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2c4acdc..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..3b3e7b8
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of prepared txn test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..15001c6
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 6330661..da1369f 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,11 +11,13 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
+#include "storage/procarray.h"
 
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -35,6 +37,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -174,6 +177,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -275,6 +279,24 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -471,6 +493,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -620,6 +666,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -706,6 +755,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -918,6 +970,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -971,6 +1026,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index da54560..efce11b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2487,6 +2487,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
-- 
1.8.3.1

v30-0001-Extend-the-output-plugin-API-to-allow-decoding-p.patchapplication/octet-stream; name=v30-0001-Extend-the-output-plugin-API-to-allow-decoding-p.patchDownload
From e8904a9dc19f4bcb429cfcaf3bbdb3c4c56d4b2c Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 01:38:31 -0500
Subject: [PATCH v30] Extend the output plugin API to allow decoding prepared
 xacts.

This adds four methods to the output plugin API, adding support for
streaming changes of two-phase transactions at prepare time.

* prepare
* commit_prepared
* rollback_prepared
* stream_prepare

Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction is not yet committed and maybe
aborted later.

Until now two-phase transactions were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the
two-phase commands were communicated to the subscriber.

This patch provides the infrastructure for logical decoding plugins to be
informed of two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

This also extends the 'test_decoding' plugin, implementing these new
methods.

This commit simply adds these new APIs and the upcoming patch to "allow
the decoding at prepare time in ReorderBuffer" will use these APIs.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c     | 167 ++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 167 +++++++++++++++-
 src/backend/replication/logical/logical.c | 303 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   5 +
 src/include/replication/output_plugin.h   |  58 ++++++
 src/include/replication/reorderbuffer.h   |  43 +++++
 src/tools/pgindent/typedefs.list          |  12 ++
 7 files changed, 748 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e12278b..6330661 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -76,6 +76,20 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 TransactionId xid, const char *gid);
+static void pg_decode_begin_prepare_txn(LogicalDecodingContext *ctx,
+								ReorderBufferTXN *txn);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 static void pg_decode_stream_start(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn);
 static void pg_output_stream_start(LogicalDecodingContext *ctx,
@@ -87,6 +101,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -123,9 +140,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->begin_prepare_cb = pg_decode_begin_prepare_txn;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
@@ -141,6 +164,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -241,6 +265,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -252,6 +286,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -320,6 +355,111 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/* BEGIN PREPARE callback */
+static void
+pg_decode_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata =
+	MemoryContextAllocZero(ctx->context, sizeof(TestDecodingTxnData));
+
+	txndata->xact_wrote_changes = false;
+	txn->output_plugin_private = txndata;
+
+	if (data->skip_empty_xacts)
+		return;
+
+	pg_output_begin(ctx, data, txn, true);
+}
+
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						 TransactionId xid, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -702,6 +842,33 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..6262273 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,9 +389,15 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodeBeginPrepareCB begin_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +419,20 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits,
+    which allows actions to be decoded on the <command>PREPARE TRANSACTION</command>.
+    The <function>begin_prepare_cb</function>, <function>prepare_cb</function>, 
+    <function>stream_prepare_cb</function>,
+    <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +493,15 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too. We will skip all the changes of such a transaction once
+     the abort is detected and abort the transaction when we read WAL for
+     <command>ROLLBACK PREPARED</command>.
     </para>
 
     <note>
@@ -587,7 +611,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -685,7 +715,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -698,6 +734,106 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-begin-prepare">
+     <title>Transaction Begin Prepare Callback</title>
+
+     <para>
+      The required <function>begin_prepare_cb</function> callback is called whenever a
+      start of a prepared transaction has been decoded.
+<programlisting>
+typedef void (*LogicalDecodeBeginPrepareCB) (struct LogicalDecodingContext *ctx,
+                                      ReorderBufferTXN *txn);
+</programlisting>
+      The <parameter>txn</parameter> parameter contains meta information about
+      the transaction, like the gid of the transaction and its XID.
+     </para>
+    </sect3>
+
+
+    <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              TransactionId xid,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents as for the
+      other callbacks. The <parameter>txn</parameter> parameter contains meta
+      information about the transaction. The <parameter>xid</parameter>
+      contains the XID because <parameter>txn</parameter> can be NULL in some
+      cases. The <parameter>gid</parameter> is the identifier that later
+      identifies this transaction for <command>COMMIT PREPARED</command> or
+      <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given combination of
+      <parameter>xid</parameter> and <parameter>gid</parameter> every time it is
+      called.
+     </para>
+     </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callback for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr prepare_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called
+      whenever a transaction commit prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                      ReorderBufferTXN *txn,
+                                                      XLogRecPtr commit_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called
+      whenever a transaction rollback prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                        ReorderBufferTXN *txn,
+                                                        XLogRecPtr rollback_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-start">
      <title>Stream Start Callback</title>
      <para>
@@ -735,6 +871,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1062,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f1f4df7..d54e051 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,15 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static bool filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  TransactionId xid, const char *gid);
+static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +83,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -227,6 +238,34 @@ StartupDecodingContext(List *output_plugin_options,
 		(ctx->callbacks.stream_truncate_cb != NULL);
 
 	/*
+	 * To support two-phase logical decoding, we require
+	 * prepare/commit-prepare/abort-prepare callbacks. The filter_prepare
+	 * callback is optional. We however enable two-phase logical decoding when
+	 * at least one of the methods is enabled so that we can easily identify
+	 * missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.begin_prepare_cb != NULL) ||
+		(ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
+	 * Callback to support decoding at prepare time.
+	 *
+	 * filter_prepare is optional, so we do not fail with ERROR when missing,
+	 * but the wrappers simply do nothing.
+	 */
+	ctx->reorder->filter_prepare = filter_prepare_cb_wrapper;
+	ctx->reorder->begin_prepare = begin_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
+
+	/*
 	 * streaming callbacks
 	 *
 	 * stream_message and stream_truncate callbacks are optional, so we do not
@@ -237,6 +276,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
@@ -783,6 +823,178 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "begin_prepare";
+	state.report_location = txn->first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->first_lsn;
+
+	/*
+	 * If the plugin supports two-phase commits then begin prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.begin_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires begin_prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.begin_prepare_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of prepare record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
 {
@@ -859,6 +1071,54 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static bool
+filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  TransactionId xid, const char *gid)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case, all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (!ctx->callbacks.filter_prepare_cb)
+		return false;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, txn, xid, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1057,6 +1317,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming at prepare time requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..7f4384b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..c832399 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,47 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/*
+ * Callback called for every BEGIN of a prepared trnsaction.
+ */
+typedef void (*LogicalDecodeBeginPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn);
+
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
+
+/*
  * Called when starting to stream a block of changes from in-progress
  * transaction (may be called repeatedly, if it's streamed in multiple
  * chunks).
@@ -124,6 +165,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -173,10 +222,19 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+
+	/* streaming of changes at prepare time */
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodeBeginPrepareCB begin_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
+
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bd9dd7e..5fd6049 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -244,6 +244,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char	   *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -418,6 +421,30 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* begin prepare callback signature */
+typedef void (*ReorderBufferBeginPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* start streaming transaction callback signature */
 typedef void (*ReorderBufferStreamStartCB) (
 											ReorderBuffer *rb,
@@ -436,6 +463,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -505,11 +538,21 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction at prepare time.
+	 */
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferBeginCB begin_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
+
+	/*
 	 * Callbacks to be called when streaming a transaction.
 	 */
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cf63acb..1759095 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1315,9 +1315,21 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodeBeginPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
1.8.3.1

v30-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchapplication/octet-stream; name=v30-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchDownload
From 21204f2e9ca26dd1ec00963cb3f9dca11c8c0a15 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 01:40:11 -0500
Subject: [PATCH v30] Allow decoding at prepare time in ReorderBuffer.

This patch allows PREPARE-time decoding two-phase transactions (if the
output plugin supports this capability), in which case the transactions
are replayed at PREPARE and then committed later when COMMIT PREPARED
arrives.

Now that we decode the changes before the commit, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We detect such failures with a special sqlerrcode
ERRCODE_TRANSACTION_ROLLBACK introduced by commit 7259736a6e and stop
decoding the remaining changes. Then we rollback the changes when rollback
prepared is encountered.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, Arseny Sher, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/decode.c        | 247 ++++++++++++---
 src/backend/replication/logical/reorderbuffer.c | 385 ++++++++++++++++++++----
 src/include/replication/reorderbuffer.h         |  20 ++
 3 files changed, 548 insertions(+), 104 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..1e9522f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,13 +67,22 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool already_decoded);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool already_decoded);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 
+static bool DecodeTXNNeedSkip(LogicalDecodingContext *ctx,
+							  XLogRecordBuffer *buf, Oid dbId,
+							  RepOriginId origin_id);
+
 /*
  * Take every XLogReadRecord()ed record and perform the actions required to
  * decode it using the output plugin already setup in the logical decoding
@@ -244,6 +253,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +263,16 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction data then
+				 * DecodeCommit doesn't need to decode it again. This is
+				 * possible iff output plugin supports two-phase commits and
+				 * doesn't skip the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+					already_decoded = !(ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+
+				DecodeCommit(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +281,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		already_decoded = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +291,14 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * If we have already decoded this transaction during prepare
+				 * then DecodeAbort need to call rollback prepared.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+					already_decoded = !(ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid));
+
+				DecodeAbort(ctx, buf, &parsed, xid, already_decoded);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +339,34 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/* does output plugin want this particular transaction? */
+				if (ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid,
+												 parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -582,10 +626,14 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool already_decoded)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -606,15 +654,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * the reorderbuffer to forget the content of the (sub-)transactions
 	 * if not.
 	 *
-	 * There can be several reasons we might not be interested in this
-	 * transaction:
-	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
-	 * 2) The transaction happened in another database.
-	 * 3) The output plugin is not interested in the origin.
-	 * 4) We are doing fast-forwarding
-	 *
 	 * We can't just use ReorderBufferAbort() here, because we need to execute
 	 * the transaction's invalidations.  This currently won't be needed if
 	 * we're just skipping over the transaction because currently we only do
@@ -627,9 +666,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * relevant syscaches.
 	 * ---
 	 */
-	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
-		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
-		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
 	{
 		for (i = 0; i < parsed->nsubxacts; i++)
 		{
@@ -640,7 +677,83 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 		return;
 	}
 
-	/* tell the reorderbuffer about the surviving subtransactions */
+	/*
+	 * Send the final commit record if the transaction data is already decoded,
+	 * otherwise, process the entire transaction.
+	 */
+	if (already_decoded)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		/* tell the reorderbuffer about the surviving subtransactions */
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+									 buf->origptr, buf->endptr);
+		}
+
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ *
+ * Note that we don't skip prepare even if we have detected concurrent abort.
+ * The reason is that it is quite possible that we had already sent some
+ * changes before we detect abort in which case we need to abort those changes
+ * in the subscriber. To abort such changes, we do send the prepare and then
+ * the rollback prepared which is what happened on the publisher-side as well.
+ * Now, we can invent a new abort API wherein in such cases we send abort and
+ * skip sending prepared and rollback prepared but then it is not that
+ * straightforward because we might have streamed this transaction by that time
+ * in which case it is handled when the rollback is encountered. It is not
+ * impossible to optimize the concurrent abort case but it can introduce design
+ * complexity w.r.t handling different cases so leaving it for now as it
+ * doesn't seem worth it.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz prepare_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		prepare_time = parsed->origin_timestamp;
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeTXNNeedSkip
+	 * for the reasons why we sometimes want to skip the transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache invalidations
+	 * if there are any for the reasons mentioned in DecodeCommit.
+	 */
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
+	{
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/* Tell the reorderbuffer about the surviving subtransactions. */
 	for (i = 0; i < parsed->nsubxacts; i++)
 	{
 		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
@@ -648,33 +761,70 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, buf->origptr, buf->endptr,
+						 prepare_time, origin_id, origin_lsn,
+						 parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'already_decoded' indicates that the transaction data is already decoded
+ * at prepare time.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool already_decoded)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz abort_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool	skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		origin_lsn = parsed->origin_lsn;
+		abort_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeTXNNeedSkip
+	 * for the reasons why we sometimes want to skip the transaction.
+	 */
+	skip_xact = DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id);
+
+	/*
+	 * Send the final rollback record if the transaction data is already
+	 * decoded and we don't need to skip it, otherwise, perform the cleanup of
+	 * the transaction.
+	 */
+	if (already_decoded && !skip_xact)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									abort_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
 	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
@@ -1080,3 +1230,24 @@ DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tuple)
 	header->t_infomask2 = xlhdr.t_infomask2;
 	header->t_hoff = xlhdr.t_hoff;
 }
+
+/*
+ * Check whether we are interested in this specific transaction.
+ *
+ * There can be several reasons we might not be interested in this
+ * transaction:
+ * 1) We might not be interested in decoding transactions up to this
+ *	  LSN. This can happen because we previously decoded it and now just
+ *	  are restarting or if we haven't assembled a consistent snapshot yet.
+ * 2) The transaction happened in another database.
+ * 3) The output plugin is not interested in the origin.
+ * 4) We are doing fast-forwarding
+ */
+static bool
+DecodeTXNNeedSkip(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+				  Oid txn_dbid, RepOriginId origin_id)
+{
+	return (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+			(txn_dbid != InvalidOid && txn_dbid != ctx->slot->data.database) ||
+			ctx->fast_forward || FilterByOrigin(ctx, origin_id));
+}
\ No newline at end of file
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 15dc51a..da54560 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -422,6 +423,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1515,12 +1522,18 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after
+ * streaming or decoding them at PREPARE. Keep the remaining info -
+ * transactions, tuplecids, invalidations and snapshots.
+ *
+ * We additionaly remove tuplecids after decoding the transaction at prepare
+ * time as we only need to perform invalidation at rollback or commit prepared.
+ *
+ * 'txn_prepared' indicates that we have decoded the transaction at prepare
+ * time.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1539,7 +1552,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1573,9 +1586,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1757,9 +1794,10 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * If the transaction was (partially) streamed, we need to commit it in a
- * 'streamed' way.  That is, we first stream the remaining part of the
- * transaction, and then invoke stream_commit message.
+ * If the transaction was (partially) streamed, we need to prepare or commit
+ * it in a 'streamed' way.  That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_prepare or stream_commit message as per
+ * the case.
  */
 static void
 ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1769,29 +1807,49 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		/*
+		 * Note, we send stream prepare even if a concurrent abort is detected.
+		 * See DecodePrepare for more information.
+		 */
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids.
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
  * Set xid to detect concurrent aborts.
  *
- * While streaming an in-progress transaction there is a possibility that the
- * (sub)transaction might get aborted concurrently.  In such case if the
- * (sub)transaction has catalog update then we might decode the tuple using
- * wrong catalog version.  For example, suppose there is one catalog tuple with
- * (xmin: 500, xmax: 0).  Now, the transaction 501 updates the catalog tuple
- * and after that we will have two tuples (xmin: 500, xmax: 501) and
- * (xmin: 501, xmax: 0).  Now, if 501 is aborted and some other transaction
- * say 502 updates the same catalog tuple then the first tuple will be changed
- * to (xmin: 500, xmax: 502).  So, the problem is that when we try to decode
- * the tuple inserted/updated in 501 after the catalog update, we will see the
- * catalog tuple with (xmin: 500, xmax: 502) as visible because it will
- * consider that the tuple is deleted by xid 502 which is not visible to our
- * snapshot.  And when we will try to decode with that catalog tuple, it can
- * lead to a wrong result or a crash.  So, it is necessary to detect
- * concurrent aborts to allow streaming of in-progress transactions.
+ * While streaming an in-progress transaction or decoding a prepared
+ * transaction there is a possibility that the (sub)transaction might get
+ * aborted concurrently.  In such case if the (sub)transaction has catalog
+ * update then we might decode the tuple using wrong catalog version.  For
+ * example, suppose there is one catalog tuple with (xmin: 500, xmax: 0).  Now,
+ * the transaction 501 updates the catalog tuple and after that we will have
+ * two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0).  Now, if 501 is
+ * aborted and some other transaction say 502 updates the same catalog tuple
+ * then the first tuple will be changed to (xmin: 500, xmax: 502).  So, the
+ * problem is that when we try to decode the tuple inserted/updated in 501
+ * after the catalog update, we will see the catalog tuple with (xmin: 500,
+ * xmax: 502) as visible because it will consider that the tuple is deleted by
+ * xid 502 which is not visible to our snapshot.  And when we will try to
+ * decode with that catalog tuple, it can lead to a wrong result or a crash.
+ * So, it is necessary to detect concurrent aborts to allow streaming of
+ * in-progress transactions or decoding of prepared  transactions.
  *
  * For detecting the concurrent abort we set CheckXidAlive to the current
  * (sub)transaction's xid for which this change belongs to.  And, during
@@ -1800,7 +1858,10 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * and discard the already streamed changes on such an error.  We might have
  * already streamed some of the changes for the aborted (sub)transaction, but
  * that is fine because when we decode the abort we will stream abort message
- * to truncate the changes in the subscriber.
+ * to truncate the changes in the subscriber. Similarly, for prepared
+ * transactions, we stop decoding if concurrent abort is detected and then
+ * rollback the changes when rollback prepared is encountered. See
+ * DecodePreare.
  */
 static inline void
 SetupCheckXidLive(TransactionId xid)
@@ -1902,7 +1963,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1914,15 +1975,19 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		specinsert = NULL;
 	}
 
-	/* Stop the stream. */
-	rb->stream_stop(rb, txn, last_lsn);
-
-	/* Remember the command ID and snapshot for the streaming run */
-	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	/*
+	 * For the streaming case, stop the stream and remember the command ID and
+	 * snapshot for the streaming run.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_stop(rb, txn, last_lsn);
+		ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	}
 }
 
 /*
- * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ * Helper function for ReorderBufferReplay and ReorderBufferStreamTXN.
  *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
@@ -1975,9 +2040,14 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		else
 			StartTransactionCommand();
 
-		/* We only need to send begin/commit for non-streamed transactions. */
+		/* We only need to send begin/begin-prepare for non-streamed transactions. */
 		if (!streaming)
-			rb->begin(rb, txn);
+		{
+			if (rbtxn_prepared(txn))
+				rb->begin_prepare(rb, txn);
+			else
+				rb->begin(rb, txn);
+		}
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -2008,8 +2078,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			prev_lsn = change->lsn;
 
-			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			/*
+			 * Set the current xid to detect concurrent aborts. This is
+			 * required for the cases when we decode the changes before the
+			 * COMMIT record is processed.
+			 */
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2300,7 +2374,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2334,15 +2417,22 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the four reasons:
+		 * 1. Decoding an in-progress txn.
+		 * 2. Decoding a prepared txn.
+		 * 3. Decoding of a prepared txn that was (partially) streamed.
+		 * 4. Decoding a committed txn.
+		 *
+		 * For 1, we allow truncation of txn data by removing the changes already
+		 * streamed but still keeping other things like invalidations, snapshot,
+		 * and tuplecids. For 2 and 3, we indicate ReorderBufferTruncateTXN to
+		 * do more elaborate truncation of txn data as the entire transaction has
+		 * been decoded except for commit. For 4, as the entire txn has been
+		 * decoded, we can fully clean up the TXN reorder buffer.
 		 */
-		if (streaming)
+		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
-
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2375,17 +2465,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2415,26 +2508,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * ReorderBufferCommitChild(), even if previously assigned to the toplevel
  * transaction with ReorderBufferAssignChild.
  *
- * This interface is called once a toplevel commit is read for both streamed
- * as well as non-streamed transactions.
+ * This interface is called once a prepare or toplevel commit is read for both
+ * streamed as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferReplay(ReorderBufferTXN *txn,
+					ReorderBuffer *rb, TransactionId xid,
 					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 					TimestampTz commit_time,
 					RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2464,7 +2550,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	if (txn->base_snapshot == NULL)
 	{
 		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+
+		/*
+		 * Removing this txn before a commit might result in the computation
+		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
+		 */
+		if (!rbtxn_prepared(txn))
+			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
 
@@ -2476,6 +2568,134 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+bool
+ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, const char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, so skip preparing it */
+	if (txn == NULL)
+		return true;
+
+	return rb->filter_prepare(rb, txn, xid, gid);
+}
+
+
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferReplay(txn, rb, xid, commit_lsn, end_lsn, commit_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Prepare a two-phase transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+					 TimestampTz prepare_time,
+					 RepOriginId origin_id, XLogRecPtr origin_lsn,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	ReorderBufferReplay(txn, rb, xid, prepare_lsn, end_lsn, prepare_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Send standalone xact event. This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time, RepOriginId origin_id,
+							XLogRecPtr origin_lsn, char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/* add the gid in the txn */
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	/*
+	 * Check if this txn had been prepared earlier. If the txn is not marked
+	 * as prepared, it could be that it was decoded earlier but we restarted,
+	 * or it could be because the prepare was prior to having assembled
+	 * a consistent snapshot. Either ways, replay the transaction like you
+	 * would on a PREPARE. But decode only if this is for a COMMIT PREPARED.
+	 * Let the remote handle duplicates.
+	 */
+	if (!rbtxn_prepared(txn) && is_commit)
+	{
+		txn->txn_flags |= RBTXN_PREPARE;
+
+		ReorderBufferReplay(txn, rb, xid, commit_lsn, end_lsn, commit_time,
+							origin_id, origin_lsn);
+	}
+
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+	txn->txn_flags |= RBTXN_PREPARE;
+
+	if (is_commit)
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2607,6 +2827,39 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 }
 
 /*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ *
+ * Note that this is a special-purpose function for prepared transactions where
+ * we don't want to clean up the TXN even when we decide to skip it. See
+ * DecodePrepare.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
+/*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5fd6049..b27ebf5 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -174,6 +174,7 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +234,12 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -627,12 +634,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -650,6 +663,13 @@ void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+										 const char *gid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+								 TimestampTz prepare_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn,
+								 char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v30-0006-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v30-0006-Support-2PC-txn-pgoutput.patchDownload
From 14c152211049ecd89ed2d8ebdaf62855f8388b32 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 02:31:17 -0500
Subject: [PATCH v30] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.
---
 src/backend/access/transam/twophase.c           |  33 ++-
 src/backend/replication/logical/proto.c         | 176 +++++++++++++-
 src/backend/replication/logical/reorderbuffer.c |   6 +
 src/backend/replication/logical/worker.c        | 294 ++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c     | 114 +++++++++
 src/include/access/twophase.h                   |   1 +
 src/include/replication/logicalproto.h          |  51 +++-
 src/include/replication/reorderbuffer.h         |  14 ++
 src/tools/pgindent/typedefs.list                |   1 +
 9 files changed, 685 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9b..00b4497 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 }
 
 /*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
+/*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
  */
@@ -1133,9 +1160,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..2772a4f 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,180 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * Flags are determined from the state of the transaction. We know we
+	 * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+	 * it's already marked as committed then it has to be COMMIT PREPARED (and
+	 * likewise for abort / ROLLBACK PREPARED).
+	 */
+	if (rbtxn_commit_prepared(txn))
+		flags = LOGICALREP_IS_COMMIT_PREPARED;
+	else if (rbtxn_rollback_prepared(txn))
+		flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+	else
+		flags = LOGICALREP_IS_PREPARE;
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (!PrepareFlagsAreValid(flags))
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(rbtxn_prepared(txn));
+	Assert(txn->gid != NULL);
+
+	/*
+	 * For streaming APIs only PREPARE is supported. [COMMIT|ROLLBACK]
+	 * PREPARED uses non-streaming APIs
+	 */
+	flags = LOGICALREP_IS_PREPARE;
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != LOGICALREP_IS_PREPARE)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* set the action (reuse the constants used for the flags) */
+	prepare_data->prepare_type = flags;
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index efce11b..fd5b1f3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2690,9 +2690,15 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->txn_flags |= RBTXN_PREPARE;
 
 	if (is_commit)
+	{
+		txn->txn_flags |= RBTXN_COMMIT_PREPARED;
 		rb->commit_prepared(rb, txn, commit_lsn);
+	}
 	else
+	{
+		txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
 		rb->rollback_prepared(rb, txn, commit_lsn);
+	}
 
 	/* cleanup: make sure there's no cache pollution */
 	ReorderBufferExecuteInvalidations(txn->ninvalidations,
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a4ec883..c5d531b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -169,6 +170,9 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
+/* for skipping duplicate prepared transaction */
+bool        skipping_prepared_txn = false;
+
 /*
  * Hash table for storing the streaming xid information along with shared file
  * set for streaming and subxact files.
@@ -722,6 +726,272 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+    LogicalRepBeginPrepareData begin_data;
+
+    logicalrep_read_begin_prepare(s, &begin_data);
+
+	if (LookupGXact(begin_data.gid))
+	{
+		/*
+	 	 * If this gid has already been prepared then we dont want to 
+	 	 * apply this txn again. Don't update remote_final_lsn.
+	 	 */
+		skipping_prepared_txn = true;
+		return;
+	}
+
+    remote_final_lsn = begin_data.final_lsn;
+
+    in_remote_transaction = true;
+
+    pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a PREPARE TRANSACTION.
+ */
+static void
+apply_handle_prepare_txn(LogicalRepPrepareData *prepare_data)
+{
+	if (skipping_prepared_txn)
+	{
+		/* 
+		 * If we are skipping this transaction because it was previously
+		 * prepared, ignore it and reset the flag.
+		 */
+		Assert(LookupGXact(prepare_data->gid));
+		skipping_prepared_txn = false;
+		return;
+	}
+
+	Assert(prepare_data->prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data->end_lsn;
+		replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+		PrepareTransactionBlock(prepare_data->gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data->end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a COMMIT PREPARED of a previously
+ * PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	FinishPreparedTransaction(prepare_data->gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Called from apply_handle_prepare to handle a ROLLBACK PREPARED of a previously
+ * PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared_txn(LogicalRepPrepareData *prepare_data)
+{
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data->gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data->gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data->end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data->end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPrepareData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	switch (prepare_data.prepare_type)
+	{
+		case LOGICALREP_IS_PREPARE:
+			apply_handle_prepare_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_COMMIT_PREPARED:
+			apply_handle_commit_prepared_txn(&prepare_data);
+			break;
+
+		case LOGICALREP_IS_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared_txn(&prepare_data);
+			break;
+
+		default:
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("unexpected type of prepare message: %d",
+							prepare_data.prepare_type)));
+	}
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPrepareData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * This should be a PREPARE only. The COMMIT PREPARED and ROLLBACK
+	 * PREPARED for streaming are handled by the non-streaming APIs.
+	 */
+	Assert(prepare_data.prepare_type == LOGICALREP_IS_PREPARE);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare_txn (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1176,6 +1446,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skipping_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1297,6 +1570,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (skipping_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1454,6 +1730,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skipping_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1823,6 +2102,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (skipping_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1979,6 +2261,18 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997ae..3f1e71a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,14 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+							           ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +65,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -143,6 +153,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +168,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -378,6 +395,85 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+    bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+    OutputPluginPrepareWrite(ctx, !send_replication_origin);
+    logicalrep_write_begin_prepare(ctx->out, txn);
+
+    if (send_replication_origin)
+    {
+        char       *origin;
+
+        /* Message boundary */
+        OutputPluginWrite(ctx, false);
+        OutputPluginPrepareWrite(ctx, true);
+
+        /*----------
+         * XXX: which behaviour do we want here?
+         *
+         * Alternatives:
+         *  - don't send origin message if origin name not found
+         *    (that's what we do now)
+         *  - throw error - that will break replication, not good
+         *  - send some special "unknown" origin
+         *----------
+         */
+        if (replorigin_by_oid(txn->origin_id, true, &origin))
+            logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
+    }
+
+    OutputPluginWrite(ctx, true);
+}
+
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -857,6 +953,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..b2628ea 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535d..2edc76e 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,13 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +118,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +126,37 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/* Prepare protocol information */
+typedef struct LogicalRepPrepareData
+{
+	uint8		prepare_type;
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPrepareData;
+
+/* types of the prepare protocol message */
+#define LOGICALREP_IS_PREPARE			0x01
+#define LOGICALREP_IS_COMMIT_PREPARED	0x02
+#define LOGICALREP_IS_ROLLBACK_PREPARED	0x04
+
+/* prepare can be exactly one of PREPARE, [COMMIT|ROLLBACK] PREPARED*/
+#define PrepareFlagsAreValid(flags) \
+	(((flags) == LOGICALREP_IS_PREPARE) || \
+	 ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \
+	 ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED))
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +164,13 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+								  		  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPrepareData *prepare_data);
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +214,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPrepareData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index b27ebf5..86061c8 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -175,6 +175,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
 #define RBTXN_PREPARE             0x0080
+#define RBTXN_COMMIT_PREPARED     0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -240,6 +242,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1759095..9d33442 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1342,6 +1342,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPrepareData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v30-0008-Support-2PC-documentation.patchapplication/octet-stream; name=v30-0008-Support-2PC-documentation.patchDownload
From e0bd07e4b0745751f10bd7071bd3600e6b5c4605 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 02:43:40 -0500
Subject: [PATCH v30] Support-2PC-documentation.

Add documentation about two-phase commit support in Logical Decoding.
---
 doc/src/sgml/logicaldecoding.sgml | 99 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 98 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 6262273..59d9dcc 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -165,7 +165,57 @@ COMMIT 693
 <keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
 $ pg_recvlogical -d postgres --slot=test --drop-slot
 </programlisting>
-  </sect1>
+
+  <para>
+  The following example shows how logical decoding can be used to handle transactions
+  that use a two-phase commit. Before you use two-phase commit commands, you must set
+  <varname>max_prepared_transactions</varname> to at least 1. You must also set the 
+  option 'two-phase-commit' to 1 while calling <function>pg_logical_slot_get_changes</function>.
+  </para>
+<programlisting>
+postgres=# BEGIN;
+postgres=*# INSERT INTO data(data) VALUES('5');
+postgres=*# PREPARE TRANSACTION 'test_prepared1';
+
+postgres=# SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/1689DC0 | 529 | BEGIN 529
+ 0/1689DC0 | 529 | table public.data: INSERT: id[integer]:3 data[text]:'5'
+ 0/1689FC0 | 529 | PREPARE TRANSACTION 'test_prepared1', txid 529
+(3 rows)
+
+postgres=# COMMIT PREPARED 'test_prepared1';
+COMMIT PREPARED
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                    data                    
+-----------+-----+--------------------------------------------
+ 0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529
+(1 row)
+
+postgres=#-- you can also rollback a prepared transaction
+postgres=# BEGIN;
+BEGIN
+postgres=*# INSERT INTO data(data) VALUES('6');INSERT 0 1
+postgres=*# PREPARE TRANSACTION 'test_prepared2';PREPARE TRANSACTION
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/168A180 | 530 | BEGIN 530
+ 0/168A1E8 | 530 | table public.data: INSERT: id[integer]:4 data[text]:'6'
+ 0/168A430 | 530 | PREPARE TRANSACTION 'test_prepared2', txid 530
+(3 rows)
+
+postgres=# ROLLBACK PREPARED 'test_prepared1';ERROR:  prepared transaction with identifier "test_prepared1" does not exist
+postgres=# ROLLBACK PREPARED 'test_prepared2';
+ROLLBACK PREPARED
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                     data                     
+-----------+-----+----------------------------------------------
+ 0/168A4B8 | 530 | ROLLBACK PREPARED 'test_prepared2', txid 530
+(1 row)
+</programlisting>
+</sect1>
 
   <sect1 id="logicaldecoding-explanation">
    <title>Logical Decoding Concepts</title>
@@ -1121,4 +1171,51 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
    </para>
 
   </sect1>
+
+  <sect1 id="logicaldecoding-two-phase-commits">
+   <title>Two-phase commit support for Logical Decoding</title>
+
+   <para>
+   With the basic output plugin callbacks (eg., <function>begin_cb</function>,
+   <function>change_cb</function>, <function>commit_cb</function> and
+   <function>message_cb</function>) two-phase commit commands like
+   <command>PREPARE TRANSACTION</command>, <command>COMMIT PREPARED</command>
+   and <command>ROLLBACK PREPARED</command> are not decoded correctly.
+   While the <command>PREPARE TRANSACTION</command> ignored, 
+   <command>COMMIT PREPARED</command> is decoded as a <command>COMMIT</command> and 
+   <command>ROLLBACK PREPARED</command> is decoded as a <command>ROLLBACK</command>.
+   </para>
+
+   <para>
+   An output plugin may provide additional callbacks to support two-phase commit commands.
+   There are multiple two-phase commit callbacks that are required,
+   (<function>begin_prepare_cb</function>, <function>prepare_cb</function>, 
+   <function>commit_prepared_cb</function>, 
+   <function>rollback_prepared_cb</function> and <function>stream_prepare_cb</function>)
+   and an optional callback (<function>filter_prepare_cb</function>).
+   </para>
+
+   <para>
+   If the output plugin callbacks for decoding two-phase commit commands are provided,
+   then on <command>PREPARE TRANSACTION</command>, the changes of that transaction are
+   decoded, passed to the output plugin and the <function>prepare_cb</function>
+   callback is invoked.This differs from the basic decoding setup where changes are
+   only passed to the output plugin when a transaction is committed. The start of a
+   prepared transaction is indicated by the <function>begin_prepare_cb</function> callback.
+   </para>
+
+   <para>
+   When a prepared transaction is rollbacked using the <command>ROLLBACK PREPARED</command>,
+   then the <function>rollback_prepared_cb</function> callback is invoked and when the
+   prepared transaction is committed using <command>COMMIT PREPARED</command>,
+   then the <function>commit_prepared_cb</function> callback is invoked.
+   </para>
+
+   <para>
+   Optionally the output plugin can specify a name pattern in the 
+   <function>filter_prepare_cb</function> and transactions with gid containing
+   that name pattern will not be decoded as a two-phase commit transaction. 
+   </para>
+
+  </sect1>
  </chapter>
-- 
1.8.3.1

v30-0009-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v30-0009-Support-2PC-txn-Subscription-option.patchDownload
From 9a8195ffab38e081c4c7339dcd871f644ebc9d81 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 02:48:56 -0500
Subject: [PATCH v30] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.

Note: The tablesync worker slot always has two_phase disabled, regardless of the option.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 13 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/catversion.h                   |  2 +-
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 21 files changed, 201 insertions(+), 52 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index db5e59f..dbe2a43 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -166,8 +166,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..0d233f4e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,19 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. When two-phase commit is not
+          enabled then PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED are not
+          decoded on the publisher.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index ca78d39..886839e 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -67,6 +67,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b140c21..5f4e191 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1149,7 +1149,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 1696454..b0745d5 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -64,7 +64,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -105,6 +106,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -210,6 +216,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -355,6 +370,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -379,7 +396,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -447,6 +465,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -720,6 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -730,7 +751,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -769,6 +791,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -787,7 +816,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -832,7 +862,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -875,7 +906,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 24f8b3e..1f404cd 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -429,6 +429,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c5d531b..e14fe62 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2759,6 +2759,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		(!am_tablesync_worker() && newsub->twophase != MySubscription->twophase) ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3405,6 +3406,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase && !am_tablesync_worker();
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 3f1e71a..7a7b369 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -175,13 +175,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -249,6 +251,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -262,6 +274,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -286,7 +299,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -327,6 +341,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 3b36335..db68386 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4221,6 +4221,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4264,9 +4265,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4287,6 +4293,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4312,6 +4319,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4380,6 +4389,9 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 317bb83..22e4e6c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -629,6 +629,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 14150d0..47306a2 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -5997,7 +5997,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6023,13 +6023,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index a16cc38..bc1d8fd 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202011251
+#define CATALOG_VERSION_NO	202011271
 
 #endif
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3fa02af..e07eed0 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -53,6 +53,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -90,6 +92,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 2edc76e..b0b57d9 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1b05b39..f96c891 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 2fa9bce..23d876e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,42 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 14fa0b2..2a0b366 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -147,6 +147,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v30-0007-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v30-0007-Support-2PC-txn-subscriber-tests.patchDownload
From bc332342c43ede664ab004a9b3332fe7c2e8ad4f Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 02:32:53 -0500
Subject: [PATCH v30] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#150Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#149)
9 attachment(s)

On Tue, Dec 8, 2020 at 2:01 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Dec 1, 2020 at 6:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

To skip it, we need to send GID in begin message and then on
subscriber-side, check if the prepared xact already exists, if so then
set a flag. The flag needs to be set in begin/start_stream and reset
in stop_stream/commit/abort. Using the flag, we can skip the entire
contents of the prepared xact. In ReorderFuffer-side also, we need to
get and set GID in txn even when we skip it because we need to send
the same at commit time. In this solution, we won't be able to send it
during normal start_stream because by that time we won't know GID and
I think that won't be required. Note that this is only required when
we skipped sending prepare, otherwise, we just need to send
Commit-Prepared at commit time.

I have implemented these changes and tested the fix using the test
setup I had shared above and it seems to be working fine.
I have also tested restarts that simulate duplicate prepares being
sent by the publisher and verified that it is handled correctly by the
subscriber.

This implementation has a flaw in that it has used commit_lsn for the
prepare when we are sending prepare just before commit prepared. We
can't send the commit LSN with prepare because if the subscriber
crashes after prepare then it would update
replorigin_session_origin_lsn with that commit_lsn. Now, after the
restart, because we will use that LSN to start decoding, the Commit
Prepared will get skipped. To fix this, we need to remember the
prepare LSN and other information even when we skip prepare and then
use it while sending the prepare during commit prepared.

Now, after fixing this, I discovered another issue which is that we
allow adding a new snapshot to prepared transactions via
SnapBuildDistributeNewCatalogSnapshot. We can only allow it to get
added to in-progress transactions. If you comment out the changes
added in SnapBuildDistributeNewCatalogSnapshot then you will notice
one test failure which indicates this problem. This problem was not
evident before the bug-fix in the previous paragraph because you were
using commit-lsn even for the prepare and newly added snapshot change
appears to be before the end_lsn.

Some other assorted changes in various patches:
v31-0001-Extend-the-output-plugin-API-to-allow-decoding-o
1. I have changed the filter_prepare API to match the signature with
FilterByOrigin. I don't see the need for ReorderBufferTxn or xid in
the API.
2. I have expanded the documentation of 'Begin Prepare Callback' to
explain how a user can use it to detect already prepared transactions
and in which scenarios that can happen.
3. Added a few comments in the code atop begin_prepare_cb_wrapper to
explain why we are adding this new API.
4. Move the check whether the filter_prepare callback is defined from
filter_prepare_cb_wrapper to caller. This is similar to how
FilterByOrigin works.
5. Fixed various whitespace and cosmetic issues.
6. Update commit message to include two of the newly added APIs

v31-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer
1. Changed the variable names and comments in DecodeXactOp.
2. A new API for FilterPrepare similar to FilterByOrigin and use that
instead of ReorderBufferPrepareNeedSkip.
3. In DecodeCommit, we need to update the reorderbuffer about the
surviving subtransactions for both ReorderBufferFinishPrepared and
ReorderBufferCommit because now both can process the transaction.
4. Because, now we need to remember the prepare info even when we skip
it, I have simplified ReorderBufferPrepare API by removing the extra
parameters as that information will be now available via
ReorderBufferTxn.
5. Updated comments at various places.

v31-0006-Support-2PC-txn-pgoutput
1. Added Asserts in streaming APIs on the subscriber-side to ensure
that we should never reach there for the already prepared transaction
case. We never need to stream the changes when we are skipping prepare
either because the snapshot was not consistent by that time or we have
already sent those changes before restart. Added the same Assert in
Begin and Commit routines because while skipping prepared txn, we must
not receive the changes of any other xact.
2.
+ /*
+ * Flags are determined from the state of the transaction. We know we
+ * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+ * it's already marked as committed then it has to be COMMIT PREPARED (and
+ * likewise for abort / ROLLBACK PREPARED).
+ */
+ if (rbtxn_commit_prepared(txn))
+ flags = LOGICALREP_IS_COMMIT_PREPARED;
+ else if (rbtxn_rollback_prepared(txn))
+ flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+ else
+ flags = LOGICALREP_IS_PREPARE;

I don't like clubbing three different operations under one message
LOGICAL_REP_MSG_PREPARE. It looks awkward to use new flags
RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED in ReordeBuffer so
that we can recognize these operations in corresponding callbacks. I
think setting any flag in ReorderBuffer should not dictate the
behavior in callbacks. Then also there are few things that are not
common to those APIs like the patch has an Assert to say that the txn
is marked with prepare flag for all three operations which I think is
not true for Rollback Prepared after the restart. We don't ensure to
set the Prepare flag if the Rollback Prepare happens after the
restart. Then, we have to introduce separate flags to distinguish
prepare/commit prepared/rollback prepared to distinguish multiple
operations sent as protocol messages. Also, all these operations are
mutually exclusive so it will be better to send separate messages for
each of these and I have changed it accordingly in the attached patch.

3. The patch has a duplicate code to send replication origins. I have
moved the common code to a separate function.

v31-0009-Support-2PC-txn-Subscription-option
1.
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 /* yyyymmddN */
-#define CATALOG_VERSION_NO 202011251
+#define CATALOG_VERSION_NO 202011271

No need to change catversion as this gets changed frequently and that
leads to conflict in the patch. We can change it either in the final
version or normally committers take care of this. If you want to
remember it, maybe adding a line for it in the commit message should
be okay. For now, I have removed this from the patch.

--
With Regards,
Amit Kapila.

Attachments:

v31-0009-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v31-0009-Support-2PC-txn-Subscription-option.patchDownload
From 4041c01338b36e3d53114a5da855e1b3322972e4 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Mon, 14 Dec 2020 12:07:14 +0530
Subject: [PATCH v31 9/9] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.

Note: The tablesync worker slot always has two_phase disabled, regardless of the option.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 13 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 200 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index db5e59f..dbe2a43 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -166,8 +166,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..0d233f4e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,19 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. When two-phase commit is not
+          enabled then PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED are not
+          decoded on the publisher.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index ca78d39..886839e 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -67,6 +67,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b140c21..5f4e191 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1149,7 +1149,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 1696454..b0745d5 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -64,7 +64,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -105,6 +106,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -210,6 +216,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -355,6 +370,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -379,7 +396,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -447,6 +465,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -720,6 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -730,7 +751,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -769,6 +791,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -787,7 +816,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -832,7 +862,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -875,7 +906,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 24f8b3e..1f404cd 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -429,6 +429,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 103e5f0..0c39ce6 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2795,6 +2795,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		(!am_tablesync_worker() && newsub->twophase != MySubscription->twophase) ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3441,6 +3442,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase && !am_tablesync_worker();
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 95d0948..dcad69f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -178,13 +178,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -252,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -265,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -289,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -330,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 673a670..cb707bf 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4221,6 +4221,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4264,9 +4265,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4287,6 +4293,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4312,6 +4319,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4380,6 +4389,9 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 317bb83..22e4e6c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -629,6 +629,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 14150d0..47306a2 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -5997,7 +5997,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6023,13 +6023,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3fa02af..e07eed0 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -53,6 +53,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -90,6 +92,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 915c921..29f4eaf 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1b05b39..f96c891 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 2fa9bce..23d876e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,42 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 14fa0b2..2a0b366 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -147,6 +147,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v31-0001-Extend-the-output-plugin-API-to-allow-decoding-o.patchapplication/octet-stream; name=v31-0001-Extend-the-output-plugin-API-to-allow-decoding-o.patchDownload
From 316bc63e3a05d66a5fff59e0aed2d584bac9c020 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 12 Dec 2020 16:41:33 +0530
Subject: [PATCH v31 1/4] Extend the output plugin API to allow decoding of
 prepared xacts.

This adds six methods to the output plugin API, adding support for
streaming changes of two-phase transactions at prepare time.

* begin_prepare
* filter_prepare
* prepare
* commit_prepared
* rollback_prepared
* stream_prepare

Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction is not yet committed and maybe
aborted later.

Until now two-phase transactions were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the
two-phase commands were communicated to the subscriber.

This patch provides the infrastructure for logical decoding plugins to be
informed of two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

This also extends the 'test_decoding' plugin, implementing these new
methods.

This commit simply adds these new APIs and the upcoming patch to "allow
the decoding at prepare time in ReorderBuffer" will use these APIs.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c     | 164 ++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 165 +++++++++++-
 src/backend/replication/logical/logical.c | 295 ++++++++++++++++++++++
 src/include/replication/logical.h         |   6 +
 src/include/replication/output_plugin.h   |  55 ++++
 src/include/replication/reorderbuffer.h   |  43 ++++
 src/tools/pgindent/typedefs.list          |  12 +
 7 files changed, 733 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e12278beb5..cffa71635c 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -76,6 +76,19 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 const char *gid);
+static void pg_decode_begin_prepare_txn(LogicalDecodingContext *ctx,
+								ReorderBufferTXN *txn);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 static void pg_decode_stream_start(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn);
 static void pg_output_stream_start(LogicalDecodingContext *ctx,
@@ -87,6 +100,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -123,9 +139,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->begin_prepare_cb = pg_decode_begin_prepare_txn;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
@@ -141,6 +163,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -241,6 +264,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -252,6 +285,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -320,6 +354,109 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/* BEGIN PREPARE callback */
+static void
+pg_decode_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata =
+	MemoryContextAllocZero(ctx->context, sizeof(TestDecodingTxnData));
+
+	txndata->xact_wrote_changes = false;
+	txn->output_plugin_private = txndata;
+
+	if (data->skip_empty_xacts)
+		return;
+
+	pg_output_begin(ctx, data, txn, true);
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -701,6 +838,33 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 	OutputPluginWrite(ctx, true);
 }
 
+static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
 static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037fac..180699afbc 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,9 +389,15 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodeBeginPrepareCB begin_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +419,20 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits,
+    which allows actions to be decoded on the <command>PREPARE TRANSACTION</command>.
+    The <function>begin_prepare_cb</function>, <function>prepare_cb</function>, 
+    <function>stream_prepare_cb</function>,
+    <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +493,15 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too. We will skip all the changes of such a transaction once
+     the abort is detected and abort the transaction when we read WAL for
+     <command>ROLLBACK PREPARED</command>.
     </para>
 
     <note>
@@ -587,7 +611,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -685,7 +715,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -698,6 +734,104 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents as for the
+      other callbacks. The <parameter>gid</parameter> is the identifier that later
+      identifies this transaction for <command>COMMIT PREPARED</command> or
+      <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given
+      <parameter>gid</parameter> every time it is called.
+     </para>
+     </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-begin-prepare">
+     <title>Transaction Begin Prepare Callback</title>
+
+     <para>
+      The required <function>begin_prepare_cb</function> callback is called
+      whenever the start of a prepared transaction has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback to
+      check if the plugin has already received this prepare in which case it
+      can skip the remaining changes of the transaction. This can only happen
+      if the user restarts the decoding after receiving the prepare for a
+      transaction but before receiving the commit prepared say because of some
+      error.
+      <programlisting>
+       typedef void (*LogicalDecodeBeginPrepareCB) (struct LogicalDecodingContext *ctx,
+                                                    ReorderBufferTXN *txn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callback for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr prepare_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called
+      whenever a transaction commit prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                      ReorderBufferTXN *txn,
+                                                      XLogRecPtr commit_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called
+      whenever a transaction rollback prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                        ReorderBufferTXN *txn,
+                                                        XLogRecPtr rollback_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-start">
      <title>Stream Start Callback</title>
      <para>
@@ -735,6 +869,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1060,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f1f4df7d70..ffc0a45b0e 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,13 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +81,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -237,11 +246,37 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
 	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
 
+
+	/*
+	 * To support two-phase logical decoding, we require
+	 * begin_prepare/prepare/commit-prepare/abort-prepare callbacks. The
+	 * filter_prepare callback is optional. We however enable two-phase logical
+	 * decoding when at least one of the methods is enabled so that we can
+	 * easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.begin_prepare_cb != NULL) ||
+		(ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
+	 * Callback to support decoding at prepare time.
+	 */
+	ctx->reorder->begin_prepare = begin_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -782,6 +817,184 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+/*
+ * The functionality of begin_prepare is quite similar to begin with the
+ * exception that this will have gid (global transaction id) information which
+ * can be used by plugin. Now, we thought about extending the existing begin
+ * but that would break the replication protocol and additionally this looks
+ * cleaner.
+ */
+static void
+begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "begin_prepare";
+	state.report_location = txn->first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->first_lsn;
+
+	/*
+	 * If the plugin supports two-phase commits then begin prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.begin_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires begin_prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.begin_prepare_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of prepare record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
@@ -859,6 +1072,45 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+bool
+filter_prepare_cb_wrapper(LogicalDecodingContext *ctx, const char *gid)
+{
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case, all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
 bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -1056,6 +1308,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming at prepare time requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7ee02..56e3e0e803 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -84,6 +84,11 @@ typedef struct LogicalDecodingContext
 	 */
 	bool		streaming;
 
+	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
 	/*
 	 * State for writing output.
 	 */
@@ -120,6 +125,7 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 												  XLogRecPtr restart_lsn);
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
+extern bool filter_prepare_cb_wrapper(LogicalDecodingContext* ctx, const char* gid);
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
 extern void ResetLogicalStreamingState(void);
 extern void UpdateDecodingStats(LogicalDecodingContext *ctx);
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796450..72410156ec 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -99,6 +99,44 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
  */
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
+/*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  const char *gid);
+
+/*
+ * Callback called for every BEGIN of a prepared trnsaction.
+ */
+typedef void (*LogicalDecodeBeginPrepareCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
+
 /*
  * Called when starting to stream a block of changes from in-progress
  * transaction (may be called repeatedly, if it's streamed in multiple
@@ -123,6 +161,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
@@ -173,10 +219,19 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+
+	/* streaming of changes at prepare time */
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodeBeginPrepareCB begin_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
+
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bd9dd7ec67..5fd604919e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -244,6 +244,9 @@ typedef struct ReorderBufferTXN
 	/* Xid of top-level transaction, if known */
 	TransactionId toplevel_xid;
 
+	/* In case of two-phase commit we need to pass GID to output plugin */
+	char	   *gid;
+
 	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
@@ -418,6 +421,30 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* begin prepare callback signature */
+typedef void (*ReorderBufferBeginPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn);
+
+typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  TransactionId xid,
+											  const char *gid);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* start streaming transaction callback signature */
 typedef void (*ReorderBufferStreamStartCB) (
 											ReorderBuffer *rb,
@@ -436,6 +463,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -504,12 +537,22 @@ struct ReorderBuffer
 	ReorderBufferCommitCB commit;
 	ReorderBufferMessageCB message;
 
+	/*
+	 * Callbacks to be called when streaming a transaction at prepare time.
+	 */
+	ReorderBufferFilterPrepareCB filter_prepare;
+	ReorderBufferBeginCB begin_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
+
 	/*
 	 * Callbacks to be called when streaming a transaction.
 	 */
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a9dca717a6..e82b4f7fe0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1315,9 +1315,21 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodeBeginPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
2.28.0.windows.1

v31-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchapplication/octet-stream; name=v31-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchDownload
From d7a86a96f16662d966c8b1348b87f5a235adbb42 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 11 Dec 2020 08:59:47 +0530
Subject: [PATCH v31 2/4] Allow decoding at prepare time in ReorderBuffer.

This patch allows PREPARE-time decoding of two-phase transactions (if the
output plugin supports this capability), in which case the transactions
are replayed at PREPARE and then committed later when COMMIT PREPARED
arrives.

Now that we decode the changes before the commit, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We detect such failures with a special sqlerrcode
ERRCODE_TRANSACTION_ROLLBACK introduced by commit 7259736a6e and stop
decoding the remaining changes. Then we rollback the changes when rollback
prepared is encountered.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, Arseny Sher, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/decode.c      | 282 ++++++++++--
 .../replication/logical/reorderbuffer.c       | 419 +++++++++++++++---
 src/backend/replication/logical/snapbuild.c   |   7 +
 src/include/replication/reorderbuffer.h       |  39 +-
 4 files changed, 636 insertions(+), 111 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee99b8..d7f480ad8f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,13 +67,24 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool two_phase);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool two_phase);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 
+/* helper functions for decoding transactions */
+static inline bool FilterPrepare(LogicalDecodingContext* ctx, const char* gid);
+static bool DecodeTXNNeedSkip(LogicalDecodingContext *ctx,
+							  XLogRecordBuffer *buf, Oid dbId,
+							  RepOriginId origin_id);
+
 /*
  * Take every XLogReadRecord()ed record and perform the actions required to
  * decode it using the output plugin already setup in the logical decoding
@@ -244,6 +255,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		two_phase = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +265,15 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * We would like to process the transaction in a two-phase
+				 * manner iff output plugin supports two-phase commits and
+				 * doesn't filter the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+					two_phase = !(FilterPrepare(ctx, parsed.twophase_gid));
+
+				DecodeCommit(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +282,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		two_phase = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +292,15 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * We would like to process the transaction in a two-phase
+				 * manner iff output plugin supports two-phase commits and
+				 * doesn't filter the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+					two_phase = !(FilterPrepare(ctx, parsed.twophase_gid));
+
+				DecodeAbort(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +341,37 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/*
+				 * We would like to process the transaction in a two-phase
+				 * manner iff output plugin supports two-phase commits and
+				 * doesn't filter the transaction at prepare time.
+				 */
+				if (FilterPrepare(ctx, parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -520,6 +569,23 @@ DecodeHeapOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	}
 }
 
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+static inline bool
+FilterPrepare(LogicalDecodingContext* ctx, const char *gid)
+{
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (ctx->callbacks.filter_prepare_cb == NULL)
+		return false;
+
+	return filter_prepare_cb_wrapper(ctx, gid);
+}
+
 static inline bool
 FilterByOrigin(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -582,10 +648,15 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'two_phase' indicates that caller wants to process the transaction in two
+ * phases, first process prepare if not already done and then process
+ * commit_prepared.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool two_phase)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -606,15 +677,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * the reorderbuffer to forget the content of the (sub-)transactions
 	 * if not.
 	 *
-	 * There can be several reasons we might not be interested in this
-	 * transaction:
-	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
-	 * 2) The transaction happened in another database.
-	 * 3) The output plugin is not interested in the origin.
-	 * 4) We are doing fast-forwarding
-	 *
 	 * We can't just use ReorderBufferAbort() here, because we need to execute
 	 * the transaction's invalidations.  This currently won't be needed if
 	 * we're just skipping over the transaction because currently we only do
@@ -627,9 +689,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * relevant syscaches.
 	 * ---
 	 */
-	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
-		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
-		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
 	{
 		for (i = 0; i < parsed->nsubxacts; i++)
 		{
@@ -647,34 +707,161 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
+	/*
+	 * Send the final commit record if the transaction data is already decoded,
+	 * otherwise, process the entire transaction.
+	 */
+	if (two_phase)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ *
+ * Note that we don't skip prepare even if we have detected concurrent abort.
+ * The reason is that it is quite possible that we had already sent some
+ * changes before we detect abort in which case we need to abort those changes
+ * in the subscriber. To abort such changes, we do send the prepare and then
+ * the rollback prepared which is what happened on the publisher-side as well.
+ * Now, we can invent a new abort API wherein in such cases we send abort and
+ * skip sending prepared and rollback prepared but then it is not that
+ * straightforward because we might have streamed this transaction by that time
+ * in which case it is handled when the rollback is encountered. It is not
+ * impossible to optimize the concurrent abort case but it can introduce design
+ * complexity w.r.t handling different cases so leaving it for now as it
+ * doesn't seem worth it.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	SnapBuild *builder = ctx->snapshot_builder;
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz prepare_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		prepare_time = parsed->origin_timestamp;
+
+	/*
+	 * Remember the prepare info for a txn so that it can be used later in
+	 * commit prepared if required. See ReorderBufferFinishPrepared.
+	 */
+	if (!ReorderBufferRememberPrepareInfo(ctx->reorder, xid, buf->origptr,
+										  buf->endptr, prepare_time, origin_id,
+										  origin_lsn))
+		return;
+
+	/* We can't start streaming unless a consistent state is reached. */
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT)
+	{
+		ReorderBufferSkipPrepare(ctx->reorder, xid);
+		return;
+	}
+
+	/*
+	 * Check whether we need to process this transaction. See DecodeTXNNeedSkip
+	 * for the reasons why we sometimes want to skip the transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache invalidations
+	 * if there are any for the reasons mentioned in DecodeCommit.
+	 */
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
+	{
+		ReorderBufferSkipPrepare(ctx->reorder, xid);
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/* Tell the reorderbuffer about the surviving subtransactions. */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'two_phase' indicates to finish prepared transaction.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool two_phase)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz abort_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool	skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		origin_lsn = parsed->origin_lsn;
+		abort_time = parsed->origin_timestamp;
 	}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	/*
+	 * Check whether we need to process this transaction. See DecodeTXNNeedSkip
+	 * for the reasons why we sometimes want to skip the transaction.
+	 */
+	skip_xact = DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id);
+
+	/*
+	 * Send the final rollback record for a prepared transaction unless we need to
+	 * skip it. For non-two-phase xacts, simply forget the xact.
+	 */
+	if (two_phase && !skip_xact)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									abort_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
+	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
+
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
@@ -1080,3 +1267,24 @@ DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tuple)
 	header->t_infomask2 = xlhdr.t_infomask2;
 	header->t_hoff = xlhdr.t_hoff;
 }
+
+/*
+ * Check whether we are interested in this specific transaction.
+ *
+ * There can be several reasons we might not be interested in this
+ * transaction:
+ * 1) We might not be interested in decoding transactions up to this
+ *	  LSN. This can happen because we previously decoded it and now just
+ *	  are restarting or if we haven't assembled a consistent snapshot yet.
+ * 2) The transaction happened in another database.
+ * 3) The output plugin is not interested in the origin.
+ * 4) We are doing fast-forwarding
+ */
+static bool
+DecodeTXNNeedSkip(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+				  Oid txn_dbid, RepOriginId origin_id)
+{
+	return (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+			(txn_dbid != InvalidOid && txn_dbid != ctx->slot->data.database) ||
+			ctx->fast_forward || FilterByOrigin(ctx, origin_id));
+}
\ No newline at end of file
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 15dc51a94d..7db89d5593 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -422,6 +423,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1515,12 +1522,18 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after
+ * streaming or decoding them at PREPARE. Keep the remaining info -
+ * transactions, tuplecids, invalidations and snapshots.
+ *
+ * We additionaly remove tuplecids after decoding the transaction at prepare
+ * time as we only need to perform invalidation at rollback or commit prepared.
+ *
+ * 'txn_prepared' indicates that we have decoded the transaction at prepare
+ * time.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1539,7 +1552,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1573,9 +1586,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1757,9 +1794,10 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * If the transaction was (partially) streamed, we need to commit it in a
- * 'streamed' way.  That is, we first stream the remaining part of the
- * transaction, and then invoke stream_commit message.
+ * If the transaction was (partially) streamed, we need to prepare or commit
+ * it in a 'streamed' way.  That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_prepare or stream_commit message as per
+ * the case.
  */
 static void
 ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1769,29 +1807,49 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		/*
+		 * Note, we send stream prepare even if a concurrent abort is detected.
+		 * See DecodePrepare for more information.
+		 */
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids.
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
  * Set xid to detect concurrent aborts.
  *
- * While streaming an in-progress transaction there is a possibility that the
- * (sub)transaction might get aborted concurrently.  In such case if the
- * (sub)transaction has catalog update then we might decode the tuple using
- * wrong catalog version.  For example, suppose there is one catalog tuple with
- * (xmin: 500, xmax: 0).  Now, the transaction 501 updates the catalog tuple
- * and after that we will have two tuples (xmin: 500, xmax: 501) and
- * (xmin: 501, xmax: 0).  Now, if 501 is aborted and some other transaction
- * say 502 updates the same catalog tuple then the first tuple will be changed
- * to (xmin: 500, xmax: 502).  So, the problem is that when we try to decode
- * the tuple inserted/updated in 501 after the catalog update, we will see the
- * catalog tuple with (xmin: 500, xmax: 502) as visible because it will
- * consider that the tuple is deleted by xid 502 which is not visible to our
- * snapshot.  And when we will try to decode with that catalog tuple, it can
- * lead to a wrong result or a crash.  So, it is necessary to detect
- * concurrent aborts to allow streaming of in-progress transactions.
+ * While streaming an in-progress transaction or decoding a prepared
+ * transaction there is a possibility that the (sub)transaction might get
+ * aborted concurrently.  In such case if the (sub)transaction has catalog
+ * update then we might decode the tuple using wrong catalog version.  For
+ * example, suppose there is one catalog tuple with (xmin: 500, xmax: 0).  Now,
+ * the transaction 501 updates the catalog tuple and after that we will have
+ * two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0).  Now, if 501 is
+ * aborted and some other transaction say 502 updates the same catalog tuple
+ * then the first tuple will be changed to (xmin: 500, xmax: 502).  So, the
+ * problem is that when we try to decode the tuple inserted/updated in 501
+ * after the catalog update, we will see the catalog tuple with (xmin: 500,
+ * xmax: 502) as visible because it will consider that the tuple is deleted by
+ * xid 502 which is not visible to our snapshot.  And when we will try to
+ * decode with that catalog tuple, it can lead to a wrong result or a crash.
+ * So, it is necessary to detect concurrent aborts to allow streaming of
+ * in-progress transactions or decoding of prepared  transactions.
  *
  * For detecting the concurrent abort we set CheckXidAlive to the current
  * (sub)transaction's xid for which this change belongs to.  And, during
@@ -1800,7 +1858,10 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * and discard the already streamed changes on such an error.  We might have
  * already streamed some of the changes for the aborted (sub)transaction, but
  * that is fine because when we decode the abort we will stream abort message
- * to truncate the changes in the subscriber.
+ * to truncate the changes in the subscriber. Similarly, for prepared
+ * transactions, we stop decoding if concurrent abort is detected and then
+ * rollback the changes when rollback prepared is encountered. See
+ * DecodePreare.
  */
 static inline void
 SetupCheckXidLive(TransactionId xid)
@@ -1902,7 +1963,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1914,15 +1975,19 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		specinsert = NULL;
 	}
 
-	/* Stop the stream. */
-	rb->stream_stop(rb, txn, last_lsn);
-
-	/* Remember the command ID and snapshot for the streaming run */
-	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	/*
+	 * For the streaming case, stop the stream and remember the command ID and
+	 * snapshot for the streaming run.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_stop(rb, txn, last_lsn);
+		ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	}
 }
 
 /*
- * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ * Helper function for ReorderBufferReplay and ReorderBufferStreamTXN.
  *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
@@ -1975,9 +2040,14 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		else
 			StartTransactionCommand();
 
-		/* We only need to send begin/commit for non-streamed transactions. */
+		/* We only need to send begin/begin-prepare for non-streamed transactions. */
 		if (!streaming)
-			rb->begin(rb, txn);
+		{
+			if (rbtxn_prepared(txn))
+				rb->begin_prepare(rb, txn);
+			else
+				rb->begin(rb, txn);
+		}
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -2008,8 +2078,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			prev_lsn = change->lsn;
 
-			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			/*
+			 * Set the current xid to detect concurrent aborts. This is
+			 * required for the cases when we decode the changes before the
+			 * COMMIT record is processed.
+			 */
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2300,7 +2374,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2334,15 +2417,22 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the four reasons:
+		 * 1. Decoding an in-progress txn.
+		 * 2. Decoding a prepared txn.
+		 * 3. Decoding of a prepared txn that was (partially) streamed.
+		 * 4. Decoding a committed txn.
+		 *
+		 * For 1, we allow truncation of txn data by removing the changes already
+		 * streamed but still keeping other things like invalidations, snapshot,
+		 * and tuplecids. For 2 and 3, we indicate ReorderBufferTruncateTXN to
+		 * do more elaborate truncation of txn data as the entire transaction has
+		 * been decoded except for commit. For 4, as the entire txn has been
+		 * decoded, we can fully clean up the TXN reorder buffer.
 		 */
-		if (streaming)
+		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
-
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2375,17 +2465,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2415,26 +2508,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * ReorderBufferCommitChild(), even if previously assigned to the toplevel
  * transaction with ReorderBufferAssignChild.
  *
- * This interface is called once a toplevel commit is read for both streamed
- * as well as non-streamed transactions.
+ * This interface is called once a prepare or toplevel commit is read for both
+ * streamed as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferReplay(ReorderBufferTXN *txn,
+					ReorderBuffer *rb, TransactionId xid,
 					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 					TimestampTz commit_time,
 					RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2464,7 +2550,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	if (txn->base_snapshot == NULL)
 	{
 		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+
+		/*
+		 * Removing this txn before a commit might result in the computation
+		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
+		 */
+		if (!rbtxn_prepared(txn))
+			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
 
@@ -2475,6 +2567,168 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 							command_id, false);
 }
 
+/*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferReplay(txn, rb, xid, commit_lsn, end_lsn, commit_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Record the prepare information for a transaction.
+ */
+bool
+ReorderBufferRememberPrepareInfo(ReorderBuffer* rb, TransactionId xid,
+								 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+								 TimestampTz prepare_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return false;
+
+	/*
+	 * Remember the prepare information to be later used by commit prepared in
+	 * case we skip doing prepare.
+	 */
+	txn->final_lsn = prepare_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = prepare_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	return true;
+}
+
+/* Remember that we have skipped prepare */
+void
+ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid)
+{
+	ReorderBufferTXN* txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_SKIPPED_PREPARE;
+}
+
+/*
+ * Prepare a two-phase transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	/* The prepare info must have been updated in txn by now. */
+	Assert(txn->final_lsn != InvalidXLogRecPtr);
+
+	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
+						txn->commit_time, txn->origin_id, txn->origin_lsn);
+}
+
+/*
+ * This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time, RepOriginId origin_id,
+							XLogRecPtr origin_lsn, char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/* add the gid in the txn */
+	txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+	strcpy(txn->gid, gid);
+
+	/*
+	 * It is possible that this transaction is not decoded at prepare time either
+	 * because by that time we didn't have a consistent snapshot or it was decoded
+	 * earlier but we have restarted. We can't distinguish between those two cases
+	 * so we send the prepare in both the cases and let downstream decide whether
+	 * to process or skip it. We don't need to decode the xact for aborts if it is
+	 * not done already.
+	 */
+	if (!rbtxn_prepared(txn) && is_commit)
+	{
+		txn->txn_flags |= RBTXN_PREPARE;
+
+		/* The prepare info must have been updated in txn even if we skip prepare. */
+		Assert(txn->final_lsn != InvalidXLogRecPtr);
+
+		/*
+		 * By this time the txn has the prepare record information and it is
+		 * important to use that so that downstream gets the accurate
+		 * information. If instead, we have passed commit information here then
+		 * downstream can behave as it has already replayed commit prepared
+		 * after the restart.
+		 */
+		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
+							txn->commit_time, txn->origin_id, txn->origin_lsn);
+	}
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	if (is_commit)
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
 /*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
@@ -2606,6 +2860,39 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 	ReorderBufferCleanupTXN(rb, txn);
 }
 
+/*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ *
+ * Note that this is a special-purpose function for prepared transactions where
+ * we don't want to clean up the TXN even when we decide to skip it. See
+ * DecodePrepare.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
 /*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 9d5d68f3fa..dc3ef7426a 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -834,6 +834,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid))
 			continue;
 
+		/*
+		 * We don't need to add snapshot to prepared transactions as they
+		 * should not see the new catalog contents.
+		 */
+		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+			continue;
+
 		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
 			 txn->xid, (uint32) (lsn >> 32), (uint32) lsn);
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5fd604919e..345534113a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -174,6 +174,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_SKIPPED_PREPARE	  0x0100
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +235,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* prepare for this transaction skipped? */
+#define rbtxn_skip_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -255,10 +269,11 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	first_lsn;
 
 	/* ----
-	 * LSN of the record that lead to this xact to be committed or
+	 * LSN of the record that lead to this xact to be prepared or committed or
 	 * aborted. This can be a
 	 * * plain commit record
 	 * * plain commit record, of a parent transaction
+	 * * prepared tansaction
 	 * * prepared transaction commit
 	 * * plain abort record
 	 * * prepared transaction abort
@@ -290,7 +305,8 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	origin_lsn;
 
 	/*
-	 * Commit time, only known when we read the actual commit record.
+	 * Commit or Prepare time, only known when we read the actual commit or
+	 * prepare record.
 	 */
 	TimestampTz commit_time;
 
@@ -425,11 +441,6 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 typedef void (*ReorderBufferBeginPrepareCB) (ReorderBuffer *rb,
 											  ReorderBufferTXN *txn);
 
-typedef bool (*ReorderBufferFilterPrepareCB) (ReorderBuffer *rb,
-											  ReorderBufferTXN *txn,
-											  TransactionId xid,
-											  const char *gid);
-
 /* prepare callback signature */
 typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 										ReorderBufferTXN *txn,
@@ -540,7 +551,6 @@ struct ReorderBuffer
 	/*
 	 * Callbacks to be called when streaming a transaction at prepare time.
 	 */
-	ReorderBufferFilterPrepareCB filter_prepare;
 	ReorderBufferBeginCB begin_prepare;
 	ReorderBufferPrepareCB prepare;
 	ReorderBufferCommitPreparedCB commit_prepared;
@@ -627,12 +637,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -646,10 +662,17 @@ void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr l
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
 											   SharedInvalidationMessage *invalidations);
 void		ReorderBufferProcessXid(ReorderBuffer *, TransactionId xid, XLogRecPtr lsn);
+
 void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLogRecPtr lsn);
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferRememberPrepareInfo(ReorderBuffer* rb, TransactionId xid,
+											 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+											 TimestampTz prepare_time,
+											 RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
2.28.0.windows.1

v31-0003-Support-2PC-txn-tests-for-test_decoding.patchapplication/octet-stream; name=v31-0003-Support-2PC-txn-tests-for-test_decoding.patchDownload
From acbbd9991b837f40a32a01d22cbe92988b1a72fb Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 01:42:59 -0500
Subject: [PATCH v30] Support 2PC txn tests for test_decoding.

Add sql tests to test_decoding for 2PC.
---
 contrib/test_decoding/Makefile                     |   2 +-
 contrib/test_decoding/expected/two_phase.out       | 242 +++++++++++++++++++++
 .../test_decoding/expected/two_phase_stream.out    | 199 +++++++++++++++++
 contrib/test_decoding/sql/two_phase.sql            | 119 ++++++++++
 contrib/test_decoding/sql/two_phase_stream.sql     |  63 ++++++
 5 files changed, 624 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/two_phase.out
 create mode 100644 contrib/test_decoding/expected/two_phase_stream.out
 create mode 100644 contrib/test_decoding/sql/two_phase.sql
 create mode 100644 contrib/test_decoding/sql/two_phase_stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..2c4acdc 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -4,7 +4,7 @@ MODULES = test_decoding
 PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
-	decoding_into_rel binary prepared replorigin time messages \
+	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
diff --git a/contrib/test_decoding/expected/two_phase.out b/contrib/test_decoding/expected/two_phase.out
new file mode 100644
index 0000000..9d29e6e
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase.out
@@ -0,0 +1,242 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+ COMMIT PREPARED 'test_prepared#1'
+(5 rows)
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+ COMMIT PREPARED 'test_prepared#3'
+(4 rows)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ COMMIT PREPARED 'test_prepared_lock'
+(5 rows)
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+ COMMIT PREPARED 'test_prepared_savepoint'
+(4 rows)
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/two_phase_stream.out b/contrib/test_decoding/expected/two_phase_stream.out
new file mode 100644
index 0000000..a21fbc7
--- /dev/null
+++ b/contrib/test_decoding/expected/two_phase_stream.out
@@ -0,0 +1,199 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ PREPARE TRANSACTION 'test1'
+ COMMIT PREPARED 'test1'
+(23 rows)
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test2'
+(24 rows)
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+           data            
+---------------------------
+ ROLLBACK PREPARED 'test2'
+(1 row)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/two_phase.sql b/contrib/test_decoding/sql/two_phase.sql
new file mode 100644
index 0000000..4ed5266
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase.sql
@@ -0,0 +1,119 @@
+-- Test two-phased transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test 1:
+-- Test that commands in a two phase xact are only decoded at PREPARE.
+-- Decoding after COMMIT PREPARED should only have the COMMIT PREPARED command and not the
+-- rest of the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 2:
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 3:
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 4:
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 5:
+-- Check `CLUSTER` (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The call should return
+-- within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 6:
+-- Test savepoints and sub-xacts. Creating savepoints will create sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 7:
+-- test that a GID containing "_nodecode" gets decoded at commit prepared time
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/two_phase_stream.sql b/contrib/test_decoding/sql/two_phase_stream.sql
new file mode 100644
index 0000000..01510e4
--- /dev/null
+++ b/contrib/test_decoding/sql/two_phase_stream.sql
@@ -0,0 +1,63 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/ROLLBACK PREPARED
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test2';
+-- should show the inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+ROLLBACK PREPARED 'test2';
+-- should show the ROLLBACK PREPARED
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with filtered gid
+-- gids with '_nodecode' should not be handled as a two-phase commit.
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a PREPARE
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
-- 
1.8.3.1

v31-0004-Support-2PC-txn-tests-for-concurrent-aborts.patchapplication/octet-stream; name=v31-0004-Support-2PC-txn-tests-for-concurrent-aborts.patchDownload
From 75e306eace493ab5b398d573fc07760e1b57886e Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 02:13:14 -0500
Subject: [PATCH v30] Support 2PC txn tests for concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2PC.
---
 contrib/test_decoding/Makefile                    |   2 +
 contrib/test_decoding/t/001_twophase.pl           | 121 ++++++++++++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++++++
 contrib/test_decoding/test_decoding.c             |  58 ++++++++++
 src/backend/replication/logical/reorderbuffer.c   |   5 +
 5 files changed, 319 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2c4acdc..49523fe 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..3b3e7b8
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of prepared txn test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..15001c6
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 6330661..da1369f 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,11 +11,13 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
+#include "storage/procarray.h"
 
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -35,6 +37,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -174,6 +177,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -275,6 +279,24 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -471,6 +493,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -620,6 +666,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -706,6 +755,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -918,6 +970,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -971,6 +1026,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index da54560..efce11b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2487,6 +2487,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
-- 
1.8.3.1

v31-0005-Support-2PC-txn-spoolfile.patchapplication/octet-stream; name=v31-0005-Support-2PC-txn-spoolfile.patchDownload
From 76c49f5c6ba7eb5e7c99feabd3c8a4e3473272ab Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 11 Dec 2020 14:31:11 +0530
Subject: [PATCH v31 3/4] Support 2PC txn - spoolfile.

This patch only refactors to isolate the streaming spool-file processing to a separate function.
Later, two-phase commit logic will require this common processing to be called from multiple places.
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++--------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 8c7fad8f74..a4ec883c01 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -924,30 +926,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -955,7 +948,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -970,7 +963,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1045,6 +1038,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
2.28.0.windows.1

v31-0006-Support-2PC-txn-pgoutput.patchapplication/octet-stream; name=v31-0006-Support-2PC-txn-pgoutput.patchDownload
From dbd8b8268da7c295879fda1dda2b6d8bac3c5519 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 12 Dec 2020 16:52:28 +0530
Subject: [PATCH v31 4/4] Support 2PC txn - pgoutput.

This patch adds support in the pgoutput plugin and subscriber for handling
of two-phase commits.

Includes pgoutput changes.

Includes subscriber changes.
---
 src/backend/access/transam/twophase.c       |  33 +-
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 252 ++++++++++++++-
 src/backend/replication/logical/worker.c    | 330 ++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 161 ++++++++--
 src/include/access/twophase.h               |   1 +
 src/include/replication/logicalproto.h      |  57 +++-
 src/include/replication/reorderbuffer.h     |  12 +
 src/tools/pgindent/typedefs.list            |   1 +
 9 files changed, 815 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9bad9..00b4497c2d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -547,6 +547,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held)
 	ProcArrayAdd(&ProcGlobal->allProcs[gxact->pgprocno]);
 }
 
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID is	around
+ */
+bool
+LookupGXact(const char *gid)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			found = true;
+			break;
+		}
+
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
+
 /*
  * LockGXact
  *		Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
@@ -1133,9 +1160,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 15ab8e7204..dd33469645 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -957,8 +957,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb31182d7..6a99ade42f 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -105,6 +105,256 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 	commit_data->committime = pq_getmsgint64(in);
 }
 
+/*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN* txn,
+								   XLogRecPtr rollback_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, rollback_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
 /*
  * Write ORIGIN to the output stream.
  */
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a4ec883c01..66bca8685f 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -169,6 +170,9 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
+/* for skipping prepared transaction */
+bool        skip_prepared_txn = false;
+
 /*
  * Hash table for storing the streaming xid information along with shared file
  * set for streaming and subxact files.
@@ -690,6 +694,12 @@ apply_handle_begin(StringInfo s)
 {
 	LogicalRepBeginData begin_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_begin(s, &begin_data);
 
 	remote_final_lsn = begin_data.final_lsn;
@@ -709,6 +719,12 @@ apply_handle_commit(StringInfo s)
 {
 	LogicalRepCommitData commit_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_commit(s, &commit_data);
 
 	Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -721,6 +737,264 @@ apply_handle_commit(StringInfo s)
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
 
+/*
+ * Handle BEGIN message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	if (LookupGXact(begin_data.gid))
+	{
+		/*
+		 * If this gid has already been prepared then we dont want to apply
+		 * this txn again. This can happen after restart where upstream can
+		 * send the prepared transaction again. See
+		 * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+		 */
+		skip_prepared_txn = true;
+		return;
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (skip_prepared_txn)
+	{
+		/*
+		 * If we are skipping this transaction because it was previously
+		 * prepared, ignore it and reset the flag.
+		 */
+		Assert(LookupGXact(prepare_data.gid));
+		skip_prepared_txn = false;
+		return;
+	}
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	/*
+	 * During logical decoding, on the apply side, it's possible that a
+	 * prepared transaction got aborted while decoding. In that case, we stop
+	 * the decoding and abort the transaction immediately. However the
+	 * ROLLBACK prepared processing still reaches the subscriber. In that case
+	 * it's ok to have a missing gid
+	 */
+	if (LookupGXact(prepare_data.gid))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
 /*
  * Handle ORIGIN message.
  *
@@ -752,6 +1026,12 @@ apply_handle_stream_start(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	/*
 	 * Start a transaction on stream start, this transaction will be committed
 	 * on the stream stop unless it is a tablesync worker in which case it will
@@ -799,6 +1079,12 @@ apply_handle_stream_stop(StringInfo s)
 {
 	Assert(in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	/*
 	 * Close the file with serialized changes, and serialize information about
 	 * subxacts for the toplevel transaction.
@@ -835,6 +1121,12 @@ apply_handle_stream_abort(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_stream_abort(s, &xid, &subxid);
 
 	/*
@@ -1053,6 +1345,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	xid = logicalrep_read_stream_commit(s, &commit_data);
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
@@ -1176,6 +1474,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1297,6 +1598,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1454,6 +1758,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1823,6 +2130,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1979,6 +2289,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997aed83..95d0948d85 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,14 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +65,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +171,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -338,27 +358,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -377,6 +378,65 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
 /*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
@@ -766,17 +826,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -856,6 +907,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 	cleanup_rel_sync_cache(txn->xid, true);
 }
 
+/*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
@@ -1175,3 +1244,31 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 	while ((entry = (RelationSyncEntry *) hash_seq_search(&status)) != NULL)
 		entry->replicate_valid = false;
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr	origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3445..b2628ea4e2 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -44,6 +44,7 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 extern void StartPrepare(GlobalTransaction gxact);
 extern void EndPrepare(GlobalTransaction gxact);
 extern bool StandbyTransactionIdIsPrepared(TransactionId xid);
+extern bool LookupGXact(const char *gid);
 
 extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p,
 												 int *nxids_p);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535df80..3fe88c8280 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,31 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, commit prepared and rollback prepared
+ * transaction. prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared and rollback lsn and rollback time for
+ * rollback prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +160,23 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN* txn,
+											   XLogRecPtr rollback_lsn);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepPreparedTxnData *prepare_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +220,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 345534113a..79df04f5f8 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e82b4f7fe0..cba9d8452f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1342,6 +1342,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
2.28.0.windows.1

v31-0007-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v31-0007-Support-2PC-txn-subscriber-tests.patchDownload
From bc332342c43ede664ab004a9b3332fe7c2e8ad4f Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 02:32:53 -0500
Subject: [PATCH v30] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v31-0008-Support-2PC-documentation.patchapplication/octet-stream; name=v31-0008-Support-2PC-documentation.patchDownload
From e0bd07e4b0745751f10bd7071bd3600e6b5c4605 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 02:43:40 -0500
Subject: [PATCH v30] Support-2PC-documentation.

Add documentation about two-phase commit support in Logical Decoding.
---
 doc/src/sgml/logicaldecoding.sgml | 99 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 98 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 6262273..59d9dcc 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -165,7 +165,57 @@ COMMIT 693
 <keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
 $ pg_recvlogical -d postgres --slot=test --drop-slot
 </programlisting>
-  </sect1>
+
+  <para>
+  The following example shows how logical decoding can be used to handle transactions
+  that use a two-phase commit. Before you use two-phase commit commands, you must set
+  <varname>max_prepared_transactions</varname> to at least 1. You must also set the 
+  option 'two-phase-commit' to 1 while calling <function>pg_logical_slot_get_changes</function>.
+  </para>
+<programlisting>
+postgres=# BEGIN;
+postgres=*# INSERT INTO data(data) VALUES('5');
+postgres=*# PREPARE TRANSACTION 'test_prepared1';
+
+postgres=# SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/1689DC0 | 529 | BEGIN 529
+ 0/1689DC0 | 529 | table public.data: INSERT: id[integer]:3 data[text]:'5'
+ 0/1689FC0 | 529 | PREPARE TRANSACTION 'test_prepared1', txid 529
+(3 rows)
+
+postgres=# COMMIT PREPARED 'test_prepared1';
+COMMIT PREPARED
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                    data                    
+-----------+-----+--------------------------------------------
+ 0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529
+(1 row)
+
+postgres=#-- you can also rollback a prepared transaction
+postgres=# BEGIN;
+BEGIN
+postgres=*# INSERT INTO data(data) VALUES('6');INSERT 0 1
+postgres=*# PREPARE TRANSACTION 'test_prepared2';PREPARE TRANSACTION
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/168A180 | 530 | BEGIN 530
+ 0/168A1E8 | 530 | table public.data: INSERT: id[integer]:4 data[text]:'6'
+ 0/168A430 | 530 | PREPARE TRANSACTION 'test_prepared2', txid 530
+(3 rows)
+
+postgres=# ROLLBACK PREPARED 'test_prepared1';ERROR:  prepared transaction with identifier "test_prepared1" does not exist
+postgres=# ROLLBACK PREPARED 'test_prepared2';
+ROLLBACK PREPARED
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                     data                     
+-----------+-----+----------------------------------------------
+ 0/168A4B8 | 530 | ROLLBACK PREPARED 'test_prepared2', txid 530
+(1 row)
+</programlisting>
+</sect1>
 
   <sect1 id="logicaldecoding-explanation">
    <title>Logical Decoding Concepts</title>
@@ -1121,4 +1171,51 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
    </para>
 
   </sect1>
+
+  <sect1 id="logicaldecoding-two-phase-commits">
+   <title>Two-phase commit support for Logical Decoding</title>
+
+   <para>
+   With the basic output plugin callbacks (eg., <function>begin_cb</function>,
+   <function>change_cb</function>, <function>commit_cb</function> and
+   <function>message_cb</function>) two-phase commit commands like
+   <command>PREPARE TRANSACTION</command>, <command>COMMIT PREPARED</command>
+   and <command>ROLLBACK PREPARED</command> are not decoded correctly.
+   While the <command>PREPARE TRANSACTION</command> ignored, 
+   <command>COMMIT PREPARED</command> is decoded as a <command>COMMIT</command> and 
+   <command>ROLLBACK PREPARED</command> is decoded as a <command>ROLLBACK</command>.
+   </para>
+
+   <para>
+   An output plugin may provide additional callbacks to support two-phase commit commands.
+   There are multiple two-phase commit callbacks that are required,
+   (<function>begin_prepare_cb</function>, <function>prepare_cb</function>, 
+   <function>commit_prepared_cb</function>, 
+   <function>rollback_prepared_cb</function> and <function>stream_prepare_cb</function>)
+   and an optional callback (<function>filter_prepare_cb</function>).
+   </para>
+
+   <para>
+   If the output plugin callbacks for decoding two-phase commit commands are provided,
+   then on <command>PREPARE TRANSACTION</command>, the changes of that transaction are
+   decoded, passed to the output plugin and the <function>prepare_cb</function>
+   callback is invoked.This differs from the basic decoding setup where changes are
+   only passed to the output plugin when a transaction is committed. The start of a
+   prepared transaction is indicated by the <function>begin_prepare_cb</function> callback.
+   </para>
+
+   <para>
+   When a prepared transaction is rollbacked using the <command>ROLLBACK PREPARED</command>,
+   then the <function>rollback_prepared_cb</function> callback is invoked and when the
+   prepared transaction is committed using <command>COMMIT PREPARED</command>,
+   then the <function>commit_prepared_cb</function> callback is invoked.
+   </para>
+
+   <para>
+   Optionally the output plugin can specify a name pattern in the 
+   <function>filter_prepare_cb</function> and transactions with gid containing
+   that name pattern will not be decoded as a two-phase commit transaction. 
+   </para>
+
+  </sect1>
  </chapter>
-- 
1.8.3.1

#151Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#150)

On Mon, Dec 14, 2020 at 2:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Today, I looked at one of the issues discussed earlier in this thread
[1]: /messages/by-id/CAMGcDxf83P5SGnGH52=_0wRP9pO6uRWCMRwAA0nxKtZvir2_vQ@mail.gmail.com
user explicitly locks the catalog relation (like Lock pg_class) or
perform Cluster on non-relmapped catalog relations (like Cluster
pg_trigger using pg_class_oid_index; and the user_table on which we
have performed any operation has a trigger) in the prepared xact. As
discussed previously, we don't have a problem when user tables are
exclusively locked because during decoding we don't acquire any lock
on those and in fact, we have a test case for the same in the patch.

In the previous discussion, most people seem to be of opinion that we
should document it in a category "don't do that", or prohibit to
prepare transactions that lock system tables in the exclusive mode as
any way that can block the entire system. The other possibility could
be that the plugin can allow enabling lock_timeout when it wants to
allow decoding of two-phase xacts and if the timeout occurs it tries
to fetch by disabling two-phase option provided by the patch.

I think it is better to document this as there is no realistic
scenario where it can happen. I also think separately (not as part of
this patch) we can investigate whether it is a good idea to prohibit
prepare for transactions that acquire exclusive locks on catalog
relations.

Thoughts?

[1]: /messages/by-id/CAMGcDxf83P5SGnGH52=_0wRP9pO6uRWCMRwAA0nxKtZvir2_vQ@mail.gmail.com

--
With Regards,
Amit Kapila.

#152Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#150)

On Mon, Dec 14, 2020 at 6:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 8, 2020 at 2:01 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Dec 1, 2020 at 6:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

To skip it, we need to send GID in begin message and then on
subscriber-side, check if the prepared xact already exists, if so then
set a flag. The flag needs to be set in begin/start_stream and reset
in stop_stream/commit/abort. Using the flag, we can skip the entire
contents of the prepared xact. In ReorderFuffer-side also, we need to
get and set GID in txn even when we skip it because we need to send
the same at commit time. In this solution, we won't be able to send it
during normal start_stream because by that time we won't know GID and
I think that won't be required. Note that this is only required when
we skipped sending prepare, otherwise, we just need to send
Commit-Prepared at commit time.

I have implemented these changes and tested the fix using the test
setup I had shared above and it seems to be working fine.
I have also tested restarts that simulate duplicate prepares being
sent by the publisher and verified that it is handled correctly by the
subscriber.

This implementation has a flaw in that it has used commit_lsn for the
prepare when we are sending prepare just before commit prepared. We
can't send the commit LSN with prepare because if the subscriber
crashes after prepare then it would update
replorigin_session_origin_lsn with that commit_lsn. Now, after the
restart, because we will use that LSN to start decoding, the Commit
Prepared will get skipped. To fix this, we need to remember the
prepare LSN and other information even when we skip prepare and then
use it while sending the prepare during commit prepared.

Now, after fixing this, I discovered another issue which is that we
allow adding a new snapshot to prepared transactions via
SnapBuildDistributeNewCatalogSnapshot. We can only allow it to get
added to in-progress transactions. If you comment out the changes
added in SnapBuildDistributeNewCatalogSnapshot then you will notice
one test failure which indicates this problem. This problem was not
evident before the bug-fix in the previous paragraph because you were
using commit-lsn even for the prepare and newly added snapshot change
appears to be before the end_lsn.

Some other assorted changes in various patches:
v31-0001-Extend-the-output-plugin-API-to-allow-decoding-o
1. I have changed the filter_prepare API to match the signature with
FilterByOrigin. I don't see the need for ReorderBufferTxn or xid in
the API.
2. I have expanded the documentation of 'Begin Prepare Callback' to
explain how a user can use it to detect already prepared transactions
and in which scenarios that can happen.
3. Added a few comments in the code atop begin_prepare_cb_wrapper to
explain why we are adding this new API.
4. Move the check whether the filter_prepare callback is defined from
filter_prepare_cb_wrapper to caller. This is similar to how
FilterByOrigin works.
5. Fixed various whitespace and cosmetic issues.
6. Update commit message to include two of the newly added APIs

v31-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer
1. Changed the variable names and comments in DecodeXactOp.
2. A new API for FilterPrepare similar to FilterByOrigin and use that
instead of ReorderBufferPrepareNeedSkip.
3. In DecodeCommit, we need to update the reorderbuffer about the
surviving subtransactions for both ReorderBufferFinishPrepared and
ReorderBufferCommit because now both can process the transaction.
4. Because, now we need to remember the prepare info even when we skip
it, I have simplified ReorderBufferPrepare API by removing the extra
parameters as that information will be now available via
ReorderBufferTxn.
5. Updated comments at various places.

v31-0006-Support-2PC-txn-pgoutput
1. Added Asserts in streaming APIs on the subscriber-side to ensure
that we should never reach there for the already prepared transaction
case. We never need to stream the changes when we are skipping prepare
either because the snapshot was not consistent by that time or we have
already sent those changes before restart. Added the same Assert in
Begin and Commit routines because while skipping prepared txn, we must
not receive the changes of any other xact.
2.
+ /*
+ * Flags are determined from the state of the transaction. We know we
+ * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+ * it's already marked as committed then it has to be COMMIT PREPARED (and
+ * likewise for abort / ROLLBACK PREPARED).
+ */
+ if (rbtxn_commit_prepared(txn))
+ flags = LOGICALREP_IS_COMMIT_PREPARED;
+ else if (rbtxn_rollback_prepared(txn))
+ flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+ else
+ flags = LOGICALREP_IS_PREPARE;

I don't like clubbing three different operations under one message
LOGICAL_REP_MSG_PREPARE. It looks awkward to use new flags
RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED in ReordeBuffer so
that we can recognize these operations in corresponding callbacks. I
think setting any flag in ReorderBuffer should not dictate the
behavior in callbacks. Then also there are few things that are not
common to those APIs like the patch has an Assert to say that the txn
is marked with prepare flag for all three operations which I think is
not true for Rollback Prepared after the restart. We don't ensure to
set the Prepare flag if the Rollback Prepare happens after the
restart. Then, we have to introduce separate flags to distinguish
prepare/commit prepared/rollback prepared to distinguish multiple
operations sent as protocol messages. Also, all these operations are
mutually exclusive so it will be better to send separate messages for
each of these and I have changed it accordingly in the attached patch.

3. The patch has a duplicate code to send replication origins. I have
moved the common code to a separate function.

v31-0009-Support-2PC-txn-Subscription-option
1.
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 202011251
+#define CATALOG_VERSION_NO 202011271

No need to change catversion as this gets changed frequently and that
leads to conflict in the patch. We can change it either in the final
version or normally committers take care of this. If you want to
remember it, maybe adding a line for it in the commit message should
be okay. For now, I have removed this from the patch.

Thank you for updating the patch. I have two questions:

-----
@@ -239,6 +239,19 @@ CREATE SUBSCRIPTION <replaceable
class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+          When two-phase commit is enabled then the decoded
transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. When
two-phase commit is not
+          enabled then PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED are not
+          decoded on the publisher.
+         </para>
+        </listitem>
+       </varlistentry>

The user will need to specify the 'two_phase’ option on CREATE
SUBSCRIPTION. It would mean the user will need to control what data is
streamed both on publication side for INSERT/UPDATE/DELETE/TRUNCATE
and on subscriber side for PREPARE. Looking at the implementation of
the ’two_phase’ option of CREATE SUBSCRIPTION, it ultimately just
passes the ‘two_phase' option to the publisher. Why don’t we set it on
the publisher side? Also, I guess we can improve the description of
’two_phase’ option of CREATE SUBSCRIPTION in the doc by adding the
fact that when this option is not enabled the transaction prepared on
the publisher is decoded as a normal transaction:

------
+   if (LookupGXact(begin_data.gid))
+   {
+       /*
+        * If this gid has already been prepared then we dont want to apply
+        * this txn again. This can happen after restart where upstream can
+        * send the prepared transaction again. See
+        * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+        */
+       skip_prepared_txn = true;
+       return;
+   }

When PREPARE arrives at the subscriber node but there is the prepared
transaction with the same transaction identifier, the apply worker
skips the whole transaction. So if the users prepared a transaction
with the same identifier on the subscriber, the prepared transaction
that came from the publisher would be ignored without any messages. On
the other hand, if applying other operations such as HEAP_INSERT
conflicts (such as when violating the unique constraint) the apply
worker raises an ERROR and stops logical replication until the
conflict is resolved. IIUC since we can know that the prepared
transaction came from the same publisher again by checking origin_lsn
in TwoPhaseFileHeader I guess we can skip the PREPARE message only
when the existing prepared transaction has the same LSN and the same
identifier. To be exact, it’s still possible that the subscriber gets
two PREPARE messages having the same LSN and name from two different
publishers but it’s unlikely happen in practice.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#153Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#152)

On Wed, Dec 16, 2020 at 1:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thank you for updating the patch. I have two questions:

-----
@@ -239,6 +239,19 @@ CREATE SUBSCRIPTION <replaceable
class="parameter">subscription_name</replaceabl
</para>
</listitem>
</varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+          When two-phase commit is enabled then the decoded
transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. When
two-phase commit is not
+          enabled then PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED are not
+          decoded on the publisher.
+         </para>
+        </listitem>
+       </varlistentry>

The user will need to specify the 'two_phase’ option on CREATE
SUBSCRIPTION. It would mean the user will need to control what data is
streamed both on publication side for INSERT/UPDATE/DELETE/TRUNCATE
and on subscriber side for PREPARE. Looking at the implementation of
the ’two_phase’ option of CREATE SUBSCRIPTION, it ultimately just
passes the ‘two_phase' option to the publisher. Why don’t we set it on
the publisher side?

There could be multiple subscriptions for the same publication, some
want to decode the transaction at prepare time and others might want
to decode at commit time only. Also, one subscription could subscribe
to multiple publications, so not sure if it is even feasible to set at
publication level (consider one txn has changes belonging to multiple
publications). This option controls how the data is streamed from a
publication similar to other options like 'streaming'. Why do you
think this should be any different?

Also, I guess we can improve the description of
’two_phase’ option of CREATE SUBSCRIPTION in the doc by adding the
fact that when this option is not enabled the transaction prepared on
the publisher is decoded as a normal transaction:

Sounds reasonable.

------
+   if (LookupGXact(begin_data.gid))
+   {
+       /*
+        * If this gid has already been prepared then we dont want to apply
+        * this txn again. This can happen after restart where upstream can
+        * send the prepared transaction again. See
+        * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+        */
+       skip_prepared_txn = true;
+       return;
+   }

When PREPARE arrives at the subscriber node but there is the prepared
transaction with the same transaction identifier, the apply worker
skips the whole transaction. So if the users prepared a transaction
with the same identifier on the subscriber, the prepared transaction
that came from the publisher would be ignored without any messages. On
the other hand, if applying other operations such as HEAP_INSERT
conflicts (such as when violating the unique constraint) the apply
worker raises an ERROR and stops logical replication until the
conflict is resolved. IIUC since we can know that the prepared
transaction came from the same publisher again by checking origin_lsn
in TwoPhaseFileHeader I guess we can skip the PREPARE message only
when the existing prepared transaction has the same LSN and the same
identifier. To be exact, it’s still possible that the subscriber gets
two PREPARE messages having the same LSN and name from two different
publishers but it’s unlikely happen in practice.

The idea sounds reasonable. I'll try and see if this works.

Thanks.

--
With Regards,
Amit Kapila.

#154Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#150)
v31-0009-Support-2PC-txn-Subscription-option
1.
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 202011251
+#define CATALOG_VERSION_NO 202011271

No need to change catversion as this gets changed frequently and that
leads to conflict in the patch. We can change it either in the final
version or normally committers take care of this. If you want to
remember it, maybe adding a line for it in the commit message should
be okay. For now, I have removed this from the patch.

--
With Regards,
Amit Kapila.

I have reviewed the changes, did not have any new comments.
While testing, I found an issue in this patch. During initialisation,
the pg_output is not initialised fully and the subscription parameters
are not all read. As a result, ctx->twophase could be
set to true , even if the subscription does not specify so. For this,
we need to make the following change in pgoutput.c:
pgoutput_startup(), similar to how streaming is handled.

/*
* This is replication start and not slot initialization.
*
* Parse and validate options passed by the client.
*/
if (!is_init)
{
:
:
}
else
{
/* Disable the streaming during the slot initialization mode. */
ctx->streaming = false;
+ ctx->twophase = false
}

regards,
Ajin

#155Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#154)

On Thu, Dec 17, 2020 at 7:02 AM Ajin Cherian <itsajin@gmail.com> wrote:

v31-0009-Support-2PC-txn-Subscription-option
1.
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 202011251
+#define CATALOG_VERSION_NO 202011271

No need to change catversion as this gets changed frequently and that
leads to conflict in the patch. We can change it either in the final
version or normally committers take care of this. If you want to
remember it, maybe adding a line for it in the commit message should
be okay. For now, I have removed this from the patch.

--
With Regards,
Amit Kapila.

I have reviewed the changes, did not have any new comments.
While testing, I found an issue in this patch. During initialisation,
the pg_output is not initialised fully and the subscription parameters
are not all read. As a result, ctx->twophase could be
set to true , even if the subscription does not specify so. For this,
we need to make the following change in pgoutput.c:
pgoutput_startup(), similar to how streaming is handled.

/*
* This is replication start and not slot initialization.
*
* Parse and validate options passed by the client.
*/
if (!is_init)
{
:
:
}
else
{
/* Disable the streaming during the slot initialization mode. */
ctx->streaming = false;
+ ctx->twophase = false
}

makes sense. I can take care of this in the next version where I am
planning to address Sawada-San's comments and few other clean up work.

--
With Regards,
Amit Kapila.

#156Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#155)

On Thu, Dec 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 17, 2020 at 7:02 AM Ajin Cherian <itsajin@gmail.com> wrote:

I have reviewed the changes, did not have any new comments.
While testing, I found an issue in this patch. During initialisation,
the pg_output is not initialised fully and the subscription parameters
are not all read. As a result, ctx->twophase could be
set to true , even if the subscription does not specify so. For this,
we need to make the following change in pgoutput.c:
pgoutput_startup(), similar to how streaming is handled.

/*
* This is replication start and not slot initialization.
*
* Parse and validate options passed by the client.
*/
if (!is_init)
{
:
:
}
else
{
/* Disable the streaming during the slot initialization mode. */
ctx->streaming = false;
+ ctx->twophase = false
}

makes sense.

On again thinking about this, I think it is good to disable it during
slot initialization but will it create any problem because during slot
initialization we don't stream any xact and stop processing WAL as
soon as we reach CONSISTENT_STATE? Did you observe any problem with
this?

--
With Regards,
Amit Kapila.

#157Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#156)

On Thu, Dec 17, 2020 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On again thinking about this, I think it is good to disable it during
slot initialization but will it create any problem because during slot
initialization we don't stream any xact and stop processing WAL as
soon as we reach CONSISTENT_STATE? Did you observe any problem with
this?

Yes, it did not stream any xact during initialization but I was
surprised that the DecodePrepare code was invoked even though
I hadn't created the subscription with twophase enabled. No problem
was observed.

regards,
Ajin Cherian
Fujitsu Australia

#158Ajin Cherian
itsajin@gmail.com
In reply to: Ajin Cherian (#157)
1 attachment(s)

Adding a test case that tests that when a consistent snapshot is
formed after a prepared transaction but before it has been COMMIT
PREPARED.
This test makes sure that in this case, the entire transaction is
decoded on a COMMIT PREPARED. This patch applies on top of v31.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v31-0010-Support-2PC-consistent-snapshot-isolation-tests.patchapplication/octet-stream; name=v31-0010-Support-2PC-consistent-snapshot-isolation-tests.patchDownload
From f9c1e899cbcec536fe23dffc7b1010503841169c Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 16 Dec 2020 23:12:58 -0500
Subject: [PATCH] Support 2PC consistent snapshot isolation tests

Added isolation test-case to test that if a consistent snapshot is created
between a PREPARE and a COMMIT PREPARED, then the whole transaction is decoded
on COMMIT PREPARED.
---
 contrib/test_decoding/Makefile                     |  2 +-
 .../test_decoding/expected/twophase_snapshot.out   | 43 +++++++++++++++++++
 contrib/test_decoding/specs/twophase_snapshot.spec | 49 ++++++++++++++++++++++
 3 files changed, 93 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/twophase_snapshot.out
 create mode 100644 contrib/test_decoding/specs/twophase_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 49523fe..ba6a2c2 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time two_phase two_phase_stream messages \
 	spill slot truncate stream stats
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
+	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream twophase_snapshot
 
 TAP_TESTS = 1
 
diff --git a/contrib/test_decoding/expected/twophase_snapshot.out b/contrib/test_decoding/expected/twophase_snapshot.out
new file mode 100644
index 0000000..53aaf01
--- /dev/null
+++ b/contrib/test_decoding/expected/twophase_snapshot.out
@@ -0,0 +1,43 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s2b s2txid s1init s3b s3txid s2alter s2c s4b s4insert s4prepare s3c s1insert s1checkpoint s1start s4commit s1start
+step s2b: BEGIN;
+step s2txid: SELECT pg_current_xact_id() IS NULL;
+?column?       
+
+f              
+step s1init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); <waiting ...>
+step s3b: BEGIN;
+step s3txid: SELECT pg_current_xact_id() IS NULL;
+?column?       
+
+f              
+step s2alter: ALTER TABLE do_write ADD COLUMN addedbys2 int;
+step s2c: COMMIT;
+step s4b: BEGIN;
+step s4insert: INSERT INTO do_write DEFAULT VALUES;
+step s4prepare: PREPARE TRANSACTION 'test1';
+step s3c: COMMIT;
+step s1init: <... completed>
+?column?       
+
+init           
+step s1insert: INSERT INTO do_write DEFAULT VALUES;
+step s1checkpoint: CHECKPOINT;
+step s1start: SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');
+data           
+
+BEGIN          
+table public.do_write: INSERT: id[integer]:2 addedbys2[integer]:null
+COMMIT         
+step s4commit: COMMIT PREPARED 'test1';
+step s1start: SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');
+data           
+
+BEGIN          
+table public.do_write: INSERT: id[integer]:1 addedbys2[integer]:null
+PREPARE TRANSACTION 'test1'
+COMMIT PREPARED 'test1'
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/twophase_snapshot.spec b/contrib/test_decoding/specs/twophase_snapshot.spec
new file mode 100644
index 0000000..505e5e3
--- /dev/null
+++ b/contrib/test_decoding/specs/twophase_snapshot.spec
@@ -0,0 +1,49 @@
+# Test decoding of two-phase transactions during the build of a consistent snapshot.
+setup
+{
+    DROP TABLE IF EXISTS do_write;
+    CREATE TABLE do_write(id serial primary key);
+}
+
+teardown
+{
+    DROP TABLE do_write;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1init" {SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');}
+step "s1start" {SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');}
+step "s1insert" { INSERT INTO do_write DEFAULT VALUES; }
+step "s1checkpoint" { CHECKPOINT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2b" { BEGIN; }
+step "s2txid" { SELECT pg_current_xact_id() IS NULL; }
+step "s2alter" { ALTER TABLE do_write ADD COLUMN addedbys2 int; }
+step "s2c" { COMMIT; }
+
+
+session "s3"
+setup { SET synchronous_commit=on; }
+
+step "s3b" { BEGIN; }
+step "s3txid" { SELECT pg_current_xact_id() IS NULL; }
+step "s3c" { COMMIT; }
+
+session "s4"
+setup { SET synchronous_commit=on; }
+
+step "s4b" { BEGIN; }
+step "s4insert" { INSERT INTO do_write DEFAULT VALUES; }
+step "s4prepare" { PREPARE TRANSACTION 'test1'; }
+step "s4commit" { COMMIT PREPARED 'test1'; }
+
+# Force building of a consistent snapshot between a PREPARE and COMMIT PREPARED.
+# Ensure that the whole transaction is decoded fresh at the time of COMMIT PREPARED.
+permutation "s2b" "s2txid" "s1init" "s3b" "s3txid" "s2alter" "s2c" "s4b" "s4insert" "s4prepare" "s3c""s1insert" "s1checkpoint" "s1start" "s4commit" "s1start"
-- 
1.8.3.1

#159Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#151)

On Tue, Dec 15, 2020 at 11:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 14, 2020 at 2:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Today, I looked at one of the issues discussed earlier in this thread
[1] which is that decoding can block (or deadlock can happen) when the
user explicitly locks the catalog relation (like Lock pg_class) or
perform Cluster on non-relmapped catalog relations (like Cluster
pg_trigger using pg_class_oid_index; and the user_table on which we
have performed any operation has a trigger) in the prepared xact. As
discussed previously, we don't have a problem when user tables are
exclusively locked because during decoding we don't acquire any lock
on those and in fact, we have a test case for the same in the patch.

Yes, and as described in that mail, the current code explicitly denies
preparation of a 2PC transaction.
under some circumstances:

postgres=# BEGIN;
postgres=# CLUSTER pg_class using pg_class_oid_index ;
postgres=# PREPARE TRANSACTION 'test_prepared_lock';
ERROR: cannot PREPARE a transaction that modified relation mapping

In the previous discussion, most people seem to be of opinion that we
should document it in a category "don't do that", or prohibit to
prepare transactions that lock system tables in the exclusive mode as
any way that can block the entire system. The other possibility could
be that the plugin can allow enabling lock_timeout when it wants to
allow decoding of two-phase xacts and if the timeout occurs it tries
to fetch by disabling two-phase option provided by the patch.

I think it is better to document this as there is no realistic
scenario where it can happen. I also think separately (not as part of
this patch) we can investigate whether it is a good idea to prohibit
prepare for transactions that acquire exclusive locks on catalog
relations.

Thoughts?

I agree with the documentation option. If we choose to disable
two-phase on timeout, we still need to decide what to
do with already prepared transactions.

regards,
Ajin Cherian
Fujitsu Australia

#160Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#153)

On Wed, Dec 16, 2020 at 6:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 16, 2020 at 1:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thank you for updating the patch. I have two questions:

-----
@@ -239,6 +239,19 @@ CREATE SUBSCRIPTION <replaceable
class="parameter">subscription_name</replaceabl
</para>
</listitem>
</varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+          When two-phase commit is enabled then the decoded
transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. When
two-phase commit is not
+          enabled then PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED are not
+          decoded on the publisher.
+         </para>
+        </listitem>
+       </varlistentry>

The user will need to specify the 'two_phase’ option on CREATE
SUBSCRIPTION. It would mean the user will need to control what data is
streamed both on publication side for INSERT/UPDATE/DELETE/TRUNCATE
and on subscriber side for PREPARE. Looking at the implementation of
the ’two_phase’ option of CREATE SUBSCRIPTION, it ultimately just
passes the ‘two_phase' option to the publisher. Why don’t we set it on
the publisher side?

There could be multiple subscriptions for the same publication, some
want to decode the transaction at prepare time and others might want
to decode at commit time only. Also, one subscription could subscribe
to multiple publications, so not sure if it is even feasible to set at
publication level (consider one txn has changes belonging to multiple
publications). This option controls how the data is streamed from a
publication similar to other options like 'streaming'. Why do you
think this should be any different?

Oh, I was thinking that the option controls what data is streamed
similar to the 'publish' option. But I agreed with you. As you
mentioned, it might be a problem if a subscription subscribes multiple
publications setting different ’two_phase’ options. Also in terms of
changing this option while streaming changes, it’s better to control
it on the subscriber side.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#161Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#153)
9 attachment(s)

On Wed, Dec 16, 2020 at 2:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 16, 2020 at 1:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Also, I guess we can improve the description of
’two_phase’ option of CREATE SUBSCRIPTION in the doc by adding the
fact that when this option is not enabled the transaction prepared on
the publisher is decoded as a normal transaction:

Sounds reasonable.

Fixed in the attached.

------
+   if (LookupGXact(begin_data.gid))
+   {
+       /*
+        * If this gid has already been prepared then we dont want to apply
+        * this txn again. This can happen after restart where upstream can
+        * send the prepared transaction again. See
+        * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+        */
+       skip_prepared_txn = true;
+       return;
+   }

When PREPARE arrives at the subscriber node but there is the prepared
transaction with the same transaction identifier, the apply worker
skips the whole transaction. So if the users prepared a transaction
with the same identifier on the subscriber, the prepared transaction
that came from the publisher would be ignored without any messages. On
the other hand, if applying other operations such as HEAP_INSERT
conflicts (such as when violating the unique constraint) the apply
worker raises an ERROR and stops logical replication until the
conflict is resolved. IIUC since we can know that the prepared
transaction came from the same publisher again by checking origin_lsn
in TwoPhaseFileHeader I guess we can skip the PREPARE message only
when the existing prepared transaction has the same LSN and the same
identifier. To be exact, it’s still possible that the subscriber gets
two PREPARE messages having the same LSN and name from two different
publishers but it’s unlikely happen in practice.

The idea sounds reasonable. I'll try and see if this works.

I went ahead and used both origin_lsn and origin_timestamp to avoid
the possibility of a match of prepared xact from two different nodes.
We can handle this at begin_prepare and prepare time but we don't have
prepare_lsn and prepare_timestamp at rollback_prepared time, so what
do about that? As of now, I am using just GID at rollback_prepare time
and that would have been sufficient if we always receive prepare
before rollback because at prepare time we would have checked
origin_lsn and origin_timestamp. But it is possible that we get
rollback prepared without prepare in case if prepare happened before
consistent_snapshot is reached and rollback happens after that. For
commit-case, we do send prepare and all the data at commit time in
such a case but doing so for rollback case doesn't sound to be a good
idea. Another possibility is that we send prepare_lsn and prepare_time
in rollback_prepared API to deal with this. I am not sure if it is a
good idea to just rely on GID in rollback_prepare. What do you think?

I have done some additional changes in the patch-series.
1. Removed some declarations from
0001-Extend-the-output-plugin-API-to-allow-decoding-o which were not
required.
2. In 0002-Allow-decoding-at-prepare-time-in-ReorderBuffer,
+       txn->txn_flags |= RBTXN_PREPARE;
+       txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */
+       strcpy(txn->gid, gid);

Changed the above code to use pstrdup.

3. Merged the test-code from 0003 to 0002. I have yet to merge the
latest test case posted by Ajin[1]/messages/by-id/CAFPTHDYWj99+ysDjCH_z8BfN8hG2FoxtJg+EU8_MpJe5owXg4A@mail.gmail.com.
4. Removed the test for Rollback Prepared from two_phase_streaming.sql
because I think a similar test exists for non-streaming case in
two_phase.sql and it doesn't make sense to repeat it.
5. Comments update and minor cosmetic changes for test cases merged
from 0003 to 0002.

[1]: /messages/by-id/CAFPTHDYWj99+ysDjCH_z8BfN8hG2FoxtJg+EU8_MpJe5owXg4A@mail.gmail.com

--
With Regards,
Amit Kapila.

Attachments:

v32-0001-Extend-the-output-plugin-API-to-allow-decoding-o.patchapplication/octet-stream; name=v32-0001-Extend-the-output-plugin-API-to-allow-decoding-o.patchDownload
From 87cf7b69c4ee059997b0391aa3ee5fe1080315c4 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 12 Dec 2020 16:41:33 +0530
Subject: [PATCH v32 1/9] Extend the output plugin API to allow decoding of
 prepared xacts.

This adds six methods to the output plugin API, adding support for
streaming changes of two-phase transactions at prepare time.

* begin_prepare
* filter_prepare
* prepare
* commit_prepared
* rollback_prepared
* stream_prepare

Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction is not yet committed and maybe
aborted later.

Until now two-phase transactions were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the
two-phase commands were communicated to the subscriber.

This patch provides the infrastructure for logical decoding plugins to be
informed of two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

This also extends the 'test_decoding' plugin, implementing these new
methods.

This commit simply adds these new APIs and the upcoming patch to "allow
the decoding at prepare time in ReorderBuffer" will use these APIs.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c     | 164 +++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 165 ++++++++++++++++-
 src/backend/replication/logical/logical.c | 295 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   6 +
 src/include/replication/output_plugin.h   |  55 ++++++
 src/include/replication/reorderbuffer.h   |  40 ++++
 src/tools/pgindent/typedefs.list          |  12 ++
 7 files changed, 730 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e12278b..94ba227 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -76,6 +76,19 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 const char *gid);
+static void pg_decode_begin_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr abort_lsn);
 static void pg_decode_stream_start(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn);
 static void pg_output_stream_start(LogicalDecodingContext *ctx,
@@ -87,6 +100,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -123,9 +139,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->begin_prepare_cb = pg_decode_begin_prepare_txn;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
@@ -141,6 +163,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -241,6 +264,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -252,6 +285,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -320,6 +354,109 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/* BEGIN PREPARE callback */
+static void
+pg_decode_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata =
+	MemoryContextAllocZero(ctx->context, sizeof(TestDecodingTxnData));
+
+	txndata->xact_wrote_changes = false;
+	txn->output_plugin_private = txndata;
+
+	if (data->skip_empty_xacts)
+		return;
+
+	pg_output_begin(ctx, data, txn, true);
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+								XLogRecPtr abort_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -702,6 +839,33 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..180699a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,9 +389,15 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodeBeginPrepareCB begin_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +419,20 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits,
+    which allows actions to be decoded on the <command>PREPARE TRANSACTION</command>.
+    The <function>begin_prepare_cb</function>, <function>prepare_cb</function>, 
+    <function>stream_prepare_cb</function>,
+    <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +493,15 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too. We will skip all the changes of such a transaction once
+     the abort is detected and abort the transaction when we read WAL for
+     <command>ROLLBACK PREPARED</command>.
     </para>
 
     <note>
@@ -587,7 +611,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -685,7 +715,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -698,6 +734,104 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents as for the
+      other callbacks. The <parameter>gid</parameter> is the identifier that later
+      identifies this transaction for <command>COMMIT PREPARED</command> or
+      <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given
+      <parameter>gid</parameter> every time it is called.
+     </para>
+     </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-begin-prepare">
+     <title>Transaction Begin Prepare Callback</title>
+
+     <para>
+      The required <function>begin_prepare_cb</function> callback is called
+      whenever the start of a prepared transaction has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback to
+      check if the plugin has already received this prepare in which case it
+      can skip the remaining changes of the transaction. This can only happen
+      if the user restarts the decoding after receiving the prepare for a
+      transaction but before receiving the commit prepared say because of some
+      error.
+      <programlisting>
+       typedef void (*LogicalDecodeBeginPrepareCB) (struct LogicalDecodingContext *ctx,
+                                                    ReorderBufferTXN *txn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callback for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr prepare_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called
+      whenever a transaction commit prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                      ReorderBufferTXN *txn,
+                                                      XLogRecPtr commit_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called
+      whenever a transaction rollback prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                        ReorderBufferTXN *txn,
+                                                        XLogRecPtr rollback_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-start">
      <title>Stream Start Callback</title>
      <para>
@@ -735,6 +869,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1060,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f1f4df7..2a8f591 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,13 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr abort_lsn);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +81,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -237,11 +246,37 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
 	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
 
+
+	/*
+	 * To support two-phase logical decoding, we require
+	 * begin_prepare/prepare/commit-prepare/abort-prepare callbacks. The
+	 * filter_prepare callback is optional. We however enable two-phase
+	 * logical decoding when at least one of the methods is enabled so that we
+	 * can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.begin_prepare_cb != NULL) ||
+		(ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
+	 * Callback to support decoding at prepare time.
+	 */
+	ctx->reorder->begin_prepare = begin_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -782,6 +817,184 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+/*
+ * The functionality of begin_prepare is quite similar to begin with the
+ * exception that this will have gid (global transaction id) information which
+ * can be used by plugin. Now, we thought about extending the existing begin
+ * but that would break the replication protocol and additionally this looks
+ * cleaner.
+ */
+static void
+begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "begin_prepare";
+	state.report_location = txn->first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->first_lsn;
+
+	/*
+	 * If the plugin supports two-phase commits then begin prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.begin_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires begin_prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.begin_prepare_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of prepare record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr abort_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, abort_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
@@ -860,6 +1073,45 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 bool
+filter_prepare_cb_wrapper(LogicalDecodingContext *ctx, const char *gid)
+{
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case, all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
+bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
 	LogicalErrorCallbackState state;
@@ -1057,6 +1309,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming at prepare time requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..28c9c1f 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
@@ -120,6 +125,7 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 												  XLogRecPtr restart_lsn);
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
+extern bool filter_prepare_cb_wrapper(LogicalDecodingContext *ctx, const char *gid);
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
 extern void ResetLogicalStreamingState(void);
 extern void UpdateDecodingStats(LogicalDecodingContext *ctx);
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..7241015 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,44 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  const char *gid);
+
+/*
+ * Callback called for every BEGIN of a prepared trnsaction.
+ */
+typedef void (*LogicalDecodeBeginPrepareCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
+
+/*
  * Called when starting to stream a block of changes from in-progress
  * transaction (may be called repeatedly, if it's streamed in multiple
  * chunks).
@@ -124,6 +162,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -173,10 +219,19 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+
+	/* streaming of changes at prepare time */
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodeBeginPrepareCB begin_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
+
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bd9dd7e..9eaba6a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -245,6 +245,12 @@ typedef struct ReorderBufferTXN
 	TransactionId toplevel_xid;
 
 	/*
+	 * Global transaction id required for identification of prepared
+	 * transactions.
+	 */
+	char	   *gid;
+
+	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
 	 * the previous records aren't relevant for logical decoding.
@@ -418,6 +424,25 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* begin prepare callback signature */
+typedef void (*ReorderBufferBeginPrepareCB) (ReorderBuffer *rb,
+											 ReorderBufferTXN *txn);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr abort_lsn);
+
 /* start streaming transaction callback signature */
 typedef void (*ReorderBufferStreamStartCB) (
 											ReorderBuffer *rb,
@@ -436,6 +461,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -505,11 +536,20 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction at prepare time.
+	 */
+	ReorderBufferBeginCB begin_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
+
+	/*
 	 * Callbacks to be called when streaming a transaction.
 	 */
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a9dca71..e82b4f7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1315,9 +1315,21 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodeBeginPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
1.8.3.1

v32-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchapplication/octet-stream; name=v32-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchDownload
From 0ad6397517f9658dc2c8062a78f49eda9abaf9da Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 16 Dec 2020 14:40:27 +0530
Subject: [PATCH v32 2/9] Allow decoding at prepare time in ReorderBuffer.

This patch allows PREPARE-time decoding of two-phase transactions (if the
output plugin supports this capability), in which case the transactions
are replayed at PREPARE and then committed later when COMMIT PREPARED
arrives.

Now that we decode the changes before the commit, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We detect such failures with a special sqlerrcode
ERRCODE_TRANSACTION_ROLLBACK introduced by commit 7259736a6e and stop
decoding the remaining changes. Then we rollback the changes when rollback
prepared is encountered.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, Arseny Sher, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 contrib/test_decoding/expected/twophase.out        | 235 ++++++++++++
 contrib/test_decoding/expected/twophase_stream.out | 147 +++++++
 contrib/test_decoding/sql/twophase.sql             | 112 ++++++
 contrib/test_decoding/sql/twophase_stream.sql      |  45 +++
 src/backend/replication/logical/decode.c           | 285 ++++++++++++--
 src/backend/replication/logical/reorderbuffer.c    | 423 +++++++++++++++++----
 src/backend/replication/logical/snapbuild.c        |   7 +
 src/include/replication/reorderbuffer.h            |  33 +-
 9 files changed, 1183 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/twophase.out
 create mode 100644 contrib/test_decoding/expected/twophase_stream.out
 create mode 100644 contrib/test_decoding/sql/twophase.sql
 create mode 100644 contrib/test_decoding/sql/twophase_stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..76d4a69 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate stream stats
+	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
diff --git a/contrib/test_decoding/expected/twophase.out b/contrib/test_decoding/expected/twophase.out
new file mode 100644
index 0000000..f9f6bed
--- /dev/null
+++ b/contrib/test_decoding/expected/twophase.out
@@ -0,0 +1,235 @@
+-- Test prepared transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test that decoding happens at PREPARE time when two-phase-commit is enabled.
+-- Decoding after COMMIT PREPARED must have all the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+ COMMIT PREPARED 'test_prepared#1'
+(5 rows)
+
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+ COMMIT PREPARED 'test_prepared#3'
+(4 rows)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Check 'CLUSTER' (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The
+-- call should return within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ COMMIT PREPARED 'test_prepared_lock'
+(5 rows)
+
+-- Test savepoints and sub-xacts. Creating savepoints will create
+-- sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+ COMMIT PREPARED 'test_prepared_savepoint'
+(4 rows)
+
+-- Test that a GID containing "_nodecode" gets decoded at commit prepared time.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/twophase_stream.out b/contrib/test_decoding/expected/twophase_stream.out
new file mode 100644
index 0000000..3acc4acd3
--- /dev/null
+++ b/contrib/test_decoding/expected/twophase_stream.out
@@ -0,0 +1,147 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK TO s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED and the other changes in the transaction
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ PREPARE TRANSACTION 'test1'
+ COMMIT PREPARED 'test1'
+(23 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with
+-- filtered gid. gids with '_nodecode' will not be decoded at prepare time.
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/twophase.sql b/contrib/test_decoding/sql/twophase.sql
new file mode 100644
index 0000000..894e4f5
--- /dev/null
+++ b/contrib/test_decoding/sql/twophase.sql
@@ -0,0 +1,112 @@
+-- Test prepared transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test that decoding happens at PREPARE time when two-phase-commit is enabled.
+-- Decoding after COMMIT PREPARED must have all the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check 'CLUSTER' (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The
+-- call should return within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test savepoints and sub-xacts. Creating savepoints will create
+-- sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that a GID containing "_nodecode" gets decoded at commit prepared time.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/twophase_stream.sql b/contrib/test_decoding/sql/twophase_stream.sql
new file mode 100644
index 0000000..e9dd44f
--- /dev/null
+++ b/contrib/test_decoding/sql/twophase_stream.sql
@@ -0,0 +1,45 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK TO s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED and the other changes in the transaction
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with
+-- filtered gid. gids with '_nodecode' will not be decoded at prepare time.
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..6ac2a60 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,13 +67,24 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool two_phase);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool two_phase);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 
+/* helper functions for decoding transactions */
+static inline bool FilterPrepare(LogicalDecodingContext *ctx, const char *gid);
+static bool DecodeTXNNeedSkip(LogicalDecodingContext *ctx,
+							  XLogRecordBuffer *buf, Oid dbId,
+							  RepOriginId origin_id);
+
 /*
  * Take every XLogReadRecord()ed record and perform the actions required to
  * decode it using the output plugin already setup in the logical decoding
@@ -244,6 +255,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		two_phase = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +265,15 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * We would like to process the transaction in a two-phase
+				 * manner iff output plugin supports two-phase commits and
+				 * doesn't filter the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+					two_phase = !(FilterPrepare(ctx, parsed.twophase_gid));
+
+				DecodeCommit(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +282,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		two_phase = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +292,15 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * We would like to process the transaction in a two-phase
+				 * manner iff output plugin supports two-phase commits and
+				 * doesn't filter the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+					two_phase = !(FilterPrepare(ctx, parsed.twophase_gid));
+
+				DecodeAbort(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +341,37 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/*
+				 * We would like to process the transaction in a two-phase
+				 * manner iff output plugin supports two-phase commits and
+				 * doesn't filter the transaction at prepare time.
+				 */
+				if (FilterPrepare(ctx, parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -520,6 +569,23 @@ DecodeHeapOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	}
 }
 
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+static inline bool
+FilterPrepare(LogicalDecodingContext *ctx, const char *gid)
+{
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (ctx->callbacks.filter_prepare_cb == NULL)
+		return false;
+
+	return filter_prepare_cb_wrapper(ctx, gid);
+}
+
 static inline bool
 FilterByOrigin(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -582,10 +648,15 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'two_phase' indicates that caller wants to process the transaction in two
+ * phases, first process prepare if not already done and then process
+ * commit_prepared.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool two_phase)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -606,15 +677,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * the reorderbuffer to forget the content of the (sub-)transactions
 	 * if not.
 	 *
-	 * There can be several reasons we might not be interested in this
-	 * transaction:
-	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
-	 * 2) The transaction happened in another database.
-	 * 3) The output plugin is not interested in the origin.
-	 * 4) We are doing fast-forwarding
-	 *
 	 * We can't just use ReorderBufferAbort() here, because we need to execute
 	 * the transaction's invalidations.  This currently won't be needed if
 	 * we're just skipping over the transaction because currently we only do
@@ -627,9 +689,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * relevant syscaches.
 	 * ---
 	 */
-	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
-		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
-		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
 	{
 		for (i = 0; i < parsed->nsubxacts; i++)
 		{
@@ -647,34 +707,164 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
+	/*
+	 * Send the final commit record if the transaction data is already
+	 * decoded, otherwise, process the entire transaction.
+	 */
+	if (two_phase)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ *
+ * Note that we don't skip prepare even if we have detected concurrent abort.
+ * The reason is that it is quite possible that we had already sent some
+ * changes before we detect abort in which case we need to abort those changes
+ * in the subscriber. To abort such changes, we do send the prepare and then
+ * the rollback prepared which is what happened on the publisher-side as well.
+ * Now, we can invent a new abort API wherein in such cases we send abort and
+ * skip sending prepared and rollback prepared but then it is not that
+ * straightforward because we might have streamed this transaction by that time
+ * in which case it is handled when the rollback is encountered. It is not
+ * impossible to optimize the concurrent abort case but it can introduce design
+ * complexity w.r.t handling different cases so leaving it for now as it
+ * doesn't seem worth it.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	SnapBuild  *builder = ctx->snapshot_builder;
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz prepare_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		prepare_time = parsed->origin_timestamp;
+
+	/*
+	 * Remember the prepare info for a txn so that it can be used later in
+	 * commit prepared if required. See ReorderBufferFinishPrepared.
+	 */
+	if (!ReorderBufferRememberPrepareInfo(ctx->reorder, xid, buf->origptr,
+										  buf->endptr, prepare_time, origin_id,
+										  origin_lsn))
+		return;
+
+	/* We can't start streaming unless a consistent state is reached. */
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT)
+	{
+		ReorderBufferSkipPrepare(ctx->reorder, xid);
+		return;
+	}
+
+	/*
+	 * Check whether we need to process this transaction. See
+	 * DecodeTXNNeedSkip for the reasons why we sometimes want to skip the
+	 * transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache
+	 * invalidations if there are any for the reasons mentioned in
+	 * DecodeCommit.
+	 */
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
+	{
+		ReorderBufferSkipPrepare(ctx->reorder, xid);
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/* Tell the reorderbuffer about the surviving subtransactions. */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'two_phase' indicates to finish prepared transaction.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool two_phase)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz abort_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool		skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		abort_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * Check whether we need to process this transaction. See
+	 * DecodeTXNNeedSkip for the reasons why we sometimes want to skip the
+	 * transaction.
+	 */
+	skip_xact = DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id);
+
+	/*
+	 * Send the final rollback record for a prepared transaction unless we
+	 * need to skip it. For non-two-phase xacts, simply forget the xact.
+	 */
+	if (two_phase && !skip_xact)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									abort_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
 	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
@@ -1080,3 +1270,24 @@ DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tuple)
 	header->t_infomask2 = xlhdr.t_infomask2;
 	header->t_hoff = xlhdr.t_hoff;
 }
+
+/*
+ * Check whether we are interested in this specific transaction.
+ *
+ * There can be several reasons we might not be interested in this
+ * transaction:
+ * 1) We might not be interested in decoding transactions up to this
+ *	  LSN. This can happen because we previously decoded it and now just
+ *	  are restarting or if we haven't assembled a consistent snapshot yet.
+ * 2) The transaction happened in another database.
+ * 3) The output plugin is not interested in the origin.
+ * 4) We are doing fast-forwarding
+ */
+static bool
+DecodeTXNNeedSkip(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+				  Oid txn_dbid, RepOriginId origin_id)
+{
+	return (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+			(txn_dbid != InvalidOid && txn_dbid != ctx->slot->data.database) ||
+			ctx->fast_forward || FilterByOrigin(ctx, origin_id));
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 7359fa9..36d5fb9 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -422,6 +423,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1515,12 +1522,18 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after
+ * streaming or decoding them at PREPARE. Keep the remaining info -
+ * transactions, tuplecids, invalidations and snapshots.
+ *
+ * We additionaly remove tuplecids after decoding the transaction at prepare
+ * time as we only need to perform invalidation at rollback or commit prepared.
+ *
+ * 'txn_prepared' indicates that we have decoded the transaction at prepare
+ * time.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1539,7 +1552,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1573,9 +1586,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1755,9 +1792,10 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * If the transaction was (partially) streamed, we need to commit it in a
- * 'streamed' way.  That is, we first stream the remaining part of the
- * transaction, and then invoke stream_commit message.
+ * If the transaction was (partially) streamed, we need to prepare or commit
+ * it in a 'streamed' way.  That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_prepare or stream_commit message as per
+ * the case.
  */
 static void
 ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1767,29 +1805,49 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		/*
+		 * Note, we send stream prepare even if a concurrent abort is
+		 * detected. See DecodePrepare for more information.
+		 */
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids.
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
  * Set xid to detect concurrent aborts.
  *
- * While streaming an in-progress transaction there is a possibility that the
- * (sub)transaction might get aborted concurrently.  In such case if the
- * (sub)transaction has catalog update then we might decode the tuple using
- * wrong catalog version.  For example, suppose there is one catalog tuple with
- * (xmin: 500, xmax: 0).  Now, the transaction 501 updates the catalog tuple
- * and after that we will have two tuples (xmin: 500, xmax: 501) and
- * (xmin: 501, xmax: 0).  Now, if 501 is aborted and some other transaction
- * say 502 updates the same catalog tuple then the first tuple will be changed
- * to (xmin: 500, xmax: 502).  So, the problem is that when we try to decode
- * the tuple inserted/updated in 501 after the catalog update, we will see the
- * catalog tuple with (xmin: 500, xmax: 502) as visible because it will
- * consider that the tuple is deleted by xid 502 which is not visible to our
- * snapshot.  And when we will try to decode with that catalog tuple, it can
- * lead to a wrong result or a crash.  So, it is necessary to detect
- * concurrent aborts to allow streaming of in-progress transactions.
+ * While streaming an in-progress transaction or decoding a prepared
+ * transaction there is a possibility that the (sub)transaction might get
+ * aborted concurrently.  In such case if the (sub)transaction has catalog
+ * update then we might decode the tuple using wrong catalog version.  For
+ * example, suppose there is one catalog tuple with (xmin: 500, xmax: 0).  Now,
+ * the transaction 501 updates the catalog tuple and after that we will have
+ * two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0).  Now, if 501 is
+ * aborted and some other transaction say 502 updates the same catalog tuple
+ * then the first tuple will be changed to (xmin: 500, xmax: 502).  So, the
+ * problem is that when we try to decode the tuple inserted/updated in 501
+ * after the catalog update, we will see the catalog tuple with (xmin: 500,
+ * xmax: 502) as visible because it will consider that the tuple is deleted by
+ * xid 502 which is not visible to our snapshot.  And when we will try to
+ * decode with that catalog tuple, it can lead to a wrong result or a crash.
+ * So, it is necessary to detect concurrent aborts to allow streaming of
+ * in-progress transactions or decoding of prepared  transactions.
  *
  * For detecting the concurrent abort we set CheckXidAlive to the current
  * (sub)transaction's xid for which this change belongs to.  And, during
@@ -1798,7 +1856,10 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * and discard the already streamed changes on such an error.  We might have
  * already streamed some of the changes for the aborted (sub)transaction, but
  * that is fine because when we decode the abort we will stream abort message
- * to truncate the changes in the subscriber.
+ * to truncate the changes in the subscriber. Similarly, for prepared
+ * transactions, we stop decoding if concurrent abort is detected and then
+ * rollback the changes when rollback prepared is encountered. See
+ * DecodePreare.
  */
 static inline void
 SetupCheckXidLive(TransactionId xid)
@@ -1900,7 +1961,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1912,15 +1973,19 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		specinsert = NULL;
 	}
 
-	/* Stop the stream. */
-	rb->stream_stop(rb, txn, last_lsn);
-
-	/* Remember the command ID and snapshot for the streaming run */
-	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	/*
+	 * For the streaming case, stop the stream and remember the command ID and
+	 * snapshot for the streaming run.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_stop(rb, txn, last_lsn);
+		ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	}
 }
 
 /*
- * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ * Helper function for ReorderBufferReplay and ReorderBufferStreamTXN.
  *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
@@ -1973,9 +2038,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		else
 			StartTransactionCommand();
 
-		/* We only need to send begin/commit for non-streamed transactions. */
+		/*
+		 * We only need to send begin/begin-prepare for non-streamed
+		 * transactions.
+		 */
 		if (!streaming)
-			rb->begin(rb, txn);
+		{
+			if (rbtxn_prepared(txn))
+				rb->begin_prepare(rb, txn);
+			else
+				rb->begin(rb, txn);
+		}
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -2006,8 +2079,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			prev_lsn = change->lsn;
 
-			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			/*
+			 * Set the current xid to detect concurrent aborts. This is
+			 * required for the cases when we decode the changes before the
+			 * COMMIT record is processed.
+			 */
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2298,7 +2375,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2332,15 +2418,22 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the four reasons: 1. Decoding an
+		 * in-progress txn. 2. Decoding a prepared txn. 3. Decoding of a
+		 * prepared txn that was (partially) streamed. 4. Decoding a committed
+		 * txn.
+		 *
+		 * For 1, we allow truncation of txn data by removing the changes
+		 * already streamed but still keeping other things like invalidations,
+		 * snapshot, and tuplecids. For 2 and 3, we indicate
+		 * ReorderBufferTruncateTXN to do more elaborate truncation of txn
+		 * data as the entire transaction has been decoded except for commit.
+		 * For 4, as the entire txn has been decoded, we can fully clean up
+		 * the TXN reorder buffer.
 		 */
-		if (streaming)
+		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
-
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2373,17 +2466,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2413,26 +2509,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * ReorderBufferCommitChild(), even if previously assigned to the toplevel
  * transaction with ReorderBufferAssignChild.
  *
- * This interface is called once a toplevel commit is read for both streamed
- * as well as non-streamed transactions.
+ * This interface is called once a prepare or toplevel commit is read for both
+ * streamed as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferReplay(ReorderBufferTXN *txn,
+					ReorderBuffer *rb, TransactionId xid,
 					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 					TimestampTz commit_time,
 					RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2462,7 +2551,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	if (txn->base_snapshot == NULL)
 	{
 		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+
+		/*
+		 * Removing this txn before a commit might result in the computation
+		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
+		 */
+		if (!rbtxn_prepared(txn))
+			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
 
@@ -2474,6 +2569,169 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferReplay(txn, rb, xid, commit_lsn, end_lsn, commit_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Record the prepare information for a transaction.
+ */
+bool
+ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+								 TimestampTz prepare_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return false;
+
+	/*
+	 * Remember the prepare information to be later used by commit prepared in
+	 * case we skip doing prepare.
+	 */
+	txn->final_lsn = prepare_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = prepare_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	return true;
+}
+
+/* Remember that we have skipped prepare */
+void
+ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_SKIPPED_PREPARE;
+}
+
+/*
+ * Prepare a two-phase transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = pstrdup(gid);
+
+	/* The prepare info must have been updated in txn by now. */
+	Assert(txn->final_lsn != InvalidXLogRecPtr);
+
+	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
+						txn->commit_time, txn->origin_id, txn->origin_lsn);
+}
+
+/*
+ * This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time, RepOriginId origin_id,
+							XLogRecPtr origin_lsn, char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/* add the gid in the txn */
+	txn->gid = pstrdup(gid);
+
+	/*
+	 * It is possible that this transaction is not decoded at prepare time
+	 * either because by that time we didn't have a consistent snapshot or it
+	 * was decoded earlier but we have restarted. We can't distinguish between
+	 * those two cases so we send the prepare in both the cases and let
+	 * downstream decide whether to process or skip it. We don't need to
+	 * decode the xact for aborts if it is not done already.
+	 */
+	if (!rbtxn_prepared(txn) && is_commit)
+	{
+		txn->txn_flags |= RBTXN_PREPARE;
+
+		/*
+		 * The prepare info must have been updated in txn even if we skip
+		 * prepare.
+		 */
+		Assert(txn->final_lsn != InvalidXLogRecPtr);
+
+		/*
+		 * By this time the txn has the prepare record information and it is
+		 * important to use that so that downstream gets the accurate
+		 * information. If instead, we have passed commit information here
+		 * then downstream can behave as it has already replayed commit
+		 * prepared after the restart.
+		 */
+		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
+							txn->commit_time, txn->origin_id, txn->origin_lsn);
+	}
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	if (is_commit)
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else
+		rb->rollback_prepared(rb, txn, commit_lsn);
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2605,6 +2863,39 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 }
 
 /*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ *
+ * Note that this is a special-purpose function for prepared transactions where
+ * we don't want to clean up the TXN even when we decide to skip it. See
+ * DecodePrepare.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
+/*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 9d5d68f..dc3ef74 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -834,6 +834,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid))
 			continue;
 
+		/*
+		 * We don't need to add snapshot to prepared transactions as they
+		 * should not see the new catalog contents.
+		 */
+		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+			continue;
+
 		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
 			 txn->xid, (uint32) (lsn >> 32), (uint32) lsn);
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 9eaba6a..88c0ed3 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -174,6 +174,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_SKIPPED_PREPARE	  0x0100
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +235,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* prepare for this transaction skipped? */
+#define rbtxn_skip_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -258,10 +272,11 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	first_lsn;
 
 	/* ----
-	 * LSN of the record that lead to this xact to be committed or
+	 * LSN of the record that lead to this xact to be prepared or committed or
 	 * aborted. This can be a
 	 * * plain commit record
 	 * * plain commit record, of a parent transaction
+	 * * prepared tansaction
 	 * * prepared transaction commit
 	 * * plain abort record
 	 * * prepared transaction abort
@@ -293,7 +308,8 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	origin_lsn;
 
 	/*
-	 * Commit time, only known when we read the actual commit record.
+	 * Commit or Prepare time, only known when we read the actual commit or
+	 * prepare record.
 	 */
 	TimestampTz commit_time;
 
@@ -624,12 +640,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -643,10 +665,17 @@ void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr l
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
 											   SharedInvalidationMessage *invalidations);
 void		ReorderBufferProcessXid(ReorderBuffer *, TransactionId xid, XLogRecPtr lsn);
+
 void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLogRecPtr lsn);
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
+											 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+											 TimestampTz prepare_time,
+											 RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v32-0003-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v32-0003-Refactor-spool-file-logic-in-worker.c.patchDownload
From 16e2b2aec96d753610149df1b0f684fac983480d Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 11 Dec 2020 14:31:11 +0530
Subject: [PATCH v32 3/9] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3874939..4f75e85 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -924,30 +926,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -955,7 +948,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -970,7 +963,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1045,6 +1038,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v32-0004-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v32-0004-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 335a06b64e075973cae45876bfea89291e14c227 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 12 Dec 2020 16:52:28 +0530
Subject: [PATCH v32 4/9] Add support for apply at prepare time to built-in
 logical replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

* We allow skipping prepared transactions if they are already prepared.
We do ensure that we skip only when the GID, origin_lsn, and
origin_timestamp of a prepared xact matches to avoid the possibility of
a match of prepared xact from two different nodes. This can happen when
the server or apply worker restarts after a prepared transaction.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  80 ++++++-
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 256 +++++++++++++++++++++-
 src/backend/replication/logical/worker.c    | 329 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 167 +++++++++++---
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  58 ++++-
 src/include/replication/reorderbuffer.h     |  12 +
 src/tools/pgindent/typedefs.list            |   1 +
 9 files changed, 872 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9b..3ee7f8a 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1133,9 +1133,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
@@ -2433,3 +2433,77 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			if (prepare_end_lsn == InvalidXLogRecPtr)
+			{
+				found = true;
+				break;
+			}
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 15ab8e7..dd33469 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -957,8 +957,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..7af8386 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,260 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN* txn,
+								   XLogRecPtr rollback_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, rollback_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4f75e85..2ddb832 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -169,6 +170,9 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
+/* for skipping prepared transaction */
+bool        skip_prepared_txn = false;
+
 /*
  * Hash table for storing the streaming xid information along with shared file
  * set for streaming and subxact files.
@@ -690,6 +694,12 @@ apply_handle_begin(StringInfo s)
 {
 	LogicalRepBeginData begin_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_begin(s, &begin_data);
 
 	remote_final_lsn = begin_data.final_lsn;
@@ -709,6 +719,12 @@ apply_handle_commit(StringInfo s)
 {
 	LogicalRepCommitData commit_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_commit(s, &commit_data);
 
 	Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -722,6 +738,263 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+	{
+		/*
+		 * If this gid has already been prepared then we dont want to apply
+		 * this txn again. This can happen after restart where upstream can
+		 * send the prepared transaction again. See
+		 * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+		 */
+		skip_prepared_txn = true;
+		return;
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (skip_prepared_txn)
+	{
+		/*
+		 * If we are skipping this transaction because it was previously
+		 * prepared, ignore it and reset the flag.
+		 */
+		Assert(LookupGXact(prepare_data.gid, prepare_data.end_lsn,
+						   prepare_data.preparetime));
+		skip_prepared_txn = false;
+		return;
+	}
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	/*
+ 	 * It is possible that we haven't received prepare because it occurred
+	 * before we reach consistent point in which we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(prepare_data.gid, InvalidXLogRecPtr, 0))
+	{
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(prepare_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -753,6 +1026,12 @@ apply_handle_stream_start(StringInfo s)
 	Assert(!in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Start a transaction on stream start, this transaction will be committed
 	 * on the stream stop unless it is a tablesync worker in which case it will
 	 * be committed after processing all the messages. We need the transaction
@@ -800,6 +1079,12 @@ apply_handle_stream_stop(StringInfo s)
 	Assert(in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Close the file with serialized changes, and serialize information about
 	 * subxacts for the toplevel transaction.
 	 */
@@ -835,6 +1120,12 @@ apply_handle_stream_abort(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_stream_abort(s, &xid, &subxid);
 
 	/*
@@ -1053,6 +1344,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	xid = logicalrep_read_stream_commit(s, &commit_data);
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
@@ -1176,6 +1473,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1297,6 +1597,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1454,6 +1757,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1823,6 +2129,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1979,6 +2288,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 49d25b0..a6bafaa 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,14 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +65,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +171,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +342,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,27 +362,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -378,6 +383,65 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -766,17 +830,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -857,6 +912,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1171,3 +1244,31 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 	while ((entry = (RelationSyncEntry *) hash_seq_search(&status)) != NULL)
 		entry->replicate_valid = false;
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr	origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..5afb977 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535d..427a8ee 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,32 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, commit prepared and rollback prepared
+ * transaction. prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared and rollback lsn and rollback time for
+ * rollback prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +161,23 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN* txn,
+											   XLogRecPtr rollback_lsn);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepPreparedTxnData *prepare_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +221,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 88c0ed3..2cd2e5e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e82b4f7..cba9d84 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1342,6 +1342,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v32-0005-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v32-0005-Support-2PC-txn-subscriber-tests.patchDownload
From c6de1d4cf02ad8ec0890386094c5f7dfe194b4c6 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 02:32:53 -0500
Subject: [PATCH v32 5/9] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v32-0006-Support-2PC-documentation.patchapplication/octet-stream; name=v32-0006-Support-2PC-documentation.patchDownload
From e3f7e7f95175084c21a9e28d8898bb83f622318d Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 02:43:40 -0500
Subject: [PATCH v32 6/9] Support-2PC-documentation.

Add documentation about two-phase commit support in Logical Decoding.
---
 doc/src/sgml/logicaldecoding.sgml | 99 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 98 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 180699a..bb882fd 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -165,7 +165,57 @@ COMMIT 693
 <keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
 $ pg_recvlogical -d postgres --slot=test --drop-slot
 </programlisting>
-  </sect1>
+
+  <para>
+  The following example shows how logical decoding can be used to handle transactions
+  that use a two-phase commit. Before you use two-phase commit commands, you must set
+  <varname>max_prepared_transactions</varname> to at least 1. You must also set the 
+  option 'two-phase-commit' to 1 while calling <function>pg_logical_slot_get_changes</function>.
+  </para>
+<programlisting>
+postgres=# BEGIN;
+postgres=*# INSERT INTO data(data) VALUES('5');
+postgres=*# PREPARE TRANSACTION 'test_prepared1';
+
+postgres=# SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/1689DC0 | 529 | BEGIN 529
+ 0/1689DC0 | 529 | table public.data: INSERT: id[integer]:3 data[text]:'5'
+ 0/1689FC0 | 529 | PREPARE TRANSACTION 'test_prepared1', txid 529
+(3 rows)
+
+postgres=# COMMIT PREPARED 'test_prepared1';
+COMMIT PREPARED
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                    data                    
+-----------+-----+--------------------------------------------
+ 0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529
+(1 row)
+
+postgres=#-- you can also rollback a prepared transaction
+postgres=# BEGIN;
+BEGIN
+postgres=*# INSERT INTO data(data) VALUES('6');INSERT 0 1
+postgres=*# PREPARE TRANSACTION 'test_prepared2';PREPARE TRANSACTION
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/168A180 | 530 | BEGIN 530
+ 0/168A1E8 | 530 | table public.data: INSERT: id[integer]:4 data[text]:'6'
+ 0/168A430 | 530 | PREPARE TRANSACTION 'test_prepared2', txid 530
+(3 rows)
+
+postgres=# ROLLBACK PREPARED 'test_prepared1';ERROR:  prepared transaction with identifier "test_prepared1" does not exist
+postgres=# ROLLBACK PREPARED 'test_prepared2';
+ROLLBACK PREPARED
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                     data                     
+-----------+-----+----------------------------------------------
+ 0/168A4B8 | 530 | ROLLBACK PREPARED 'test_prepared2', txid 530
+(1 row)
+</programlisting>
+</sect1>
 
   <sect1 id="logicaldecoding-explanation">
    <title>Logical Decoding Concepts</title>
@@ -1119,4 +1169,51 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
    </para>
 
   </sect1>
+
+  <sect1 id="logicaldecoding-two-phase-commits">
+   <title>Two-phase commit support for Logical Decoding</title>
+
+   <para>
+   With the basic output plugin callbacks (eg., <function>begin_cb</function>,
+   <function>change_cb</function>, <function>commit_cb</function> and
+   <function>message_cb</function>) two-phase commit commands like
+   <command>PREPARE TRANSACTION</command>, <command>COMMIT PREPARED</command>
+   and <command>ROLLBACK PREPARED</command> are not decoded correctly.
+   While the <command>PREPARE TRANSACTION</command> ignored, 
+   <command>COMMIT PREPARED</command> is decoded as a <command>COMMIT</command> and 
+   <command>ROLLBACK PREPARED</command> is decoded as a <command>ROLLBACK</command>.
+   </para>
+
+   <para>
+   An output plugin may provide additional callbacks to support two-phase commit commands.
+   There are multiple two-phase commit callbacks that are required,
+   (<function>begin_prepare_cb</function>, <function>prepare_cb</function>, 
+   <function>commit_prepared_cb</function>, 
+   <function>rollback_prepared_cb</function> and <function>stream_prepare_cb</function>)
+   and an optional callback (<function>filter_prepare_cb</function>).
+   </para>
+
+   <para>
+   If the output plugin callbacks for decoding two-phase commit commands are provided,
+   then on <command>PREPARE TRANSACTION</command>, the changes of that transaction are
+   decoded, passed to the output plugin and the <function>prepare_cb</function>
+   callback is invoked.This differs from the basic decoding setup where changes are
+   only passed to the output plugin when a transaction is committed. The start of a
+   prepared transaction is indicated by the <function>begin_prepare_cb</function> callback.
+   </para>
+
+   <para>
+   When a prepared transaction is rollbacked using the <command>ROLLBACK PREPARED</command>,
+   then the <function>rollback_prepared_cb</function> callback is invoked and when the
+   prepared transaction is committed using <command>COMMIT PREPARED</command>,
+   then the <function>commit_prepared_cb</function> callback is invoked.
+   </para>
+
+   <para>
+   Optionally the output plugin can specify a name pattern in the 
+   <function>filter_prepare_cb</function> and transactions with gid containing
+   that name pattern will not be decoded as a two-phase commit transaction. 
+   </para>
+
+  </sect1>
  </chapter>
-- 
1.8.3.1

v32-0007-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v32-0007-Support-2PC-txn-Subscription-option.patchDownload
From b0cb3cdfd44721fadae82c8dfc0d7ad4cc5b316d Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Mon, 14 Dec 2020 12:07:14 +0530
Subject: [PATCH v32 7/9] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.

Note: The tablesync worker slot always has two_phase disabled, regardless of the option.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 202 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index db5e59f..dbe2a43 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -166,8 +166,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index ca78d39..886839e 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -67,6 +67,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b140c21..5f4e191 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1149,7 +1149,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 1696454..b0745d5 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -64,7 +64,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -105,6 +106,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -210,6 +216,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -355,6 +370,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -379,7 +396,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -447,6 +465,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -720,6 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -730,7 +751,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -769,6 +791,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -787,7 +816,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -832,7 +862,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -875,7 +906,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 24f8b3e..1f404cd 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -429,6 +429,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 2ddb832..2362fba 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2794,6 +2794,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		(!am_tablesync_worker() && newsub->twophase != MySubscription->twophase) ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3440,6 +3441,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase && !am_tablesync_worker();
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index a6bafaa..dd819c9 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -178,13 +178,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -252,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -265,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -289,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -330,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 673a670..cb707bf 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4221,6 +4221,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4264,9 +4265,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4287,6 +4293,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4312,6 +4319,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4380,6 +4389,9 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 317bb83..22e4e6c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -629,6 +629,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 14150d0..47306a2 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -5997,7 +5997,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6023,13 +6023,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3fa02af..e07eed0 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -53,6 +53,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -90,6 +92,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 427a8ee..252f43c 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1b05b39..f96c891 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 2fa9bce..23d876e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,42 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 14fa0b2..2a0b366 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -147,6 +147,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v32-0008-Support-2PC-consistent-snapshot-isolation-tests.patchapplication/octet-stream; name=v32-0008-Support-2PC-consistent-snapshot-isolation-tests.patchDownload
From 5ab6a40149965bfba83861e4e1a65e9b62f98c05 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Thu, 17 Dec 2020 17:30:21 +0530
Subject: [PATCH v32 8/9] Support 2PC consistent snapshot isolation tests

Added isolation test-case to test that if a consistent snapshot is created
between a PREPARE and a COMMIT PREPARED, then the whole transaction is decoded
on COMMIT PREPARED.
---
 contrib/test_decoding/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 76d4a69..380b716 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
+	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream twophase_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
-- 
1.8.3.1

v32-0009-Support-2PC-txn-tests-for-concurrent-aborts.patchapplication/octet-stream; name=v32-0009-Support-2PC-txn-tests-for-concurrent-aborts.patchDownload
From 6713624657fa349e6672540e814e4b7089390e14 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Thu, 17 Dec 2020 17:30:59 +0530
Subject: [PATCH v32 9/9] Support 2PC txn tests for concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2PC.
---
 contrib/test_decoding/Makefile                  |  2 +
 contrib/test_decoding/test_decoding.c           | 58 +++++++++++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c |  5 +++
 3 files changed, 65 insertions(+)

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 380b716..e314bad 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream twophase_snapshot
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 94ba227..83ca592 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,11 +11,13 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
+#include "storage/procarray.h"
 
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -35,6 +37,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -173,6 +176,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -274,6 +278,24 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -468,6 +490,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -617,6 +663,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -703,6 +752,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -915,6 +967,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -968,6 +1023,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 36d5fb9..4bc9e1e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2488,6 +2488,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
-- 
1.8.3.1

#162Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#161)

On Thu, Dec 17, 2020 at 6:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 16, 2020 at 2:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 16, 2020 at 1:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Also, I guess we can improve the description of
’two_phase’ option of CREATE SUBSCRIPTION in the doc by adding the
fact that when this option is not enabled the transaction prepared on
the publisher is decoded as a normal transaction:

Sounds reasonable.

Fixed in the attached.

------
+   if (LookupGXact(begin_data.gid))
+   {
+       /*
+        * If this gid has already been prepared then we dont want to apply
+        * this txn again. This can happen after restart where upstream can
+        * send the prepared transaction again. See
+        * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+        */
+       skip_prepared_txn = true;
+       return;
+   }

When PREPARE arrives at the subscriber node but there is the prepared
transaction with the same transaction identifier, the apply worker
skips the whole transaction. So if the users prepared a transaction
with the same identifier on the subscriber, the prepared transaction
that came from the publisher would be ignored without any messages. On
the other hand, if applying other operations such as HEAP_INSERT
conflicts (such as when violating the unique constraint) the apply
worker raises an ERROR and stops logical replication until the
conflict is resolved. IIUC since we can know that the prepared
transaction came from the same publisher again by checking origin_lsn
in TwoPhaseFileHeader I guess we can skip the PREPARE message only
when the existing prepared transaction has the same LSN and the same
identifier. To be exact, it’s still possible that the subscriber gets
two PREPARE messages having the same LSN and name from two different
publishers but it’s unlikely happen in practice.

The idea sounds reasonable. I'll try and see if this works.

I went ahead and used both origin_lsn and origin_timestamp to avoid
the possibility of a match of prepared xact from two different nodes.
We can handle this at begin_prepare and prepare time but we don't have
prepare_lsn and prepare_timestamp at rollback_prepared time, so what
do about that? As of now, I am using just GID at rollback_prepare time
and that would have been sufficient if we always receive prepare
before rollback because at prepare time we would have checked
origin_lsn and origin_timestamp. But it is possible that we get
rollback prepared without prepare in case if prepare happened before
consistent_snapshot is reached and rollback happens after that.

Note that it is not easy to detect this case, otherwise, we would have
avoided sending rollback_prepared. See comments in
ReorderBufferFinishPrepared in patch
v32-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.

--
With Regards,
Amit Kapila.

#163Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#157)

On Thu, Dec 17, 2020 at 9:30 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, Dec 17, 2020 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On again thinking about this, I think it is good to disable it during
slot initialization but will it create any problem because during slot
initialization we don't stream any xact and stop processing WAL as
soon as we reach CONSISTENT_STATE? Did you observe any problem with
this?

Yes, it did not stream any xact during initialization but I was
surprised that the DecodePrepare code was invoked even though
I hadn't created the subscription with twophase enabled. No problem
was observed.

Fair enough, I have fixed this in the patch-series posted sometime back.

--
With Regards,
Amit Kapila.

#164Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#161)

On Thu, Dec 17, 2020 at 11:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I went ahead and used both origin_lsn and origin_timestamp to avoid
the possibility of a match of prepared xact from two different nodes.
We can handle this at begin_prepare and prepare time but we don't have
prepare_lsn and prepare_timestamp at rollback_prepared time, so what
do about that? As of now, I am using just GID at rollback_prepare time
and that would have been sufficient if we always receive prepare
before rollback because at prepare time we would have checked
origin_lsn and origin_timestamp. But it is possible that we get
rollback prepared without prepare in case if prepare happened before
consistent_snapshot is reached and rollback happens after that. For
commit-case, we do send prepare and all the data at commit time in
such a case but doing so for rollback case doesn't sound to be a good
idea. Another possibility is that we send prepare_lsn and prepare_time
in rollback_prepared API to deal with this. I am not sure if it is a
good idea to just rely on GID in rollback_prepare. What do you think?

Thinking about it for some time, my initial reaction was that the
distributed servers should maintain uniqueness of GIDs and re-checking
with LSNs is just overkill. But thinking some more, I realise that
since we allow reuse of GIDs, there could be a race condition where a
previously aborted/committed txn's GID was reused
which could lead to this. Yes, I think we could change
rollback_prepare to send out prepare_lsn and prepare_time as well,
just to be safe.

regards,
Ajin Cherian
Fujitsu Australia.

#165Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#164)
10 attachment(s)

On Fri, Dec 18, 2020 at 11:23 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, Dec 17, 2020 at 11:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I went ahead and used both origin_lsn and origin_timestamp to avoid
the possibility of a match of prepared xact from two different nodes.
We can handle this at begin_prepare and prepare time but we don't have
prepare_lsn and prepare_timestamp at rollback_prepared time, so what
do about that? As of now, I am using just GID at rollback_prepare time
and that would have been sufficient if we always receive prepare
before rollback because at prepare time we would have checked
origin_lsn and origin_timestamp. But it is possible that we get
rollback prepared without prepare in case if prepare happened before
consistent_snapshot is reached and rollback happens after that. For
commit-case, we do send prepare and all the data at commit time in
such a case but doing so for rollback case doesn't sound to be a good
idea. Another possibility is that we send prepare_lsn and prepare_time
in rollback_prepared API to deal with this. I am not sure if it is a
good idea to just rely on GID in rollback_prepare. What do you think?

Thinking about it for some time, my initial reaction was that the
distributed servers should maintain uniqueness of GIDs and re-checking
with LSNs is just overkill. But thinking some more, I realise that
since we allow reuse of GIDs, there could be a race condition where a
previously aborted/committed txn's GID was reused
which could lead to this. Yes, I think we could change
rollback_prepare to send out prepare_lsn and prepare_time as well,
just to be safe.

Okay, I have changed the rollback_prepare API as discussed above and
accordingly handle the case where rollback is received without prepare
in apply_handle_rollback_prepared.

While testing for this case, I noticed that the tracking of
replication progress for aborts is not complete due to which after
restart we can again ask for the rollback lsn. This shouldn't be a
problem with the latest code because we will simply skip it when there
is no corresponding prepare but this is far from ideal because that is
the sole purpose of tracking via replication origins. This was due to
the incomplete handling of aborts in the original commit 1eb6d6527a. I
have fixed this now in a separate patch
v33-0004-Track-replication-origin-progress-for-rollbacks. If you want
to see the problem then change the below code and don't apply
v33-0004-Track-replication-origin-progress-for-rollbacks, the
regression failure is due to the reason that we are not tracking
progress for aborts:

apply_handle_rollback_prepared
{
..
if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
rollback_data.preparetime))
..
}

to
apply_handle_rollback_prepared
{
..
Assert (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
rollback_data.preparetime));

--
With Regards,
Amit Kapila.

Attachments:

v33-0005-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v33-0005-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From b4b5f14aa9851d8fb59283ce016caca1f37e4ed3 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 12 Dec 2020 16:52:28 +0530
Subject: [PATCH v33 05/10] Add support for apply at prepare time to built-in
 logical replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

* We allow skipping prepared transactions if they are already prepared.
We do ensure that we skip only when the GID, origin_lsn, and
origin_timestamp of a prepared xact matches to avoid the possibility of
a match of prepared xact from two different nodes. This can happen when
the server or apply worker restarts after a prepared transaction.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  74 ++++++-
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 260 +++++++++++++++++++++-
 src/backend/replication/logical/worker.c    | 330 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 172 ++++++++++++---
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  75 ++++++-
 src/include/replication/reorderbuffer.h     |  12 +
 src/tools/pgindent/typedefs.list            |   3 +
 9 files changed, 895 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index fe10809..71cca00 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1133,9 +1133,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
@@ -2446,3 +2446,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 15ab8e7..dd33469 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -957,8 +957,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..1047385 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,264 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4f75e85..0acfc03 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -169,6 +170,9 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
+/* for skipping prepared transaction */
+bool        skip_prepared_txn = false;
+
 /*
  * Hash table for storing the streaming xid information along with shared file
  * set for streaming and subxact files.
@@ -690,6 +694,12 @@ apply_handle_begin(StringInfo s)
 {
 	LogicalRepBeginData begin_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_begin(s, &begin_data);
 
 	remote_final_lsn = begin_data.final_lsn;
@@ -709,6 +719,12 @@ apply_handle_commit(StringInfo s)
 {
 	LogicalRepCommitData commit_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_commit(s, &commit_data);
 
 	Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -722,6 +738,264 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+	{
+		/*
+		 * If this gid has already been prepared then we dont want to apply
+		 * this txn again. This can happen after restart where upstream can
+		 * send the prepared transaction again. See
+		 * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+		 */
+		skip_prepared_txn = true;
+		return;
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (skip_prepared_txn)
+	{
+		/*
+		 * If we are skipping this transaction because it was previously
+		 * prepared, ignore it and reset the flag.
+		 */
+		Assert(LookupGXact(prepare_data.gid, prepare_data.end_lsn,
+						   prepare_data.preparetime));
+		skip_prepared_txn = false;
+		return;
+	}
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -753,6 +1027,12 @@ apply_handle_stream_start(StringInfo s)
 	Assert(!in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Start a transaction on stream start, this transaction will be committed
 	 * on the stream stop unless it is a tablesync worker in which case it will
 	 * be committed after processing all the messages. We need the transaction
@@ -800,6 +1080,12 @@ apply_handle_stream_stop(StringInfo s)
 	Assert(in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Close the file with serialized changes, and serialize information about
 	 * subxacts for the toplevel transaction.
 	 */
@@ -835,6 +1121,12 @@ apply_handle_stream_abort(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_stream_abort(s, &xid, &subxid);
 
 	/*
@@ -1053,6 +1345,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	xid = logicalrep_read_stream_commit(s, &commit_data);
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
@@ -1176,6 +1474,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1297,6 +1598,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1454,6 +1758,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1823,6 +2130,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1979,6 +2289,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 49d25b0..7cf2951 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +67,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +78,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +173,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +344,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,27 +364,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -378,6 +385,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -766,17 +835,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -857,6 +917,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1171,3 +1249,31 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 	while ((entry = (RelationSyncEntry *) hash_seq_search(&status)) != NULL)
 		entry->replicate_valid = false;
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr	origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..5afb977 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535d..13ea3b7 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6d63338..4b92e68 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e82b4f7..c217f54 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1339,12 +1339,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v33-0006-Support-2PC-documentation.patchapplication/octet-stream; name=v33-0006-Support-2PC-documentation.patchDownload
From 7a5a8017525d16c06f8b910978bc844b29b5377d Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 02:43:40 -0500
Subject: [PATCH v33 06/10] Support-2PC-documentation.

Add documentation about two-phase commit support in Logical Decoding.
---
 doc/src/sgml/logicaldecoding.sgml | 99 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 98 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 829bbc1..e9394bf 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -165,7 +165,57 @@ COMMIT 693
 <keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
 $ pg_recvlogical -d postgres --slot=test --drop-slot
 </programlisting>
-  </sect1>
+
+  <para>
+  The following example shows how logical decoding can be used to handle transactions
+  that use a two-phase commit. Before you use two-phase commit commands, you must set
+  <varname>max_prepared_transactions</varname> to at least 1. You must also set the 
+  option 'two-phase-commit' to 1 while calling <function>pg_logical_slot_get_changes</function>.
+  </para>
+<programlisting>
+postgres=# BEGIN;
+postgres=*# INSERT INTO data(data) VALUES('5');
+postgres=*# PREPARE TRANSACTION 'test_prepared1';
+
+postgres=# SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/1689DC0 | 529 | BEGIN 529
+ 0/1689DC0 | 529 | table public.data: INSERT: id[integer]:3 data[text]:'5'
+ 0/1689FC0 | 529 | PREPARE TRANSACTION 'test_prepared1', txid 529
+(3 rows)
+
+postgres=# COMMIT PREPARED 'test_prepared1';
+COMMIT PREPARED
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                    data                    
+-----------+-----+--------------------------------------------
+ 0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529
+(1 row)
+
+postgres=#-- you can also rollback a prepared transaction
+postgres=# BEGIN;
+BEGIN
+postgres=*# INSERT INTO data(data) VALUES('6');INSERT 0 1
+postgres=*# PREPARE TRANSACTION 'test_prepared2';PREPARE TRANSACTION
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/168A180 | 530 | BEGIN 530
+ 0/168A1E8 | 530 | table public.data: INSERT: id[integer]:4 data[text]:'6'
+ 0/168A430 | 530 | PREPARE TRANSACTION 'test_prepared2', txid 530
+(3 rows)
+
+postgres=# ROLLBACK PREPARED 'test_prepared1';ERROR:  prepared transaction with identifier "test_prepared1" does not exist
+postgres=# ROLLBACK PREPARED 'test_prepared2';
+ROLLBACK PREPARED
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                     data                     
+-----------+-----+----------------------------------------------
+ 0/168A4B8 | 530 | ROLLBACK PREPARED 'test_prepared2', txid 530
+(1 row)
+</programlisting>
+</sect1>
 
   <sect1 id="logicaldecoding-explanation">
    <title>Logical Decoding Concepts</title>
@@ -1126,4 +1176,51 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
    </para>
 
   </sect1>
+
+  <sect1 id="logicaldecoding-two-phase-commits">
+   <title>Two-phase commit support for Logical Decoding</title>
+
+   <para>
+   With the basic output plugin callbacks (eg., <function>begin_cb</function>,
+   <function>change_cb</function>, <function>commit_cb</function> and
+   <function>message_cb</function>) two-phase commit commands like
+   <command>PREPARE TRANSACTION</command>, <command>COMMIT PREPARED</command>
+   and <command>ROLLBACK PREPARED</command> are not decoded correctly.
+   While the <command>PREPARE TRANSACTION</command> ignored, 
+   <command>COMMIT PREPARED</command> is decoded as a <command>COMMIT</command> and 
+   <command>ROLLBACK PREPARED</command> is decoded as a <command>ROLLBACK</command>.
+   </para>
+
+   <para>
+   An output plugin may provide additional callbacks to support two-phase commit commands.
+   There are multiple two-phase commit callbacks that are required,
+   (<function>begin_prepare_cb</function>, <function>prepare_cb</function>, 
+   <function>commit_prepared_cb</function>, 
+   <function>rollback_prepared_cb</function> and <function>stream_prepare_cb</function>)
+   and an optional callback (<function>filter_prepare_cb</function>).
+   </para>
+
+   <para>
+   If the output plugin callbacks for decoding two-phase commit commands are provided,
+   then on <command>PREPARE TRANSACTION</command>, the changes of that transaction are
+   decoded, passed to the output plugin and the <function>prepare_cb</function>
+   callback is invoked.This differs from the basic decoding setup where changes are
+   only passed to the output plugin when a transaction is committed. The start of a
+   prepared transaction is indicated by the <function>begin_prepare_cb</function> callback.
+   </para>
+
+   <para>
+   When a prepared transaction is rollbacked using the <command>ROLLBACK PREPARED</command>,
+   then the <function>rollback_prepared_cb</function> callback is invoked and when the
+   prepared transaction is committed using <command>COMMIT PREPARED</command>,
+   then the <function>commit_prepared_cb</function> callback is invoked.
+   </para>
+
+   <para>
+   Optionally the output plugin can specify a name pattern in the 
+   <function>filter_prepare_cb</function> and transactions with gid containing
+   that name pattern will not be decoded as a two-phase commit transaction. 
+   </para>
+
+  </sect1>
  </chapter>
-- 
1.8.3.1

v33-0007-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v33-0007-Support-2PC-txn-subscriber-tests.patchDownload
From d0b2026ff89fc994df6a03219c642e3ea78e8ca3 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Dec 2020 02:32:53 -0500
Subject: [PATCH v33 07/10] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v33-0008-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v33-0008-Support-2PC-txn-Subscription-option.patchDownload
From 464e5f19ba247463b21b3ddd4bad23aa91153bcc Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Mon, 14 Dec 2020 12:07:14 +0530
Subject: [PATCH v33 08/10] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.

Note: The tablesync worker slot always has two_phase disabled, regardless of the option.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 202 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index db5e59f..dbe2a43 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -166,8 +166,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index ca78d39..886839e 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -67,6 +67,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b140c21..5f4e191 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1149,7 +1149,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 1696454..b0745d5 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -64,7 +64,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -105,6 +106,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -210,6 +216,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -355,6 +370,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -379,7 +396,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -447,6 +465,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -720,6 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -730,7 +751,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -769,6 +791,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -787,7 +816,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -832,7 +862,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -875,7 +906,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 24f8b3e..1f404cd 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -429,6 +429,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0acfc03..cd9056c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2795,6 +2795,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		(!am_tablesync_worker() && newsub->twophase != MySubscription->twophase) ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3441,6 +3442,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase && !am_tablesync_worker();
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7cf2951..7e42a70 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -180,13 +180,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -254,6 +256,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -267,6 +279,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -291,7 +304,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -332,6 +346,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 673a670..cb707bf 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4221,6 +4221,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4264,9 +4265,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4287,6 +4293,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4312,6 +4319,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4380,6 +4389,9 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 317bb83..22e4e6c 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -629,6 +629,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 14150d0..47306a2 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -5997,7 +5997,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6023,13 +6023,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3fa02af..e07eed0 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -53,6 +53,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -90,6 +92,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 13ea3b7..4f5aec9 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1b05b39..f96c891 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 2fa9bce..23d876e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,42 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 14fa0b2..2a0b366 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -147,6 +147,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v33-0009-Support-2PC-consistent-snapshot-isolation-tests.patchapplication/octet-stream; name=v33-0009-Support-2PC-consistent-snapshot-isolation-tests.patchDownload
From 0ad9fbc357144d5323e04281acea584970864360 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 19 Dec 2020 08:17:48 +0530
Subject: [PATCH v33 09/10] Support 2PC consistent snapshot isolation tests

Added isolation test-case to test that if a consistent snapshot is created
between a PREPARE and a COMMIT PREPARED, then the whole transaction is decoded
on COMMIT PREPARED.
---
 contrib/test_decoding/Makefile                     |  3 +-
 .../test_decoding/expected/twophase_snapshot.out   | 43 +++++++++++++++++++
 contrib/test_decoding/specs/twophase_snapshot.spec | 49 ++++++++++++++++++++++
 3 files changed, 94 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/twophase_snapshot.out
 create mode 100644 contrib/test_decoding/specs/twophase_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 76d4a69..c5e28ce 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
+	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
+	twophase_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/twophase_snapshot.out b/contrib/test_decoding/expected/twophase_snapshot.out
new file mode 100644
index 0000000..53aaf01
--- /dev/null
+++ b/contrib/test_decoding/expected/twophase_snapshot.out
@@ -0,0 +1,43 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s2b s2txid s1init s3b s3txid s2alter s2c s4b s4insert s4prepare s3c s1insert s1checkpoint s1start s4commit s1start
+step s2b: BEGIN;
+step s2txid: SELECT pg_current_xact_id() IS NULL;
+?column?       
+
+f              
+step s1init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); <waiting ...>
+step s3b: BEGIN;
+step s3txid: SELECT pg_current_xact_id() IS NULL;
+?column?       
+
+f              
+step s2alter: ALTER TABLE do_write ADD COLUMN addedbys2 int;
+step s2c: COMMIT;
+step s4b: BEGIN;
+step s4insert: INSERT INTO do_write DEFAULT VALUES;
+step s4prepare: PREPARE TRANSACTION 'test1';
+step s3c: COMMIT;
+step s1init: <... completed>
+?column?       
+
+init           
+step s1insert: INSERT INTO do_write DEFAULT VALUES;
+step s1checkpoint: CHECKPOINT;
+step s1start: SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');
+data           
+
+BEGIN          
+table public.do_write: INSERT: id[integer]:2 addedbys2[integer]:null
+COMMIT         
+step s4commit: COMMIT PREPARED 'test1';
+step s1start: SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');
+data           
+
+BEGIN          
+table public.do_write: INSERT: id[integer]:1 addedbys2[integer]:null
+PREPARE TRANSACTION 'test1'
+COMMIT PREPARED 'test1'
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/twophase_snapshot.spec b/contrib/test_decoding/specs/twophase_snapshot.spec
new file mode 100644
index 0000000..505e5e3
--- /dev/null
+++ b/contrib/test_decoding/specs/twophase_snapshot.spec
@@ -0,0 +1,49 @@
+# Test decoding of two-phase transactions during the build of a consistent snapshot.
+setup
+{
+    DROP TABLE IF EXISTS do_write;
+    CREATE TABLE do_write(id serial primary key);
+}
+
+teardown
+{
+    DROP TABLE do_write;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1init" {SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');}
+step "s1start" {SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');}
+step "s1insert" { INSERT INTO do_write DEFAULT VALUES; }
+step "s1checkpoint" { CHECKPOINT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2b" { BEGIN; }
+step "s2txid" { SELECT pg_current_xact_id() IS NULL; }
+step "s2alter" { ALTER TABLE do_write ADD COLUMN addedbys2 int; }
+step "s2c" { COMMIT; }
+
+
+session "s3"
+setup { SET synchronous_commit=on; }
+
+step "s3b" { BEGIN; }
+step "s3txid" { SELECT pg_current_xact_id() IS NULL; }
+step "s3c" { COMMIT; }
+
+session "s4"
+setup { SET synchronous_commit=on; }
+
+step "s4b" { BEGIN; }
+step "s4insert" { INSERT INTO do_write DEFAULT VALUES; }
+step "s4prepare" { PREPARE TRANSACTION 'test1'; }
+step "s4commit" { COMMIT PREPARED 'test1'; }
+
+# Force building of a consistent snapshot between a PREPARE and COMMIT PREPARED.
+# Ensure that the whole transaction is decoded fresh at the time of COMMIT PREPARED.
+permutation "s2b" "s2txid" "s1init" "s3b" "s3txid" "s2alter" "s2c" "s4b" "s4insert" "s4prepare" "s3c""s1insert" "s1checkpoint" "s1start" "s4commit" "s1start"
-- 
1.8.3.1

v33-0010-Support-2PC-txn-tests-for-concurrent-aborts.patchapplication/octet-stream; name=v33-0010-Support-2PC-txn-tests-for-concurrent-aborts.patchDownload
From 575b3c46829d986c4bd3e149e6efcf0c30387ba9 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 19 Dec 2020 08:20:35 +0530
Subject: [PATCH v33 10/10] Support 2PC txn tests for concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2PC.
---
 contrib/test_decoding/Makefile                    |   2 +
 contrib/test_decoding/t/001_twophase.pl           | 121 ++++++++++++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++++++
 contrib/test_decoding/test_decoding.c             |  58 ++++++++++
 src/backend/replication/logical/reorderbuffer.c   |   5 +
 5 files changed, 319 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index c5e28ce..e0cd841 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -10,6 +10,8 @@ ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..3b3e7b8
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of prepared txn test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..15001c6
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 0576355..efe7f5c 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,11 +11,13 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
+#include "storage/procarray.h"
 
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -35,6 +37,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -174,6 +177,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -275,6 +279,24 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -471,6 +493,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -620,6 +666,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -706,6 +755,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -918,6 +970,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -971,6 +1026,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4aa9df2..2b5bb05 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2488,6 +2488,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
-- 
1.8.3.1

v33-0001-Extend-the-output-plugin-API-to-allow-decoding-o.patchapplication/octet-stream; name=v33-0001-Extend-the-output-plugin-API-to-allow-decoding-o.patchDownload
From 6737b91ef728b45c614dddeef19e353e5f59faf9 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sat, 12 Dec 2020 16:41:33 +0530
Subject: [PATCH v33 01/10] Extend the output plugin API to allow decoding of
 prepared xacts.

This adds six methods to the output plugin API, adding support for
streaming changes of two-phase transactions at prepare time.

* begin_prepare
* filter_prepare
* prepare
* commit_prepared
* rollback_prepared
* stream_prepare

Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction is not yet committed and maybe
aborted later.

Until now two-phase transactions were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the
two-phase commands were communicated to the subscriber.

This patch provides the infrastructure for logical decoding plugins to be
informed of two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

This also extends the 'test_decoding' plugin, implementing these new
methods.

This commit simply adds these new APIs and the upcoming patch to "allow
the decoding at prepare time in ReorderBuffer" will use these APIs.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c     | 167 +++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 172 ++++++++++++++++-
 src/backend/replication/logical/logical.c | 297 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   6 +
 src/include/replication/output_plugin.h   |  56 ++++++
 src/include/replication/reorderbuffer.h   |  41 +++++
 src/tools/pgindent/typedefs.list          |  12 ++
 7 files changed, 744 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e12278b..0576355 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -76,6 +76,20 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 const char *gid);
+static void pg_decode_begin_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr prepare_end_lsn,
+											TimestampTz prepare_time);
 static void pg_decode_stream_start(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn);
 static void pg_output_stream_start(LogicalDecodingContext *ctx,
@@ -87,6 +101,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -123,9 +140,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->begin_prepare_cb = pg_decode_begin_prepare_txn;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
@@ -141,6 +164,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -241,6 +265,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -252,6 +286,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -320,6 +355,111 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/* BEGIN PREPARE callback */
+static void
+pg_decode_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata =
+	MemoryContextAllocZero(ctx->context, sizeof(TestDecodingTxnData));
+
+	txndata->xact_wrote_changes = false;
+	txn->output_plugin_private = txndata;
+
+	if (data->skip_empty_xacts)
+		return;
+
+	pg_output_begin(ctx, data, txn, true);
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_end_lsn,
+								TimestampTz prepare_time)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -702,6 +842,33 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..829bbc1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,9 +389,15 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodeBeginPrepareCB begin_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +419,20 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits,
+    which allows actions to be decoded on the <command>PREPARE TRANSACTION</command>.
+    The <function>begin_prepare_cb</function>, <function>prepare_cb</function>, 
+    <function>stream_prepare_cb</function>,
+    <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +493,15 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too. We will skip all the changes of such a transaction once
+     the abort is detected and abort the transaction when we read WAL for
+     <command>ROLLBACK PREPARED</command>.
     </para>
 
     <note>
@@ -587,7 +611,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -685,7 +715,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -698,6 +734,111 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents as for the
+      other callbacks. The <parameter>gid</parameter> is the identifier that later
+      identifies this transaction for <command>COMMIT PREPARED</command> or
+      <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given
+      <parameter>gid</parameter> every time it is called.
+     </para>
+     </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-begin-prepare">
+     <title>Transaction Begin Prepare Callback</title>
+
+     <para>
+      The required <function>begin_prepare_cb</function> callback is called
+      whenever the start of a prepared transaction has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback to
+      check if the plugin has already received this prepare in which case it
+      can skip the remaining changes of the transaction. This can only happen
+      if the user restarts the decoding after receiving the prepare for a
+      transaction but before receiving the commit prepared say because of some
+      error.
+      <programlisting>
+       typedef void (*LogicalDecodeBeginPrepareCB) (struct LogicalDecodingContext *ctx,
+                                                    ReorderBufferTXN *txn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callback for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr prepare_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called
+      whenever a transaction commit prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                      ReorderBufferTXN *txn,
+                                                      XLogRecPtr commit_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called
+      whenever a transaction rollback prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this prepare transaction in which case it can apply the
+      rollback, otherwise, it can skip the rollback operation. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have prepared transaction with same identifier.
+      <programlisting>
+       typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                        ReorderBufferTXN *txn,
+                                                        XLogRecPtr preapre_end_lsn,
+                                                        TimestampTz prepare_time);
+      </programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-start">
      <title>Stream Start Callback</title>
      <para>
@@ -735,6 +876,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1067,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f1f4df7..6244fb9 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,13 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +81,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -237,11 +246,37 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
 	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
 
+
+	/*
+	 * To support two-phase logical decoding, we require
+	 * begin_prepare/prepare/commit-prepare/abort-prepare callbacks. The
+	 * filter_prepare callback is optional. We however enable two-phase
+	 * logical decoding when at least one of the methods is enabled so that we
+	 * can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.begin_prepare_cb != NULL) ||
+		(ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
+	 * Callback to support decoding at prepare time.
+	 */
+	ctx->reorder->begin_prepare = begin_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -782,6 +817,186 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+/*
+ * The functionality of begin_prepare is quite similar to begin with the
+ * exception that this will have gid (global transaction id) information which
+ * can be used by plugin. Now, we thought about extending the existing begin
+ * but that would break the replication protocol and additionally this looks
+ * cleaner.
+ */
+static void
+begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "begin_prepare";
+	state.report_location = txn->first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->first_lsn;
+
+	/*
+	 * If the plugin supports two-phase commits then begin prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.begin_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires begin_prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.begin_prepare_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of prepare record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then abort prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, prepare_end_lsn,
+										prepare_time);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
@@ -860,6 +1075,45 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 bool
+filter_prepare_cb_wrapper(LogicalDecodingContext *ctx, const char *gid)
+{
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case, all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
+bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
 	LogicalErrorCallbackState state;
@@ -1057,6 +1311,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming at prepare time requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..28c9c1f 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
@@ -120,6 +125,7 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 												  XLogRecPtr restart_lsn);
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
+extern bool filter_prepare_cb_wrapper(LogicalDecodingContext *ctx, const char *gid);
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
 extern void ResetLogicalStreamingState(void);
 extern void UpdateDecodingStats(LogicalDecodingContext *ctx);
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..89e1dc3 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,45 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  const char *gid);
+
+/*
+ * Callback called for every BEGIN of a prepared trnsaction.
+ */
+typedef void (*LogicalDecodeBeginPrepareCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr prepare_end_lsn,
+												 TimestampTz prepare_time);
+
+
+/*
  * Called when starting to stream a block of changes from in-progress
  * transaction (may be called repeatedly, if it's streamed in multiple
  * chunks).
@@ -124,6 +163,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -173,10 +220,19 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+
+	/* streaming of changes at prepare time */
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodeBeginPrepareCB begin_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
+
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bd9dd7e..1e60afe 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -245,6 +245,12 @@ typedef struct ReorderBufferTXN
 	TransactionId toplevel_xid;
 
 	/*
+	 * Global transaction id required for identification of prepared
+	 * transactions.
+	 */
+	char	   *gid;
+
+	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
 	 * the previous records aren't relevant for logical decoding.
@@ -418,6 +424,26 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* begin prepare callback signature */
+typedef void (*ReorderBufferBeginPrepareCB) (ReorderBuffer *rb,
+											 ReorderBufferTXN *txn);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr prepare_end_lsn,
+												 TimestampTz prepare_time);
+
 /* start streaming transaction callback signature */
 typedef void (*ReorderBufferStreamStartCB) (
 											ReorderBuffer *rb,
@@ -436,6 +462,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -505,11 +537,20 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction at prepare time.
+	 */
+	ReorderBufferBeginCB begin_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
+
+	/*
 	 * Callbacks to be called when streaming a transaction.
 	 */
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a9dca71..e82b4f7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1315,9 +1315,21 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodeBeginPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
1.8.3.1

v33-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchapplication/octet-stream; name=v33-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchDownload
From 4bed675dbdb64f70d2a4785ab1cc5ed26c9356c3 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Wed, 16 Dec 2020 14:40:27 +0530
Subject: [PATCH v33 02/10] Allow decoding at prepare time in ReorderBuffer.

This patch allows PREPARE-time decoding of two-phase transactions (if the
output plugin supports this capability), in which case the transactions
are replayed at PREPARE and then committed later when COMMIT PREPARED
arrives.

Now that we decode the changes before the commit, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We detect such failures with a special sqlerrcode
ERRCODE_TRANSACTION_ROLLBACK introduced by commit 7259736a6e and stop
decoding the remaining changes. Then we rollback the changes when rollback
prepared is encountered.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, Arseny Sher, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 contrib/test_decoding/expected/twophase.out        | 235 +++++++++++
 contrib/test_decoding/expected/twophase_stream.out | 147 +++++++
 contrib/test_decoding/sql/twophase.sql             | 112 ++++++
 contrib/test_decoding/sql/twophase_stream.sql      |  45 +++
 src/backend/replication/logical/decode.c           | 285 ++++++++++++--
 src/backend/replication/logical/reorderbuffer.c    | 432 +++++++++++++++++----
 src/backend/replication/logical/snapbuild.c        |   7 +
 src/include/replication/reorderbuffer.h            |  33 +-
 9 files changed, 1192 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/twophase.out
 create mode 100644 contrib/test_decoding/expected/twophase_stream.out
 create mode 100644 contrib/test_decoding/sql/twophase.sql
 create mode 100644 contrib/test_decoding/sql/twophase_stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..76d4a69 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate stream stats
+	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
diff --git a/contrib/test_decoding/expected/twophase.out b/contrib/test_decoding/expected/twophase.out
new file mode 100644
index 0000000..f9f6bed
--- /dev/null
+++ b/contrib/test_decoding/expected/twophase.out
@@ -0,0 +1,235 @@
+-- Test prepared transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test that decoding happens at PREPARE time when two-phase-commit is enabled.
+-- Decoding after COMMIT PREPARED must have all the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+ COMMIT PREPARED 'test_prepared#1'
+(5 rows)
+
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+ COMMIT PREPARED 'test_prepared#3'
+(4 rows)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Check 'CLUSTER' (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The
+-- call should return within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ COMMIT PREPARED 'test_prepared_lock'
+(5 rows)
+
+-- Test savepoints and sub-xacts. Creating savepoints will create
+-- sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+ COMMIT PREPARED 'test_prepared_savepoint'
+(4 rows)
+
+-- Test that a GID containing "_nodecode" gets decoded at commit prepared time.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/twophase_stream.out b/contrib/test_decoding/expected/twophase_stream.out
new file mode 100644
index 0000000..3acc4acd3
--- /dev/null
+++ b/contrib/test_decoding/expected/twophase_stream.out
@@ -0,0 +1,147 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK TO s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED and the other changes in the transaction
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ PREPARE TRANSACTION 'test1'
+ COMMIT PREPARED 'test1'
+(23 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with
+-- filtered gid. gids with '_nodecode' will not be decoded at prepare time.
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/twophase.sql b/contrib/test_decoding/sql/twophase.sql
new file mode 100644
index 0000000..894e4f5
--- /dev/null
+++ b/contrib/test_decoding/sql/twophase.sql
@@ -0,0 +1,112 @@
+-- Test prepared transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test that decoding happens at PREPARE time when two-phase-commit is enabled.
+-- Decoding after COMMIT PREPARED must have all the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check 'CLUSTER' (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The
+-- call should return within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test savepoints and sub-xacts. Creating savepoints will create
+-- sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that a GID containing "_nodecode" gets decoded at commit prepared time.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/twophase_stream.sql b/contrib/test_decoding/sql/twophase_stream.sql
new file mode 100644
index 0000000..e9dd44f
--- /dev/null
+++ b/contrib/test_decoding/sql/twophase_stream.sql
@@ -0,0 +1,45 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK TO s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED and the other changes in the transaction
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with
+-- filtered gid. gids with '_nodecode' will not be decoded at prepare time.
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..6ac2a60 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,13 +67,24 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool two_phase);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool two_phase);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 
+/* helper functions for decoding transactions */
+static inline bool FilterPrepare(LogicalDecodingContext *ctx, const char *gid);
+static bool DecodeTXNNeedSkip(LogicalDecodingContext *ctx,
+							  XLogRecordBuffer *buf, Oid dbId,
+							  RepOriginId origin_id);
+
 /*
  * Take every XLogReadRecord()ed record and perform the actions required to
  * decode it using the output plugin already setup in the logical decoding
@@ -244,6 +255,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		two_phase = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +265,15 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * We would like to process the transaction in a two-phase
+				 * manner iff output plugin supports two-phase commits and
+				 * doesn't filter the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+					two_phase = !(FilterPrepare(ctx, parsed.twophase_gid));
+
+				DecodeCommit(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +282,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		two_phase = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +292,15 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * We would like to process the transaction in a two-phase
+				 * manner iff output plugin supports two-phase commits and
+				 * doesn't filter the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+					two_phase = !(FilterPrepare(ctx, parsed.twophase_gid));
+
+				DecodeAbort(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +341,37 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/*
+				 * We would like to process the transaction in a two-phase
+				 * manner iff output plugin supports two-phase commits and
+				 * doesn't filter the transaction at prepare time.
+				 */
+				if (FilterPrepare(ctx, parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -520,6 +569,23 @@ DecodeHeapOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	}
 }
 
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+static inline bool
+FilterPrepare(LogicalDecodingContext *ctx, const char *gid)
+{
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (ctx->callbacks.filter_prepare_cb == NULL)
+		return false;
+
+	return filter_prepare_cb_wrapper(ctx, gid);
+}
+
 static inline bool
 FilterByOrigin(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -582,10 +648,15 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'two_phase' indicates that caller wants to process the transaction in two
+ * phases, first process prepare if not already done and then process
+ * commit_prepared.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool two_phase)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -606,15 +677,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * the reorderbuffer to forget the content of the (sub-)transactions
 	 * if not.
 	 *
-	 * There can be several reasons we might not be interested in this
-	 * transaction:
-	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
-	 * 2) The transaction happened in another database.
-	 * 3) The output plugin is not interested in the origin.
-	 * 4) We are doing fast-forwarding
-	 *
 	 * We can't just use ReorderBufferAbort() here, because we need to execute
 	 * the transaction's invalidations.  This currently won't be needed if
 	 * we're just skipping over the transaction because currently we only do
@@ -627,9 +689,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * relevant syscaches.
 	 * ---
 	 */
-	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
-		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
-		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
 	{
 		for (i = 0; i < parsed->nsubxacts; i++)
 		{
@@ -647,34 +707,164 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
+	/*
+	 * Send the final commit record if the transaction data is already
+	 * decoded, otherwise, process the entire transaction.
+	 */
+	if (two_phase)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ *
+ * Note that we don't skip prepare even if we have detected concurrent abort.
+ * The reason is that it is quite possible that we had already sent some
+ * changes before we detect abort in which case we need to abort those changes
+ * in the subscriber. To abort such changes, we do send the prepare and then
+ * the rollback prepared which is what happened on the publisher-side as well.
+ * Now, we can invent a new abort API wherein in such cases we send abort and
+ * skip sending prepared and rollback prepared but then it is not that
+ * straightforward because we might have streamed this transaction by that time
+ * in which case it is handled when the rollback is encountered. It is not
+ * impossible to optimize the concurrent abort case but it can introduce design
+ * complexity w.r.t handling different cases so leaving it for now as it
+ * doesn't seem worth it.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	SnapBuild  *builder = ctx->snapshot_builder;
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz prepare_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		prepare_time = parsed->origin_timestamp;
+
+	/*
+	 * Remember the prepare info for a txn so that it can be used later in
+	 * commit prepared if required. See ReorderBufferFinishPrepared.
+	 */
+	if (!ReorderBufferRememberPrepareInfo(ctx->reorder, xid, buf->origptr,
+										  buf->endptr, prepare_time, origin_id,
+										  origin_lsn))
+		return;
+
+	/* We can't start streaming unless a consistent state is reached. */
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT)
+	{
+		ReorderBufferSkipPrepare(ctx->reorder, xid);
+		return;
+	}
+
+	/*
+	 * Check whether we need to process this transaction. See
+	 * DecodeTXNNeedSkip for the reasons why we sometimes want to skip the
+	 * transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache
+	 * invalidations if there are any for the reasons mentioned in
+	 * DecodeCommit.
+	 */
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
+	{
+		ReorderBufferSkipPrepare(ctx->reorder, xid);
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/* Tell the reorderbuffer about the surviving subtransactions. */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'two_phase' indicates to finish prepared transaction.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool two_phase)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz abort_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool		skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		abort_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * Check whether we need to process this transaction. See
+	 * DecodeTXNNeedSkip for the reasons why we sometimes want to skip the
+	 * transaction.
+	 */
+	skip_xact = DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id);
+
+	/*
+	 * Send the final rollback record for a prepared transaction unless we
+	 * need to skip it. For non-two-phase xacts, simply forget the xact.
+	 */
+	if (two_phase && !skip_xact)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									abort_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
 	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
@@ -1080,3 +1270,24 @@ DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tuple)
 	header->t_infomask2 = xlhdr.t_infomask2;
 	header->t_hoff = xlhdr.t_hoff;
 }
+
+/*
+ * Check whether we are interested in this specific transaction.
+ *
+ * There can be several reasons we might not be interested in this
+ * transaction:
+ * 1) We might not be interested in decoding transactions up to this
+ *	  LSN. This can happen because we previously decoded it and now just
+ *	  are restarting or if we haven't assembled a consistent snapshot yet.
+ * 2) The transaction happened in another database.
+ * 3) The output plugin is not interested in the origin.
+ * 4) We are doing fast-forwarding
+ */
+static bool
+DecodeTXNNeedSkip(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+				  Oid txn_dbid, RepOriginId origin_id)
+{
+	return (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+			(txn_dbid != InvalidOid && txn_dbid != ctx->slot->data.database) ||
+			ctx->fast_forward || FilterByOrigin(ctx, origin_id));
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 7359fa9..4aa9df2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -422,6 +423,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1515,12 +1522,18 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after
+ * streaming or decoding them at PREPARE. Keep the remaining info -
+ * transactions, tuplecids, invalidations and snapshots.
+ *
+ * We additionaly remove tuplecids after decoding the transaction at prepare
+ * time as we only need to perform invalidation at rollback or commit prepared.
+ *
+ * 'txn_prepared' indicates that we have decoded the transaction at prepare
+ * time.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1539,7 +1552,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1573,9 +1586,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1755,9 +1792,10 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * If the transaction was (partially) streamed, we need to commit it in a
- * 'streamed' way.  That is, we first stream the remaining part of the
- * transaction, and then invoke stream_commit message.
+ * If the transaction was (partially) streamed, we need to prepare or commit
+ * it in a 'streamed' way.  That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_prepare or stream_commit message as per
+ * the case.
  */
 static void
 ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1767,29 +1805,49 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		/*
+		 * Note, we send stream prepare even if a concurrent abort is
+		 * detected. See DecodePrepare for more information.
+		 */
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids.
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
  * Set xid to detect concurrent aborts.
  *
- * While streaming an in-progress transaction there is a possibility that the
- * (sub)transaction might get aborted concurrently.  In such case if the
- * (sub)transaction has catalog update then we might decode the tuple using
- * wrong catalog version.  For example, suppose there is one catalog tuple with
- * (xmin: 500, xmax: 0).  Now, the transaction 501 updates the catalog tuple
- * and after that we will have two tuples (xmin: 500, xmax: 501) and
- * (xmin: 501, xmax: 0).  Now, if 501 is aborted and some other transaction
- * say 502 updates the same catalog tuple then the first tuple will be changed
- * to (xmin: 500, xmax: 502).  So, the problem is that when we try to decode
- * the tuple inserted/updated in 501 after the catalog update, we will see the
- * catalog tuple with (xmin: 500, xmax: 502) as visible because it will
- * consider that the tuple is deleted by xid 502 which is not visible to our
- * snapshot.  And when we will try to decode with that catalog tuple, it can
- * lead to a wrong result or a crash.  So, it is necessary to detect
- * concurrent aborts to allow streaming of in-progress transactions.
+ * While streaming an in-progress transaction or decoding a prepared
+ * transaction there is a possibility that the (sub)transaction might get
+ * aborted concurrently.  In such case if the (sub)transaction has catalog
+ * update then we might decode the tuple using wrong catalog version.  For
+ * example, suppose there is one catalog tuple with (xmin: 500, xmax: 0).  Now,
+ * the transaction 501 updates the catalog tuple and after that we will have
+ * two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0).  Now, if 501 is
+ * aborted and some other transaction say 502 updates the same catalog tuple
+ * then the first tuple will be changed to (xmin: 500, xmax: 502).  So, the
+ * problem is that when we try to decode the tuple inserted/updated in 501
+ * after the catalog update, we will see the catalog tuple with (xmin: 500,
+ * xmax: 502) as visible because it will consider that the tuple is deleted by
+ * xid 502 which is not visible to our snapshot.  And when we will try to
+ * decode with that catalog tuple, it can lead to a wrong result or a crash.
+ * So, it is necessary to detect concurrent aborts to allow streaming of
+ * in-progress transactions or decoding of prepared  transactions.
  *
  * For detecting the concurrent abort we set CheckXidAlive to the current
  * (sub)transaction's xid for which this change belongs to.  And, during
@@ -1798,7 +1856,10 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * and discard the already streamed changes on such an error.  We might have
  * already streamed some of the changes for the aborted (sub)transaction, but
  * that is fine because when we decode the abort we will stream abort message
- * to truncate the changes in the subscriber.
+ * to truncate the changes in the subscriber. Similarly, for prepared
+ * transactions, we stop decoding if concurrent abort is detected and then
+ * rollback the changes when rollback prepared is encountered. See
+ * DecodePreare.
  */
 static inline void
 SetupCheckXidLive(TransactionId xid)
@@ -1900,7 +1961,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1912,15 +1973,19 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		specinsert = NULL;
 	}
 
-	/* Stop the stream. */
-	rb->stream_stop(rb, txn, last_lsn);
-
-	/* Remember the command ID and snapshot for the streaming run */
-	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	/*
+	 * For the streaming case, stop the stream and remember the command ID and
+	 * snapshot for the streaming run.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_stop(rb, txn, last_lsn);
+		ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	}
 }
 
 /*
- * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ * Helper function for ReorderBufferReplay and ReorderBufferStreamTXN.
  *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
@@ -1973,9 +2038,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		else
 			StartTransactionCommand();
 
-		/* We only need to send begin/commit for non-streamed transactions. */
+		/*
+		 * We only need to send begin/begin-prepare for non-streamed
+		 * transactions.
+		 */
 		if (!streaming)
-			rb->begin(rb, txn);
+		{
+			if (rbtxn_prepared(txn))
+				rb->begin_prepare(rb, txn);
+			else
+				rb->begin(rb, txn);
+		}
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -2006,8 +2079,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			prev_lsn = change->lsn;
 
-			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			/*
+			 * Set the current xid to detect concurrent aborts. This is
+			 * required for the cases when we decode the changes before the
+			 * COMMIT record is processed.
+			 */
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2298,7 +2375,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2332,15 +2418,22 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the four reasons: 1. Decoding an
+		 * in-progress txn. 2. Decoding a prepared txn. 3. Decoding of a
+		 * prepared txn that was (partially) streamed. 4. Decoding a committed
+		 * txn.
+		 *
+		 * For 1, we allow truncation of txn data by removing the changes
+		 * already streamed but still keeping other things like invalidations,
+		 * snapshot, and tuplecids. For 2 and 3, we indicate
+		 * ReorderBufferTruncateTXN to do more elaborate truncation of txn
+		 * data as the entire transaction has been decoded except for commit.
+		 * For 4, as the entire txn has been decoded, we can fully clean up
+		 * the TXN reorder buffer.
 		 */
-		if (streaming)
+		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
-
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2373,17 +2466,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2413,26 +2509,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * ReorderBufferCommitChild(), even if previously assigned to the toplevel
  * transaction with ReorderBufferAssignChild.
  *
- * This interface is called once a toplevel commit is read for both streamed
- * as well as non-streamed transactions.
+ * This interface is called once a prepare or toplevel commit is read for both
+ * streamed as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferReplay(ReorderBufferTXN *txn,
+					ReorderBuffer *rb, TransactionId xid,
 					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 					TimestampTz commit_time,
 					RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2462,7 +2551,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	if (txn->base_snapshot == NULL)
 	{
 		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+
+		/*
+		 * Removing this txn before a commit might result in the computation
+		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
+		 */
+		if (!rbtxn_prepared(txn))
+			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
 
@@ -2474,6 +2569,178 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferReplay(txn, rb, xid, commit_lsn, end_lsn, commit_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Record the prepare information for a transaction.
+ */
+bool
+ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+								 TimestampTz prepare_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return false;
+
+	/*
+	 * Remember the prepare information to be later used by commit prepared in
+	 * case we skip doing prepare.
+	 */
+	txn->final_lsn = prepare_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = prepare_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	return true;
+}
+
+/* Remember that we have skipped prepare */
+void
+ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_SKIPPED_PREPARE;
+}
+
+/*
+ * Prepare a two-phase transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = pstrdup(gid);
+
+	/* The prepare info must have been updated in txn by now. */
+	Assert(txn->final_lsn != InvalidXLogRecPtr);
+
+	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
+						txn->commit_time, txn->origin_id, txn->origin_lsn);
+}
+
+/*
+ * This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time, RepOriginId origin_id,
+							XLogRecPtr origin_lsn, char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+	XLogRecPtr	prepare_end_lsn;
+	TimestampTz	prepare_time;
+
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * By this time the txn has the prepare record information, remember it to
+	 * be later used for rollback.
+	 */
+	prepare_end_lsn = txn->end_lsn;
+	prepare_time = txn->commit_time;
+
+	/* add the gid in the txn */
+	txn->gid = pstrdup(gid);
+
+	/*
+	 * It is possible that this transaction is not decoded at prepare time
+	 * either because by that time we didn't have a consistent snapshot or it
+	 * was decoded earlier but we have restarted. We can't distinguish between
+	 * those two cases so we send the prepare in both the cases and let
+	 * downstream decide whether to process or skip it. We don't need to
+	 * decode the xact for aborts if it is not done already.
+	 */
+	if (!rbtxn_prepared(txn) && is_commit)
+	{
+		txn->txn_flags |= RBTXN_PREPARE;
+
+		/*
+		 * The prepare info must have been updated in txn even if we skip
+		 * prepare.
+		 */
+		Assert(txn->final_lsn != InvalidXLogRecPtr);
+
+		/*
+		 * By this time the txn has the prepare record information and it is
+		 * important to use that so that downstream gets the accurate
+		 * information. If instead, we have passed commit information here
+		 * then downstream can behave as it has already replayed commit
+		 * prepared after the restart.
+		 */
+		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
+							txn->commit_time, txn->origin_id, txn->origin_lsn);
+	}
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	if (is_commit)
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else
+		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2605,6 +2872,39 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 }
 
 /*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ *
+ * Note that this is a special-purpose function for prepared transactions where
+ * we don't want to clean up the TXN even when we decide to skip it. See
+ * DecodePrepare.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
+/*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 9d5d68f..dc3ef74 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -834,6 +834,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid))
 			continue;
 
+		/*
+		 * We don't need to add snapshot to prepared transactions as they
+		 * should not see the new catalog contents.
+		 */
+		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+			continue;
+
 		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
 			 txn->xid, (uint32) (lsn >> 32), (uint32) lsn);
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1e60afe..6d63338 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -174,6 +174,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_SKIPPED_PREPARE	  0x0100
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +235,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* prepare for this transaction skipped? */
+#define rbtxn_skip_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -258,10 +272,11 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	first_lsn;
 
 	/* ----
-	 * LSN of the record that lead to this xact to be committed or
+	 * LSN of the record that lead to this xact to be prepared or committed or
 	 * aborted. This can be a
 	 * * plain commit record
 	 * * plain commit record, of a parent transaction
+	 * * prepared tansaction
 	 * * prepared transaction commit
 	 * * plain abort record
 	 * * prepared transaction abort
@@ -293,7 +308,8 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	origin_lsn;
 
 	/*
-	 * Commit time, only known when we read the actual commit record.
+	 * Commit or Prepare time, only known when we read the actual commit or
+	 * prepare record.
 	 */
 	TimestampTz commit_time;
 
@@ -625,12 +641,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -644,10 +666,17 @@ void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr l
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
 											   SharedInvalidationMessage *invalidations);
 void		ReorderBufferProcessXid(ReorderBuffer *, TransactionId xid, XLogRecPtr lsn);
+
 void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLogRecPtr lsn);
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
+											 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+											 TimestampTz prepare_time,
+											 RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v33-0003-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v33-0003-Refactor-spool-file-logic-in-worker.c.patchDownload
From f4c6fda3ffd1d2fdb4d8ec870ae68dd169c4390e Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 11 Dec 2020 14:31:11 +0530
Subject: [PATCH v33 03/10] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3874939..4f75e85 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -924,30 +926,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -955,7 +948,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -970,7 +963,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1045,6 +1038,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v33-0004-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v33-0004-Track-replication-origin-progress-for-rollbacks.patchDownload
From 77c17a7a0b1baa4ad8aa915cb1e830707c59a41c Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 18 Dec 2020 11:06:58 +0530
Subject: [PATCH v33 04/10] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9b..fe10809 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2277,6 +2277,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2299,6 +2307,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 9cd0b7c..b7470ce 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5720,8 +5720,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5927,7 +5926,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5976,6 +5976,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6017,7 +6024,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6025,7 +6033,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

#166Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#165)

On Sat, Dec 19, 2020 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I have changed the rollback_prepare API as discussed above and
accordingly handle the case where rollback is received without prepare
in apply_handle_rollback_prepared.

I have reviewed and tested your new patchset, I agree with all the
changes that you have made and have tested quite a few scenarios and
they seem to be working as expected.
No major comments but some minor observations:

Patch 1:
logical.c: 984
Comment should be "rollback prepared" rather than "abort prepared".

Patch 2:
decode.c: 737: The comments in the header of DecodePrepare seem out of
place, I think here it should describe what the function does rather
than what it does not.
reorderbuffer.c: 2422: It looks like pg_indent has mangled the
comments, the numbering is no longer aligned.

Patch 5:
worker.c: 753: Type: change "dont" to "don't"

Patch 6: logicaldecoding.sgml
logicaldecoding example is no longer correct. This was true prior to
the changes done to replay prepared transactions after a restart. Now
the whole transaction will get decoded again after the commit
prepared.

postgres=# COMMIT PREPARED 'test_prepared1';
COMMIT PREPARED
postgres=# select * from
pg_logical_slot_get_changes('regression_slot', NULL, NULL,
'two-phase-commit', '1');
lsn | xid | data
-----------+-----+--------------------------------------------
0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529
(1 row)

Patch 8:
worker.c: 2798 :
worker.c: 3445 : disabling two-phase in tablesync worker.
considering new design of multiple commits in tablesync, do we need
to disable two-phase in tablesync?

Other than this I've noticed a few typos that are not in the patch but
in the surrounding code.
logical.c: 1383: Comment should mention stream_commit_cb not stream_abort_cb.
decode.c: 686 - Extra "it's" here: "because it's it happened"

regards,
Ajin Cherian
Fujitsu Australia

#167Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#166)

On Tue, Dec 22, 2020 at 2:51 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Sat, Dec 19, 2020 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I have changed the rollback_prepare API as discussed above and
accordingly handle the case where rollback is received without prepare
in apply_handle_rollback_prepared.

I have reviewed and tested your new patchset, I agree with all the
changes that you have made and have tested quite a few scenarios and
they seem to be working as expected.
No major comments but some minor observations:

Patch 1:
logical.c: 984
Comment should be "rollback prepared" rather than "abort prepared".

Agreed.

Patch 2:
decode.c: 737: The comments in the header of DecodePrepare seem out of
place, I think here it should describe what the function does rather
than what it does not.

Hmm, I have written it because it is important to explain the theory
of concurrent aborts as that is not quite obvious. Also, the
functionality is quite similar to DecodeCommit and the comments inside
the function explain clearly if there is any difference so not sure
what additional we can write, do you have any suggestions?

reorderbuffer.c: 2422: It looks like pg_indent has mangled the
comments, the numbering is no longer aligned.

Yeah, I had also noticed that but not sure if there is a better
alternative because we don't want to change it after each pgindent
run. We might want to use (a), (b) .. notation instead but otherwise,
there is no big problem with how it is.

Patch 5:
worker.c: 753: Type: change "dont" to "don't"

Okay.

Patch 6: logicaldecoding.sgml
logicaldecoding example is no longer correct. This was true prior to
the changes done to replay prepared transactions after a restart. Now
the whole transaction will get decoded again after the commit
prepared.

postgres=# COMMIT PREPARED 'test_prepared1';
COMMIT PREPARED
postgres=# select * from
pg_logical_slot_get_changes('regression_slot', NULL, NULL,
'two-phase-commit', '1');
lsn | xid | data
-----------+-----+--------------------------------------------
0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529
(1 row)

Agreed.

Patch 8:
worker.c: 2798 :
worker.c: 3445 : disabling two-phase in tablesync worker.
considering new design of multiple commits in tablesync, do we need
to disable two-phase in tablesync?

No, but let Peter's patch get committed then we can change it.

Other than this I've noticed a few typos that are not in the patch but
in the surrounding code.
logical.c: 1383: Comment should mention stream_commit_cb not stream_abort_cb.
decode.c: 686 - Extra "it's" here: "because it's it happened"

Anything not related to this patch, please post in a separate email.

Can you please update the patch for the points we agreed upon?

--
With Regards,
Amit Kapila.

#168Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#167)
10 attachment(s)

On Tue, Dec 22, 2020 at 8:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 22, 2020 at 2:51 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Sat, Dec 19, 2020 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I have changed the rollback_prepare API as discussed above and
accordingly handle the case where rollback is received without prepare
in apply_handle_rollback_prepared.

I have reviewed and tested your new patchset, I agree with all the
changes that you have made and have tested quite a few scenarios and
they seem to be working as expected.
No major comments but some minor observations:

Patch 1:
logical.c: 984
Comment should be "rollback prepared" rather than "abort prepared".

Agreed.

Changed.

Patch 2:
decode.c: 737: The comments in the header of DecodePrepare seem out of
place, I think here it should describe what the function does rather
than what it does not.

Hmm, I have written it because it is important to explain the theory
of concurrent aborts as that is not quite obvious. Also, the
functionality is quite similar to DecodeCommit and the comments inside
the function explain clearly if there is any difference so not sure
what additional we can write, do you have any suggestions?

I have slightly re-worded it. Have a look.

reorderbuffer.c: 2422: It looks like pg_indent has mangled the
comments, the numbering is no longer aligned.

Yeah, I had also noticed that but not sure if there is a better
alternative because we don't want to change it after each pgindent
run. We might want to use (a), (b) .. notation instead but otherwise,
there is no big problem with how it is.

Leaving this as is.

Patch 5:
worker.c: 753: Type: change "dont" to "don't"

Okay.

Changed.

Patch 6: logicaldecoding.sgml
logicaldecoding example is no longer correct. This was true prior to
the changes done to replay prepared transactions after a restart. Now
the whole transaction will get decoded again after the commit
prepared.

postgres=# COMMIT PREPARED 'test_prepared1';
COMMIT PREPARED
postgres=# select * from
pg_logical_slot_get_changes('regression_slot', NULL, NULL,
'two-phase-commit', '1');
lsn | xid | data
-----------+-----+--------------------------------------------
0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529
(1 row)

Agreed.

Changed.

Patch 8:
worker.c: 2798 :
worker.c: 3445 : disabling two-phase in tablesync worker.
considering new design of multiple commits in tablesync, do we need
to disable two-phase in tablesync?

No, but let Peter's patch get committed then we can change it.

OK, leaving it.

Can you please update the patch for the points we agreed upon?

Changed and attached.
regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v34-0005-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v34-0005-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From c96e694b9efb9f6d542f0eb32e438e67821f3e7e Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 02:38:34 -0500
Subject: [PATCH v34] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

* We allow skipping prepared transactions if they are already prepared.
We do ensure that we skip only when the GID, origin_lsn, and
origin_timestamp of a prepared xact matches to avoid the possibility of
a match of prepared xact from two different nodes. This can happen when
the server or apply worker restarts after a prepared transaction.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  74 ++++++-
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 260 +++++++++++++++++++++-
 src/backend/replication/logical/worker.c    | 330 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 172 ++++++++++++---
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  75 ++++++-
 src/include/replication/reorderbuffer.h     |  12 +
 src/tools/pgindent/typedefs.list            |   3 +
 9 files changed, 895 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index fe10809..71cca00 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1133,9 +1133,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
@@ -2446,3 +2446,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 15ab8e7..dd33469 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -957,8 +957,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..1047385 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,264 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4f75e85..4f57a8a 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -169,6 +170,9 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
+/* for skipping prepared transaction */
+bool        skip_prepared_txn = false;
+
 /*
  * Hash table for storing the streaming xid information along with shared file
  * set for streaming and subxact files.
@@ -690,6 +694,12 @@ apply_handle_begin(StringInfo s)
 {
 	LogicalRepBeginData begin_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_begin(s, &begin_data);
 
 	remote_final_lsn = begin_data.final_lsn;
@@ -709,6 +719,12 @@ apply_handle_commit(StringInfo s)
 {
 	LogicalRepCommitData commit_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_commit(s, &commit_data);
 
 	Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -722,6 +738,264 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+	{
+		/*
+		 * If this gid has already been prepared then we don't want to apply
+		 * this txn again. This can happen after restart where upstream can
+		 * send the prepared transaction again. See
+		 * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+		 */
+		skip_prepared_txn = true;
+		return;
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (skip_prepared_txn)
+	{
+		/*
+		 * If we are skipping this transaction because it was previously
+		 * prepared, ignore it and reset the flag.
+		 */
+		Assert(LookupGXact(prepare_data.gid, prepare_data.end_lsn,
+						   prepare_data.preparetime));
+		skip_prepared_txn = false;
+		return;
+	}
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -753,6 +1027,12 @@ apply_handle_stream_start(StringInfo s)
 	Assert(!in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Start a transaction on stream start, this transaction will be committed
 	 * on the stream stop unless it is a tablesync worker in which case it will
 	 * be committed after processing all the messages. We need the transaction
@@ -800,6 +1080,12 @@ apply_handle_stream_stop(StringInfo s)
 	Assert(in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Close the file with serialized changes, and serialize information about
 	 * subxacts for the toplevel transaction.
 	 */
@@ -835,6 +1121,12 @@ apply_handle_stream_abort(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_stream_abort(s, &xid, &subxid);
 
 	/*
@@ -1053,6 +1345,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	xid = logicalrep_read_stream_commit(s, &commit_data);
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
@@ -1176,6 +1474,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1297,6 +1598,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1454,6 +1758,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1823,6 +2130,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1979,6 +2289,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 49d25b0..7cf2951 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +67,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +78,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +173,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +344,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,27 +364,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -378,6 +385,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -766,17 +835,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -857,6 +917,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1171,3 +1249,31 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 	while ((entry = (RelationSyncEntry *) hash_seq_search(&status)) != NULL)
 		entry->replicate_valid = false;
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr	origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..5afb977 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535d..13ea3b7 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6d63338..4b92e68 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9cd047ba..ecba4ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1339,12 +1339,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v34-0004-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v34-0004-Track-replication-origin-progress-for-rollbacks.patchDownload
From a3462f6f7e4a5fcda573fd783e139ea2e3114583 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 02:36:02 -0500
Subject: [PATCH v34] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9b..fe10809 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2277,6 +2277,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2299,6 +2307,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 9cd0b7c..b7470ce 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5720,8 +5720,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5927,7 +5926,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5976,6 +5976,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6017,7 +6024,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6025,7 +6033,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v34-0001-Extend-the-output-plugin-API-to-allow-decoding-o.patchapplication/octet-stream; name=v34-0001-Extend-the-output-plugin-API-to-allow-decoding-o.patchDownload
From 24468b822d0a0e9fd27706e9dd68cf7b3624b5cc Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 22 Dec 2020 21:55:46 -0500
Subject: [PATCH v34] Extend the output plugin API to allow decoding of
 prepared xacts.

This adds six methods to the output plugin API, adding support for
streaming changes of two-phase transactions at prepare time.

* begin_prepare
* filter_prepare
* prepare
* commit_prepared
* rollback_prepared
* stream_prepare

Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction is not yet committed and maybe
aborted later.

Until now two-phase transactions were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the
two-phase commands were communicated to the subscriber.

This patch provides the infrastructure for logical decoding plugins to be
informed of two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

This also extends the 'test_decoding' plugin, implementing these new
methods.

This commit simply adds these new APIs and the upcoming patch to "allow
the decoding at prepare time in ReorderBuffer" will use these APIs.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c     | 167 +++++++++++++++++
 doc/src/sgml/logicaldecoding.sgml         | 172 ++++++++++++++++-
 src/backend/replication/logical/logical.c | 297 ++++++++++++++++++++++++++++++
 src/include/replication/logical.h         |   6 +
 src/include/replication/output_plugin.h   |  56 ++++++
 src/include/replication/reorderbuffer.h   |  41 +++++
 src/tools/pgindent/typedefs.list          |  12 ++
 7 files changed, 744 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e12278b..0576355 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -76,6 +76,20 @@ static void pg_decode_message(LogicalDecodingContext *ctx,
 							  ReorderBufferTXN *txn, XLogRecPtr message_lsn,
 							  bool transactional, const char *prefix,
 							  Size sz, const char *message);
+static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx,
+									 const char *gid);
+static void pg_decode_begin_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn);
+static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
+								  ReorderBufferTXN *txn,
+								  XLogRecPtr prepare_lsn);
+static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
+										  ReorderBufferTXN *txn,
+										  XLogRecPtr commit_lsn);
+static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+											ReorderBufferTXN *txn,
+											XLogRecPtr prepare_end_lsn,
+											TimestampTz prepare_time);
 static void pg_decode_stream_start(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn);
 static void pg_output_stream_start(LogicalDecodingContext *ctx,
@@ -87,6 +101,9 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx,
 static void pg_decode_stream_abort(LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr abort_lsn);
+static void pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+									 ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
 static void pg_decode_stream_commit(LogicalDecodingContext *ctx,
 									ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
@@ -123,9 +140,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->filter_by_origin_cb = pg_decode_filter;
 	cb->shutdown_cb = pg_decode_shutdown;
 	cb->message_cb = pg_decode_message;
+	cb->filter_prepare_cb = pg_decode_filter_prepare;
+	cb->begin_prepare_cb = pg_decode_begin_prepare_txn;
+	cb->prepare_cb = pg_decode_prepare_txn;
+	cb->commit_prepared_cb = pg_decode_commit_prepared_txn;
+	cb->rollback_prepared_cb = pg_decode_rollback_prepared_txn;
 	cb->stream_start_cb = pg_decode_stream_start;
 	cb->stream_stop_cb = pg_decode_stream_stop;
 	cb->stream_abort_cb = pg_decode_stream_abort;
+	cb->stream_prepare_cb = pg_decode_stream_prepare;
 	cb->stream_commit_cb = pg_decode_stream_commit;
 	cb->stream_change_cb = pg_decode_stream_change;
 	cb->stream_message_cb = pg_decode_stream_message;
@@ -141,6 +164,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	ListCell   *option;
 	TestDecodingData *data;
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 
 	data = palloc0(sizeof(TestDecodingData));
 	data->context = AllocSetContextCreate(ctx->context,
@@ -241,6 +265,16 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "two-phase-commit") == 0)
+		{
+			if (elem->arg == NULL)
+				continue;
+			else if (!parse_bool(strVal(elem->arg), &enable_twophase))
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
+								strVal(elem->arg), elem->defname)));
+		}
 		else
 		{
 			ereport(ERROR,
@@ -252,6 +286,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 
 	ctx->streaming &= enable_streaming;
+	ctx->twophase &= enable_twophase;
 }
 
 /* cleanup this plugin's resources */
@@ -320,6 +355,111 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	OutputPluginWrite(ctx, true);
 }
 
+/* BEGIN PREPARE callback */
+static void
+pg_decode_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata =
+	MemoryContextAllocZero(ctx->context, sizeof(TestDecodingTxnData));
+
+	txndata->xact_wrote_changes = false;
+	txn->output_plugin_private = txndata;
+
+	if (data->skip_empty_xacts)
+		return;
+
+	pg_output_begin(ctx, data, txn, true);
+}
+
+/* PREPARE callback */
+static void
+pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					  XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "PREPARE TRANSACTION %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* COMMIT PREPARED callback */
+static void
+pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							  XLogRecPtr commit_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "COMMIT PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/* ROLLBACK PREPARED callback */
+static void
+pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_end_lsn,
+								TimestampTz prepare_time)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	appendStringInfo(ctx->out, "ROLLBACK PREPARED %s",
+					 quote_literal_cstr(txn->gid));
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, ", txid %u", txn->xid);
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * Filter out two-phase transactions.
+ *
+ * Each plugin can implement its own filtering logic. Here we demonstrate a
+ * simple logic by checking the GID. If the GID contains the "_nodecode"
+ * substring, then we filter it out.
+ */
+static bool
+pg_decode_filter_prepare(LogicalDecodingContext *ctx, const char *gid)
+{
+	if (strstr(gid, "_nodecode") != NULL)
+		return true;
+
+	return false;
+}
+
 static bool
 pg_decode_filter(LogicalDecodingContext *ctx,
 				 RepOriginId origin_id)
@@ -702,6 +842,33 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx,
 }
 
 static void
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+						 ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	TestDecodingData *data = ctx->output_plugin_private;
+	TestDecodingTxnData *txndata = txn->output_plugin_private;
+
+	if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+		return;
+
+	OutputPluginPrepareWrite(ctx, true);
+
+	if (data->include_xids)
+		appendStringInfo(ctx->out, "preparing streamed transaction TXN %s, txid %u",
+						 quote_literal_cstr(txn->gid), txn->xid);
+	else
+		appendStringInfo(ctx->out, "preparing streamed transaction %s",
+						 quote_literal_cstr(txn->gid));
+
+	if (data->include_timestamp)
+		appendStringInfo(ctx->out, " (at %s)",
+						 timestamptz_to_str(txn->commit_time));
+
+	OutputPluginWrite(ctx, true);
+}
+
+static void
 pg_decode_stream_commit(LogicalDecodingContext *ctx,
 						ReorderBufferTXN *txn,
 						XLogRecPtr commit_lsn)
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 813a037..829bbc1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -389,9 +389,15 @@ typedef struct OutputPluginCallbacks
     LogicalDecodeMessageCB message_cb;
     LogicalDecodeFilterByOriginCB filter_by_origin_cb;
     LogicalDecodeShutdownCB shutdown_cb;
+    LogicalDecodeFilterPrepareCB filter_prepare_cb;
+    LogicalDecodeBeginPrepareCB begin_prepare_cb;
+    LogicalDecodePrepareCB prepare_cb;
+    LogicalDecodeCommitPreparedCB commit_prepared_cb;
+    LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
     LogicalDecodeStreamStartCB stream_start_cb;
     LogicalDecodeStreamStopCB stream_stop_cb;
     LogicalDecodeStreamAbortCB stream_abort_cb;
+    LogicalDecodeStreamPrepareCB stream_prepare_cb;
     LogicalDecodeStreamCommitCB stream_commit_cb;
     LogicalDecodeStreamChangeCB stream_change_cb;
     LogicalDecodeStreamMessageCB stream_message_cb;
@@ -413,10 +419,20 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function> and <function>stream_change_cb</function>
+     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
+     and <function>stream_prepare_cb</function>
      are required, while <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
+
+    <para>
+    An output plugin may also define functions to support two-phase commits,
+    which allows actions to be decoded on the <command>PREPARE TRANSACTION</command>.
+    The <function>begin_prepare_cb</function>, <function>prepare_cb</function>, 
+    <function>stream_prepare_cb</function>,
+    <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
+    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    </para>
    </sect2>
 
    <sect2 id="logicaldecoding-capabilities">
@@ -477,7 +493,15 @@ CREATE TABLE another_catalog_table(data text) WITH (user_catalog_table = true);
      never get
      decoded. Successful savepoints are
      folded into the transaction containing them in the order they were
-     executed within that transaction.
+     executed within that transaction. A transaction that is prepared for
+     a two-phase commit using <command>PREPARE TRANSACTION</command> will
+     also be decoded if the output plugin callbacks needed for decoding
+     them are provided. It is possible that the current transaction which
+     is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command>
+     command. In that case, the logical decoding of this transaction will
+     be aborted too. We will skip all the changes of such a transaction once
+     the abort is detected and abort the transaction when we read WAL for
+     <command>ROLLBACK PREPARED</command>.
     </para>
 
     <note>
@@ -587,7 +611,13 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx,
       an <command>INSERT</command>, <command>UPDATE</command>,
       or <command>DELETE</command>. Even if the original command modified
       several rows at once the callback will be called individually for each
-      row.
+      row. The <function>change_cb</function> callback may access system or
+      user catalog tables to aid in the process of outputting the row
+      modification details. In case of decoding a prepared (but yet
+      uncommitted) transaction or decoding of an uncommitted transaction, this
+      change callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
 <programlisting>
 typedef void (*LogicalDecodeChangeCB) (struct LogicalDecodingContext *ctx,
                                        ReorderBufferTXN *txn,
@@ -685,7 +715,13 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
       non-transactional and the XID was not assigned yet in the transaction
       which logged the message. The <parameter>lsn</parameter> has WAL
       location of the message. The <parameter>transactional</parameter> says
-      if the message was sent as transactional or not.
+      if the message was sent as transactional or not. Similar to the change
+      callback, in case of decoding a prepared (but yet uncommitted)
+      transaction or decoding of an uncommitted transaction, this message
+      callback might also error out due to simultaneous rollback of
+      this very same transaction. In that case, the logical decoding of this
+      aborted transaction is stopped gracefully.
+
       The <parameter>prefix</parameter> is arbitrary null-terminated prefix
       which can be used for identifying interesting messages for the current
       plugin. And finally the <parameter>message</parameter> parameter holds
@@ -698,6 +734,111 @@ typedef void (*LogicalDecodeMessageCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-filter-prepare">
+     <title>Prepare Filter Callback</title>
+
+     <para>
+       The optional <function>filter_prepare_cb</function> callback
+       is called to determine whether data that is part of the current
+       two-phase commit transaction should be considered for decode
+       at this prepare stage or as a regular one-phase transaction at
+       <command>COMMIT PREPARED</command> time later. To signal that
+       decoding should be skipped, return <literal>true</literal>;
+       <literal>false</literal> otherwise. When the callback is not
+       defined, <literal>false</literal> is assumed (i.e. nothing is
+       filtered).
+<programlisting>
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              const char *gid);
+</programlisting>
+      The <parameter>ctx</parameter> parameter has the same contents as for the
+      other callbacks. The <parameter>gid</parameter> is the identifier that later
+      identifies this transaction for <command>COMMIT PREPARED</command> or
+      <command>ROLLBACK PREPARED</command>.
+     </para>
+     <para>
+      The callback has to provide the same static answer for a given
+      <parameter>gid</parameter> every time it is called.
+     </para>
+     </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-begin-prepare">
+     <title>Transaction Begin Prepare Callback</title>
+
+     <para>
+      The required <function>begin_prepare_cb</function> callback is called
+      whenever the start of a prepared transaction has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback to
+      check if the plugin has already received this prepare in which case it
+      can skip the remaining changes of the transaction. This can only happen
+      if the user restarts the decoding after receiving the prepare for a
+      transaction but before receiving the commit prepared say because of some
+      error.
+      <programlisting>
+       typedef void (*LogicalDecodeBeginPrepareCB) (struct LogicalDecodingContext *ctx,
+                                                    ReorderBufferTXN *txn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-prepare">
+     <title>Transaction Prepare Callback</title>
+
+     <para>
+      The required <function>prepare_cb</function> callback is called whenever
+      a transaction which is prepared for two-phase commit has been
+      decoded. The <function>change_cb</function> callback for all modified
+      rows will have been called before this, if there have been any modified
+      rows. The <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+                                               ReorderBufferTXN *txn,
+                                               XLogRecPtr prepare_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-commit-prepared">
+     <title>Transaction Commit Prepared Callback</title>
+
+     <para>
+      The required <function>commit_prepared_cb</function> callback is called
+      whenever a transaction commit prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback.
+      <programlisting>
+       typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                      ReorderBufferTXN *txn,
+                                                      XLogRecPtr commit_lsn);
+      </programlisting>
+     </para>
+    </sect3>
+
+    <sect3 id="logicaldecoding-output-plugin-rollback-prepared">
+     <title>Transaction Rollback Prepared Callback</title>
+
+     <para>
+      The required <function>rollback_prepared_cb</function> callback is called
+      whenever a transaction rollback prepared has been decoded. The
+      <parameter>gid</parameter> field, which is part of the
+      <parameter>txn</parameter> parameter can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this prepare transaction in which case it can apply the
+      rollback, otherwise, it can skip the rollback operation. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have prepared transaction with same identifier.
+      <programlisting>
+       typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+                                                        ReorderBufferTXN *txn,
+                                                        XLogRecPtr preapre_end_lsn,
+                                                        TimestampTz prepare_time);
+      </programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-start">
      <title>Stream Start Callback</title>
      <para>
@@ -735,6 +876,19 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
      </para>
     </sect3>
 
+    <sect3 id="logicaldecoding-output-plugin-stream-prepare">
+     <title>Stream Prepare Callback</title>
+     <para>
+      The <function>stream_prepare_cb</function> callback is called to prepare
+      a previously streamed transaction as part of a two-phase commit.
+<programlisting>
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+                                              ReorderBufferTXN *txn,
+                                              XLogRecPtr prepare_lsn);
+</programlisting>
+     </para>
+    </sect3>
+
     <sect3 id="logicaldecoding-output-plugin-stream-commit">
      <title>Stream Commit Callback</title>
      <para>
@@ -913,9 +1067,13 @@ OutputPluginWrite(ctx, true);
     When streaming an in-progress transaction, the changes (and messages) are
     streamed in blocks demarcated by <function>stream_start_cb</function>
     and <function>stream_stop_cb</function> callbacks. Once all the decoded
-    changes are transmitted, the transaction is committed using the
-    <function>stream_commit_cb</function> callback (or possibly aborted using
-    the <function>stream_abort_cb</function> callback).
+    changes are transmitted, the transaction can be committed using the
+    the <function>stream_commit_cb</function> callback
+    (or possibly aborted using the <function>stream_abort_cb</function> callback).
+    If two-phase commits are supported, the transaction can be prepared using the
+    <function>stream_prepare_cb</function> callback, commit prepared using the
+    <function>commit_prepared_cb</function> callback or aborted using the
+    <function>rollback_prepared_cb</function>.
    </para>
 
    <para>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f1f4df7..3399483 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -59,6 +59,13 @@ static void shutdown_cb_wrapper(LogicalDecodingContext *ctx);
 static void begin_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
 static void commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  XLogRecPtr commit_lsn);
+static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn);
+static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_lsn);
+static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									   XLogRecPtr commit_lsn);
+static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							  Relation relation, ReorderBufferChange *change);
 static void truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -74,6 +81,8 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 								   XLogRecPtr last_lsn);
 static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									XLogRecPtr abort_lsn);
+static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+									  XLogRecPtr prepare_lsn);
 static void stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 									 XLogRecPtr commit_lsn);
 static void stream_change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -237,11 +246,37 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder->stream_start = stream_start_cb_wrapper;
 	ctx->reorder->stream_stop = stream_stop_cb_wrapper;
 	ctx->reorder->stream_abort = stream_abort_cb_wrapper;
+	ctx->reorder->stream_prepare = stream_prepare_cb_wrapper;
 	ctx->reorder->stream_commit = stream_commit_cb_wrapper;
 	ctx->reorder->stream_change = stream_change_cb_wrapper;
 	ctx->reorder->stream_message = stream_message_cb_wrapper;
 	ctx->reorder->stream_truncate = stream_truncate_cb_wrapper;
 
+
+	/*
+	 * To support two-phase logical decoding, we require
+	 * begin_prepare/prepare/commit-prepare/abort-prepare callbacks. The
+	 * filter_prepare callback is optional. We however enable two-phase
+	 * logical decoding when at least one of the methods is enabled so that we
+	 * can easily identify missing methods.
+	 *
+	 * We decide it here, but only check it later in the wrappers.
+	 */
+	ctx->twophase = (ctx->callbacks.begin_prepare_cb != NULL) ||
+		(ctx->callbacks.prepare_cb != NULL) ||
+		(ctx->callbacks.commit_prepared_cb != NULL) ||
+		(ctx->callbacks.rollback_prepared_cb != NULL) ||
+		(ctx->callbacks.stream_prepare_cb != NULL) ||
+		(ctx->callbacks.filter_prepare_cb != NULL);
+
+	/*
+	 * Callback to support decoding at prepare time.
+	 */
+	ctx->reorder->begin_prepare = begin_prepare_cb_wrapper;
+	ctx->reorder->prepare = prepare_cb_wrapper;
+	ctx->reorder->commit_prepared = commit_prepared_cb_wrapper;
+	ctx->reorder->rollback_prepared = rollback_prepared_cb_wrapper;
+
 	ctx->out = makeStringInfo();
 	ctx->prepare_write = prepare_write;
 	ctx->write = do_write;
@@ -782,6 +817,186 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	error_context_stack = errcallback.previous;
 }
 
+/*
+ * The functionality of begin_prepare is quite similar to begin with the
+ * exception that this will have gid (global transaction id) information which
+ * can be used by plugin. Now, we thought about extending the existing begin
+ * but that would break the replication protocol and additionally this looks
+ * cleaner.
+ */
+static void
+begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "begin_prepare";
+	state.report_location = txn->first_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->first_lsn;
+
+	/*
+	 * If the plugin supports two-phase commits then begin prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.begin_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires begin_prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.begin_prepare_cb(ctx, txn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+				   XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "prepare";
+	state.report_location = txn->final_lsn; /* beginning of prepare record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin supports two-phase commits then prepare callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires prepare_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						   XLogRecPtr commit_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "commit_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then commit prepared callback
+	 * is mandatory
+	 */
+	if (ctx->callbacks.commit_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
+rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+							 XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/* We're only supposed to call this when two-phase commits are supported */
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "rollback_prepared";
+	state.report_location = txn->final_lsn; /* beginning of commit record */
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn; /* points to the end of the record */
+
+	/*
+	 * If the plugin support two-phase commits then rollback prepared callback is
+	 * mandatory
+	 */
+	if (ctx->callbacks.rollback_prepared_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical replication at prepare time requires rollback_prepared_cb callback")));
+
+	/* do the actual work: call callback */
+	ctx->callbacks.rollback_prepared_cb(ctx, txn, prepare_end_lsn,
+										prepare_time);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
 static void
 change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				  Relation relation, ReorderBufferChange *change)
@@ -860,6 +1075,45 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 bool
+filter_prepare_cb_wrapper(LogicalDecodingContext *ctx, const char *gid)
+{
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+	bool		ret;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case, all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "filter_prepare";
+	state.report_location = InvalidXLogRecPtr;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = false;
+
+	/* do the actual work: call callback */
+	ret = ctx->callbacks.filter_prepare_cb(ctx, gid);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
+	return ret;
+}
+
+bool
 filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
 	LogicalErrorCallbackState state;
@@ -1057,6 +1311,49 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 }
 
 static void
+stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+						  XLogRecPtr prepare_lsn)
+{
+	LogicalDecodingContext *ctx = cache->private_data;
+	LogicalErrorCallbackState state;
+	ErrorContextCallback errcallback;
+
+	Assert(!ctx->fast_forward);
+
+	/*
+	 * We're only supposed to call this when streaming and two-phase commits
+	 * are supported.
+	 */
+	Assert(ctx->streaming);
+	Assert(ctx->twophase);
+
+	/* Push callback + info on the error context stack */
+	state.ctx = ctx;
+	state.callback_name = "stream_prepare";
+	state.report_location = txn->final_lsn;
+	errcallback.callback = output_plugin_error_callback;
+	errcallback.arg = (void *) &state;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* set output state */
+	ctx->accept_writes = true;
+	ctx->write_xid = txn->xid;
+	ctx->write_location = txn->end_lsn;
+
+	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("logical streaming at prepare time requires a stream_prepare_cb callback")));
+
+	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
+
+static void
 stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						 XLogRecPtr commit_lsn)
 {
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 40bab7e..28c9c1f 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,6 +85,11 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
+	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 */
+	bool		twophase;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
@@ -120,6 +125,7 @@ extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 												  XLogRecPtr restart_lsn);
 extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
+extern bool filter_prepare_cb_wrapper(LogicalDecodingContext *ctx, const char *gid);
 extern bool filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, RepOriginId origin_id);
 extern void ResetLogicalStreamingState(void);
 extern void UpdateDecodingStats(LogicalDecodingContext *ctx);
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index b78c796..89e1dc3 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -100,6 +100,45 @@ typedef bool (*LogicalDecodeFilterByOriginCB) (struct LogicalDecodingContext *ct
 typedef void (*LogicalDecodeShutdownCB) (struct LogicalDecodingContext *ctx);
 
 /*
+ * Called before decoding of PREPARE record to decide whether this
+ * transaction should be decoded with separate calls to prepare and
+ * commit_prepared/rollback_prepared callbacks or wait till COMMIT PREPARED
+ * and sent as usual transaction.
+ */
+typedef bool (*LogicalDecodeFilterPrepareCB) (struct LogicalDecodingContext *ctx,
+											  const char *gid);
+
+/*
+ * Callback called for every BEGIN of a prepared trnsaction.
+ */
+typedef void (*LogicalDecodeBeginPrepareCB) (struct LogicalDecodingContext *ctx,
+											 ReorderBufferTXN *txn);
+
+/*
+ * Called for PREPARE record unless it was filtered by filter_prepare()
+ * callback.
+ */
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/*
+ * Called for COMMIT PREPARED.
+ */
+typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/*
+ * Called for ROLLBACK PREPARED.
+ */
+typedef void (*LogicalDecodeRollbackPreparedCB) (struct LogicalDecodingContext *ctx,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr prepare_end_lsn,
+												 TimestampTz prepare_time);
+
+
+/*
  * Called when starting to stream a block of changes from in-progress
  * transaction (may be called repeatedly, if it's streamed in multiple
  * chunks).
@@ -124,6 +163,14 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx,
 											XLogRecPtr abort_lsn);
 
 /*
+ * Called to prepare changes streamed to remote node from in-progress
+ * transaction. This is called as part of a two-phase commit.
+ */
+typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
+/*
  * Called to apply changes streamed to remote node from in-progress
  * transaction.
  */
@@ -173,10 +220,19 @@ typedef struct OutputPluginCallbacks
 	LogicalDecodeMessageCB message_cb;
 	LogicalDecodeFilterByOriginCB filter_by_origin_cb;
 	LogicalDecodeShutdownCB shutdown_cb;
+
+	/* streaming of changes at prepare time */
+	LogicalDecodeFilterPrepareCB filter_prepare_cb;
+	LogicalDecodeBeginPrepareCB begin_prepare_cb;
+	LogicalDecodePrepareCB prepare_cb;
+	LogicalDecodeCommitPreparedCB commit_prepared_cb;
+	LogicalDecodeRollbackPreparedCB rollback_prepared_cb;
+
 	/* streaming of changes */
 	LogicalDecodeStreamStartCB stream_start_cb;
 	LogicalDecodeStreamStopCB stream_stop_cb;
 	LogicalDecodeStreamAbortCB stream_abort_cb;
+	LogicalDecodeStreamPrepareCB stream_prepare_cb;
 	LogicalDecodeStreamCommitCB stream_commit_cb;
 	LogicalDecodeStreamChangeCB stream_change_cb;
 	LogicalDecodeStreamMessageCB stream_message_cb;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bd9dd7e..1e60afe 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -245,6 +245,12 @@ typedef struct ReorderBufferTXN
 	TransactionId toplevel_xid;
 
 	/*
+	 * Global transaction id required for identification of prepared
+	 * transactions.
+	 */
+	char	   *gid;
+
+	/*
 	 * LSN of the first data carrying, WAL record with knowledge about this
 	 * xid. This is allowed to *not* be first record adorned with this xid, if
 	 * the previous records aren't relevant for logical decoding.
@@ -418,6 +424,26 @@ typedef void (*ReorderBufferMessageCB) (ReorderBuffer *rb,
 										const char *prefix, Size sz,
 										const char *message);
 
+/* begin prepare callback signature */
+typedef void (*ReorderBufferBeginPrepareCB) (ReorderBuffer *rb,
+											 ReorderBufferTXN *txn);
+
+/* prepare callback signature */
+typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
+										ReorderBufferTXN *txn,
+										XLogRecPtr prepare_lsn);
+
+/* commit prepared callback signature */
+typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
+											   ReorderBufferTXN *txn,
+											   XLogRecPtr commit_lsn);
+
+/* rollback  prepared callback signature */
+typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
+												 ReorderBufferTXN *txn,
+												 XLogRecPtr prepare_end_lsn,
+												 TimestampTz prepare_time);
+
 /* start streaming transaction callback signature */
 typedef void (*ReorderBufferStreamStartCB) (
 											ReorderBuffer *rb,
@@ -436,6 +462,12 @@ typedef void (*ReorderBufferStreamAbortCB) (
 											ReorderBufferTXN *txn,
 											XLogRecPtr abort_lsn);
 
+/* prepare streamed transaction callback signature */
+typedef void (*ReorderBufferStreamPrepareCB) (
+											  ReorderBuffer *rb,
+											  ReorderBufferTXN *txn,
+											  XLogRecPtr prepare_lsn);
+
 /* commit streamed transaction callback signature */
 typedef void (*ReorderBufferStreamCommitCB) (
 											 ReorderBuffer *rb,
@@ -505,11 +537,20 @@ struct ReorderBuffer
 	ReorderBufferMessageCB message;
 
 	/*
+	 * Callbacks to be called when streaming a transaction at prepare time.
+	 */
+	ReorderBufferBeginCB begin_prepare;
+	ReorderBufferPrepareCB prepare;
+	ReorderBufferCommitPreparedCB commit_prepared;
+	ReorderBufferRollbackPreparedCB rollback_prepared;
+
+	/*
 	 * Callbacks to be called when streaming a transaction.
 	 */
 	ReorderBufferStreamStartCB stream_start;
 	ReorderBufferStreamStopCB stream_stop;
 	ReorderBufferStreamAbortCB stream_abort;
+	ReorderBufferStreamPrepareCB stream_prepare;
 	ReorderBufferStreamCommitCB stream_commit;
 	ReorderBufferStreamChangeCB stream_change;
 	ReorderBufferStreamMessageCB stream_message;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bca37c5..9cd047ba 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1315,9 +1315,21 @@ LogStmtLevel
 LogicalDecodeBeginCB
 LogicalDecodeChangeCB
 LogicalDecodeCommitCB
+LogicalDecodeFilterPrepareCB
+LogicalDecodeBeginPrepareCB
+LogicalDecodePrepareCB
+LogicalDecodeCommitPreparedCB
+LogicalDecodeRollbackPreparedCB
 LogicalDecodeFilterByOriginCB
 LogicalDecodeMessageCB
 LogicalDecodeShutdownCB
+LogicalDecodeStreamStartCB
+LogicalDecodeStreamStopCB
+LogicalDecodeStreamAbortCB
+LogicalDecodeStreamPrepareCB
+LogicalDecodeStreamCommitCB
+LogicalDecodeStreamChangeCB
+LogicalDecodeStreamMessageCB
 LogicalDecodeStartupCB
 LogicalDecodeTruncateCB
 LogicalDecodingContext
-- 
1.8.3.1

v34-0006-Support-2PC-documentation.patchapplication/octet-stream; name=v34-0006-Support-2PC-documentation.patchDownload
From 465bd37fa52604ce2404191c704b12b47c19d639 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 04:04:25 -0500
Subject: [PATCH v34] Support-2PC-documentation.

Add documentation about two-phase commit support in Logical Decoding.
---
 doc/src/sgml/logicaldecoding.sgml | 98 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 97 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 829bbc1..b7c91fc 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -165,7 +165,56 @@ COMMIT 693
 <keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
 $ pg_recvlogical -d postgres --slot=test --drop-slot
 </programlisting>
-  </sect1>
+
+  <para>
+  The following example shows how logical decoding can be used to handle transactions
+  that use a two-phase commit. Before you use two-phase commit commands, you must set
+  <varname>max_prepared_transactions</varname> to at least 1. You must also set the 
+  option 'two-phase-commit' to 1 while calling <function>pg_logical_slot_get_changes</function>.
+  </para>
+<programlisting>
+postgres=# BEGIN;
+postgres=*# INSERT INTO data(data) VALUES('5');
+postgres=*# PREPARE TRANSACTION 'test_prepared1';
+
+postgres=# SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/1689DC0 | 529 | BEGIN 529
+ 0/1689DC0 | 529 | table public.data: INSERT: id[integer]:3 data[text]:'5'
+ 0/1689FC0 | 529 | PREPARE TRANSACTION 'test_prepared1', txid 529
+(3 rows)
+
+postgres=# COMMIT PREPARED 'test_prepared1';
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                    data                    
+-----------+-----+--------------------------------------------
+ 0/1689DC0 | 529 | BEGIN 529
+ 0/1689DC0 | 529 | table public.data: INSERT: id[integer]:3 data[text]:'5'
+ 0/1689FC0 | 529 | PREPARE TRANSACTION 'test_prepared1', txid 529
+ 0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529
+(4 row)
+
+postgres=#-- you can also rollback a prepared transaction
+postgres=# BEGIN;
+postgres=*# INSERT INTO data(data) VALUES('6');
+postgres=*# PREPARE TRANSACTION 'test_prepared2';
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/168A180 | 530 | BEGIN 530
+ 0/168A1E8 | 530 | table public.data: INSERT: id[integer]:4 data[text]:'6'
+ 0/168A430 | 530 | PREPARE TRANSACTION 'test_prepared2', txid 530
+(3 rows)
+
+postgres=# ROLLBACK PREPARED 'test_prepared2';
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                     data                     
+-----------+-----+----------------------------------------------
+ 0/168A4B8 | 530 | ROLLBACK PREPARED 'test_prepared2', txid 530
+(1 row)
+</programlisting>
+</sect1>
 
   <sect1 id="logicaldecoding-explanation">
    <title>Logical Decoding Concepts</title>
@@ -1126,4 +1175,51 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
    </para>
 
   </sect1>
+
+  <sect1 id="logicaldecoding-two-phase-commits">
+   <title>Two-phase commit support for Logical Decoding</title>
+
+   <para>
+   With the basic output plugin callbacks (eg., <function>begin_cb</function>,
+   <function>change_cb</function>, <function>commit_cb</function> and
+   <function>message_cb</function>) two-phase commit commands like
+   <command>PREPARE TRANSACTION</command>, <command>COMMIT PREPARED</command>
+   and <command>ROLLBACK PREPARED</command> are not decoded correctly.
+   While the <command>PREPARE TRANSACTION</command> ignored, 
+   <command>COMMIT PREPARED</command> is decoded as a <command>COMMIT</command> and 
+   <command>ROLLBACK PREPARED</command> is decoded as a <command>ROLLBACK</command>.
+   </para>
+
+   <para>
+   An output plugin may provide additional callbacks to support two-phase commit commands.
+   There are multiple two-phase commit callbacks that are required,
+   (<function>begin_prepare_cb</function>, <function>prepare_cb</function>, 
+   <function>commit_prepared_cb</function>, 
+   <function>rollback_prepared_cb</function> and <function>stream_prepare_cb</function>)
+   and an optional callback (<function>filter_prepare_cb</function>).
+   </para>
+
+   <para>
+   If the output plugin callbacks for decoding two-phase commit commands are provided,
+   then on <command>PREPARE TRANSACTION</command>, the changes of that transaction are
+   decoded, passed to the output plugin and the <function>prepare_cb</function>
+   callback is invoked.This differs from the basic decoding setup where changes are
+   only passed to the output plugin when a transaction is committed. The start of a
+   prepared transaction is indicated by the <function>begin_prepare_cb</function> callback.
+   </para>
+
+   <para>
+   When a prepared transaction is rollbacked using the <command>ROLLBACK PREPARED</command>,
+   then the <function>rollback_prepared_cb</function> callback is invoked and when the
+   prepared transaction is committed using <command>COMMIT PREPARED</command>,
+   then the <function>commit_prepared_cb</function> callback is invoked.
+   </para>
+
+   <para>
+   Optionally the output plugin can specify a name pattern in the 
+   <function>filter_prepare_cb</function> and transactions with gid containing
+   that name pattern will not be decoded as a two-phase commit transaction. 
+   </para>
+
+  </sect1>
  </chapter>
-- 
1.8.3.1

v34-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchapplication/octet-stream; name=v34-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchDownload
From 09f8d1943fa1780aa8d1404072bf2cf574d21573 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 01:19:55 -0500
Subject: [PATCH v34] Allow decoding at prepare time in ReorderBuffer.

This patch allows PREPARE-time decoding of two-phase transactions (if the
output plugin supports this capability), in which case the transactions
are replayed at PREPARE and then committed later when COMMIT PREPARED
arrives.

Now that we decode the changes before the commit, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We detect such failures with a special sqlerrcode
ERRCODE_TRANSACTION_ROLLBACK introduced by commit 7259736a6e and stop
decoding the remaining changes. Then we rollback the changes when rollback
prepared is encountered.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, Arseny Sher, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 contrib/test_decoding/expected/twophase.out        | 235 +++++++++++
 contrib/test_decoding/expected/twophase_stream.out | 147 +++++++
 contrib/test_decoding/sql/twophase.sql             | 112 ++++++
 contrib/test_decoding/sql/twophase_stream.sql      |  45 +++
 src/backend/replication/logical/decode.c           | 286 ++++++++++++--
 src/backend/replication/logical/reorderbuffer.c    | 432 +++++++++++++++++----
 src/backend/replication/logical/snapbuild.c        |   7 +
 src/include/replication/reorderbuffer.h            |  33 +-
 9 files changed, 1193 insertions(+), 106 deletions(-)
 create mode 100644 contrib/test_decoding/expected/twophase.out
 create mode 100644 contrib/test_decoding/expected/twophase_stream.out
 create mode 100644 contrib/test_decoding/sql/twophase.sql
 create mode 100644 contrib/test_decoding/sql/twophase_stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..76d4a69 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate stream stats
+	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
diff --git a/contrib/test_decoding/expected/twophase.out b/contrib/test_decoding/expected/twophase.out
new file mode 100644
index 0000000..f9f6bed
--- /dev/null
+++ b/contrib/test_decoding/expected/twophase.out
@@ -0,0 +1,235 @@
+-- Test prepared transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test that decoding happens at PREPARE time when two-phase-commit is enabled.
+-- Decoding after COMMIT PREPARED must have all the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+ COMMIT PREPARED 'test_prepared#1'
+(5 rows)
+
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+ COMMIT PREPARED 'test_prepared#3'
+(4 rows)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Check 'CLUSTER' (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The
+-- call should return within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ COMMIT PREPARED 'test_prepared_lock'
+(5 rows)
+
+-- Test savepoints and sub-xacts. Creating savepoints will create
+-- sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+ COMMIT PREPARED 'test_prepared_savepoint'
+(4 rows)
+
+-- Test that a GID containing "_nodecode" gets decoded at commit prepared time.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/twophase_stream.out b/contrib/test_decoding/expected/twophase_stream.out
new file mode 100644
index 0000000..3acc4acd3
--- /dev/null
+++ b/contrib/test_decoding/expected/twophase_stream.out
@@ -0,0 +1,147 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK TO s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED and the other changes in the transaction
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ PREPARE TRANSACTION 'test1'
+ COMMIT PREPARED 'test1'
+(23 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with
+-- filtered gid. gids with '_nodecode' will not be decoded at prepare time.
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/twophase.sql b/contrib/test_decoding/sql/twophase.sql
new file mode 100644
index 0000000..894e4f5
--- /dev/null
+++ b/contrib/test_decoding/sql/twophase.sql
@@ -0,0 +1,112 @@
+-- Test prepared transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test that decoding happens at PREPARE time when two-phase-commit is enabled.
+-- Decoding after COMMIT PREPARED must have all the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check 'CLUSTER' (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The
+-- call should return within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test savepoints and sub-xacts. Creating savepoints will create
+-- sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that a GID containing "_nodecode" gets decoded at commit prepared time.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/twophase_stream.sql b/contrib/test_decoding/sql/twophase_stream.sql
new file mode 100644
index 0000000..e9dd44f
--- /dev/null
+++ b/contrib/test_decoding/sql/twophase_stream.sql
@@ -0,0 +1,45 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK TO s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED and the other changes in the transaction
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with
+-- filtered gid. gids with '_nodecode' will not be decoded at prepare time.
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..11df519 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,13 +67,24 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool two_phase);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool two_phase);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 
+/* helper functions for decoding transactions */
+static inline bool FilterPrepare(LogicalDecodingContext *ctx, const char *gid);
+static bool DecodeTXNNeedSkip(LogicalDecodingContext *ctx,
+							  XLogRecordBuffer *buf, Oid dbId,
+							  RepOriginId origin_id);
+
 /*
  * Take every XLogReadRecord()ed record and perform the actions required to
  * decode it using the output plugin already setup in the logical decoding
@@ -244,6 +255,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		two_phase = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +265,15 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * We would like to process the transaction in a two-phase
+				 * manner iff output plugin supports two-phase commits and
+				 * doesn't filter the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase)
+					two_phase = !(FilterPrepare(ctx, parsed.twophase_gid));
+
+				DecodeCommit(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +282,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		two_phase = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +292,15 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * We would like to process the transaction in a two-phase
+				 * manner iff output plugin supports two-phase commits and
+				 * doesn't filter the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED && ctx->twophase)
+					two_phase = !(FilterPrepare(ctx, parsed.twophase_gid));
+
+				DecodeAbort(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +341,37 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* check that output plugin is capable of two-phase decoding */
+				if (!ctx->twophase)
+				{
+					ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
+					break;
+				}
+
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/*
+				 * We would like to process the transaction in a two-phase
+				 * manner iff output plugin supports two-phase commits and
+				 * doesn't filter the transaction at prepare time.
+				 */
+				if (FilterPrepare(ctx, parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -520,6 +569,23 @@ DecodeHeapOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	}
 }
 
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+static inline bool
+FilterPrepare(LogicalDecodingContext *ctx, const char *gid)
+{
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (ctx->callbacks.filter_prepare_cb == NULL)
+		return false;
+
+	return filter_prepare_cb_wrapper(ctx, gid);
+}
+
 static inline bool
 FilterByOrigin(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -582,10 +648,15 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'two_phase' indicates that caller wants to process the transaction in two
+ * phases, first process prepare if not already done and then process
+ * commit_prepared.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool two_phase)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -606,15 +677,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * the reorderbuffer to forget the content of the (sub-)transactions
 	 * if not.
 	 *
-	 * There can be several reasons we might not be interested in this
-	 * transaction:
-	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
-	 * 2) The transaction happened in another database.
-	 * 3) The output plugin is not interested in the origin.
-	 * 4) We are doing fast-forwarding
-	 *
 	 * We can't just use ReorderBufferAbort() here, because we need to execute
 	 * the transaction's invalidations.  This currently won't be needed if
 	 * we're just skipping over the transaction because currently we only do
@@ -627,9 +689,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * relevant syscaches.
 	 * ---
 	 */
-	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
-		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
-		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
 	{
 		for (i = 0; i < parsed->nsubxacts; i++)
 		{
@@ -647,34 +707,165 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
+	/*
+	 * Send the final commit record if the transaction data is already
+	 * decoded, otherwise, process the entire transaction.
+	 */
+	if (two_phase)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ *
+ * Note that we could check for concurrent abort and skip prepare here.
+ * The reason we don't do that is because it is quite possible that we
+ * had already sent some changes before we detect abort in which case we need
+ * to abort those changes in the subscriber.
+ * To abort such changes, we do send the prepare and then
+ * the rollback prepared which is what happened on the publisher-side as well.
+ * Now, we can invent a new abort API wherein in such cases we send abort and
+ * skip sending prepared and rollback prepared but then it is not that
+ * straightforward because we might have streamed this transaction by that time
+ * in which case it is handled when the rollback is encountered. It is not
+ * impossible to optimize the concurrent abort case but it can introduce design
+ * complexity w.r.t handling different cases so leaving it for now as it
+ * doesn't seem worth it.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	SnapBuild  *builder = ctx->snapshot_builder;
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz prepare_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		prepare_time = parsed->origin_timestamp;
+
+	/*
+	 * Remember the prepare info for a txn so that it can be used later in
+	 * commit prepared if required. See ReorderBufferFinishPrepared.
+	 */
+	if (!ReorderBufferRememberPrepareInfo(ctx->reorder, xid, buf->origptr,
+										  buf->endptr, prepare_time, origin_id,
+										  origin_lsn))
+		return;
+
+	/* We can't start streaming unless a consistent state is reached. */
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT)
+	{
+		ReorderBufferSkipPrepare(ctx->reorder, xid);
+		return;
+	}
+
+	/*
+	 * Check whether we need to process this transaction. See
+	 * DecodeTXNNeedSkip for the reasons why we sometimes want to skip the
+	 * transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache
+	 * invalidations if there are any for the reasons mentioned in
+	 * DecodeCommit.
+	 */
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
+	{
+		ReorderBufferSkipPrepare(ctx->reorder, xid);
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/* Tell the reorderbuffer about the surviving subtransactions. */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'two_phase' indicates to finish prepared transaction.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool two_phase)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz abort_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool		skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		abort_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * Check whether we need to process this transaction. See
+	 * DecodeTXNNeedSkip for the reasons why we sometimes want to skip the
+	 * transaction.
+	 */
+	skip_xact = DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id);
+
+	/*
+	 * Send the final rollback record for a prepared transaction unless we
+	 * need to skip it. For non-two-phase xacts, simply forget the xact.
+	 */
+	if (two_phase && !skip_xact)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									abort_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
 	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
@@ -1080,3 +1271,24 @@ DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tuple)
 	header->t_infomask2 = xlhdr.t_infomask2;
 	header->t_hoff = xlhdr.t_hoff;
 }
+
+/*
+ * Check whether we are interested in this specific transaction.
+ *
+ * There can be several reasons we might not be interested in this
+ * transaction:
+ * 1) We might not be interested in decoding transactions up to this
+ *	  LSN. This can happen because we previously decoded it and now just
+ *	  are restarting or if we haven't assembled a consistent snapshot yet.
+ * 2) The transaction happened in another database.
+ * 3) The output plugin is not interested in the origin.
+ * 4) We are doing fast-forwarding
+ */
+static bool
+DecodeTXNNeedSkip(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+				  Oid txn_dbid, RepOriginId origin_id)
+{
+	return (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+			(txn_dbid != InvalidOid && txn_dbid != ctx->slot->data.database) ||
+			ctx->fast_forward || FilterByOrigin(ctx, origin_id));
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 6b0a59e..bbefb68 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -422,6 +423,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1516,12 +1523,18 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after
+ * streaming or decoding them at PREPARE. Keep the remaining info -
+ * transactions, tuplecids, invalidations and snapshots.
+ *
+ * We additionaly remove tuplecids after decoding the transaction at prepare
+ * time as we only need to perform invalidation at rollback or commit prepared.
+ *
+ * 'txn_prepared' indicates that we have decoded the transaction at prepare
+ * time.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1540,7 +1553,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1574,9 +1587,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1756,9 +1793,10 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * If the transaction was (partially) streamed, we need to commit it in a
- * 'streamed' way.  That is, we first stream the remaining part of the
- * transaction, and then invoke stream_commit message.
+ * If the transaction was (partially) streamed, we need to prepare or commit
+ * it in a 'streamed' way.  That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_prepare or stream_commit message as per
+ * the case.
  */
 static void
 ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1768,29 +1806,49 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		/*
+		 * Note, we send stream prepare even if a concurrent abort is
+		 * detected. See DecodePrepare for more information.
+		 */
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids.
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
  * Set xid to detect concurrent aborts.
  *
- * While streaming an in-progress transaction there is a possibility that the
- * (sub)transaction might get aborted concurrently.  In such case if the
- * (sub)transaction has catalog update then we might decode the tuple using
- * wrong catalog version.  For example, suppose there is one catalog tuple with
- * (xmin: 500, xmax: 0).  Now, the transaction 501 updates the catalog tuple
- * and after that we will have two tuples (xmin: 500, xmax: 501) and
- * (xmin: 501, xmax: 0).  Now, if 501 is aborted and some other transaction
- * say 502 updates the same catalog tuple then the first tuple will be changed
- * to (xmin: 500, xmax: 502).  So, the problem is that when we try to decode
- * the tuple inserted/updated in 501 after the catalog update, we will see the
- * catalog tuple with (xmin: 500, xmax: 502) as visible because it will
- * consider that the tuple is deleted by xid 502 which is not visible to our
- * snapshot.  And when we will try to decode with that catalog tuple, it can
- * lead to a wrong result or a crash.  So, it is necessary to detect
- * concurrent aborts to allow streaming of in-progress transactions.
+ * While streaming an in-progress transaction or decoding a prepared
+ * transaction there is a possibility that the (sub)transaction might get
+ * aborted concurrently.  In such case if the (sub)transaction has catalog
+ * update then we might decode the tuple using wrong catalog version.  For
+ * example, suppose there is one catalog tuple with (xmin: 500, xmax: 0).  Now,
+ * the transaction 501 updates the catalog tuple and after that we will have
+ * two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0).  Now, if 501 is
+ * aborted and some other transaction say 502 updates the same catalog tuple
+ * then the first tuple will be changed to (xmin: 500, xmax: 502).  So, the
+ * problem is that when we try to decode the tuple inserted/updated in 501
+ * after the catalog update, we will see the catalog tuple with (xmin: 500,
+ * xmax: 502) as visible because it will consider that the tuple is deleted by
+ * xid 502 which is not visible to our snapshot.  And when we will try to
+ * decode with that catalog tuple, it can lead to a wrong result or a crash.
+ * So, it is necessary to detect concurrent aborts to allow streaming of
+ * in-progress transactions or decoding of prepared  transactions.
  *
  * For detecting the concurrent abort we set CheckXidAlive to the current
  * (sub)transaction's xid for which this change belongs to.  And, during
@@ -1799,7 +1857,10 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * and discard the already streamed changes on such an error.  We might have
  * already streamed some of the changes for the aborted (sub)transaction, but
  * that is fine because when we decode the abort we will stream abort message
- * to truncate the changes in the subscriber.
+ * to truncate the changes in the subscriber. Similarly, for prepared
+ * transactions, we stop decoding if concurrent abort is detected and then
+ * rollback the changes when rollback prepared is encountered. See
+ * DecodePreare.
  */
 static inline void
 SetupCheckXidLive(TransactionId xid)
@@ -1901,7 +1962,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1913,15 +1974,19 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		specinsert = NULL;
 	}
 
-	/* Stop the stream. */
-	rb->stream_stop(rb, txn, last_lsn);
-
-	/* Remember the command ID and snapshot for the streaming run */
-	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	/*
+	 * For the streaming case, stop the stream and remember the command ID and
+	 * snapshot for the streaming run.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_stop(rb, txn, last_lsn);
+		ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	}
 }
 
 /*
- * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ * Helper function for ReorderBufferReplay and ReorderBufferStreamTXN.
  *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
@@ -1974,9 +2039,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		else
 			StartTransactionCommand();
 
-		/* We only need to send begin/commit for non-streamed transactions. */
+		/*
+		 * We only need to send begin/begin-prepare for non-streamed
+		 * transactions.
+		 */
 		if (!streaming)
-			rb->begin(rb, txn);
+		{
+			if (rbtxn_prepared(txn))
+				rb->begin_prepare(rb, txn);
+			else
+				rb->begin(rb, txn);
+		}
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -2007,8 +2080,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			prev_lsn = change->lsn;
 
-			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			/*
+			 * Set the current xid to detect concurrent aborts. This is
+			 * required for the cases when we decode the changes before the
+			 * COMMIT record is processed.
+			 */
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2299,7 +2376,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2333,15 +2419,22 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the four reasons: 1. Decoding an
+		 * in-progress txn. 2. Decoding a prepared txn. 3. Decoding of a
+		 * prepared txn that was (partially) streamed. 4. Decoding a committed
+		 * txn.
+		 *
+		 * For 1, we allow truncation of txn data by removing the changes
+		 * already streamed but still keeping other things like invalidations,
+		 * snapshot, and tuplecids. For 2 and 3, we indicate
+		 * ReorderBufferTruncateTXN to do more elaborate truncation of txn
+		 * data as the entire transaction has been decoded except for commit.
+		 * For 4, as the entire txn has been decoded, we can fully clean up
+		 * the TXN reorder buffer.
 		 */
-		if (streaming)
+		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
-
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2374,17 +2467,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2414,26 +2510,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * ReorderBufferCommitChild(), even if previously assigned to the toplevel
  * transaction with ReorderBufferAssignChild.
  *
- * This interface is called once a toplevel commit is read for both streamed
- * as well as non-streamed transactions.
+ * This interface is called once a prepare or toplevel commit is read for both
+ * streamed as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferReplay(ReorderBufferTXN *txn,
+					ReorderBuffer *rb, TransactionId xid,
 					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 					TimestampTz commit_time,
 					RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2463,7 +2552,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	if (txn->base_snapshot == NULL)
 	{
 		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+
+		/*
+		 * Removing this txn before a commit might result in the computation
+		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
+		 */
+		if (!rbtxn_prepared(txn))
+			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
 
@@ -2475,6 +2570,178 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferReplay(txn, rb, xid, commit_lsn, end_lsn, commit_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Record the prepare information for a transaction.
+ */
+bool
+ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+								 TimestampTz prepare_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return false;
+
+	/*
+	 * Remember the prepare information to be later used by commit prepared in
+	 * case we skip doing prepare.
+	 */
+	txn->final_lsn = prepare_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = prepare_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	return true;
+}
+
+/* Remember that we have skipped prepare */
+void
+ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_SKIPPED_PREPARE;
+}
+
+/*
+ * Prepare a two-phase transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = pstrdup(gid);
+
+	/* The prepare info must have been updated in txn by now. */
+	Assert(txn->final_lsn != InvalidXLogRecPtr);
+
+	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
+						txn->commit_time, txn->origin_id, txn->origin_lsn);
+}
+
+/*
+ * This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time, RepOriginId origin_id,
+							XLogRecPtr origin_lsn, char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+	XLogRecPtr	prepare_end_lsn;
+	TimestampTz	prepare_time;
+
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * By this time the txn has the prepare record information, remember it to
+	 * be later used for rollback.
+	 */
+	prepare_end_lsn = txn->end_lsn;
+	prepare_time = txn->commit_time;
+
+	/* add the gid in the txn */
+	txn->gid = pstrdup(gid);
+
+	/*
+	 * It is possible that this transaction is not decoded at prepare time
+	 * either because by that time we didn't have a consistent snapshot or it
+	 * was decoded earlier but we have restarted. We can't distinguish between
+	 * those two cases so we send the prepare in both the cases and let
+	 * downstream decide whether to process or skip it. We don't need to
+	 * decode the xact for aborts if it is not done already.
+	 */
+	if (!rbtxn_prepared(txn) && is_commit)
+	{
+		txn->txn_flags |= RBTXN_PREPARE;
+
+		/*
+		 * The prepare info must have been updated in txn even if we skip
+		 * prepare.
+		 */
+		Assert(txn->final_lsn != InvalidXLogRecPtr);
+
+		/*
+		 * By this time the txn has the prepare record information and it is
+		 * important to use that so that downstream gets the accurate
+		 * information. If instead, we have passed commit information here
+		 * then downstream can behave as it has already replayed commit
+		 * prepared after the restart.
+		 */
+		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
+							txn->commit_time, txn->origin_id, txn->origin_lsn);
+	}
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	if (is_commit)
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else
+		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2606,6 +2873,39 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 }
 
 /*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ *
+ * Note that this is a special-purpose function for prepared transactions where
+ * we don't want to clean up the TXN even when we decide to skip it. See
+ * DecodePrepare.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
+/*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 9d5d68f..dc3ef74 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -834,6 +834,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid))
 			continue;
 
+		/*
+		 * We don't need to add snapshot to prepared transactions as they
+		 * should not see the new catalog contents.
+		 */
+		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+			continue;
+
 		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
 			 txn->xid, (uint32) (lsn >> 32), (uint32) lsn);
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1e60afe..6d63338 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -174,6 +174,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_SKIPPED_PREPARE	  0x0100
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +235,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* prepare for this transaction skipped? */
+#define rbtxn_skip_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -258,10 +272,11 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	first_lsn;
 
 	/* ----
-	 * LSN of the record that lead to this xact to be committed or
+	 * LSN of the record that lead to this xact to be prepared or committed or
 	 * aborted. This can be a
 	 * * plain commit record
 	 * * plain commit record, of a parent transaction
+	 * * prepared tansaction
 	 * * prepared transaction commit
 	 * * plain abort record
 	 * * prepared transaction abort
@@ -293,7 +308,8 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	origin_lsn;
 
 	/*
-	 * Commit time, only known when we read the actual commit record.
+	 * Commit or Prepare time, only known when we read the actual commit or
+	 * prepare record.
 	 */
 	TimestampTz commit_time;
 
@@ -625,12 +641,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -644,10 +666,17 @@ void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr l
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
 											   SharedInvalidationMessage *invalidations);
 void		ReorderBufferProcessXid(ReorderBuffer *, TransactionId xid, XLogRecPtr lsn);
+
 void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLogRecPtr lsn);
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
+											 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+											 TimestampTz prepare_time,
+											 RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v34-0003-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v34-0003-Refactor-spool-file-logic-in-worker.c.patchDownload
From f50e2f6e76427c59495cec0a212a463b81fc70a8 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 01:39:44 -0500
Subject: [PATCH v34] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3874939..4f75e85 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -924,30 +926,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -955,7 +948,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -970,7 +963,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1045,6 +1038,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v34-0007-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v34-0007-Support-2PC-txn-subscriber-tests.patchDownload
From 8d253599ba48e109a67265d0eb10c531e272981d Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 04:09:22 -0500
Subject: [PATCH v34] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v34-0008-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v34-0008-Support-2PC-txn-Subscription-option.patchDownload
From 8738b401b76591af115449af51f27d9587a09030 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 04:10:57 -0500
Subject: [PATCH v34] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.

Note: The tablesync worker slot always has two_phase disabled, regardless of the option.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 202 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index db5e59f..dbe2a43 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -166,8 +166,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index ca78d39..886839e 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -67,6 +67,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b140c21..5f4e191 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1149,7 +1149,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 1696454..b0745d5 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -64,7 +64,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -105,6 +106,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -210,6 +216,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -355,6 +370,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -379,7 +396,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -447,6 +465,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -720,6 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -730,7 +751,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -769,6 +791,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -787,7 +816,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -832,7 +862,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -875,7 +906,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 24f8b3e..1f404cd 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -429,6 +429,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4f57a8a..2ec526b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2795,6 +2795,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		(!am_tablesync_worker() && newsub->twophase != MySubscription->twophase) ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3441,6 +3442,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase && !am_tablesync_worker();
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7cf2951..7e42a70 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -180,13 +180,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -254,6 +256,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -267,6 +279,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -291,7 +304,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -332,6 +346,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 8b1e5cc..a7079bc 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4222,6 +4222,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4265,9 +4266,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4288,6 +4294,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4313,6 +4320,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4381,6 +4390,9 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index d7f77f1..3dae9ce 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -630,6 +630,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 14150d0..47306a2 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -5997,7 +5997,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6023,13 +6023,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3fa02af..e07eed0 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -53,6 +53,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -90,6 +92,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 13ea3b7..4f5aec9 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1b05b39..f96c891 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 2fa9bce..23d876e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,42 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 14fa0b2..2a0b366 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -147,6 +147,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v34-0009-Support-2PC-consistent-snapshot-isolation-tests.patchapplication/octet-stream; name=v34-0009-Support-2PC-consistent-snapshot-isolation-tests.patchDownload
From 91d7dc2ecd2f7388b2b8c0a8122ef4546442b824 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 04:12:53 -0500
Subject: [PATCH v34] Support 2PC consistent snapshot isolation tests.

Added isolation test-case to test that if a consistent snapshot is created
between a PREPARE and a COMMIT PREPARED, then the whole transaction is decoded
on COMMIT PREPARED.
---
 contrib/test_decoding/Makefile                     |  3 +-
 .../test_decoding/expected/twophase_snapshot.out   | 43 +++++++++++++++++++
 contrib/test_decoding/specs/twophase_snapshot.spec | 49 ++++++++++++++++++++++
 3 files changed, 94 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/twophase_snapshot.out
 create mode 100644 contrib/test_decoding/specs/twophase_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 76d4a69..c5e28ce 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
+	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
+	twophase_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/twophase_snapshot.out b/contrib/test_decoding/expected/twophase_snapshot.out
new file mode 100644
index 0000000..53aaf01
--- /dev/null
+++ b/contrib/test_decoding/expected/twophase_snapshot.out
@@ -0,0 +1,43 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s2b s2txid s1init s3b s3txid s2alter s2c s4b s4insert s4prepare s3c s1insert s1checkpoint s1start s4commit s1start
+step s2b: BEGIN;
+step s2txid: SELECT pg_current_xact_id() IS NULL;
+?column?       
+
+f              
+step s1init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); <waiting ...>
+step s3b: BEGIN;
+step s3txid: SELECT pg_current_xact_id() IS NULL;
+?column?       
+
+f              
+step s2alter: ALTER TABLE do_write ADD COLUMN addedbys2 int;
+step s2c: COMMIT;
+step s4b: BEGIN;
+step s4insert: INSERT INTO do_write DEFAULT VALUES;
+step s4prepare: PREPARE TRANSACTION 'test1';
+step s3c: COMMIT;
+step s1init: <... completed>
+?column?       
+
+init           
+step s1insert: INSERT INTO do_write DEFAULT VALUES;
+step s1checkpoint: CHECKPOINT;
+step s1start: SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');
+data           
+
+BEGIN          
+table public.do_write: INSERT: id[integer]:2 addedbys2[integer]:null
+COMMIT         
+step s4commit: COMMIT PREPARED 'test1';
+step s1start: SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');
+data           
+
+BEGIN          
+table public.do_write: INSERT: id[integer]:1 addedbys2[integer]:null
+PREPARE TRANSACTION 'test1'
+COMMIT PREPARED 'test1'
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/twophase_snapshot.spec b/contrib/test_decoding/specs/twophase_snapshot.spec
new file mode 100644
index 0000000..505e5e3
--- /dev/null
+++ b/contrib/test_decoding/specs/twophase_snapshot.spec
@@ -0,0 +1,49 @@
+# Test decoding of two-phase transactions during the build of a consistent snapshot.
+setup
+{
+    DROP TABLE IF EXISTS do_write;
+    CREATE TABLE do_write(id serial primary key);
+}
+
+teardown
+{
+    DROP TABLE do_write;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1init" {SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');}
+step "s1start" {SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');}
+step "s1insert" { INSERT INTO do_write DEFAULT VALUES; }
+step "s1checkpoint" { CHECKPOINT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2b" { BEGIN; }
+step "s2txid" { SELECT pg_current_xact_id() IS NULL; }
+step "s2alter" { ALTER TABLE do_write ADD COLUMN addedbys2 int; }
+step "s2c" { COMMIT; }
+
+
+session "s3"
+setup { SET synchronous_commit=on; }
+
+step "s3b" { BEGIN; }
+step "s3txid" { SELECT pg_current_xact_id() IS NULL; }
+step "s3c" { COMMIT; }
+
+session "s4"
+setup { SET synchronous_commit=on; }
+
+step "s4b" { BEGIN; }
+step "s4insert" { INSERT INTO do_write DEFAULT VALUES; }
+step "s4prepare" { PREPARE TRANSACTION 'test1'; }
+step "s4commit" { COMMIT PREPARED 'test1'; }
+
+# Force building of a consistent snapshot between a PREPARE and COMMIT PREPARED.
+# Ensure that the whole transaction is decoded fresh at the time of COMMIT PREPARED.
+permutation "s2b" "s2txid" "s1init" "s3b" "s3txid" "s2alter" "s2c" "s4b" "s4insert" "s4prepare" "s3c""s1insert" "s1checkpoint" "s1start" "s4commit" "s1start"
-- 
1.8.3.1

v34-0010-Support-2PC-txn-tests-for-concurrent-aborts.patchapplication/octet-stream; name=v34-0010-Support-2PC-txn-tests-for-concurrent-aborts.patchDownload
From 7a7da281edbbbc5acf642c81cd99822fbaa5b76c Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 04:30:30 -0500
Subject: [PATCH v34] Support 2PC txn tests for concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2PC.
---
 contrib/test_decoding/Makefile                    |   2 +
 contrib/test_decoding/t/001_twophase.pl           | 121 ++++++++++++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++++++
 contrib/test_decoding/test_decoding.c             |  58 ++++++++++
 src/backend/replication/logical/reorderbuffer.c   |   5 +
 5 files changed, 319 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index c5e28ce..e0cd841 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -10,6 +10,8 @@ ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..3b3e7b8
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of prepared txn test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..15001c6
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 0576355..efe7f5c 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,11 +11,13 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
+#include "storage/procarray.h"
 
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -35,6 +37,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -174,6 +177,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -275,6 +279,24 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -471,6 +493,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -620,6 +666,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -706,6 +755,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -918,6 +970,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -971,6 +1026,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bbefb68..1d43fcf 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2489,6 +2489,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
-- 
1.8.3.1

#169Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#168)

On Wed, Dec 23, 2020 at 3:08 PM Ajin Cherian <itsajin@gmail.com> wrote:

Can you please update the patch for the points we agreed upon?

Changed and attached.

Thanks, I have looked at these patches again and it seems patches 0001
to 0004 are in good shape, and among those
v33-0001-Extend-the-output-plugin-API-to-allow-decoding-o is good to
go. So, I am planning to push the first patch (0001*) in next week
sometime unless you or someone else has any comments on it.

--
With Regards,
Amit Kapila.

#170osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#169)
RE: [HACKERS] logical decoding of two-phase transactions

Hi, Amit-San

On Thursday, Dec 24, 2020 2:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 23, 2020 at 3:08 PM Ajin Cherian <itsajin@gmail.com> wrote:

Can you please update the patch for the points we agreed upon?

Changed and attached.

Thanks, I have looked at these patches again and it seems patches 0001 to
0004 are in good shape, and among those
v33-0001-Extend-the-output-plugin-API-to-allow-decoding-o is good to go.
So, I am planning to push the first patch (0001*) in next week sometime
unless you or someone else has any comments on it.

I agree this from the perspective of good code quality for memory management.

I reviewed the v33 patchset by using valgrind and
conclude that the patchset of version 33th has no problem in terms of memory management.
This can be applied to v34 because the difference between the two versions are really small.

I conducted comparison of valgrind logfiles between master and master with v33 patchset applied.
I checked both testing of contrib/test-decoding and src/test/subscription of course, using valgrind.

The first reason why I reached the conclusion is that
I don't find any description of memcheck error in the log files.
I picked up and greped error message expressions in the documentation of the valgrind - [1]https://valgrind.org/docs/manual/mc-manual.html#mc-manual.errormsgs,
but there was no grep matches.

Secondly, I surveyed function stack of valgrind's 3 types of memory leak,
"Definitely lost", "Indirectly lost" and "Possibly lost" and
it turned out that the patchset didn't add any new cause of memory leak.

[1]: https://valgrind.org/docs/manual/mc-manual.html#mc-manual.errormsgs

Best Regards,
Takamichi Osumi

#171Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Ajin Cherian (#168)

Hi Ajin,

On Wed, Dec 23, 2020 at 6:38 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Dec 22, 2020 at 8:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 22, 2020 at 2:51 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Sat, Dec 19, 2020 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I have changed the rollback_prepare API as discussed above and
accordingly handle the case where rollback is received without prepare
in apply_handle_rollback_prepared.

I have reviewed and tested your new patchset, I agree with all the
changes that you have made and have tested quite a few scenarios and
they seem to be working as expected.
No major comments but some minor observations:

Patch 1:
logical.c: 984
Comment should be "rollback prepared" rather than "abort prepared".

Agreed.

Changed.

Patch 2:
decode.c: 737: The comments in the header of DecodePrepare seem out of
place, I think here it should describe what the function does rather
than what it does not.

Hmm, I have written it because it is important to explain the theory
of concurrent aborts as that is not quite obvious. Also, the
functionality is quite similar to DecodeCommit and the comments inside
the function explain clearly if there is any difference so not sure
what additional we can write, do you have any suggestions?

I have slightly re-worded it. Have a look.

reorderbuffer.c: 2422: It looks like pg_indent has mangled the
comments, the numbering is no longer aligned.

Yeah, I had also noticed that but not sure if there is a better
alternative because we don't want to change it after each pgindent
run. We might want to use (a), (b) .. notation instead but otherwise,
there is no big problem with how it is.

Leaving this as is.

Patch 5:
worker.c: 753: Type: change "dont" to "don't"

Okay.

Changed.

Patch 6: logicaldecoding.sgml
logicaldecoding example is no longer correct. This was true prior to
the changes done to replay prepared transactions after a restart. Now
the whole transaction will get decoded again after the commit
prepared.

postgres=# COMMIT PREPARED 'test_prepared1';
COMMIT PREPARED
postgres=# select * from
pg_logical_slot_get_changes('regression_slot', NULL, NULL,
'two-phase-commit', '1');
lsn | xid | data
-----------+-----+--------------------------------------------
0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529
(1 row)

Agreed.

Changed.

Patch 8:
worker.c: 2798 :
worker.c: 3445 : disabling two-phase in tablesync worker.
considering new design of multiple commits in tablesync, do we need
to disable two-phase in tablesync?

No, but let Peter's patch get committed then we can change it.

OK, leaving it.

Can you please update the patch for the points we agreed upon?

Changed and attached.

Thank you for updating the patches!

I realized that this patch is not registered yet for the next
CommitFest[1]https://commitfest.postgresql.org/31/ that starts in a couple of days. I found the old entry
of this patch[2]https://commitfest.postgresql.org/22/944/ but it's marked as "Returned with feedback". Although
this patch is being reviewed actively, I suggest you adding it before
2021-01-01 AoE[2]https://commitfest.postgresql.org/22/944/ so cfbot also can test your patch.

Regards,

[1]: https://commitfest.postgresql.org/31/
[2]: https://commitfest.postgresql.org/22/944/
[3]: https://en.wikipedia.org/wiki/Anywhere_on_Earth

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#172Ajin Cherian
itsajin@gmail.com
In reply to: Masahiko Sawada (#171)
1 attachment(s)

Hi Sawada-san,

I think Amit has a plan to commit this patch-set in phases. I will
leave it to him to decide because I think he has a plan.
I took time to refactor the test_decoding isolation test for
consistent snapshot so that it uses just 3 sessions rather than 4.
Posting an updated patch-0009

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v34-0009-Support-2PC-consistent-snapshot-isolation-tests.patchapplication/octet-stream; name=v34-0009-Support-2PC-consistent-snapshot-isolation-tests.patchDownload
From cd2b4d0f8506840c0247edd18181c1aba74f636d Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 29 Dec 2020 02:18:36 -0500
Subject: [PATCH v34] Support 2PC consistent snapshot isolation tests.

Added isolation test-case to test that if a consistent snapshot is created
between a PREPARE and a COMMIT PREPARED, then the whole transaction is decoded
on COMMIT PREPARED.
---
 contrib/test_decoding/Makefile                     |  3 +-
 .../test_decoding/expected/twophase_snapshot.out   | 43 +++++++++++++++++++++
 contrib/test_decoding/specs/twophase_snapshot.spec | 44 ++++++++++++++++++++++
 3 files changed, 89 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/twophase_snapshot.out
 create mode 100644 contrib/test_decoding/specs/twophase_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 76d4a69..c5e28ce 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
+	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
+	twophase_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/twophase_snapshot.out b/contrib/test_decoding/expected/twophase_snapshot.out
new file mode 100644
index 0000000..0d38958
--- /dev/null
+++ b/contrib/test_decoding/expected/twophase_snapshot.out
@@ -0,0 +1,43 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2b s2txid s1init s3b s3txid s2alter s2c s2b s2insert s2prepare s3c s1insert s1checkpoint s1start s2commit s1start
+step s2b: BEGIN;
+step s2txid: SELECT pg_current_xact_id() IS NULL;
+?column?       
+
+f              
+step s1init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); <waiting ...>
+step s3b: BEGIN;
+step s3txid: SELECT pg_current_xact_id() IS NULL;
+?column?       
+
+f              
+step s2alter: ALTER TABLE do_write ADD COLUMN addedbys2 int;
+step s2c: COMMIT;
+step s2b: BEGIN;
+step s2insert: INSERT INTO do_write DEFAULT VALUES;
+step s2prepare: PREPARE TRANSACTION 'test1';
+step s3c: COMMIT;
+step s1init: <... completed>
+?column?       
+
+init           
+step s1insert: INSERT INTO do_write DEFAULT VALUES;
+step s1checkpoint: CHECKPOINT;
+step s1start: SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');
+data           
+
+BEGIN          
+table public.do_write: INSERT: id[integer]:2 addedbys2[integer]:null
+COMMIT         
+step s2commit: COMMIT PREPARED 'test1';
+step s1start: SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');
+data           
+
+BEGIN          
+table public.do_write: INSERT: id[integer]:1 addedbys2[integer]:null
+PREPARE TRANSACTION 'test1'
+COMMIT PREPARED 'test1'
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/twophase_snapshot.spec b/contrib/test_decoding/specs/twophase_snapshot.spec
new file mode 100644
index 0000000..8856b27
--- /dev/null
+++ b/contrib/test_decoding/specs/twophase_snapshot.spec
@@ -0,0 +1,44 @@
+# Test decoding of two-phase transactions during the build of a consistent snapshot.
+setup
+{
+    DROP TABLE IF EXISTS do_write;
+    CREATE TABLE do_write(id serial primary key);
+}
+
+teardown
+{
+    DROP TABLE do_write;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1init" {SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');}
+step "s1start" {SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');}
+step "s1insert" { INSERT INTO do_write DEFAULT VALUES; }
+step "s1checkpoint" { CHECKPOINT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2b" { BEGIN; }
+step "s2txid" { SELECT pg_current_xact_id() IS NULL; }
+step "s2alter" { ALTER TABLE do_write ADD COLUMN addedbys2 int; }
+step "s2c" { COMMIT; }
+step "s2insert" { INSERT INTO do_write DEFAULT VALUES; }
+step "s2prepare" { PREPARE TRANSACTION 'test1'; }
+step "s2commit" { COMMIT PREPARED 'test1'; }
+
+
+session "s3"
+setup { SET synchronous_commit=on; }
+
+step "s3b" { BEGIN; }
+step "s3txid" { SELECT pg_current_xact_id() IS NULL; }
+step "s3c" { COMMIT; }
+
+# Force building of a consistent snapshot between a PREPARE and COMMIT PREPARED.
+# Ensure that the whole transaction is decoded fresh at the time of COMMIT PREPARED.
+permutation "s2b" "s2txid" "s1init" "s3b" "s3txid" "s2alter" "s2c" "s2b" "s2insert" "s2prepare" "s3c""s1insert" "s1checkpoint" "s1start" "s2commit" "s1start"
-- 
1.8.3.1

#173Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#172)

On Tue, Dec 29, 2020 at 3:15 PM Ajin Cherian <itsajin@gmail.com> wrote:

Hi Sawada-san,

I think Amit has a plan to commit this patch-set in phases.

I have pushed the first patch and I would like to make a few changes
in the second patch after which I will post the new version. I'll try
to do that tomorrow if possible and register the patch.

I will
leave it to him to decide because I think he has a plan.
I took time to refactor the test_decoding isolation test for
consistent snapshot so that it uses just 3 sessions rather than 4.
Posting an updated patch-0009

Thanks, I will look into this.

--
With Regards,
Amit Kapila.

#174Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#173)
8 attachment(s)

On Wed, Dec 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 29, 2020 at 3:15 PM Ajin Cherian <itsajin@gmail.com> wrote:

Hi Sawada-san,

I think Amit has a plan to commit this patch-set in phases.

I have pushed the first patch and I would like to make a few changes
in the second patch after which I will post the new version. I'll try
to do that tomorrow if possible and register the patch.

Please find attached a rebased version of this patch-set. I have made
a number of changes in the
v35-0001-Allow-decoding-at-prepare-time-in-ReorderBuffer.

1. Centralize the logic to decide whether to perform decoding at
prepare time in FilterPrepare function.
2. Changed comments atop DecodePrepare. I didn't like much the
comments changed by Ajin in the last patch.
3. Merged the doc changes patch after some changes mostly cosmetic.

I am planning to commit the first patch in this series early next week
after reading it once more.

--
With Regards,
Amit Kapila.

Attachments:

v35-0001-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchapplication/octet-stream; name=v35-0001-Allow-decoding-at-prepare-time-in-ReorderBuffer.patchDownload
From 5b564481ae243a62cb213e168628c82bc2510d52 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 01:19:55 -0500
Subject: [PATCH v35 1/8] Allow decoding at prepare time in ReorderBuffer.

This patch allows PREPARE-time decoding of two-phase transactions (if the
output plugin supports this capability), in which case the transactions
are replayed at PREPARE and then committed later when COMMIT PREPARED
arrives.

Now that we decode the changes before the commit, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We detect such failures with a special sqlerrcode
ERRCODE_TRANSACTION_ROLLBACK introduced by commit 7259736a6e and stop
decoding the remaining changes. Then we rollback the changes when rollback
prepared is encountered.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, Arseny Sher, and Dilip Kumar
Tested-by: Takamichi Osumi
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 contrib/test_decoding/expected/twophase.out        | 235 +++++++++++
 contrib/test_decoding/expected/twophase_stream.out | 147 +++++++
 contrib/test_decoding/sql/twophase.sql             | 112 ++++++
 contrib/test_decoding/sql/twophase_stream.sql      |  45 +++
 doc/src/sgml/logicaldecoding.sgml                  | 104 ++++-
 src/backend/replication/logical/decode.c           | 286 ++++++++++++--
 src/backend/replication/logical/logical.c          |   9 -
 src/backend/replication/logical/reorderbuffer.c    | 432 +++++++++++++++++----
 src/backend/replication/logical/snapbuild.c        |   7 +
 src/include/replication/reorderbuffer.h            |  33 +-
 11 files changed, 1296 insertions(+), 116 deletions(-)
 create mode 100644 contrib/test_decoding/expected/twophase.out
 create mode 100644 contrib/test_decoding/expected/twophase_stream.out
 create mode 100644 contrib/test_decoding/sql/twophase.sql
 create mode 100644 contrib/test_decoding/sql/twophase_stream.sql

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a4c76f..76d4a69 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,7 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate stream stats
+	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
 
diff --git a/contrib/test_decoding/expected/twophase.out b/contrib/test_decoding/expected/twophase.out
new file mode 100644
index 0000000..f9f6bed
--- /dev/null
+++ b/contrib/test_decoding/expected/twophase.out
@@ -0,0 +1,235 @@
+-- Test prepared transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+-- Test that decoding happens at PREPARE time when two-phase-commit is enabled.
+-- Decoding after COMMIT PREPARED must have all the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+(4 rows)
+
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:1
+ table public.test_prepared1: INSERT: id[integer]:2
+ PREPARE TRANSACTION 'test_prepared#1'
+ COMMIT PREPARED 'test_prepared#1'
+(5 rows)
+
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:3
+ PREPARE TRANSACTION 'test_prepared#2'
+(3 rows)
+
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                data                 
+-------------------------------------
+ ROLLBACK PREPARED 'test_prepared#2'
+(1 row)
+
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation     | locktype |        mode         
+-----------------+----------+---------------------
+ test_prepared_1 | relation | RowExclusiveLock
+ test_prepared_1 | relation | AccessExclusiveLock
+(2 rows)
+
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+(3 rows)
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                        data                        
+----------------------------------------------------
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:5
+ COMMIT
+(3 rows)
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                  data                                   
+-------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:4 data[text]:'frakbar'
+ PREPARE TRANSACTION 'test_prepared#3'
+ COMMIT PREPARED 'test_prepared#3'
+(4 rows)
+
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                
+--------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:6 data[text]:null
+ COMMIT
+ BEGIN
+ table public.test_prepared2: INSERT: id[integer]:7
+ COMMIT
+(6 rows)
+
+-- Check 'CLUSTER' (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+    relation    | locktype |        mode         
+----------------+----------+---------------------
+ test_prepared1 | relation | RowExclusiveLock
+ test_prepared1 | relation | ShareLock
+ test_prepared1 | relation | AccessExclusiveLock
+(3 rows)
+
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The
+-- call should return within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+(4 rows)
+
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                   data                                    
+---------------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:8 data[text]:'othercol'
+ table public.test_prepared1: INSERT: id[integer]:9 data[text]:'othercol2'
+ PREPARE TRANSACTION 'test_prepared_lock'
+ COMMIT PREPARED 'test_prepared_lock'
+(5 rows)
+
+-- Test savepoints and sub-xacts. Creating savepoints will create
+-- sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+(3 rows)
+
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                            data                            
+------------------------------------------------------------
+ BEGIN
+ table public.test_prepared_savepoint: INSERT: a[integer]:1
+ PREPARE TRANSACTION 'test_prepared_savepoint'
+ COMMIT PREPARED 'test_prepared_savepoint'
+(4 rows)
+
+-- Test that a GID containing "_nodecode" gets decoded at commit prepared time.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+                                data                                 
+---------------------------------------------------------------------
+ BEGIN
+ table public.test_prepared1: INSERT: id[integer]:20 data[text]:null
+ COMMIT
+(3 rows)
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/expected/twophase_stream.out b/contrib/test_decoding/expected/twophase_stream.out
new file mode 100644
index 0000000..3acc4acd3
--- /dev/null
+++ b/contrib/test_decoding/expected/twophase_stream.out
@@ -0,0 +1,147 @@
+-- Test streaming of two-phase commits
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+----------
+ init
+(1 row)
+
+CREATE TABLE stream_test(data text);
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+ data 
+------
+(0 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK TO s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+ opening a streamed block for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ streaming change for transaction
+ closing a streamed block for transaction
+ preparing streamed transaction 'test1'
+(24 rows)
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED and the other changes in the transaction
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ PREPARE TRANSACTION 'test1'
+ COMMIT PREPARED 'test1'
+(23 rows)
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with
+-- filtered gid. gids with '_nodecode' will not be decoded at prepare time.
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+ ?column? 
+----------
+ msg5
+(1 row)
+
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                           data                           
+----------------------------------------------------------
+ streaming message: transactional: 1 prefix: test, sz: 50
+(1 row)
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+                            data                             
+-------------------------------------------------------------
+ BEGIN
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa1'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa2'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa3'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa4'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa5'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa6'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa7'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa8'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa9'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa10'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa11'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa12'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa13'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa14'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa15'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa16'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa17'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa18'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa19'
+ table public.stream_test: INSERT: data[text]:'aaaaaaaaaa20'
+ COMMIT
+(22 rows)
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
+ pg_drop_replication_slot 
+--------------------------
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/twophase.sql b/contrib/test_decoding/sql/twophase.sql
new file mode 100644
index 0000000..894e4f5
--- /dev/null
+++ b/contrib/test_decoding/sql/twophase.sql
@@ -0,0 +1,112 @@
+-- Test prepared transactions. When two-phase-commit is enabled, transactions are
+-- decoded at PREPARE time rather than at COMMIT PREPARED time.
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE test_prepared1(id integer primary key);
+CREATE TABLE test_prepared2(id integer primary key);
+
+-- Test that decoding happens at PREPARE time when two-phase-commit is enabled.
+-- Decoding after COMMIT PREPARED must have all the commands in the transaction.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (1);
+INSERT INTO test_prepared1 VALUES (2);
+-- should show nothing because the xact has not been prepared yet.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+PREPARE TRANSACTION 'test_prepared#1';
+-- should show both the above inserts and the PREPARE TRANSACTION.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared#1';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that rollback of a prepared xact is decoded.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (3);
+PREPARE TRANSACTION 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+ROLLBACK PREPARED 'test_prepared#2';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test prepare of a xact containing ddl. Leaving xact uncommitted for next test.
+BEGIN;
+ALTER TABLE test_prepared1 ADD COLUMN data text;
+INSERT INTO test_prepared1 VALUES (4, 'frakbar');
+PREPARE TRANSACTION 'test_prepared#3';
+-- confirm that exclusive lock from the ALTER command is held on test_prepared1 table
+SELECT 'test_prepared_1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The insert should show the newly altered column but not the DDL.
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that we decode correctly while an uncommitted prepared xact
+-- with ddl exists.
+--
+-- Use a separate table for the concurrent transaction because the lock from
+-- the ALTER will stop us inserting into the other one.
+--
+INSERT INTO test_prepared2 VALUES (5);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+COMMIT PREPARED 'test_prepared#3';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+-- make sure stuff still works
+INSERT INTO test_prepared1 VALUES (6);
+INSERT INTO test_prepared2 VALUES (7);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Check 'CLUSTER' (as operation that hold exclusive lock) doesn't block
+-- logical decoding.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (8, 'othercol');
+CLUSTER test_prepared1 USING test_prepared1_pkey;
+INSERT INTO test_prepared1 VALUES (9, 'othercol2');
+PREPARE TRANSACTION 'test_prepared_lock';
+
+SELECT 'test_prepared1' AS relation, locktype, mode
+FROM pg_locks
+WHERE locktype = 'relation'
+  AND relation = 'test_prepared1'::regclass;
+-- The above CLUSTER command shouldn't cause a timeout on 2pc decoding. The
+-- call should return within a second.
+SET statement_timeout = '1s';
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+RESET statement_timeout;
+COMMIT PREPARED 'test_prepared_lock';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test savepoints and sub-xacts. Creating savepoints will create
+-- sub-xacts implicitly.
+BEGIN;
+CREATE TABLE test_prepared_savepoint (a int);
+INSERT INTO test_prepared_savepoint VALUES (1);
+SAVEPOINT test_savepoint;
+INSERT INTO test_prepared_savepoint VALUES (2);
+ROLLBACK TO SAVEPOINT test_savepoint;
+PREPARE TRANSACTION 'test_prepared_savepoint';
+-- should show only 1, not 2
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_savepoint';
+-- consume the commit
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test that a GID containing "_nodecode" gets decoded at commit prepared time.
+BEGIN;
+INSERT INTO test_prepared1 VALUES (20);
+PREPARE TRANSACTION 'test_prepared_nodecode';
+-- should show nothing
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+COMMIT PREPARED 'test_prepared_nodecode';
+-- should be decoded now
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- Test 8:
+-- cleanup and make sure results are also empty
+DROP TABLE test_prepared1;
+DROP TABLE test_prepared2;
+-- show results. There should be nothing to show
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');
+
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/twophase_stream.sql b/contrib/test_decoding/sql/twophase_stream.sql
new file mode 100644
index 0000000..e9dd44f
--- /dev/null
+++ b/contrib/test_decoding/sql/twophase_stream.sql
@@ -0,0 +1,45 @@
+-- Test streaming of two-phase commits
+
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+
+CREATE TABLE stream_test(data text);
+
+-- consume DDL
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK TO s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1';
+-- should show the inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1';
+--should show the COMMIT PREPARED and the other changes in the transaction
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+-- streaming test with sub-transaction and PREPARE/COMMIT PREPARED but with
+-- filtered gid. gids with '_nodecode' will not be decoded at prepare time.
+BEGIN;
+SAVEPOINT s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+ROLLBACK to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+PREPARE TRANSACTION 'test1_nodecode';
+-- should NOT show inserts after a ROLLBACK
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+COMMIT PREPARED 'test1_nodecode';
+-- should show the inserts but not show a COMMIT PREPARED but a COMMIT
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1');
+
+DROP TABLE stream_test;
+SELECT pg_drop_replication_slot('regression_slot');
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index d63f90f..deb28a8 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -165,7 +165,58 @@ COMMIT 693
 <keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
 $ pg_recvlogical -d postgres --slot=test --drop-slot
 </programlisting>
-  </sect1>
+
+  <para>
+  The following example shows SQL interface can be used to decode prepared
+  transactions. Before you use two-phase commit commands, you must set
+  <varname>max_prepared_transactions</varname> to at least 1. You must also set
+  the option 'two-phase-commit' to 1 while calling
+  <function>pg_logical_slot_get_changes</function>. Note that we will stream
+  the entire transaction after the commit if it is not already decoded.
+  </para>
+<programlisting>
+postgres=# BEGIN;
+postgres=*# INSERT INTO data(data) VALUES('5');
+postgres=*# PREPARE TRANSACTION 'test_prepared1';
+
+postgres=# SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/1689DC0 | 529 | BEGIN 529
+ 0/1689DC0 | 529 | table public.data: INSERT: id[integer]:3 data[text]:'5'
+ 0/1689FC0 | 529 | PREPARE TRANSACTION 'test_prepared1', txid 529
+(3 rows)
+
+postgres=# COMMIT PREPARED 'test_prepared1';
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                    data                    
+-----------+-----+--------------------------------------------
+ 0/1689DC0 | 529 | BEGIN 529
+ 0/1689DC0 | 529 | table public.data: INSERT: id[integer]:3 data[text]:'5'
+ 0/1689FC0 | 529 | PREPARE TRANSACTION 'test_prepared1', txid 529
+ 0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529
+(4 row)
+
+postgres=#-- you can also rollback a prepared transaction
+postgres=# BEGIN;
+postgres=*# INSERT INTO data(data) VALUES('6');
+postgres=*# PREPARE TRANSACTION 'test_prepared2';
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                          data                           
+-----------+-----+---------------------------------------------------------
+ 0/168A180 | 530 | BEGIN 530
+ 0/168A1E8 | 530 | table public.data: INSERT: id[integer]:4 data[text]:'6'
+ 0/168A430 | 530 | PREPARE TRANSACTION 'test_prepared2', txid 530
+(3 rows)
+
+postgres=# ROLLBACK PREPARED 'test_prepared2';
+postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1');
+    lsn    | xid |                     data                     
+-----------+-----+----------------------------------------------
+ 0/168A4B8 | 530 | ROLLBACK PREPARED 'test_prepared2', txid 530
+(1 row)
+</programlisting>
+</sect1>
 
   <sect1 id="logicaldecoding-explanation">
    <title>Logical Decoding Concepts</title>
@@ -1126,4 +1177,55 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
    </para>
 
   </sect1>
+
+  <sect1 id="logicaldecoding-two-phase-commits">
+   <title>Two-phase commit support for Logical Decoding</title>
+
+   <para>
+    With the basic output plugin callbacks (eg., <function>begin_cb</function>,
+    <function>change_cb</function>, <function>commit_cb</function> and
+    <function>message_cb</function>) two-phase commit commands like
+    <command>PREPARE TRANSACTION</command>, <command>COMMIT PREPARED</command>
+    and <command>ROLLBACK PREPARED</command> are not decoded. While the
+    <command>PREPARE TRANSACTION</command> is ignored,
+    <command>COMMIT PREPARED</command> is decoded as a <command>COMMIT</command>
+    and <command>ROLLBACK PREPARED</command> is decoded as a
+    <command>ROLLBACK</command>.
+   </para>
+
+   <para>
+    To support streaming of two-phase commands, an output plugin needs to provide
+    the additional callbacks. There are multiple two-phase commit callbacks that
+    are required, (<function>begin_prepare_cb</function>,
+    <function>prepare_cb</function>, <function>commit_prepared_cb</function>, 
+    <function>rollback_prepared_cb</function> and
+    <function>stream_prepare_cb</function>) and an optional callback
+    (<function>filter_prepare_cb</function>).
+   </para>
+
+   <para>
+    If the output plugin callbacks for decoding two-phase commit commands are
+    provided, then on <command>PREPARE TRANSACTION</command>, the changes of
+    that transaction are decoded, passed to the output plugin, and the
+    <function>prepare_cb</function> callback is invoked. This differs from the
+    basic decoding setup where changes are only passed to the output plugin
+    when a transaction is committed. The start of a prepared transaction is
+    indicated by the <function>begin_prepare_cb</function> callback.
+   </para>
+
+   <para>
+    When a prepared transaction is rollbacked using the
+    <command>ROLLBACK PREPARED</command>, then the
+    <function>rollback_prepared_cb</function> callback is invoked and when the
+    prepared transaction is committed using <command>COMMIT PREPARED</command>,
+    then the <function>commit_prepared_cb</function> callback is invoked.
+   </para>
+
+   <para>
+    Optionally the output plugin can specify a name pattern in the
+    <function>filter_prepare_cb</function> and transactions with gid containing
+    that name pattern will not be decoded as a two-phase commit transaction.
+   </para>
+
+  </sect1>
  </chapter>
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee9..3c09738 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -67,13 +67,24 @@ static void DecodeMultiInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf
 static void DecodeSpecConfirm(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 
 static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						 xl_xact_parsed_commit *parsed, TransactionId xid);
+						 xl_xact_parsed_commit *parsed, TransactionId xid,
+						 bool two_phase);
 static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-						xl_xact_parsed_abort *parsed, TransactionId xid);
+						xl_xact_parsed_abort *parsed, TransactionId xid,
+						bool two_phase);
+static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+						  xl_xact_parsed_prepare *parsed);
+
 
 /* common function to decode tuples */
 static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
 
+/* helper functions for decoding transactions */
+static inline bool FilterPrepare(LogicalDecodingContext *ctx, const char *gid);
+static bool DecodeTXNNeedSkip(LogicalDecodingContext *ctx,
+							  XLogRecordBuffer *buf, Oid dbId,
+							  RepOriginId origin_id);
+
 /*
  * Take every XLogReadRecord()ed record and perform the actions required to
  * decode it using the output plugin already setup in the logical decoding
@@ -244,6 +255,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_commit *xlrec;
 				xl_xact_parsed_commit parsed;
 				TransactionId xid;
+				bool		two_phase = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
 				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -253,7 +265,15 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeCommit(ctx, buf, &parsed, xid);
+				/*
+				 * We would like to process the transaction in a two-phase
+				 * manner iff output plugin supports two-phase commits and
+				 * doesn't filter the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_COMMIT_PREPARED)
+					two_phase = !(FilterPrepare(ctx, parsed.twophase_gid));
+
+				DecodeCommit(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
 		case XLOG_XACT_ABORT:
@@ -262,6 +282,7 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				xl_xact_abort *xlrec;
 				xl_xact_parsed_abort parsed;
 				TransactionId xid;
+				bool		two_phase = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
 				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
@@ -271,7 +292,15 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				else
 					xid = parsed.twophase_xid;
 
-				DecodeAbort(ctx, buf, &parsed, xid);
+				/*
+				 * We would like to process the transaction in a two-phase
+				 * manner iff output plugin supports two-phase commits and
+				 * doesn't filter the transaction at prepare time.
+				 */
+				if (info == XLOG_XACT_ABORT_PREPARED)
+					two_phase = !(FilterPrepare(ctx, parsed.twophase_gid));
+
+				DecodeAbort(ctx, buf, &parsed, xid, two_phase);
 				break;
 			}
 		case XLOG_XACT_ASSIGNMENT:
@@ -312,17 +341,30 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			}
 			break;
 		case XLOG_XACT_PREPARE:
+			{
+				xl_xact_parsed_prepare parsed;
+				xl_xact_prepare *xlrec;
 
-			/*
-			 * Currently decoding ignores PREPARE TRANSACTION and will just
-			 * decode the transaction when the COMMIT PREPARED is sent or
-			 * throw away the transaction's contents when a ROLLBACK PREPARED
-			 * is received. In the future we could add code to expose prepared
-			 * transactions in the changestream allowing for a kind of
-			 * distributed 2PC.
-			 */
-			ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr);
-			break;
+				/* ok, parse it */
+				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
+				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+								   xlrec, &parsed);
+
+				/*
+				 * We would like to process the transaction in a two-phase
+				 * manner iff output plugin supports two-phase commits and
+				 * doesn't filter the transaction at prepare time.
+				 */
+				if (FilterPrepare(ctx, parsed.twophase_gid))
+				{
+					ReorderBufferProcessXid(reorder, parsed.twophase_xid,
+											buf->origptr);
+					break;
+				}
+
+				DecodePrepare(ctx, buf, &parsed);
+				break;
+			}
 		default:
 			elog(ERROR, "unexpected RM_XACT_ID record type: %u", info);
 	}
@@ -520,6 +562,32 @@ DecodeHeapOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	}
 }
 
+/*
+ * Ask output plugin whether we want to skip this PREPARE and send
+ * this transaction as a regular commit later.
+ */
+static inline bool
+FilterPrepare(LogicalDecodingContext *ctx, const char *gid)
+{
+	/*
+	 * Skip if decoding of two-phase transactions at PREPARE time is not
+	 * enabled. In that case, all two-phase transactions are considered
+	 * filtered out and will be applied as regular transactions at COMMIT
+	 * PREPARED.
+	 */
+	if (!ctx->twophase)
+		return true;
+
+	/*
+	 * The filter_prepare callback is optional. When not supplied, all
+	 * prepared transactions should go through.
+	 */
+	if (ctx->callbacks.filter_prepare_cb == NULL)
+		return false;
+
+	return filter_prepare_cb_wrapper(ctx, gid);
+}
+
 static inline bool
 FilterByOrigin(LogicalDecodingContext *ctx, RepOriginId origin_id)
 {
@@ -582,10 +650,15 @@ DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 /*
  * Consolidated commit record handling between the different form of commit
  * records.
+ *
+ * 'two_phase' indicates that caller wants to process the transaction in two
+ * phases, first process prepare if not already done and then process
+ * commit_prepared.
  */
 static void
 DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			 xl_xact_parsed_commit *parsed, TransactionId xid)
+			 xl_xact_parsed_commit *parsed, TransactionId xid,
+			 bool two_phase)
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	TimestampTz commit_time = parsed->xact_time;
@@ -606,15 +679,6 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * the reorderbuffer to forget the content of the (sub-)transactions
 	 * if not.
 	 *
-	 * There can be several reasons we might not be interested in this
-	 * transaction:
-	 * 1) We might not be interested in decoding transactions up to this
-	 *	  LSN. This can happen because we previously decoded it and now just
-	 *	  are restarting or if we haven't assembled a consistent snapshot yet.
-	 * 2) The transaction happened in another database.
-	 * 3) The output plugin is not interested in the origin.
-	 * 4) We are doing fast-forwarding
-	 *
 	 * We can't just use ReorderBufferAbort() here, because we need to execute
 	 * the transaction's invalidations.  This currently won't be needed if
 	 * we're just skipping over the transaction because currently we only do
@@ -627,9 +691,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	 * relevant syscaches.
 	 * ---
 	 */
-	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
-		(parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
-		ctx->fast_forward || FilterByOrigin(ctx, origin_id))
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
 	{
 		for (i = 0; i < parsed->nsubxacts; i++)
 		{
@@ -647,34 +709,163 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 								 buf->origptr, buf->endptr);
 	}
 
+	/*
+	 * Send the final commit record if the transaction data is already
+	 * decoded, otherwise, process the entire transaction.
+	 */
+	if (two_phase)
+	{
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									commit_time, origin_id, origin_lsn,
+									parsed->twophase_gid, true);
+	}
+	else
+	{
+		ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
+							commit_time, origin_id, origin_lsn);
+	}
+
+	/*
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
+	 */
+	UpdateDecodingStats(ctx);
+}
+
+/*
+ * Decode PREPARE record. Similar logic as in DecodeCommit.
+ *
+ * Note that we don't skip prepare even if have detected concurrent abort
+ * because it is quite possible that we had already sent some changes before we
+ * detect abort in which case we need to abort those changes in the subscriber.
+ * To abort such changes, we do send the prepare and then the rollback prepared
+ * which is what happened on the publisher-side as well. Now, we can invent a
+ * new abort API wherein in such cases we send abort and skip sending prepared
+ * and rollback prepared but then it is not that straightforward because we
+ * might have streamed this transaction by that time in which case it is
+ * handled when the rollback is encountered. It is not impossible to optimize
+ * the concurrent abort case but it can introduce design complexity w.r.t
+ * handling different cases so leaving it for now as it doesn't seem worth it.
+ */
+static void
+DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+			  xl_xact_parsed_prepare *parsed)
+{
+	SnapBuild  *builder = ctx->snapshot_builder;
+	XLogRecPtr	origin_lsn = parsed->origin_lsn;
+	TimestampTz prepare_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	int			i;
+	TransactionId xid = parsed->twophase_xid;
+
+	if (parsed->origin_timestamp != 0)
+		prepare_time = parsed->origin_timestamp;
+
+	/*
+	 * Remember the prepare info for a txn so that it can be used later in
+	 * commit prepared if required. See ReorderBufferFinishPrepared.
+	 */
+	if (!ReorderBufferRememberPrepareInfo(ctx->reorder, xid, buf->origptr,
+										  buf->endptr, prepare_time, origin_id,
+										  origin_lsn))
+		return;
+
+	/* We can't start streaming unless a consistent state is reached. */
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT)
+	{
+		ReorderBufferSkipPrepare(ctx->reorder, xid);
+		return;
+	}
+
+	/*
+	 * Check whether we need to process this transaction. See
+	 * DecodeTXNNeedSkip for the reasons why we sometimes want to skip the
+	 * transaction.
+	 *
+	 * We can't call ReorderBufferForget as we did in DecodeCommit as the txn
+	 * hasn't yet been committed, removing this txn before a commit might
+	 * result in the computation of an incorrect restart_lsn. See
+	 * SnapBuildProcessRunningXacts. But we need to process cache
+	 * invalidations if there are any for the reasons mentioned in
+	 * DecodeCommit.
+	 */
+	if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
+	{
+		ReorderBufferSkipPrepare(ctx->reorder, xid);
+		ReorderBufferInvalidate(ctx->reorder, xid, buf->origptr);
+		return;
+	}
+
+	/* Tell the reorderbuffer about the surviving subtransactions. */
+	for (i = 0; i < parsed->nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
+								 buf->origptr, buf->endptr);
+	}
+
 	/* replay actions of all transaction + subtransactions in order */
-	ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr,
-						commit_time, origin_id, origin_lsn);
+	ReorderBufferPrepare(ctx->reorder, xid, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction commit/abort. It is not clear
-	 * that sending more or less frequently than this would be better.
+	 * Update the decoding stats at transaction prepare/commit/abort. It is
+	 * not clear that sending more or less frequently than this would be
+	 * better.
 	 */
 	UpdateDecodingStats(ctx);
 }
 
+
 /*
  * Get the data from the various forms of abort records and pass it on to
- * snapbuild.c and reorderbuffer.c
+ * snapbuild.c and reorderbuffer.c.
+ *
+ * 'two_phase' indicates to finish prepared transaction.
  */
 static void
 DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
-			xl_xact_parsed_abort *parsed, TransactionId xid)
+			xl_xact_parsed_abort *parsed, TransactionId xid,
+			bool two_phase)
 {
 	int			i;
+	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
+	TimestampTz abort_time = parsed->xact_time;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
+	bool		skip_xact;
 
-	for (i = 0; i < parsed->nsubxacts; i++)
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		origin_lsn = parsed->origin_lsn;
+		abort_time = parsed->origin_timestamp;
+	}
+
+	/*
+	 * Check whether we need to process this transaction. See
+	 * DecodeTXNNeedSkip for the reasons why we sometimes want to skip the
+	 * transaction.
+	 */
+	skip_xact = DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id);
+
+	/*
+	 * Send the final rollback record for a prepared transaction unless we
+	 * need to skip it. For non-two-phase xacts, simply forget the xact.
+	 */
+	if (two_phase && !skip_xact)
 	{
-		ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
-						   buf->record->EndRecPtr);
+		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
+									abort_time, origin_id, origin_lsn,
+									parsed->twophase_gid, false);
 	}
+	else
+	{
+		for (i = 0; i < parsed->nsubxacts; i++)
+		{
+			ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
+							   buf->record->EndRecPtr);
+		}
 
-	ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+		ReorderBufferAbort(ctx->reorder, xid, buf->record->EndRecPtr);
+	}
 
 	/* update the decoding stats */
 	UpdateDecodingStats(ctx);
@@ -1080,3 +1271,24 @@ DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tuple)
 	header->t_infomask2 = xlhdr.t_infomask2;
 	header->t_hoff = xlhdr.t_hoff;
 }
+
+/*
+ * Check whether we are interested in this specific transaction.
+ *
+ * There can be several reasons we might not be interested in this
+ * transaction:
+ * 1) We might not be interested in decoding transactions up to this
+ *	  LSN. This can happen because we previously decoded it and now just
+ *	  are restarting or if we haven't assembled a consistent snapshot yet.
+ * 2) The transaction happened in another database.
+ * 3) The output plugin is not interested in the origin.
+ * 4) We are doing fast-forwarding
+ */
+static bool
+DecodeTXNNeedSkip(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
+				  Oid txn_dbid, RepOriginId origin_id)
+{
+	return (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+			(txn_dbid != InvalidOid && txn_dbid != ctx->slot->data.database) ||
+			ctx->fast_forward || FilterByOrigin(ctx, origin_id));
+}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6e3de92..5c2c404 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1083,15 +1083,6 @@ filter_prepare_cb_wrapper(LogicalDecodingContext *ctx, const char *gid)
 
 	Assert(!ctx->fast_forward);
 
-	/*
-	 * Skip if decoding of two-phase transactions at PREPARE time is not
-	 * enabled. In that case, all two-phase transactions are considered
-	 * filtered out and will be applied as regular transactions at COMMIT
-	 * PREPARED.
-	 */
-	if (!ctx->twophase)
-		return true;
-
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "filter_prepare";
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 6b0a59e..bbefb68 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
+static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+									 bool txn_prepared);
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
@@ -422,6 +423,12 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* free data that's contained */
 
+	if (txn->gid != NULL)
+	{
+		pfree(txn->gid);
+		txn->gid = NULL;
+	}
+
 	if (txn->tuplecid_hash != NULL)
 	{
 		hash_destroy(txn->tuplecid_hash);
@@ -1516,12 +1523,18 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 }
 
 /*
- * Discard changes from a transaction (and subtransactions), after streaming
- * them.  Keep the remaining info - transactions, tuplecids, invalidations and
- * snapshots.
+ * Discard changes from a transaction (and subtransactions), either after
+ * streaming or decoding them at PREPARE. Keep the remaining info -
+ * transactions, tuplecids, invalidations and snapshots.
+ *
+ * We additionaly remove tuplecids after decoding the transaction at prepare
+ * time as we only need to perform invalidation at rollback or commit prepared.
+ *
+ * 'txn_prepared' indicates that we have decoded the transaction at prepare
+ * time.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
 
@@ -1540,7 +1553,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(rbtxn_is_known_subxact(subtxn));
 		Assert(subtxn->nsubtxns == 0);
 
-		ReorderBufferTruncateTXN(rb, subtxn);
+		ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
 	}
 
 	/* cleanup changes in the txn */
@@ -1574,9 +1587,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	 * about the toplevel xact (we send the XID in all messages), but we never
 	 * stream XIDs of empty subxacts.
 	 */
-	if ((!txn->toptxn) || (txn->nentries_mem != 0))
+	if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
 		txn->txn_flags |= RBTXN_IS_STREAMED;
 
+	if (txn_prepared)
+	{
+		/*
+		 * If this is a prepared txn, cleanup the tuplecids we stored for
+		 * decoding catalog snapshot access. They are always stored in the
+		 * toplevel transaction.
+		 */
+		dlist_foreach_modify(iter, &txn->tuplecids)
+		{
+			ReorderBufferChange *change;
+
+			change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+			/* Check we're not mixing changes from different transactions. */
+			Assert(change->txn == txn);
+			Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+			/* Remove the change from its containing list. */
+			dlist_delete(&change->node);
+
+			ReorderBufferReturnChange(rb, change, true);
+		}
+	}
+
 	/*
 	 * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
 	 * memory. We could also keep the hash table and update it with new ctid
@@ -1756,9 +1793,10 @@ ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
 }
 
 /*
- * If the transaction was (partially) streamed, we need to commit it in a
- * 'streamed' way.  That is, we first stream the remaining part of the
- * transaction, and then invoke stream_commit message.
+ * If the transaction was (partially) streamed, we need to prepare or commit
+ * it in a 'streamed' way.  That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_prepare or stream_commit message as per
+ * the case.
  */
 static void
 ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1768,29 +1806,49 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	ReorderBufferStreamTXN(rb, txn);
 
-	rb->stream_commit(rb, txn, txn->final_lsn);
+	if (rbtxn_prepared(txn))
+	{
+		/*
+		 * Note, we send stream prepare even if a concurrent abort is
+		 * detected. See DecodePrepare for more information.
+		 */
+		rb->stream_prepare(rb, txn, txn->final_lsn);
 
-	ReorderBufferCleanupTXN(rb, txn);
+		/*
+		 * This is a PREPARED transaction, part of a two-phase commit. The
+		 * full cleanup will happen as part of the COMMIT PREPAREDs, so now
+		 * just truncate txn by removing changes and tuple_cids.
+		 */
+		ReorderBufferTruncateTXN(rb, txn, true);
+		/* Reset the CheckXidAlive */
+		CheckXidAlive = InvalidTransactionId;
+	}
+	else
+	{
+		rb->stream_commit(rb, txn, txn->final_lsn);
+		ReorderBufferCleanupTXN(rb, txn);
+	}
 }
 
 /*
  * Set xid to detect concurrent aborts.
  *
- * While streaming an in-progress transaction there is a possibility that the
- * (sub)transaction might get aborted concurrently.  In such case if the
- * (sub)transaction has catalog update then we might decode the tuple using
- * wrong catalog version.  For example, suppose there is one catalog tuple with
- * (xmin: 500, xmax: 0).  Now, the transaction 501 updates the catalog tuple
- * and after that we will have two tuples (xmin: 500, xmax: 501) and
- * (xmin: 501, xmax: 0).  Now, if 501 is aborted and some other transaction
- * say 502 updates the same catalog tuple then the first tuple will be changed
- * to (xmin: 500, xmax: 502).  So, the problem is that when we try to decode
- * the tuple inserted/updated in 501 after the catalog update, we will see the
- * catalog tuple with (xmin: 500, xmax: 502) as visible because it will
- * consider that the tuple is deleted by xid 502 which is not visible to our
- * snapshot.  And when we will try to decode with that catalog tuple, it can
- * lead to a wrong result or a crash.  So, it is necessary to detect
- * concurrent aborts to allow streaming of in-progress transactions.
+ * While streaming an in-progress transaction or decoding a prepared
+ * transaction there is a possibility that the (sub)transaction might get
+ * aborted concurrently.  In such case if the (sub)transaction has catalog
+ * update then we might decode the tuple using wrong catalog version.  For
+ * example, suppose there is one catalog tuple with (xmin: 500, xmax: 0).  Now,
+ * the transaction 501 updates the catalog tuple and after that we will have
+ * two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0).  Now, if 501 is
+ * aborted and some other transaction say 502 updates the same catalog tuple
+ * then the first tuple will be changed to (xmin: 500, xmax: 502).  So, the
+ * problem is that when we try to decode the tuple inserted/updated in 501
+ * after the catalog update, we will see the catalog tuple with (xmin: 500,
+ * xmax: 502) as visible because it will consider that the tuple is deleted by
+ * xid 502 which is not visible to our snapshot.  And when we will try to
+ * decode with that catalog tuple, it can lead to a wrong result or a crash.
+ * So, it is necessary to detect concurrent aborts to allow streaming of
+ * in-progress transactions or decoding of prepared  transactions.
  *
  * For detecting the concurrent abort we set CheckXidAlive to the current
  * (sub)transaction's xid for which this change belongs to.  And, during
@@ -1799,7 +1857,10 @@ ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
  * and discard the already streamed changes on such an error.  We might have
  * already streamed some of the changes for the aborted (sub)transaction, but
  * that is fine because when we decode the abort we will stream abort message
- * to truncate the changes in the subscriber.
+ * to truncate the changes in the subscriber. Similarly, for prepared
+ * transactions, we stop decoding if concurrent abort is detected and then
+ * rollback the changes when rollback prepared is encountered. See
+ * DecodePreare.
  */
 static inline void
 SetupCheckXidLive(TransactionId xid)
@@ -1901,7 +1962,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					  ReorderBufferChange *specinsert)
 {
 	/* Discard the changes that we just streamed */
-	ReorderBufferTruncateTXN(rb, txn);
+	ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 
 	/* Free all resources allocated for toast reconstruction */
 	ReorderBufferToastReset(rb, txn);
@@ -1913,15 +1974,19 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		specinsert = NULL;
 	}
 
-	/* Stop the stream. */
-	rb->stream_stop(rb, txn, last_lsn);
-
-	/* Remember the command ID and snapshot for the streaming run */
-	ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	/*
+	 * For the streaming case, stop the stream and remember the command ID and
+	 * snapshot for the streaming run.
+	 */
+	if (rbtxn_is_streamed(txn))
+	{
+		rb->stream_stop(rb, txn, last_lsn);
+		ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
+	}
 }
 
 /*
- * Helper function for ReorderBufferCommit and ReorderBufferStreamTXN.
+ * Helper function for ReorderBufferReplay and ReorderBufferStreamTXN.
  *
  * Send data of a transaction (and its subtransactions) to the
  * output plugin. We iterate over the top and subtransactions (using a k-way
@@ -1974,9 +2039,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		else
 			StartTransactionCommand();
 
-		/* We only need to send begin/commit for non-streamed transactions. */
+		/*
+		 * We only need to send begin/begin-prepare for non-streamed
+		 * transactions.
+		 */
 		if (!streaming)
-			rb->begin(rb, txn);
+		{
+			if (rbtxn_prepared(txn))
+				rb->begin_prepare(rb, txn);
+			else
+				rb->begin(rb, txn);
+		}
 
 		ReorderBufferIterTXNInit(rb, txn, &iterstate);
 		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
@@ -2007,8 +2080,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 			prev_lsn = change->lsn;
 
-			/* Set the current xid to detect concurrent aborts. */
-			if (streaming)
+			/*
+			 * Set the current xid to detect concurrent aborts. This is
+			 * required for the cases when we decode the changes before the
+			 * COMMIT record is processed.
+			 */
+			if (streaming || rbtxn_prepared(change->txn))
 			{
 				curtxn = change->txn;
 				SetupCheckXidLive(curtxn->xid);
@@ -2299,7 +2376,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 		else
-			rb->commit(rb, txn, commit_lsn);
+		{
+			/*
+			 * Call either PREPARE (for two-phase transactions) or COMMIT (for
+			 * regular ones).
+			 */
+			if (rbtxn_prepared(txn))
+				rb->prepare(rb, txn, commit_lsn);
+			else
+				rb->commit(rb, txn, commit_lsn);
+		}
 
 		/* this is just a sanity check against bad output plugin behaviour */
 		if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
@@ -2333,15 +2419,22 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			RollbackAndReleaseCurrentSubTransaction();
 
 		/*
-		 * If we are streaming the in-progress transaction then discard the
-		 * changes that we just streamed, and mark the transactions as
-		 * streamed (if they contained changes). Otherwise, remove all the
-		 * changes and deallocate the ReorderBufferTXN.
+		 * We are here due to one of the four reasons: 1. Decoding an
+		 * in-progress txn. 2. Decoding a prepared txn. 3. Decoding of a
+		 * prepared txn that was (partially) streamed. 4. Decoding a committed
+		 * txn.
+		 *
+		 * For 1, we allow truncation of txn data by removing the changes
+		 * already streamed but still keeping other things like invalidations,
+		 * snapshot, and tuplecids. For 2 and 3, we indicate
+		 * ReorderBufferTruncateTXN to do more elaborate truncation of txn
+		 * data as the entire transaction has been decoded except for commit.
+		 * For 4, as the entire txn has been decoded, we can fully clean up
+		 * the TXN reorder buffer.
 		 */
-		if (streaming)
+		if (streaming || rbtxn_prepared(txn))
 		{
-			ReorderBufferTruncateTXN(rb, txn);
-
+			ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
 			/* Reset the CheckXidAlive */
 			CheckXidAlive = InvalidTransactionId;
 		}
@@ -2374,17 +2467,20 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 		/*
 		 * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
-		 * abort of the (sub)transaction we are streaming. We need to do the
-		 * cleanup and return gracefully on this error, see SetupCheckXidLive.
+		 * abort of the (sub)transaction we are streaming or preparing. We
+		 * need to do the cleanup and return gracefully on this error, see
+		 * SetupCheckXidLive.
 		 */
 		if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
 		{
 			/*
-			 * This error can only occur when we are sending the data in
-			 * streaming mode and the streaming is not finished yet.
+			 * This error can occur either when we are sending the data in
+			 * streaming mode and the streaming is not finished yet or when we
+			 * are sending the data out on a PREPARE during a two-phase
+			 * commit.
 			 */
-			Assert(streaming);
-			Assert(stream_started);
+			Assert(streaming || rbtxn_prepared(txn));
+			Assert(stream_started || rbtxn_prepared(txn));
 
 			/* Cleanup the temporary error state. */
 			FlushErrorState();
@@ -2414,26 +2510,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * ReorderBufferCommitChild(), even if previously assigned to the toplevel
  * transaction with ReorderBufferAssignChild.
  *
- * This interface is called once a toplevel commit is read for both streamed
- * as well as non-streamed transactions.
+ * This interface is called once a prepare or toplevel commit is read for both
+ * streamed as well as non-streamed transactions.
  */
-void
-ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+static void
+ReorderBufferReplay(ReorderBufferTXN *txn,
+					ReorderBuffer *rb, TransactionId xid,
 					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 					TimestampTz commit_time,
 					RepOriginId origin_id, XLogRecPtr origin_lsn)
 {
-	ReorderBufferTXN *txn;
 	Snapshot	snapshot_now;
 	CommandId	command_id = FirstCommandId;
 
-	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
-								false);
-
-	/* unknown transaction, nothing to replay */
-	if (txn == NULL)
-		return;
-
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
 	txn->commit_time = commit_time;
@@ -2463,7 +2552,13 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	if (txn->base_snapshot == NULL)
 	{
 		Assert(txn->ninvalidations == 0);
-		ReorderBufferCleanupTXN(rb, txn);
+
+		/*
+		 * Removing this txn before a commit might result in the computation
+		 * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
+		 */
+		if (!rbtxn_prepared(txn))
+			ReorderBufferCleanupTXN(rb, txn);
 		return;
 	}
 
@@ -2475,6 +2570,178 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
+ * Commit a transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+					XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+					TimestampTz commit_time,
+					RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	ReorderBufferReplay(txn, rb, xid, commit_lsn, end_lsn, commit_time,
+						origin_id, origin_lsn);
+}
+
+/*
+ * Record the prepare information for a transaction.
+ */
+bool
+ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
+								 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+								 TimestampTz prepare_time,
+								 RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return false;
+
+	/*
+	 * Remember the prepare information to be later used by commit prepared in
+	 * case we skip doing prepare.
+	 */
+	txn->final_lsn = prepare_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = prepare_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	return true;
+}
+
+/* Remember that we have skipped prepare */
+void
+ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_SKIPPED_PREPARE;
+}
+
+/*
+ * Prepare a two-phase transaction.
+ *
+ * See comments for ReorderBufferReplay().
+ */
+void
+ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+					 char *gid)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown transaction, nothing to replay */
+	if (txn == NULL)
+		return;
+
+	txn->txn_flags |= RBTXN_PREPARE;
+	txn->gid = pstrdup(gid);
+
+	/* The prepare info must have been updated in txn by now. */
+	Assert(txn->final_lsn != InvalidXLogRecPtr);
+
+	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
+						txn->commit_time, txn->origin_id, txn->origin_lsn);
+}
+
+/*
+ * This is used to handle COMMIT/ROLLBACK PREPARED.
+ */
+void
+ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+							TimestampTz commit_time, RepOriginId origin_id,
+							XLogRecPtr origin_lsn, char *gid, bool is_commit)
+{
+	ReorderBufferTXN *txn;
+	XLogRecPtr	prepare_end_lsn;
+	TimestampTz	prepare_time;
+
+	txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, false);
+
+	/* unknown transaction, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * By this time the txn has the prepare record information, remember it to
+	 * be later used for rollback.
+	 */
+	prepare_end_lsn = txn->end_lsn;
+	prepare_time = txn->commit_time;
+
+	/* add the gid in the txn */
+	txn->gid = pstrdup(gid);
+
+	/*
+	 * It is possible that this transaction is not decoded at prepare time
+	 * either because by that time we didn't have a consistent snapshot or it
+	 * was decoded earlier but we have restarted. We can't distinguish between
+	 * those two cases so we send the prepare in both the cases and let
+	 * downstream decide whether to process or skip it. We don't need to
+	 * decode the xact for aborts if it is not done already.
+	 */
+	if (!rbtxn_prepared(txn) && is_commit)
+	{
+		txn->txn_flags |= RBTXN_PREPARE;
+
+		/*
+		 * The prepare info must have been updated in txn even if we skip
+		 * prepare.
+		 */
+		Assert(txn->final_lsn != InvalidXLogRecPtr);
+
+		/*
+		 * By this time the txn has the prepare record information and it is
+		 * important to use that so that downstream gets the accurate
+		 * information. If instead, we have passed commit information here
+		 * then downstream can behave as it has already replayed commit
+		 * prepared after the restart.
+		 */
+		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
+							txn->commit_time, txn->origin_id, txn->origin_lsn);
+	}
+
+	txn->final_lsn = commit_lsn;
+	txn->end_lsn = end_lsn;
+	txn->commit_time = commit_time;
+	txn->origin_id = origin_id;
+	txn->origin_lsn = origin_lsn;
+
+	if (is_commit)
+		rb->commit_prepared(rb, txn, commit_lsn);
+	else
+		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
+
+	/* cleanup: make sure there's no cache pollution */
+	ReorderBufferExecuteInvalidations(txn->ninvalidations,
+									  txn->invalidations);
+	ReorderBufferCleanupTXN(rb, txn);
+}
+
+/*
  * Abort a transaction that possibly has previous changes. Needs to be first
  * called for subtransactions and then for the toplevel xid.
  *
@@ -2606,6 +2873,39 @@ ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
 }
 
 /*
+ * Invalidate cache for those transactions that need to be skipped just in case
+ * catalogs were manipulated as part of the transaction.
+ *
+ * Note that this is a special-purpose function for prepared transactions where
+ * we don't want to clean up the TXN even when we decide to skip it. See
+ * DecodePrepare.
+ */
+void
+ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+
+	txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
+								false);
+
+	/* unknown, nothing to do */
+	if (txn == NULL)
+		return;
+
+	/*
+	 * Process cache invalidation messages if there are any. Even if we're not
+	 * interested in the transaction's contents, it could have manipulated the
+	 * catalog and we need to update the caches according to that.
+	 */
+	if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
+		ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
+										   txn->invalidations);
+	else
+		Assert(txn->ninvalidations == 0);
+}
+
+
+/*
  * Execute invalidations happening outside the context of a decoded
  * transaction. That currently happens either for xid-less commits
  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 9d5d68f..dc3ef74 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -834,6 +834,13 @@ SnapBuildDistributeNewCatalogSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, txn->xid))
 			continue;
 
+		/*
+		 * We don't need to add snapshot to prepared transactions as they
+		 * should not see the new catalog contents.
+		 */
+		if (rbtxn_prepared(txn) || rbtxn_skip_prepared(txn))
+			continue;
+
 		elog(DEBUG2, "adding a new snapshot to %u at %X/%X",
 			 txn->xid, (uint32) (lsn >> 32), (uint32) lsn);
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 1e60afe..6d63338 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -174,6 +174,8 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED         0x0010
 #define RBTXN_HAS_TOAST_INSERT    0x0020
 #define RBTXN_HAS_SPEC_INSERT     0x0040
+#define RBTXN_PREPARE             0x0080
+#define RBTXN_SKIPPED_PREPARE	  0x0100
 
 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +235,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )
 
+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* prepare for this transaction skipped? */
+#define rbtxn_skip_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -258,10 +272,11 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	first_lsn;
 
 	/* ----
-	 * LSN of the record that lead to this xact to be committed or
+	 * LSN of the record that lead to this xact to be prepared or committed or
 	 * aborted. This can be a
 	 * * plain commit record
 	 * * plain commit record, of a parent transaction
+	 * * prepared tansaction
 	 * * prepared transaction commit
 	 * * plain abort record
 	 * * prepared transaction abort
@@ -293,7 +308,8 @@ typedef struct ReorderBufferTXN
 	XLogRecPtr	origin_lsn;
 
 	/*
-	 * Commit time, only known when we read the actual commit record.
+	 * Commit or Prepare time, only known when we read the actual commit or
+	 * prepare record.
 	 */
 	TimestampTz commit_time;
 
@@ -625,12 +641,18 @@ void		ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho
 void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+										TimestampTz commit_time,
+										RepOriginId origin_id, XLogRecPtr origin_lsn,
+										char *gid, bool is_commit);
 void		ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn);
 void		ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
 									 XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
 void		ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 void		ReorderBufferAbortOld(ReorderBuffer *, TransactionId xid);
 void		ReorderBufferForget(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+void		ReorderBufferInvalidate(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
 
 void		ReorderBufferSetBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
 void		ReorderBufferAddSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
@@ -644,10 +666,17 @@ void		ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr l
 void		ReorderBufferImmediateInvalidation(ReorderBuffer *, uint32 ninvalidations,
 											   SharedInvalidationMessage *invalidations);
 void		ReorderBufferProcessXid(ReorderBuffer *, TransactionId xid, XLogRecPtr lsn);
+
 void		ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLogRecPtr lsn);
 bool		ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool		ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);
 
+bool		ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
+											 XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
+											 TimestampTz prepare_time,
+											 RepOriginId origin_id, XLogRecPtr origin_lsn);
+void		ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid);
+void		ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuffer *);
 TransactionId ReorderBufferGetOldestXmin(ReorderBuffer *rb);
 
-- 
1.8.3.1

v35-0002-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v35-0002-Refactor-spool-file-logic-in-worker.c.patchDownload
From e9b582054acfc45176faaab1f65d481a2d05335c Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 01:39:44 -0500
Subject: [PATCH v35 2/8] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3874939..4f75e85 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -924,30 +926,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -955,7 +948,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -970,7 +963,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1045,6 +1038,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v35-0003-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v35-0003-Track-replication-origin-progress-for-rollbacks.patchDownload
From 829b2f1ea04ad9d384cd01d4bf92fea5de4d988a Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 02:36:02 -0500
Subject: [PATCH v35 3/8] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 873bf9b..fe10809 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2277,6 +2277,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2299,6 +2307,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 9cd0b7c..b7470ce 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5720,8 +5720,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5927,7 +5926,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5976,6 +5976,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6017,7 +6024,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6025,7 +6033,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v35-0004-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v35-0004-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 1e273c48a6a2556fa59c0a24f7dfc0885e197331 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 02:38:34 -0500
Subject: [PATCH v35 4/8] Add support for apply at prepare time to built-in
 logical replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

* We allow skipping prepared transactions if they are already prepared.
We do ensure that we skip only when the GID, origin_lsn, and
origin_timestamp of a prepared xact matches to avoid the possibility of
a match of prepared xact from two different nodes. This can happen when
the server or apply worker restarts after a prepared transaction.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  74 ++++++-
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 260 +++++++++++++++++++++-
 src/backend/replication/logical/worker.c    | 330 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 172 ++++++++++++---
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  75 ++++++-
 src/include/replication/reorderbuffer.h     |  12 +
 src/tools/pgindent/typedefs.list            |   3 +
 9 files changed, 895 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index fe10809..71cca00 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1133,9 +1133,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
@@ -2446,3 +2446,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 15ab8e7..dd33469 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -957,8 +957,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index fdb3118..1047385 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,264 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4f75e85..4f57a8a 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -169,6 +170,9 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
+/* for skipping prepared transaction */
+bool        skip_prepared_txn = false;
+
 /*
  * Hash table for storing the streaming xid information along with shared file
  * set for streaming and subxact files.
@@ -690,6 +694,12 @@ apply_handle_begin(StringInfo s)
 {
 	LogicalRepBeginData begin_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_begin(s, &begin_data);
 
 	remote_final_lsn = begin_data.final_lsn;
@@ -709,6 +719,12 @@ apply_handle_commit(StringInfo s)
 {
 	LogicalRepCommitData commit_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_commit(s, &commit_data);
 
 	Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -722,6 +738,264 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+	{
+		/*
+		 * If this gid has already been prepared then we don't want to apply
+		 * this txn again. This can happen after restart where upstream can
+		 * send the prepared transaction again. See
+		 * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+		 */
+		skip_prepared_txn = true;
+		return;
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (skip_prepared_txn)
+	{
+		/*
+		 * If we are skipping this transaction because it was previously
+		 * prepared, ignore it and reset the flag.
+		 */
+		Assert(LookupGXact(prepare_data.gid, prepare_data.end_lsn,
+						   prepare_data.preparetime));
+		skip_prepared_txn = false;
+		return;
+	}
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/* The synchronization worker runs in single transaction. */
+	if (IsTransactionState() && !am_tablesync_worker())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -753,6 +1027,12 @@ apply_handle_stream_start(StringInfo s)
 	Assert(!in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Start a transaction on stream start, this transaction will be committed
 	 * on the stream stop unless it is a tablesync worker in which case it will
 	 * be committed after processing all the messages. We need the transaction
@@ -800,6 +1080,12 @@ apply_handle_stream_stop(StringInfo s)
 	Assert(in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Close the file with serialized changes, and serialize information about
 	 * subxacts for the toplevel transaction.
 	 */
@@ -835,6 +1121,12 @@ apply_handle_stream_abort(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_stream_abort(s, &xid, &subxid);
 
 	/*
@@ -1053,6 +1345,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	xid = logicalrep_read_stream_commit(s, &commit_data);
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
@@ -1176,6 +1474,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1297,6 +1598,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1454,6 +1758,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1823,6 +2130,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1979,6 +2289,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 49d25b0..7cf2951 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +67,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +78,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +173,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +344,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,27 +364,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -378,6 +385,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -766,17 +835,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -857,6 +917,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1171,3 +1249,31 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 	while ((entry = (RelationSyncEntry *) hash_seq_search(&status)) != NULL)
 		entry->replicate_valid = false;
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr	origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 2ca71c3..5afb977 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 1f2535d..13ea3b7 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 6d63338..4b92e68 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9cd047ba..ecba4ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1339,12 +1339,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v35-0005-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v35-0005-Support-2PC-txn-subscriber-tests.patchDownload
From f1e5e111e5726bdfc1a9c2b017ebf62c261f70cb Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 04:09:22 -0500
Subject: [PATCH v35 5/8] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v35-0006-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v35-0006-Support-2PC-txn-Subscription-option.patchDownload
From a734140f5fa0793fc3f034b15439b5231756689f Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 04:10:57 -0500
Subject: [PATCH v35 6/8] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.

Note: The tablesync worker slot always has two_phase disabled, regardless of the option.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 202 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index db5e59f..dbe2a43 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -166,8 +166,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index ca78d39..886839e 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -67,6 +67,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b140c21..5f4e191 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1149,7 +1149,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 1696454..b0745d5 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -64,7 +64,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -105,6 +106,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -210,6 +216,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -355,6 +370,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -379,7 +396,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -447,6 +465,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -720,6 +739,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -730,7 +751,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -769,6 +791,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -787,7 +816,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -832,7 +862,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -875,7 +906,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 24f8b3e..1f404cd 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -429,6 +429,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4f57a8a..2ec526b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2795,6 +2795,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		(!am_tablesync_worker() && newsub->twophase != MySubscription->twophase) ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3441,6 +3442,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase && !am_tablesync_worker();
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7cf2951..7e42a70 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -180,13 +180,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -254,6 +256,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -267,6 +279,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -291,7 +304,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -332,6 +346,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 1ab98a2..fe90d28 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4222,6 +4222,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4265,9 +4266,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4288,6 +4294,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4313,6 +4320,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4381,6 +4390,9 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index d7f77f1..3dae9ce 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -630,6 +630,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 14150d0..47306a2 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -5997,7 +5997,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6023,13 +6023,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 3fa02af..e07eed0 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -53,6 +53,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -90,6 +92,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 13ea3b7..4f5aec9 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1b05b39..f96c891 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 2fa9bce..23d876e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,42 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 14fa0b2..2a0b366 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -147,6 +147,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v35-0007-Support-2PC-consistent-snapshot-isolation-tests.patchapplication/octet-stream; name=v35-0007-Support-2PC-consistent-snapshot-isolation-tests.patchDownload
From ffed6ab6f80abedaf9227ddb445741892501e8e7 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 29 Dec 2020 02:18:36 -0500
Subject: [PATCH v35 7/8] Support 2PC consistent snapshot isolation tests.

Added isolation test-case to test that if a consistent snapshot is created
between a PREPARE and a COMMIT PREPARED, then the whole transaction is decoded
on COMMIT PREPARED.
---
 contrib/test_decoding/Makefile                     |  3 +-
 .../test_decoding/expected/twophase_snapshot.out   | 43 +++++++++++++++++++++
 contrib/test_decoding/specs/twophase_snapshot.spec | 44 ++++++++++++++++++++++
 3 files changed, 89 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/twophase_snapshot.out
 create mode 100644 contrib/test_decoding/specs/twophase_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 76d4a69..c5e28ce 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
+	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
+	twophase_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/twophase_snapshot.out b/contrib/test_decoding/expected/twophase_snapshot.out
new file mode 100644
index 0000000..0d38958
--- /dev/null
+++ b/contrib/test_decoding/expected/twophase_snapshot.out
@@ -0,0 +1,43 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2b s2txid s1init s3b s3txid s2alter s2c s2b s2insert s2prepare s3c s1insert s1checkpoint s1start s2commit s1start
+step s2b: BEGIN;
+step s2txid: SELECT pg_current_xact_id() IS NULL;
+?column?       
+
+f              
+step s1init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); <waiting ...>
+step s3b: BEGIN;
+step s3txid: SELECT pg_current_xact_id() IS NULL;
+?column?       
+
+f              
+step s2alter: ALTER TABLE do_write ADD COLUMN addedbys2 int;
+step s2c: COMMIT;
+step s2b: BEGIN;
+step s2insert: INSERT INTO do_write DEFAULT VALUES;
+step s2prepare: PREPARE TRANSACTION 'test1';
+step s3c: COMMIT;
+step s1init: <... completed>
+?column?       
+
+init           
+step s1insert: INSERT INTO do_write DEFAULT VALUES;
+step s1checkpoint: CHECKPOINT;
+step s1start: SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');
+data           
+
+BEGIN          
+table public.do_write: INSERT: id[integer]:2 addedbys2[integer]:null
+COMMIT         
+step s2commit: COMMIT PREPARED 'test1';
+step s1start: SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');
+data           
+
+BEGIN          
+table public.do_write: INSERT: id[integer]:1 addedbys2[integer]:null
+PREPARE TRANSACTION 'test1'
+COMMIT PREPARED 'test1'
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/twophase_snapshot.spec b/contrib/test_decoding/specs/twophase_snapshot.spec
new file mode 100644
index 0000000..8856b27
--- /dev/null
+++ b/contrib/test_decoding/specs/twophase_snapshot.spec
@@ -0,0 +1,44 @@
+# Test decoding of two-phase transactions during the build of a consistent snapshot.
+setup
+{
+    DROP TABLE IF EXISTS do_write;
+    CREATE TABLE do_write(id serial primary key);
+}
+
+teardown
+{
+    DROP TABLE do_write;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1init" {SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');}
+step "s1start" {SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');}
+step "s1insert" { INSERT INTO do_write DEFAULT VALUES; }
+step "s1checkpoint" { CHECKPOINT; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2b" { BEGIN; }
+step "s2txid" { SELECT pg_current_xact_id() IS NULL; }
+step "s2alter" { ALTER TABLE do_write ADD COLUMN addedbys2 int; }
+step "s2c" { COMMIT; }
+step "s2insert" { INSERT INTO do_write DEFAULT VALUES; }
+step "s2prepare" { PREPARE TRANSACTION 'test1'; }
+step "s2commit" { COMMIT PREPARED 'test1'; }
+
+
+session "s3"
+setup { SET synchronous_commit=on; }
+
+step "s3b" { BEGIN; }
+step "s3txid" { SELECT pg_current_xact_id() IS NULL; }
+step "s3c" { COMMIT; }
+
+# Force building of a consistent snapshot between a PREPARE and COMMIT PREPARED.
+# Ensure that the whole transaction is decoded fresh at the time of COMMIT PREPARED.
+permutation "s2b" "s2txid" "s1init" "s3b" "s3txid" "s2alter" "s2c" "s2b" "s2insert" "s2prepare" "s3c""s1insert" "s1checkpoint" "s1start" "s2commit" "s1start"
-- 
1.8.3.1

v35-0008-Support-2PC-txn-tests-for-concurrent-aborts.patchapplication/octet-stream; name=v35-0008-Support-2PC-txn-tests-for-concurrent-aborts.patchDownload
From d28861b4c12bf6d51f2f67f025a93ffe4c6d8087 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Dec 2020 04:30:30 -0500
Subject: [PATCH v35 8/8] Support 2PC txn tests for concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2PC.
---
 contrib/test_decoding/Makefile                    |   2 +
 contrib/test_decoding/t/001_twophase.pl           | 121 ++++++++++++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++++++
 contrib/test_decoding/test_decoding.c             |  58 ++++++++++
 src/backend/replication/logical/reorderbuffer.c   |   5 +
 5 files changed, 319 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index c5e28ce..e0cd841 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -10,6 +10,8 @@ ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..3b3e7b8
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of prepared txn test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..15001c6
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 0576355..efe7f5c 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,11 +11,13 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
+#include "storage/procarray.h"
 
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -35,6 +37,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -174,6 +177,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -275,6 +279,24 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -471,6 +493,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -620,6 +666,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -706,6 +755,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -918,6 +970,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -971,6 +1026,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bbefb68..1d43fcf 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2489,6 +2489,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
-- 
1.8.3.1

#175Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#174)

On Thu, Dec 31, 2020 at 10:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 29, 2020 at 3:15 PM Ajin Cherian <itsajin@gmail.com> wrote:

Hi Sawada-san,

I think Amit has a plan to commit this patch-set in phases.

I have pushed the first patch and I would like to make a few changes
in the second patch after which I will post the new version. I'll try
to do that tomorrow if possible and register the patch.

Please find attached a rebased version of this patch-set.

Registered in CF (https://commitfest.postgresql.org/31/2914/).

--
With Regards,
Amit Kapila.

#176Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#174)

On Thu, Dec 31, 2020 at 4:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

3. Merged the doc changes patch after some changes mostly cosmetic.

Some minor comments here:

v35-0001 - logicaldecoding.sgml

In the example section:
Change "The following example shows SQL interface can be used to
decode prepared transactions."
to "The following example shows the SQL interface that can be used to
decode prepared transactions."

Then in "Two-phase commit support for Logical Decoding" page:
Change "To support streaming of two-phase commands, an output plugin
needs to provide the additional callbacks."
to "To support streaming of two-phase commands, an output plugin needs
to provide additional callbacks."

Other than that, I have no more comments.

regards,
Ajin Cherian
Fujitsu Australia

#177Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#176)

On Thu, Dec 31, 2020 at 12:31 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, Dec 31, 2020 at 4:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

3. Merged the doc changes patch after some changes mostly cosmetic.

Some minor comments here:

v35-0001 - logicaldecoding.sgml

In the example section:
Change "The following example shows SQL interface can be used to
decode prepared transactions."
to "The following example shows the SQL interface that can be used to
decode prepared transactions."

Then in "Two-phase commit support for Logical Decoding" page:
Change "To support streaming of two-phase commands, an output plugin
needs to provide the additional callbacks."
to "To support streaming of two-phase commands, an output plugin needs
to provide additional callbacks."

Other than that, I have no more comments.

Thanks, I have pushed the 0001* patch after making the above and a few
other cosmetic modifications.

--
With Regards,
Amit Kapila.

#178Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#172)

On Tue, Dec 29, 2020 at 3:15 PM Ajin Cherian <itsajin@gmail.com> wrote:

Hi Sawada-san,

I think Amit has a plan to commit this patch-set in phases. I will
leave it to him to decide because I think he has a plan.
I took time to refactor the test_decoding isolation test for
consistent snapshot so that it uses just 3 sessions rather than 4.
Posting an updated patch-0009

I have reviewed this test case patch and have the below comments:

1.
+step "s1checkpoint" { CHECKPOINT; }
...
+step "s2alter" { ALTER TABLE do_write ADD COLUMN addedbys2 int; }

I don't see the need for the above steps and we should be able to
generate the required scenario without these as well. Is there any
reason to keep those?

2.
"s3c""s1insert"

space is missing between these two.

3.
+# Force building of a consistent snapshot between a PREPARE and
COMMIT PREPARED.
+# Ensure that the whole transaction is decoded fresh at the time of
COMMIT PREPARED.
+permutation "s2b" "s2txid" "s1init" "s3b" "s3txid" "s2alter" "s2c"
"s2b" "s2insert" "s2prepare" "s3c""s1insert" "s1checkpoint" "s1start"
"s2commit" "s1start"

I think we can update the above comments to indicate how and which
important steps help us to realize the required scenario. See
subxact_without_top.spec for reference.

4.
+step "s2c" { COMMIT; }
...
+step "s2prepare" { PREPARE TRANSACTION 'test1'; }
+step "s2commit" { COMMIT PREPARED 'test1'; }

s2c and s2commit seem to be confusing names as both sounds like doing
the same thing. How about changing s2commit to s2cp and s2prepare to
s2p?

--
With Regards,
Amit Kapila.

#179Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#178)
1 attachment(s)

On Tue, Jan 5, 2021 at 5:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have reviewed this test case patch and have the below comments:

1.
+step "s1checkpoint" { CHECKPOINT; }
...
+step "s2alter" { ALTER TABLE do_write ADD COLUMN addedbys2 int; }

I don't see the need for the above steps and we should be able to
generate the required scenario without these as well. Is there any
reason to keep those?

Removed.

2.
"s3c""s1insert"

space is missing between these two.

Updated.

3.
+# Force building of a consistent snapshot between a PREPARE and
COMMIT PREPARED.
+# Ensure that the whole transaction is decoded fresh at the time of
COMMIT PREPARED.
+permutation "s2b" "s2txid" "s1init" "s3b" "s3txid" "s2alter" "s2c"
"s2b" "s2insert" "s2prepare" "s3c""s1insert" "s1checkpoint" "s1start"
"s2commit" "s1start"

I think we can update the above comments to indicate how and which
important steps help us to realize the required scenario. See
subxact_without_top.spec for reference.

Added more comments to explain the state change of logical decoding.

4.
+step "s2c" { COMMIT; }
...
+step "s2prepare" { PREPARE TRANSACTION 'test1'; }
+step "s2commit" { COMMIT PREPARED 'test1'; }

s2c and s2commit seem to be confusing names as both sounds like doing
the same thing. How about changing s2commit to s2cp and s2prepare to
s2p?

Updated.

I've addressed the above comments and the patch is attached. I've
called it v36-0007.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v36-0007-Support-2PC-consistent-snapshot-isolation-tests.patchapplication/octet-stream; name=v36-0007-Support-2PC-consistent-snapshot-isolation-tests.patchDownload
From 201936ed1238951e8a13b67da2507e1ea3769537 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 5 Jan 2021 03:30:11 -0500
Subject: [PATCH v36] Support 2PC consistent snapshot isolation tests.

Added isolation test-case to test that if a consistent snapshot is created
between a PREPARE and a COMMIT PREPARED, then the whole transaction is decoded
on COMMIT PREPARED.
---
 contrib/test_decoding/Makefile                     |  3 +-
 .../test_decoding/expected/twophase_snapshot.out   | 41 +++++++++++++++++++
 contrib/test_decoding/specs/twophase_snapshot.spec | 47 ++++++++++++++++++++++
 3 files changed, 90 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/twophase_snapshot.out
 create mode 100644 contrib/test_decoding/specs/twophase_snapshot.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 76d4a69..c5e28ce 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,8 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream
+	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
+	twophase_snapshot
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/twophase_snapshot.out b/contrib/test_decoding/expected/twophase_snapshot.out
new file mode 100644
index 0000000..3b7f23b
--- /dev/null
+++ b/contrib/test_decoding/expected/twophase_snapshot.out
@@ -0,0 +1,41 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s2b s2txid s1init s3b s3txid s2c s2b s2insert s2p s3c s1insert s1start s2cp s1start
+step s2b: BEGIN;
+step s2txid: SELECT pg_current_xact_id() IS NULL;
+?column?       
+
+f              
+step s1init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); <waiting ...>
+step s3b: BEGIN;
+step s3txid: SELECT pg_current_xact_id() IS NULL;
+?column?       
+
+f              
+step s2c: COMMIT;
+step s2b: BEGIN;
+step s2insert: INSERT INTO do_write DEFAULT VALUES;
+step s2p: PREPARE TRANSACTION 'test1';
+step s3c: COMMIT;
+step s1init: <... completed>
+?column?       
+
+init           
+step s1insert: INSERT INTO do_write DEFAULT VALUES;
+step s1start: SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');
+data           
+
+BEGIN          
+table public.do_write: INSERT: id[integer]:2
+COMMIT         
+step s2cp: COMMIT PREPARED 'test1';
+step s1start: SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');
+data           
+
+BEGIN          
+table public.do_write: INSERT: id[integer]:1
+PREPARE TRANSACTION 'test1'
+COMMIT PREPARED 'test1'
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/twophase_snapshot.spec b/contrib/test_decoding/specs/twophase_snapshot.spec
new file mode 100644
index 0000000..1127d89
--- /dev/null
+++ b/contrib/test_decoding/specs/twophase_snapshot.spec
@@ -0,0 +1,47 @@
+# Test decoding of two-phase transactions during the build of a consistent snapshot.
+setup
+{
+    DROP TABLE IF EXISTS do_write;
+    CREATE TABLE do_write(id serial primary key);
+}
+
+teardown
+{
+    DROP TABLE do_write;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1init" {SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');}
+step "s1start" {SELECT data  FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'two-phase-commit', '1');}
+step "s1insert" { INSERT INTO do_write DEFAULT VALUES; }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2b" { BEGIN; }
+step "s2txid" { SELECT pg_current_xact_id() IS NULL; }
+step "s2c" { COMMIT; }
+step "s2insert" { INSERT INTO do_write DEFAULT VALUES; }
+step "s2p" { PREPARE TRANSACTION 'test1'; }
+step "s2cp" { COMMIT PREPARED 'test1'; }
+
+
+session "s3"
+setup { SET synchronous_commit=on; }
+
+step "s3b" { BEGIN; }
+step "s3txid" { SELECT pg_current_xact_id() IS NULL; }
+step "s3c" { COMMIT; }
+
+# 's1init' step will initialize the replication slot and cause logical decoding to wait
+# in initial starting point till the in-progress transaction in s2 is committed.
+# 's2c' step will cause logical decoding to go to initial consistent point and wait for
+# in-progress transaction in s3 to commit.
+# 's3c' step will cause logical decoding to find a consistent point while the transaction 
+# in s2 is prepared and not yet committed.
+# After the 's2cp' step, ensure that the whole transaction is decoded fresh at 's1start'.
+permutation "s2b" "s2txid" "s1init" "s3b" "s3txid" "s2c" "s2b" "s2insert" "s2p" "s3c" "s1insert" "s1start" "s2cp" "s1start"
-- 
1.8.3.1

#180Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#179)

On Tue, Jan 5, 2021 at 2:11 PM Ajin Cherian <itsajin@gmail.com> wrote:

I've addressed the above comments and the patch is attached. I've
called it v36-0007.

Thanks, I have pushed this after minor wordsmithing.

--
With Regards,
Amit Kapila.

#181Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#167)

On Tue, Dec 22, 2020 at 3:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 22, 2020 at 2:51 PM Ajin Cherian <itsajin@gmail.com> wrote:

Other than this I've noticed a few typos that are not in the patch but
in the surrounding code.
logical.c: 1383: Comment should mention stream_commit_cb not stream_abort_cb.
decode.c: 686 - Extra "it's" here: "because it's it happened"

Anything not related to this patch, please post in a separate email.

Pushed the fix for above reported typos.

--
With Regards,
Amit Kapila.

#182Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#180)

On Tue, Jan 5, 2021 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 5, 2021 at 2:11 PM Ajin Cherian <itsajin@gmail.com> wrote:

I've addressed the above comments and the patch is attached. I've
called it v36-0007.

Thanks, I have pushed this after minor wordsmithing.

The test case is failing on one of the build farm machines. See email
from Tom Lane [1]/messages/by-id/363512.1610171267@sss.pgh.pa.us. The symptom clearly shows that we are decoding
empty xacts which can happen due to background activity by autovacuum.
I think we need a fix similar to what we have done in
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=82a0ba7707e010a29f5fe1a0020d963c82b8f1cb.

I'll try to reproduce and provide a fix for this later today or tomorrow.

[1]: /messages/by-id/363512.1610171267@sss.pgh.pa.us

--
With Regards,
Amit Kapila.

#183Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#182)

On Sat, Jan 9, 2021 at 12:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 5, 2021 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 5, 2021 at 2:11 PM Ajin Cherian <itsajin@gmail.com> wrote:

I've addressed the above comments and the patch is attached. I've
called it v36-0007.

Thanks, I have pushed this after minor wordsmithing.

The test case is failing on one of the build farm machines. See email
from Tom Lane [1]. The symptom clearly shows that we are decoding
empty xacts which can happen due to background activity by autovacuum.
I think we need a fix similar to what we have done in
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=82a0ba7707e010a29f5fe1a0020d963c82b8f1cb.

I'll try to reproduce and provide a fix for this later today or tomorrow.

I have pushed the fix.

--
With Regards,
Amit Kapila.

#184Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#183)
7 attachment(s)

Please find attached the new patch set v37.

This patch set v37* is now rebased to use the most recent tablesync
patch from the other thread [1]/messages/by-id/CAA4eK1KHJxaZS-fod-0fey=0tq3=Gkn4ho=8N4-5HWiCfu0H1A@mail.gmail.com.
i.e. notice that v37-0001 is an exact copy of the
v17-0001-tablesync-Solution1.patch

Details how v37* patches relate to earlier patches is shown below:

======
v35-0001 -> committed -> NA
v17-0001-Tablesync-Solution1 -> (copy from [1]/messages/by-id/CAA4eK1KHJxaZS-fod-0fey=0tq3=Gkn4ho=8N4-5HWiCfu0H1A@mail.gmail.com) -> v37-0001
v35-0002 -> (unchanged) -> v37-0002
v35-0003 -> (unchanged) -> v37-0003
v35-0004 -> (modify code, apply_handle_prepare changed for tablesync
worker) -> v37-0004
v35-0005 -> (unchanged) --> v37-0005
v35-0006 -> (modify code, twophase mode is now same for
tablesync/apply slots) -> v37-0006
v35-0007 -> v36-0007 -> committed -> NA
v35-0008 -> (unchanged) -> v37-0007
======

----
[1]: /messages/by-id/CAA4eK1KHJxaZS-fod-0fey=0tq3=Gkn4ho=8N4-5HWiCfu0H1A@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v37-0001-Tablesync-Solution1.patchapplication/octet-stream; name=v37-0001-Tablesync-Solution1.patchDownload
From 3ac1b872eaab5532a8b36e85b12d2322a57d2230 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 20 Jan 2021 12:25:16 +1100
Subject: [PATCH v37] Tablesync Solution1.

DO NOT COMMIT THIS CODE.

This is v17 of the tablesync patch.
Please see [1] for the latest version of this patch to be committed.
[1] https://www.postgresql.org/message-id/flat/CAHut%2BPt9%2Bg8qQR0kMC85nY-O4uDQxXboamZAYhHbvkebzC9fAQ%40mail.gmail.com#1eefd9184ff1a2b94342c5c1ff6ef669
---
 doc/src/sgml/catalogs.sgml                  |   1 +
 doc/src/sgml/logical-replication.sgml       |  17 +-
 doc/src/sgml/ref/drop_subscription.sgml     |   6 +-
 src/backend/commands/subscriptioncmds.c     | 166 ++++++++------
 src/backend/replication/logical/origin.c    |   2 +-
 src/backend/replication/logical/tablesync.c | 340 ++++++++++++++++++++++++----
 src/backend/replication/logical/worker.c    |  27 +--
 src/backend/tcop/postgres.c                 |   6 +
 src/include/catalog/pg_subscription_rel.h   |   2 +
 src/include/replication/logicalworker.h     |   2 +
 src/include/replication/slot.h              |   3 +
 src/test/subscription/t/004_sync.pl         |  96 +++++++-
 12 files changed, 540 insertions(+), 128 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 43d7a1a..82e74e1 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7662,6 +7662,7 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
        State code:
        <literal>i</literal> = initialize,
        <literal>d</literal> = data is being copied,
+       <literal>f</literal> = finished table copy,
        <literal>s</literal> = synchronized,
        <literal>r</literal> = ready (normal replication)
       </para></entry>
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index a560ad6..20cdd57 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -248,7 +248,17 @@
 
    <para>
     As mentioned earlier, each (active) subscription receives changes from a
-    replication slot on the remote (publishing) side.  Normally, the remote
+    replication slot on the remote (publishing) side.
+   </para>
+   <para>
+    Additional table synchronization slots are normally transient, created
+    internally and dropped automatically when they are no longer needed.
+    These table synchronization slots have generated names:
+    <quote><literal>pg_%u_sync_%u</literal></quote> (parameters: Subscription
+    <parameter>oid</parameter>, Table <parameter>relid</parameter>)
+   </para>
+   <para>
+    Normally, the remote
     replication slot is created automatically when the subscription is created
     using <command>CREATE SUBSCRIPTION</command> and it is dropped
     automatically when the subscription is dropped using <command>DROP
@@ -294,8 +304,9 @@
        using <command>ALTER SUBSCRIPTION</command> before attempting to drop
        the subscription.  If the remote database instance no longer exists, no
        further action is then necessary.  If, however, the remote database
-       instance is just unreachable, the replication slot should then be
-       dropped manually; otherwise it would continue to reserve WAL and might
+       instance is just unreachable, the replication slot (and any still 
+       remaining table synchronization slots) should then be
+       dropped manually; otherwise it/they would continue to reserve WAL and might
        eventually cause the disk to fill up.  Such cases should be carefully
        investigated.
       </para>
diff --git a/doc/src/sgml/ref/drop_subscription.sgml b/doc/src/sgml/ref/drop_subscription.sgml
index adbdeaf..aee9615 100644
--- a/doc/src/sgml/ref/drop_subscription.sgml
+++ b/doc/src/sgml/ref/drop_subscription.sgml
@@ -79,7 +79,8 @@ DROP SUBSCRIPTION [ IF EXISTS ] <replaceable class="parameter">name</replaceable
   <para>
    When dropping a subscription that is associated with a replication slot on
    the remote host (the normal state), <command>DROP SUBSCRIPTION</command>
-   will connect to the remote host and try to drop the replication slot as
+   will connect to the remote host and try to drop the replication slot (and
+   any remaining table synchronization slots) as
    part of its operation.  This is necessary so that the resources allocated
    for the subscription on the remote host are released.  If this fails,
    either because the remote host is not reachable or because the remote
@@ -89,7 +90,8 @@ DROP SUBSCRIPTION [ IF EXISTS ] <replaceable class="parameter">name</replaceable
    executing <literal>ALTER SUBSCRIPTION ... SET (slot_name = NONE)</literal>.
    After that, <command>DROP SUBSCRIPTION</command> will no longer attempt any
    actions on a remote host.  Note that if the remote replication slot still
-   exists, it should then be dropped manually; otherwise it will continue to
+   exists, it (and any related table synchronization slots) should then be
+   dropped manually; otherwise it/they will continue to
    reserve WAL and might eventually cause the disk to fill up.  See
    also <xref linkend="logical-replication-subscription-slot"/>.
   </para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 082f785..03cf91e 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -34,6 +34,7 @@
 #include "nodes/makefuncs.h"
 #include "replication/logicallauncher.h"
 #include "replication/origin.h"
+#include "replication/slot.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "replication/worker_internal.h"
@@ -928,7 +929,6 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
 	char	   *err = NULL;
 	RepOriginId originid;
 	WalReceiverConn *wrconn = NULL;
-	StringInfoData cmd;
 	Form_pg_subscription form;
 
 	/*
@@ -1015,101 +1015,133 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
 
 	ReleaseSysCache(tup);
 
-	/*
-	 * Stop all the subscription workers immediately.
-	 *
-	 * This is necessary if we are dropping the replication slot, so that the
-	 * slot becomes accessible.
-	 *
-	 * It is also necessary if the subscription is disabled and was disabled
-	 * in the same transaction.  Then the workers haven't seen the disabling
-	 * yet and will still be running, leading to hangs later when we want to
-	 * drop the replication origin.  If the subscription was disabled before
-	 * this transaction, then there shouldn't be any workers left, so this
-	 * won't make a difference.
-	 *
-	 * New workers won't be started because we hold an exclusive lock on the
-	 * subscription till the end of the transaction.
-	 */
-	LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
-	subworkers = logicalrep_workers_find(subid, false);
-	LWLockRelease(LogicalRepWorkerLock);
-	foreach(lc, subworkers)
+	PG_TRY();
 	{
-		LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);
-
-		logicalrep_worker_stop(w->subid, w->relid);
-	}
-	list_free(subworkers);
-
-	/* Clean up dependencies */
-	deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0);
-
-	/* Remove any associated relation synchronization states. */
-	RemoveSubscriptionRel(subid, InvalidOid);
+		/*
+		 * Stop all the subscription workers immediately.
+		 *
+		 * This is necessary if we are dropping the replication slot, so that
+		 * the slot becomes accessible.
+		 *
+		 * It is also necessary if the subscription is disabled and was
+		 * disabled in the same transaction.  Then the workers haven't seen
+		 * the disabling yet and will still be running, leading to hangs later
+		 * when we want to drop the replication origin.  If the subscription
+		 * was disabled before this transaction, then there shouldn't be any
+		 * workers left, so this won't make a difference.
+		 *
+		 * New workers won't be started because we hold an exclusive lock on
+		 * the subscription till the end of the transaction.
+		 */
+		LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
+		subworkers = logicalrep_workers_find(subid, false);
+		LWLockRelease(LogicalRepWorkerLock);
+		foreach(lc, subworkers)
+		{
+			LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);
 
-	/* Remove the origin tracking if exists. */
-	snprintf(originname, sizeof(originname), "pg_%u", subid);
-	originid = replorigin_by_name(originname, true);
-	if (originid != InvalidRepOriginId)
-		replorigin_drop(originid, false);
+			logicalrep_worker_stop(w->subid, w->relid);
+		}
+		list_free(subworkers);
+
+		/* Clean up dependencies. */
+		deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0);
+
+		/* Remove any associated relation synchronization states. */
+		RemoveSubscriptionRel(subid, InvalidOid);
+
+		/* Remove the origin tracking if exists. */
+		snprintf(originname, sizeof(originname), "pg_%u", subid);
+		originid = replorigin_by_name(originname, true);
+		if (originid != InvalidRepOriginId)
+			replorigin_drop(originid, false);
+
+		/*
+		 * If there is a slot associated with the subscription, then drop the
+		 * replication slot at the publisher node using the replication
+		 * connection.
+		 */
+		if (slotname)
+		{
+			load_file("libpqwalreceiver", false);
 
-	/*
-	 * If there is no slot associated with the subscription, we can finish
-	 * here.
-	 */
-	if (!slotname)
+			wrconn = walrcv_connect(conninfo, true, subname, &err);
+			if (wrconn == NULL)
+				ereport(ERROR,
+						(errmsg("could not connect to publisher when attempting to "
+								"drop the replication slot \"%s\"", slotname),
+						 errdetail("The error was: %s", err),
+				/* translator: %s is an SQL ALTER command */
+						 errhint("Use %s to disassociate the subscription from the slot.",
+								 "ALTER SUBSCRIPTION ... SET (slot_name = NONE)")));
+
+			ReplicationSlotDropAtPubNode(wrconn, slotname, false);
+		}
+	}
+	PG_FINALLY();
 	{
+		if (wrconn)
+			walrcv_disconnect(wrconn);
+
 		table_close(rel, NoLock);
-		return;
 	}
+	PG_END_TRY();
+}
+
+/*
+ * Drop the replication slot at the publisher node using the replication connection.
+ *
+ * missing_ok - if true then only issue WARNING message if the slot cannot be deleted.
+ */
+void
+ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok)
+{
+	StringInfoData cmd;
+
+	Assert(wrconn);
 
-	/*
-	 * Otherwise drop the replication slot at the publisher node using the
-	 * replication connection.
-	 */
 	load_file("libpqwalreceiver", false);
 
 	initStringInfo(&cmd);
 	appendStringInfo(&cmd, "DROP_REPLICATION_SLOT %s WAIT", quote_identifier(slotname));
 
-	wrconn = walrcv_connect(conninfo, true, subname, &err);
-	if (wrconn == NULL)
-		ereport(ERROR,
-				(errmsg("could not connect to publisher when attempting to "
-						"drop the replication slot \"%s\"", slotname),
-				 errdetail("The error was: %s", err),
-		/* translator: %s is an SQL ALTER command */
-				 errhint("Use %s to disassociate the subscription from the slot.",
-						 "ALTER SUBSCRIPTION ... SET (slot_name = NONE)")));
-
 	PG_TRY();
 	{
 		WalRcvExecResult *res;
 
 		res = walrcv_exec(wrconn, cmd.data, 0, NULL);
 
-		if (res->status != WALRCV_OK_COMMAND)
-			ereport(ERROR,
+		if (res->status == WALRCV_OK_COMMAND)
+		{
+			/* NOTICE. Success. */
+			ereport(NOTICE,
+					(errmsg("dropped replication slot \"%s\" on publisher",
+							slotname)));
+		}
+		else if (res->status == WALRCV_ERROR && missing_ok)
+		{
+			/* WARNING. Error, but missing_ok = true. */
+			ereport(WARNING,
 					(errmsg("could not drop the replication slot \"%s\" on publisher",
 							slotname),
 					 errdetail("The error was: %s", res->err)));
+		}
 		else
-			ereport(NOTICE,
-					(errmsg("dropped replication slot \"%s\" on publisher",
-							slotname)));
+		{
+			/* ERROR. */
+			ereport(ERROR,
+					(errmsg("could not drop the replication slot \"%s\" on publisher",
+							slotname),
+					 errdetail("The error was: %s", res->err)));
+		}
 
 		walrcv_clear_result(res);
 	}
 	PG_FINALLY();
 	{
-		walrcv_disconnect(wrconn);
+		pfree(cmd.data);
 	}
 	PG_END_TRY();
-
-	pfree(cmd.data);
-
-	table_close(rel, NoLock);
 }
 
 /*
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 77781d0..304c879 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -357,7 +357,7 @@ restart:
 		if (state->roident == roident)
 		{
 			/* found our slot, is it busy? */
-			if (state->acquired_by != 0)
+			if (state->acquired_by != 0 && state->acquired_by != MyProcPid)
 			{
 				ConditionVariable *cv;
 
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 863d196..ec85c08 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -31,8 +31,10 @@
  *		 table state to INIT.
  *	   - Tablesync worker starts; changes table state from INIT to DATASYNC while
  *		 copying.
- *	   - Tablesync worker finishes the copy and sets table state to SYNCWAIT;
- *		 waits for state change.
+ *	   - Tablesync worker does initial table copy; there is a FINISHEDCOPY state to
+ *		 indicate when the copy phase has completed, so if the worker crashes
+ *		 before reaching SYNCDONE the copy will not be re-attempted.
+ *	   - Tablesync worker then sets table state to SYNCWAIT; waits for state change.
  *	   - Apply worker periodically checks for tables in SYNCWAIT state.  When
  *		 any appear, it sets the table state to CATCHUP and starts loop-waiting
  *		 until either the table state is set to SYNCDONE or the sync worker
@@ -48,8 +50,8 @@
  *		 point it sets state to READY and stops tracking.  Again, there might
  *		 be zero changes in between.
  *
- *	  So the state progression is always: INIT -> DATASYNC -> SYNCWAIT ->
- *	  CATCHUP -> SYNCDONE -> READY.
+ *	  So the state progression is always: INIT -> DATASYNC ->
+ *	  (sync worker FINISHEDCOPY) -> SYNCWAIT -> CATCHUP -> SYNCDONE -> READY.
  *
  *	  The catalog pg_subscription_rel is used to keep information about
  *	  subscribed tables and their state.  Some transient state during data
@@ -59,6 +61,7 @@
  *	  Example flows look like this:
  *	   - Apply is in front:
  *		  sync:8
+ *			-> set in catalog FINISHEDCOPY
  *			-> set in memory SYNCWAIT
  *		  apply:10
  *			-> set in memory CATCHUP
@@ -74,6 +77,7 @@
  *
  *	   - Sync is in front:
  *		  sync:10
+ *			-> set in catalog FINISHEDCOPY
  *			-> set in memory SYNCWAIT
  *		  apply:8
  *			-> set in memory CATCHUP
@@ -98,11 +102,16 @@
 #include "miscadmin.h"
 #include "parser/parse_relation.h"
 #include "pgstat.h"
+#include "postmaster/interrupt.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalrelation.h"
+#include "replication/logicalworker.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
+#include "replication/slot.h"
+#include "replication/origin.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -260,6 +269,92 @@ invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 }
 
 /*
+ * The sync worker cleans up any slot / origin resources it may have created.
+ * This function is called from ProcessInterrupts() as result of tablesync being
+ * signalled.
+ */
+void
+tablesync_cleanup_at_interrupt(void)
+{
+	bool		drop_slot_needed;
+	char		originname[NAMEDATALEN] = {0};
+	RepOriginId originid;
+	TimeLineID	tli;
+	Oid			subid = MySubscription->oid;
+	Oid			relid = MyLogicalRepWorker->relid;
+
+	elog(DEBUG1,
+		 "tablesync_cleanup_at_interrupt for relid = %d",
+		 MyLogicalRepWorker->relid);
+
+	/*
+	 * Cleanup the tablesync slot, if needed.
+	 *
+	 * If state is SYNCDONE or READY then the slot has already been dropped.
+	 */
+	drop_slot_needed =
+		wrconn != NULL &&
+		MyLogicalRepWorker->relstate != SUBREL_STATE_SYNCDONE &&
+		MyLogicalRepWorker->relstate != SUBREL_STATE_READY;
+
+	if (drop_slot_needed)
+	{
+		char		syncslotname[NAMEDATALEN] = {0};
+		bool		missing_ok = true;	/* no ERROR if slot is missing. */
+
+		/*
+		 * End wal streaming so the wrconn can be re-used to drop the slot.
+		 */
+		PG_TRY();
+		{
+			walrcv_endstreaming(wrconn, &tli);
+		}
+		PG_CATCH();
+		{
+			/*
+			 * It is possible that the walrcv_startstreaming was not yet
+			 * called (e.g. the interrupt initiating this cleanup may have
+			 * happened during the table COPY phase) so suppress any error
+			 * here to cope with that scenario.
+			 */
+		}
+		PG_END_TRY();
+
+		ReplicationSlotNameForTablesync(subid, relid, syncslotname);
+
+		ReplicationSlotDropAtPubNode(wrconn, syncslotname, missing_ok);
+	}
+
+	/*
+	 * Remove the tablesync's origin tracking if exists.
+	 *
+	 * The origin APIS must be called within a transaction, and this
+	 * transaction will be ended within the finish_sync_worker().
+	 */
+	if (!IsTransactionState())
+	{
+		StartTransactionCommand();
+	}
+	snprintf(originname, sizeof(originname), "pg_%u_%u", subid, relid);
+	originid = replorigin_by_name(originname, true);
+	if (originid != InvalidRepOriginId)
+	{
+		replorigin_drop(originid, false);
+
+		/*
+		 * CommitTransactionCommand would normally attempt to advance the
+		 * origin, but now that the origin has been dropped that would fail,
+		 * so we need to reset the replorigin_session here to prevent this
+		 * error happening.
+		 */
+		replorigin_session_reset();
+		replorigin_session_origin = InvalidRepOriginId;
+	}
+
+	finish_sync_worker();		/* doesn't return. */
+}
+
+/*
  * Handle table synchronization cooperation from the synchronization
  * worker.
  *
@@ -270,30 +365,58 @@ invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 static void
 process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
-	Assert(IsTransactionState());
+	bool		sync_done = false;
+	Oid			subid = MySubscription->oid;
+	Oid			relid = MyLogicalRepWorker->relid;
 
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
+	sync_done = MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
+		current_lsn >= MyLogicalRepWorker->relstate_lsn;
+	SpinLockRelease(&MyLogicalRepWorker->relmutex);
 
-	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
-		current_lsn >= MyLogicalRepWorker->relstate_lsn)
+	if (sync_done)
 	{
 		TimeLineID	tli;
+		char		syncslotname[NAMEDATALEN] = {0};
 
+		/* End wal streaming so wrconn can be re-used to drop the slot. */
+		walrcv_endstreaming(wrconn, &tli);
+
+		/*
+		 * Cleanup the tablesync slot.
+		 */
+		ReplicationSlotNameForTablesync(subid, relid, syncslotname);
+
+		elog(DEBUG1,
+			 "process_syncing_tables_for_sync: dropping the tablesync slot \"%s\".",
+			 syncslotname);
+		ReplicationSlotDropAtPubNode(wrconn, syncslotname, false);
+
+		/*
+		 * Change state to SYNCDONE.
+		 */
+		SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 		MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE;
 		MyLogicalRepWorker->relstate_lsn = current_lsn;
 
 		SpinLockRelease(&MyLogicalRepWorker->relmutex);
 
+		/*
+		 * UpdateSubscriptionRelState must be called within a transaction.
+		 * That transaction will be ended within the finish_sync_worker().
+		 */
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+		}
+
 		UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
 								   MyLogicalRepWorker->relid,
 								   MyLogicalRepWorker->relstate,
 								   MyLogicalRepWorker->relstate_lsn);
 
-		walrcv_endstreaming(wrconn, &tli);
 		finish_sync_worker();
 	}
-	else
-		SpinLockRelease(&MyLogicalRepWorker->relmutex);
 }
 
 /*
@@ -412,6 +535,33 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 					started_tx = true;
 				}
 
+				/*
+				 * Remove the tablesync origin tracking if exists.
+				 *
+				 * The normal case origin drop is done here instead of in the
+				 * process_syncing_tables_for_sync function because if the
+				 * tablesync worker process attempted to call drop its own
+				 * orign then would prevent the origin from advancing properly
+				 * on commit TX.
+				 */
+				{
+					char		originname[NAMEDATALEN];
+					RepOriginId originid;
+
+					snprintf(originname, sizeof(originname), "pg_%u_%u", MyLogicalRepWorker->subid, rstate->relid);
+					originid = replorigin_by_name(originname, true);
+					if (OidIsValid(originid))
+					{
+						elog(DEBUG1,
+							 "process_syncing_tables_for_apply: dropping tablesync origin tracking for \"%s\".",
+							 originname);
+						replorigin_drop(originid, false);
+					}
+				}
+
+				/*
+				 * Update the state only after the origin cleanup.
+				 */
 				UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
 										   rstate->relid, rstate->state,
 										   rstate->lsn);
@@ -808,6 +958,30 @@ copy_table(Relation rel)
 }
 
 /*
+ * Determine the tablesync slot name.
+ *
+ * The name must not exceed NAMEDATALEN -1 because of remote node constraints on
+ * slot name length.
+ *
+ * The returned slot name is either returned in the supplied buffer or
+ * palloc'ed in current memory context (if NULL buffer).
+ */
+char *
+ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslotname)
+{
+	if (syncslotname)
+	{
+		sprintf(syncslotname, "pg_%u_sync_%u", suboid, relid);
+	}
+	else
+	{
+		syncslotname = psprintf("pg_%u_sync_%u", suboid, relid);
+	}
+
+	return syncslotname;
+}
+
+/*
  * Start syncing the table in the sync worker.
  *
  * If nothing needs to be done to sync the table, we exit the worker without
@@ -824,6 +998,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	XLogRecPtr	relstate_lsn;
 	Relation	rel;
 	WalRcvExecResult *res;
+	char		originname[NAMEDATALEN];
+	RepOriginId originid;
 
 	/* Check the state of the table synchronization. */
 	StartTransactionCommand();
@@ -849,19 +1025,11 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 			finish_sync_worker();	/* doesn't return */
 	}
 
-	/*
-	 * To build a slot name for the sync work, we are limited to NAMEDATALEN -
-	 * 1 characters.  We cut the original slot name to NAMEDATALEN - 28 chars
-	 * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0').  (It's actually the
-	 * NAMEDATALEN on the remote that matters, but this scheme will also work
-	 * reasonably if that is different.)
-	 */
-	StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small");	/* for sanity */
-	slotname = psprintf("%.*s_%u_sync_%u",
-						NAMEDATALEN - 28,
-						MySubscription->slotname,
-						MySubscription->oid,
-						MyLogicalRepWorker->relid);
+	/* Calculate the name of the tablesync slot. */
+	slotname = ReplicationSlotNameForTablesync(
+											   MySubscription->oid,
+											   MyLogicalRepWorker->relid,
+											   NULL);	/* use palloc */
 
 	/*
 	 * Here we use the slot name instead of the subscription name as the
@@ -874,7 +1042,30 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 				(errmsg("could not connect to the publisher: %s", err)));
 
 	Assert(MyLogicalRepWorker->relstate == SUBREL_STATE_INIT ||
-		   MyLogicalRepWorker->relstate == SUBREL_STATE_DATASYNC);
+		   MyLogicalRepWorker->relstate == SUBREL_STATE_DATASYNC ||
+		   MyLogicalRepWorker->relstate == SUBREL_STATE_FINISHEDCOPY);
+
+	/* Assign the origin tracking record name. */
+	snprintf(originname, sizeof(originname), "pg_%u_%u", MySubscription->oid, MyLogicalRepWorker->relid);
+
+	if (MyLogicalRepWorker->relstate == SUBREL_STATE_FINISHEDCOPY)
+	{
+		/*
+		 * The COPY phase was previously done, but tablesync then crashed/etc
+		 * before it was able to finish normally.
+		 */
+		StartTransactionCommand();
+
+		/*
+		 * The origin tracking name must already exist (missing_ok=false).
+		 */
+		originid = replorigin_by_name(originname, false);
+		replorigin_session_setup(originid);
+		replorigin_session_origin = originid;
+		*origin_startpos = replorigin_session_get_progress(false);
+
+		goto copy_table_done;
+	}
 
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 	MyLogicalRepWorker->relstate = SUBREL_STATE_DATASYNC;
@@ -890,9 +1081,6 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
-	/*
-	 * We want to do the table data sync in a single transaction.
-	 */
 	StartTransactionCommand();
 
 	/*
@@ -918,29 +1106,99 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	walrcv_clear_result(res);
 
 	/*
-	 * Create a new temporary logical decoding slot.  This slot will be used
+	 * Create a new permanent logical decoding slot. This slot will be used
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, true,
+	walrcv_create_slot(wrconn, slotname, false,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
-	/* Now do the initial data copy */
-	PushActiveSnapshot(GetTransactionSnapshot());
-	copy_table(rel);
-	PopActiveSnapshot();
+	/*
+	 * Be sure to remove the newly created tablesync slot if the COPY fails.
+	 */
+	PG_TRY();
+	{
+		/* Now do the initial data copy */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		copy_table(rel);
+		PopActiveSnapshot();
+
+		res = walrcv_exec(wrconn, "COMMIT", 0, NULL);
+		if (res->status != WALRCV_OK_COMMAND)
+			ereport(ERROR,
+					(errmsg("table copy could not finish transaction on publisher"),
+					 errdetail("The error was: %s", res->err)));
+		walrcv_clear_result(res);
+
+		table_close(rel, NoLock);
+
+		/* Make the copy visible. */
+		CommandCounterIncrement();
+	}
+	PG_CATCH();
+	{
+		/*
+		 * If something failed during copy table then cleanup the created
+		 * slot.
+		 */
+		elog(DEBUG1,
+			 "LogicalRepSyncTableStart: tablesync copy failed. Dropping the tablesync slot \"%s\".",
+			 slotname);
+		ReplicationSlotDropAtPubNode(wrconn, slotname, false);
 
-	res = walrcv_exec(wrconn, "COMMIT", 0, NULL);
-	if (res->status != WALRCV_OK_COMMAND)
+		pfree(slotname);
+		slotname = NULL;
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	/* Setup replication origin tracking. */
+	originid = replorigin_by_name(originname, true);
+	if (!OidIsValid(originid))
+	{
+		/*
+		 * Origin tracking does not exist, so create it now.
+		 *
+		 * Then advance to the LSN got from walrcv_create_slot. This is WAL
+		 * logged for for the purpose of recovery. Locks are to prevent the
+		 * replication origin from vanishing while advancing.
+		 */
+		LockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
+		originid = replorigin_create(originname);
+		replorigin_advance(originid, *origin_startpos, InvalidXLogRecPtr,
+						   true /* go backward */ , true /* WAL log */ );
+		UnlockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
+
+		replorigin_session_setup(originid);
+		replorigin_session_origin = originid;
+	}
+	else
+	{
 		ereport(ERROR,
-				(errmsg("table copy could not finish transaction on publisher"),
-				 errdetail("The error was: %s", res->err)));
-	walrcv_clear_result(res);
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("replication origin \"%s\" already exists",
+						originname)));
+	}
 
-	table_close(rel, NoLock);
+	/*
+	 * Update the persisted state to indicate the COPY phase is done; make it
+	 * visible to others.
+	 */
+	UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
+							   MyLogicalRepWorker->relid,
+							   SUBREL_STATE_FINISHEDCOPY,
+							   MyLogicalRepWorker->relstate_lsn);
 
-	/* Make the copy visible. */
-	CommandCounterIncrement();
+copy_table_done:
+
+	elog(DEBUG1,
+		 "LogicalRepSyncTableStart: '%s' origin_startpos lsn %X/%X",
+		 originname,
+		 (uint32) (*origin_startpos >> 32),
+		 (uint32) *origin_startpos);
+
+	CommitTransactionCommand();
 
 	/*
 	 * We are done with the initial data synchronization, update the state.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f2b2549..6482dd6 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -807,12 +807,8 @@ apply_handle_stream_stop(StringInfo s)
 	/* We must be in a valid transaction state */
 	Assert(IsTransactionState());
 
-	/* The synchronization worker runs in single transaction. */
-	if (!am_tablesync_worker())
-	{
-		/* Commit the per-stream transaction */
-		CommitTransactionCommand();
-	}
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
 
 	in_streamed_transaction = false;
 
@@ -889,9 +885,7 @@ apply_handle_stream_abort(StringInfo s)
 			/* Cleanup the subxact info */
 			cleanup_subxact_info();
 
-			/* The synchronization worker runs in single transaction */
-			if (!am_tablesync_worker())
-				CommitTransactionCommand();
+			CommitTransactionCommand();
 			return;
 		}
 
@@ -918,8 +912,7 @@ apply_handle_stream_abort(StringInfo s)
 		/* write the updated subxact list */
 		subxact_info_write(MyLogicalRepWorker->subid, xid);
 
-		if (!am_tablesync_worker())
-			CommitTransactionCommand();
+		CommitTransactionCommand();
 	}
 }
 
@@ -1062,8 +1055,7 @@ apply_handle_stream_commit(StringInfo s)
 static void
 apply_handle_commit_internal(StringInfo s, LogicalRepCommitData* commit_data)
 {
-	/* The synchronization worker runs in single transaction. */
-	if (IsTransactionState() && !am_tablesync_worker())
+	if (IsTransactionState())
 	{
 		/*
 		 * Update origin state so we can restart streaming from correct
@@ -3112,3 +3104,12 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Is current process a logical replication tablesync worker?
+ */
+bool
+IsLogicalWorkerTablesync(void)
+{
+	return am_tablesync_worker();
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8dab9fd..2a0565e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3086,9 +3086,15 @@ ProcessInterrupts(void)
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("terminating autovacuum process due to administrator command")));
 		else if (IsLogicalWorker())
+		{
+			/* Tablesync workers do their own cleanups. */
+			if (IsLogicalWorkerTablesync())
+				tablesync_cleanup_at_interrupt(); /* does not return. */
+
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("terminating logical replication worker due to administrator command")));
+		}
 		else if (IsLogicalLauncher())
 		{
 			ereport(DEBUG1,
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 06663b9..9027c42 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -61,6 +61,8 @@ DECLARE_UNIQUE_INDEX(pg_subscription_rel_srrelid_srsubid_index, 6117, on pg_subs
 #define SUBREL_STATE_INIT		'i' /* initializing (sublsn NULL) */
 #define SUBREL_STATE_DATASYNC	'd' /* data is being synchronized (sublsn
 									 * NULL) */
+#define SUBREL_STATE_FINISHEDCOPY 'f'	/* tablesync copy phase is completed
+										 * (sublsn NULL) */
 #define SUBREL_STATE_SYNCDONE	's' /* synchronization finished in front of
 									 * apply (sublsn set) */
 #define SUBREL_STATE_READY		'r' /* ready (sublsn set) */
diff --git a/src/include/replication/logicalworker.h b/src/include/replication/logicalworker.h
index 2ad61a0..085916c 100644
--- a/src/include/replication/logicalworker.h
+++ b/src/include/replication/logicalworker.h
@@ -15,5 +15,7 @@
 extern void ApplyWorkerMain(Datum main_arg);
 
 extern bool IsLogicalWorker(void);
+extern bool IsLogicalWorkerTablesync(void);
+extern void tablesync_cleanup_at_interrupt(void);
 
 #endif							/* LOGICALWORKER_H */
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 53f636c..5f52335 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -15,6 +15,7 @@
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
+#include "replication/walreceiver.h"
 
 /*
  * Behaviour of replication slots, upon release or crash.
@@ -211,6 +212,8 @@ extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name);
+extern char *ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslotname);
+extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..ba96d1d 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -3,7 +3,9 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 10;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
 
 # Initialize publisher node
 my $node_publisher = get_new_node('publisher');
@@ -149,7 +151,99 @@ $result = $node_subscriber->safe_psql('postgres',
 is($result, qq(20),
 	'changes for table added after subscription initialized replicated');
 
+##
+## slot integrity
+##
+## Manually create a slot with the same name that tablesync will want.
+## Expect tablesync ERROR when clash is detected.
+## Then remove the slot so tablesync can proceed.
+## Expect tablesync can now finish normally.
+##
+
+# drop the subscription
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+# empty the table tab_rep
+$node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep;");
+
+# empty the table tab_rep_next
+$node_subscriber->safe_psql('postgres', "DELETE FROM tab_rep_next;");
+
+# recreate the subscription again, but leave it disabled so that we can get the OID
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub
+	with (enabled = false)"
+);
+
+# need to create the name of the tablesync slot, for this we need the subscription OID
+# and the table OID.
+my $subid = $node_subscriber->safe_psql('postgres',
+	"SELECT oid FROM pg_subscription WHERE subname = 'tap_sub';");
+is(looks_like_number($subid), qq(1), 'get the subscription OID');
+
+my $relid = $node_subscriber->safe_psql('postgres',
+	"SELECT 'tab_rep_next'::regclass::oid");
+is(looks_like_number($relid), qq(1), 'get the table OID');
+
+# name of the tablesync slot is pg_'suboid'_sync_'tableoid'.
+my $slotname = 'pg_' . $subid . '_' . 'sync_' . $relid;
+
+# temporarily, create a slot having the same name of the tablesync slot.
+$node_publisher->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('$slotname', 'pgoutput', false);");
+
+# enable the subscription
+$node_subscriber->safe_psql('postgres',
+	"ALTER SUBSCRIPTION tap_sub ENABLE"
+);
+
+# check for occurrence of the expected error
+poll_output_until("replication slot \"$slotname\" already exists")
+    or die "no error stop for the pre-existing origin";
+
+# now drop the offending slot, the tablesync should recover.
+$node_publisher->safe_psql('postgres',
+	"SELECT pg_drop_replication_slot('$slotname');");
+
+# wait for sync to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_rep_next");
+is($result, qq(20),
+	'data for table added after subscription initialized are now synced');
+
+# Cleanup
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
 
 $node_subscriber->stop('fast');
 $node_publisher->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 10 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_subscriber->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
-- 
1.8.3.1

v37-0003-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v37-0003-Track-replication-origin-progress-for-rollbacks.patchDownload
From 8b50e4aba261061fafefac1325cde38302017a17 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 20 Jan 2021 13:11:03 +1100
Subject: [PATCH v37] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index fc18b77..609cf02 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2277,6 +2277,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2299,6 +2307,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a2068e3..5619547 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5720,8 +5720,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5927,7 +5926,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5976,6 +5976,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6017,7 +6024,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6025,7 +6033,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v37-0002-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v37-0002-Refactor-spool-file-logic-in-worker.c.patchDownload
From a684571901a1b2170150d6ea0eb0f07578251144 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 20 Jan 2021 13:08:44 +1100
Subject: [PATCH v37] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6482dd6..07edffd 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -917,30 +919,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +941,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +956,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1031,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v37-0004-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v37-0004-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 0b91a6e890c148959d5c15659f697a72c9b5024e Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 20 Jan 2021 15:53:11 +1100
Subject: [PATCH v37] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

* We allow skipping prepared transactions if they are already prepared.
We do ensure that we skip only when the GID, origin_lsn, and
origin_timestamp of a prepared xact matches to avoid the possibility of
a match of prepared xact from two different nodes. This can happen when
the server or apply worker restarts after a prepared transaction.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  74 ++++++-
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 260 +++++++++++++++++++++-
 src/backend/replication/logical/worker.c    | 329 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 172 ++++++++++++---
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  75 ++++++-
 src/include/replication/reorderbuffer.h     |  12 +
 src/tools/pgindent/typedefs.list            |   3 +
 9 files changed, 894 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 609cf02..6e5fe25 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1133,9 +1133,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
@@ -2446,3 +2446,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 304c879..a979171 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -957,8 +957,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..ca83c94 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,264 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 07edffd..e13d0ec 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -169,6 +170,9 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
+/* for skipping prepared transaction */
+bool        skip_prepared_txn = false;
+
 /*
  * Hash table for storing the streaming xid information along with shared file
  * set for streaming and subxact files.
@@ -690,6 +694,12 @@ apply_handle_begin(StringInfo s)
 {
 	LogicalRepBeginData begin_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_begin(s, &begin_data);
 
 	remote_final_lsn = begin_data.final_lsn;
@@ -709,6 +719,12 @@ apply_handle_commit(StringInfo s)
 {
 	LogicalRepCommitData commit_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_commit(s, &commit_data);
 
 	Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -722,6 +738,263 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+	{
+		/*
+		 * If this gid has already been prepared then we don't want to apply
+		 * this txn again. This can happen after restart where upstream can
+		 * send the prepared transaction again. See
+		 * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+		 */
+		skip_prepared_txn = true;
+		return;
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (skip_prepared_txn)
+	{
+		/*
+		 * If we are skipping this transaction because it was previously
+		 * prepared, ignore it and reset the flag.
+		 */
+		Assert(LookupGXact(prepare_data.gid, prepare_data.end_lsn,
+						   prepare_data.preparetime));
+		skip_prepared_txn = false;
+		return;
+	}
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -753,6 +1026,12 @@ apply_handle_stream_start(StringInfo s)
 	Assert(!in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Start a transaction on stream start, this transaction will be committed
 	 * on the stream stop unless it is a tablesync worker in which case it will
 	 * be committed after processing all the messages. We need the transaction
@@ -800,6 +1079,12 @@ apply_handle_stream_stop(StringInfo s)
 	Assert(in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Close the file with serialized changes, and serialize information about
 	 * subxacts for the toplevel transaction.
 	 */
@@ -831,6 +1116,12 @@ apply_handle_stream_abort(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_stream_abort(s, &xid, &subxid);
 
 	/*
@@ -1046,6 +1337,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	xid = logicalrep_read_stream_commit(s, &commit_data);
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
@@ -1168,6 +1465,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1289,6 +1589,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1447,6 +1750,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1816,6 +2122,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1972,6 +2281,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2f01137..b916665 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +67,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +78,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +173,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +344,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,27 +364,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -378,6 +385,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -776,17 +845,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -867,6 +927,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1181,3 +1259,31 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 	while ((entry = (RelationSyncEntry *) hash_seq_search(&status)) != NULL)
 		entry->replicate_valid = false;
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr	origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..40417e6 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bab31bf..6bb162e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 943142c..b9c1087 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1340,12 +1340,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v37-0005-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v37-0005-Support-2PC-txn-subscriber-tests.patchDownload
From ce790e90b09eb9c3a2b2e26459680d59fa600221 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 20 Jan 2021 16:50:41 +1100
Subject: [PATCH v37] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v37-0006-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v37-0006-Support-2PC-txn-Subscription-option.patchDownload
From 749f998acaa8745a4597745dfd1c1d33852220be Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 20 Jan 2021 17:23:47 +1100
Subject: [PATCH v37] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 202 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index db5e59f..dbe2a43 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -166,8 +166,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 44cb285..8f09d1d 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -67,6 +67,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd..55dd8da 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1167,7 +1167,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 03cf91e..3551d8f 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -65,7 +65,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -106,6 +107,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -211,6 +217,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -356,6 +371,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -380,7 +397,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -448,6 +466,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -721,6 +740,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -731,7 +752,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -770,6 +792,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -788,7 +817,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -833,7 +863,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -876,7 +907,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				AlterSubscription_refresh(sub, copy_data);
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e958274..a87b9a9 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -429,6 +429,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e13d0ec..dc71f96 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2787,6 +2787,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3433,6 +3434,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index b916665..a891c1b7 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -180,13 +180,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -254,6 +256,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -267,6 +279,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -291,7 +304,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -332,6 +346,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 798d145..c638338 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 1290f96..07c3ad8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index caf9756..41a34e0 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -5988,7 +5988,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6014,13 +6014,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index e361802..950db60 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -53,6 +53,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -90,6 +92,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 40417e6..8e94a26 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4313f51..3d278ca 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 2fa9bce..23d876e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,42 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 14fa0b2..2a0b366 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -147,6 +147,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 
 DROP SUBSCRIPTION regress_testsub;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v37-0007-Support-2PC-txn-tests-for-concurrent-aborts.patchapplication/octet-stream; name=v37-0007-Support-2PC-txn-tests-for-concurrent-aborts.patchDownload
From eb67b9dda4d75aab26bb096fd20334477c5eeb02 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 20 Jan 2021 18:08:10 +1100
Subject: [PATCH v37] Support 2PC txn tests for concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2PC.
---
 contrib/test_decoding/Makefile                    |   2 +
 contrib/test_decoding/t/001_twophase.pl           | 121 ++++++++++++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++++++
 contrib/test_decoding/test_decoding.c             |  58 ++++++++++
 src/backend/replication/logical/reorderbuffer.c   |   5 +
 5 files changed, 319 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index c5e28ce..e0cd841 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -10,6 +10,8 @@ ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..3b3e7b8
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of prepared txn test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..15001c6
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 929255e..3fa172a 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,11 +11,13 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
+#include "storage/procarray.h"
 
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -35,6 +37,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -174,6 +177,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -275,6 +279,24 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -471,6 +493,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -620,6 +666,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -706,6 +755,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -918,6 +970,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -971,6 +1026,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5a62ab8..4a4a9ed 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2489,6 +2489,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
-- 
1.8.3.1

#185Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#184)
7 attachment(s)

PSA the new patch set v38*.

This patch set has been rebased to use the most recent tablesync patch
from other thread [1]/messages/by-id/CAA4eK1KHJxaZS-fod-0fey=0tq3=Gkn4ho=8N4-5HWiCfu0H1A@mail.gmail.com
(i.e. notice that v38-0001 is an exact copy of that thread's tablesync
patch v31)

----
[1]: /messages/by-id/CAA4eK1KHJxaZS-fod-0fey=0tq3=Gkn4ho=8N4-5HWiCfu0H1A@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v38-0002-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v38-0002-Refactor-spool-file-logic-in-worker.c.patchDownload
From 8cb175ccc55f9c1581a1313be1bc611af0610e56 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Feb 2021 17:49:57 +1100
Subject: [PATCH v38] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index cfc924c..b50f962 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -917,30 +919,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +941,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +956,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1031,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v38-0004-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v38-0004-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 63efb92bb96de12d78a82b4c941ebff21aa4ff7d Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Feb 2021 18:38:57 +1100
Subject: [PATCH v38] Add support for apply at prepare time to built-in logical
  replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

* We allow skipping prepared transactions if they are already prepared.
We do ensure that we skip only when the GID, origin_lsn, and
origin_timestamp of a prepared xact matches to avoid the possibility of
a match of prepared xact from two different nodes. This can happen when
the server or apply worker restarts after a prepared transaction.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  74 ++++++-
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 260 +++++++++++++++++++++-
 src/backend/replication/logical/worker.c    | 329 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 172 ++++++++++++---
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  75 ++++++-
 src/include/replication/reorderbuffer.h     |  12 +
 src/tools/pgindent/typedefs.list            |   3 +
 9 files changed, 894 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 609cf02..6e5fe25 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1133,9 +1133,9 @@ EndPrepare(GlobalTransaction gxact)
 	gxact->prepare_start_lsn = ProcLastRecPtr;
 
 	/*
-	 * Mark the prepared transaction as valid.  As soon as xact.c marks
-	 * MyProc as not running our XID (which it will do immediately after
-	 * this function returns), others can commit/rollback the xact.
+	 * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+	 * as not running our XID (which it will do immediately after this
+	 * function returns), others can commit/rollback the xact.
 	 *
 	 * NB: a side effect of this is to make a dummy ProcArray entry for the
 	 * prepared XID.  This must happen before we clear the XID from MyProc /
@@ -2446,3 +2446,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 685eaa6..73b420a 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -974,8 +974,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..ca83c94 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 
 	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);
 
-	/* send the flags field (unused for now) */
+	/* send the flags field */
 	pq_sendbyte(out, flags);
 
 	/* send fields */
@@ -106,6 +106,264 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b50f962..e01d02e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -169,6 +170,9 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
+/* for skipping prepared transaction */
+bool        skip_prepared_txn = false;
+
 /*
  * Hash table for storing the streaming xid information along with shared file
  * set for streaming and subxact files.
@@ -690,6 +694,12 @@ apply_handle_begin(StringInfo s)
 {
 	LogicalRepBeginData begin_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_begin(s, &begin_data);
 
 	remote_final_lsn = begin_data.final_lsn;
@@ -709,6 +719,12 @@ apply_handle_commit(StringInfo s)
 {
 	LogicalRepCommitData commit_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_commit(s, &commit_data);
 
 	Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -722,6 +738,263 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+	{
+		/*
+		 * If this gid has already been prepared then we don't want to apply
+		 * this txn again. This can happen after restart where upstream can
+		 * send the prepared transaction again. See
+		 * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+		 */
+		skip_prepared_txn = true;
+		return;
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (skip_prepared_txn)
+	{
+		/*
+		 * If we are skipping this transaction because it was previously
+		 * prepared, ignore it and reset the flag.
+		 */
+		Assert(LookupGXact(prepare_data.gid, prepare_data.end_lsn,
+						   prepare_data.preparetime));
+		skip_prepared_txn = false;
+		return;
+	}
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -753,6 +1026,12 @@ apply_handle_stream_start(StringInfo s)
 	Assert(!in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Start a transaction on stream start, this transaction will be committed
 	 * on the stream stop unless it is a tablesync worker in which case it
 	 * will be committed after processing all the messages. We need the
@@ -800,6 +1079,12 @@ apply_handle_stream_stop(StringInfo s)
 	Assert(in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Close the file with serialized changes, and serialize information about
 	 * subxacts for the toplevel transaction.
 	 */
@@ -831,6 +1116,12 @@ apply_handle_stream_abort(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_stream_abort(s, &xid, &subxid);
 
 	/*
@@ -1046,6 +1337,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	xid = logicalrep_read_stream_commit(s, &commit_data);
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
@@ -1168,6 +1465,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1289,6 +1589,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1447,6 +1750,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1816,6 +2122,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1972,6 +2281,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 79765f9..c33ea25 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +67,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +78,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +173,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +344,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,27 +364,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -378,6 +385,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -776,17 +845,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -867,6 +927,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1192,3 +1270,31 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..40417e6 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bab31bf..6bb162e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bab4f3a..048681c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v38-0005-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v38-0005-Support-2PC-txn-subscriber-tests.patchDownload
From 1d806c001d62d456bf5875dca4224fa741c9cd01 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Feb 2021 18:49:30 +1100
Subject: [PATCH v38] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v38-0001-Tablesync-V31.patchapplication/octet-stream; name=v38-0001-Tablesync-V31.patchDownload
From 4ace9858ffab9dcc8eb57ae63421abef4a276695 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Feb 2021 17:35:42 +1100
Subject: [PATCH v38] Tablesync V31

DO NOT COMMIT THIS CODE.

This is v31 of the tablesync patch. Please see [1] for the latest version of this patch to be committed.

[1] https://www.postgresql.org/message-id/flat/CAA4eK1JHBqwtGcdjRnaCoD%2B_1G87pFnVw3AJjyBGx%2BYTN%3DuZTg%40mail.gmail.com#55fd22d19c95080996c99393c1cb2fad
---
 doc/src/sgml/catalogs.sgml                         |   1 +
 doc/src/sgml/logical-replication.sgml              |  59 ++-
 doc/src/sgml/ref/alter_subscription.sgml           |  18 +
 doc/src/sgml/ref/drop_subscription.sgml            |   6 +-
 src/backend/access/transam/xact.c                  |  11 -
 src/backend/catalog/pg_subscription.c              |  39 ++
 src/backend/commands/subscriptioncmds.c            | 467 ++++++++++++++++-----
 .../libpqwalreceiver/libpqwalreceiver.c            |   8 +
 src/backend/replication/logical/launcher.c         | 147 -------
 src/backend/replication/logical/tablesync.c        | 236 +++++++++--
 src/backend/replication/logical/worker.c           |  18 +-
 src/backend/tcop/utility.c                         |   3 +-
 src/include/catalog/pg_subscription_rel.h          |   2 +
 src/include/commands/subscriptioncmds.h            |   2 +-
 src/include/replication/logicallauncher.h          |   2 -
 src/include/replication/slot.h                     |   3 +
 src/include/replication/walreceiver.h              |   1 +
 src/include/replication/worker_internal.h          |   3 +-
 src/test/regress/expected/subscription.out         |  21 +
 src/test/regress/sql/subscription.sql              |  22 +
 src/test/subscription/t/004_sync.pl                |  21 +-
 src/tools/pgindent/typedefs.list                   |   2 +-
 22 files changed, 767 insertions(+), 325 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index ea222c0..692ad65 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7673,6 +7673,7 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
        State code:
        <literal>i</literal> = initialize,
        <literal>d</literal> = data is being copied,
+       <literal>f</literal> = finished table copy,
        <literal>s</literal> = synchronized,
        <literal>r</literal> = ready (normal replication)
       </para></entry>
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index a560ad6..d0742f2 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -186,9 +186,10 @@
 
   <para>
    Each subscription will receive changes via one replication slot (see
-   <xref linkend="streaming-replication-slots"/>).  Additional temporary
-   replication slots may be required for the initial data synchronization
-   of pre-existing table data.
+   <xref linkend="streaming-replication-slots"/>).  Additional replication
+   slots may be required for the initial data synchronization of
+   pre-existing table data and those will be dropped at the end of data
+   synchronization.
   </para>
 
   <para>
@@ -248,13 +249,23 @@
 
    <para>
     As mentioned earlier, each (active) subscription receives changes from a
-    replication slot on the remote (publishing) side.  Normally, the remote
-    replication slot is created automatically when the subscription is created
-    using <command>CREATE SUBSCRIPTION</command> and it is dropped
-    automatically when the subscription is dropped using <command>DROP
-    SUBSCRIPTION</command>.  In some situations, however, it can be useful or
-    necessary to manipulate the subscription and the underlying replication
-    slot separately.  Here are some scenarios:
+    replication slot on the remote (publishing) side.
+   </para>
+   <para>
+    Additional table synchronization slots are normally transient, created
+    internally to perform initial table synchronization and dropped
+    automatically when they are no longer needed. These table synchronization
+    slots have generated names: <quote><literal>pg_%u_sync_%u_%llu</literal></quote>
+    (parameters: Subscription <parameter>oid</parameter>,
+    Table <parameter>relid</parameter>, system identifier <parameter>sysid</parameter>)
+   </para>
+   <para>
+    Normally, the remote replication slot is created automatically when the
+    subscription is created using <command>CREATE SUBSCRIPTION</command> and it
+    is dropped automatically when the subscription is dropped using
+    <command>DROP SUBSCRIPTION</command>.  In some situations, however, it can
+    be useful or necessary to manipulate the subscription and the underlying
+    replication slot separately.  Here are some scenarios:
 
     <itemizedlist>
      <listitem>
@@ -294,8 +305,9 @@
        using <command>ALTER SUBSCRIPTION</command> before attempting to drop
        the subscription.  If the remote database instance no longer exists, no
        further action is then necessary.  If, however, the remote database
-       instance is just unreachable, the replication slot should then be
-       dropped manually; otherwise it would continue to reserve WAL and might
+       instance is just unreachable, the replication slot (and any still 
+       remaining table synchronization slots) should then be
+       dropped manually; otherwise it/they would continue to reserve WAL and might
        eventually cause the disk to fill up.  Such cases should be carefully
        investigated.
       </para>
@@ -468,16 +480,19 @@
   <sect2 id="logical-replication-snapshot">
     <title>Initial Snapshot</title>
     <para>
-      The initial data in existing subscribed tables are snapshotted and
-      copied in a parallel instance of a special kind of apply process.
-      This process will create its own temporary replication slot and
-      copy the existing data. Once existing data is copied, the worker
-      enters synchronization mode, which ensures that the table is brought
-      up to a synchronized state with the main apply process by streaming
-      any changes that happened during the initial data copy using standard
-      logical replication. Once the synchronization is done, the control
-      of the replication of the table is given back to the main apply
-      process where the replication continues as normal.
+     The initial data in existing subscribed tables are snapshotted and
+     copied in a parallel instance of a special kind of apply process.
+     This process will create its own replication slot and copy the existing
+     data.  As soon as the copy is finished the table contents will become
+     visible to other backends.  Once existing data is copied, the worker
+     enters synchronization mode, which ensures that the table is brought
+     up to a synchronized state with the main apply process by streaming
+     any changes that happened during the initial data copy using standard
+     logical replication.  During this synchronization phase, the changes
+     are applied and committed in the same order as they happened on the
+     publisher.  Once the synchronization is done, the control of the
+     replication of the table is given back to the main apply process where
+     the replication continues as normal.
     </para>
   </sect2>
  </sect1>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index db5e59f..bcb0acf 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -48,6 +48,24 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    (Currently, all subscription owners must be superusers, so the owner checks
    will be bypassed in practice.  But this might change in the future.)
   </para>
+  
+  <para>
+   When refreshing a publication we remove the relations that are no longer
+   part of the publication and we also remove the tablesync slots if there are
+   any. It is necessary to remove tablesync slots so that the resources
+   allocated for the subscription on the remote host are released. If due to
+   network breakdown or some other error, <productname>PostgreSQL</productname>
+   is unable to remove the slots, an ERROR will be reported. To proceed in this
+   situation, either the user need to retry the operation or disassociate the
+   slot from the subscription and drop the subscription as explained in
+   <xref linkend="sql-dropsubscription"/>.
+  </para>
+
+  <para>
+   Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
+   <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with refresh
+   option as true cannot be executed inside a transaction block.
+  </para>
  </refsect1>
 
  <refsect1>
diff --git a/doc/src/sgml/ref/drop_subscription.sgml b/doc/src/sgml/ref/drop_subscription.sgml
index adbdeaf..aee9615 100644
--- a/doc/src/sgml/ref/drop_subscription.sgml
+++ b/doc/src/sgml/ref/drop_subscription.sgml
@@ -79,7 +79,8 @@ DROP SUBSCRIPTION [ IF EXISTS ] <replaceable class="parameter">name</replaceable
   <para>
    When dropping a subscription that is associated with a replication slot on
    the remote host (the normal state), <command>DROP SUBSCRIPTION</command>
-   will connect to the remote host and try to drop the replication slot as
+   will connect to the remote host and try to drop the replication slot (and
+   any remaining table synchronization slots) as
    part of its operation.  This is necessary so that the resources allocated
    for the subscription on the remote host are released.  If this fails,
    either because the remote host is not reachable or because the remote
@@ -89,7 +90,8 @@ DROP SUBSCRIPTION [ IF EXISTS ] <replaceable class="parameter">name</replaceable
    executing <literal>ALTER SUBSCRIPTION ... SET (slot_name = NONE)</literal>.
    After that, <command>DROP SUBSCRIPTION</command> will no longer attempt any
    actions on a remote host.  Note that if the remote replication slot still
-   exists, it should then be dropped manually; otherwise it will continue to
+   exists, it (and any related table synchronization slots) should then be
+   dropped manually; otherwise it/they will continue to
    reserve WAL and might eventually cause the disk to fill up.  See
    also <xref linkend="logical-replication-subscription-slot"/>.
   </para>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a2068e3..3c8b4eb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2432,15 +2432,6 @@ PrepareTransaction(void)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot PREPARE a transaction that has exported snapshots")));
 
-	/*
-	 * Don't allow PREPARE but for transaction that has/might kill logical
-	 * replication workers.
-	 */
-	if (XactManipulatesLogicalReplicationWorkers())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("cannot PREPARE a transaction that has manipulated logical replication workers")));
-
 	/* Prevent cancel/die interrupt while cleaning up */
 	HOLD_INTERRUPTS();
 
@@ -4899,7 +4890,6 @@ CommitSubTransaction(void)
 	AtEOSubXact_HashTables(true, s->nestingLevel);
 	AtEOSubXact_PgStat(true, s->nestingLevel);
 	AtSubCommit_Snapshot(s->nestingLevel);
-	AtEOSubXact_ApplyLauncher(true, s->nestingLevel);
 
 	/*
 	 * We need to restore the upper transaction's read-only state, in case the
@@ -5059,7 +5049,6 @@ AbortSubTransaction(void)
 		AtEOSubXact_HashTables(false, s->nestingLevel);
 		AtEOSubXact_PgStat(false, s->nestingLevel);
 		AtSubAbort_Snapshot(s->nestingLevel);
-		AtEOSubXact_ApplyLauncher(false, s->nestingLevel);
 	}
 
 	/*
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 44cb285..750ec2a 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -29,6 +29,7 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
 #include "utils/pg_lsn.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
@@ -337,6 +338,13 @@ GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn)
 	char		substate;
 	bool		isnull;
 	Datum		d;
+	Relation	rel;
+
+	/*
+	 * This is to avoid the race condition with AlterSubscription which tries
+	 * to remove this relstate.
+	 */
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
 
 	/* Try finding the mapping. */
 	tup = SearchSysCache2(SUBSCRIPTIONRELMAP,
@@ -363,6 +371,8 @@ GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn)
 	/* Cleanup */
 	ReleaseSysCache(tup);
 
+	table_close(rel, AccessShareLock);
+
 	return substate;
 }
 
@@ -403,6 +413,35 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	scan = table_beginscan_catalog(rel, nkeys, skey);
 	while (HeapTupleIsValid(tup = heap_getnext(scan, ForwardScanDirection)))
 	{
+		Form_pg_subscription_rel subrel;
+
+		subrel = (Form_pg_subscription_rel) GETSTRUCT(tup);
+
+		/*
+		 * We don't allow to drop the relation mapping when the table
+		 * synchronization is in progress unless the caller updates the
+		 * corresponding subscription as well. This is to ensure that we don't
+		 * leave tablesync slots or origins in the system when the
+		 * corresponding table is dropped.
+		 */
+		if (!OidIsValid(subid) && subrel->srsubstate != SUBREL_STATE_READY)
+		{
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("could not drop relation mapping for subscription \"%s\"",
+							get_subscription_name(subrel->srsubid, false)),
+					 errdetail("Table synchronization for relation \"%s\" is in progress and is in state \"%c\".",
+							   get_rel_name(relid), subrel->srsubstate),
+
+			/*
+			 * translator: first %s is a SQL ALTER command and second %s is a
+			 * SQL DROP command
+			 */
+					 errhint("Use %s to enable subscription if not already enabled or use %s to drop the subscription.",
+							 "ALTER SUBSCRIPTION ... ENABLE",
+							 "DROP SUBSCRIPTION ...")));
+		}
+
 		CatalogTupleDelete(rel, &tup->t_self);
 	}
 	table_endscan(scan);
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 5ccbc9d..7996f84 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -34,6 +34,7 @@
 #include "nodes/makefuncs.h"
 #include "replication/logicallauncher.h"
 #include "replication/origin.h"
+#include "replication/slot.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "replication/worker_internal.h"
@@ -46,6 +47,8 @@
 #include "utils/syscache.h"
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
+static void ReportSlotConnectionError(List *rstates, Oid subid, char *slotname, char *err);
+
 
 /*
  * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands.
@@ -566,107 +569,207 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
 	Oid		   *pubrel_local_oids;
 	ListCell   *lc;
 	int			off;
+	int			remove_rel_len;
+	Relation	rel = NULL;
+	typedef struct SubRemoveRels
+	{
+		Oid			relid;
+		char		state;
+	} SubRemoveRels;
+	SubRemoveRels *sub_remove_rels;
 
 	/* Load the library providing us libpq calls. */
 	load_file("libpqwalreceiver", false);
 
-	/* Try to connect to the publisher. */
-	wrconn = walrcv_connect(sub->conninfo, true, sub->name, &err);
-	if (!wrconn)
-		ereport(ERROR,
-				(errmsg("could not connect to the publisher: %s", err)));
-
-	/* Get the table list from publisher. */
-	pubrel_names = fetch_table_list(wrconn, sub->publications);
+	PG_TRY();
+	{
+		/* Try to connect to the publisher. */
+		wrconn = walrcv_connect(sub->conninfo, true, sub->name, &err);
+		if (!wrconn)
+			ereport(ERROR,
+					(errmsg("could not connect to the publisher: %s", err)));
 
-	/* We are done with the remote side, close connection. */
-	walrcv_disconnect(wrconn);
+		/* Get the table list from publisher. */
+		pubrel_names = fetch_table_list(wrconn, sub->publications);
 
-	/* Get local table list. */
-	subrel_states = GetSubscriptionRelations(sub->oid);
+		/* Get local table list. */
+		subrel_states = GetSubscriptionRelations(sub->oid);
 
-	/*
-	 * Build qsorted array of local table oids for faster lookup. This can
-	 * potentially contain all tables in the database so speed of lookup is
-	 * important.
-	 */
-	subrel_local_oids = palloc(list_length(subrel_states) * sizeof(Oid));
-	off = 0;
-	foreach(lc, subrel_states)
-	{
-		SubscriptionRelState *relstate = (SubscriptionRelState *) lfirst(lc);
+		/*
+		 * Build qsorted array of local table oids for faster lookup. This can
+		 * potentially contain all tables in the database so speed of lookup
+		 * is important.
+		 */
+		subrel_local_oids = palloc(list_length(subrel_states) * sizeof(Oid));
+		off = 0;
+		foreach(lc, subrel_states)
+		{
+			SubscriptionRelState *relstate = (SubscriptionRelState *) lfirst(lc);
 
-		subrel_local_oids[off++] = relstate->relid;
-	}
-	qsort(subrel_local_oids, list_length(subrel_states),
-		  sizeof(Oid), oid_cmp);
+			subrel_local_oids[off++] = relstate->relid;
+		}
+		qsort(subrel_local_oids, list_length(subrel_states),
+			  sizeof(Oid), oid_cmp);
+
+		/*
+		 * Rels that we want to remove from subscription and drop any slots
+		 * and origins corresponding to them.
+		 */
+		sub_remove_rels = palloc(list_length(subrel_states) * sizeof(SubRemoveRels));
+
+		/*
+		 * Walk over the remote tables and try to match them to locally known
+		 * tables. If the table is not known locally create a new state for
+		 * it.
+		 *
+		 * Also builds array of local oids of remote tables for the next step.
+		 */
+		off = 0;
+		pubrel_local_oids = palloc(list_length(pubrel_names) * sizeof(Oid));
+
+		foreach(lc, pubrel_names)
+		{
+			RangeVar   *rv = (RangeVar *) lfirst(lc);
+			Oid			relid;
 
-	/*
-	 * Walk over the remote tables and try to match them to locally known
-	 * tables. If the table is not known locally create a new state for it.
-	 *
-	 * Also builds array of local oids of remote tables for the next step.
-	 */
-	off = 0;
-	pubrel_local_oids = palloc(list_length(pubrel_names) * sizeof(Oid));
+			relid = RangeVarGetRelid(rv, AccessShareLock, false);
 
-	foreach(lc, pubrel_names)
-	{
-		RangeVar   *rv = (RangeVar *) lfirst(lc);
-		Oid			relid;
+			/* Check for supported relkind. */
+			CheckSubscriptionRelkind(get_rel_relkind(relid),
+									 rv->schemaname, rv->relname);
 
-		relid = RangeVarGetRelid(rv, AccessShareLock, false);
+			pubrel_local_oids[off++] = relid;
 
-		/* Check for supported relkind. */
-		CheckSubscriptionRelkind(get_rel_relkind(relid),
-								 rv->schemaname, rv->relname);
+			if (!bsearch(&relid, subrel_local_oids,
+						 list_length(subrel_states), sizeof(Oid), oid_cmp))
+			{
+				AddSubscriptionRelState(sub->oid, relid,
+										copy_data ? SUBREL_STATE_INIT : SUBREL_STATE_READY,
+										InvalidXLogRecPtr);
+				ereport(DEBUG1,
+						(errmsg("table \"%s.%s\" added to subscription \"%s\"",
+								rv->schemaname, rv->relname, sub->name)));
+			}
+		}
 
-		pubrel_local_oids[off++] = relid;
+		/*
+		 * Next remove state for tables we should not care about anymore using
+		 * the data we collected above
+		 */
+		qsort(pubrel_local_oids, list_length(pubrel_names),
+			  sizeof(Oid), oid_cmp);
 
-		if (!bsearch(&relid, subrel_local_oids,
-					 list_length(subrel_states), sizeof(Oid), oid_cmp))
+		remove_rel_len = 0;
+		for (off = 0; off < list_length(subrel_states); off++)
 		{
-			AddSubscriptionRelState(sub->oid, relid,
-									copy_data ? SUBREL_STATE_INIT : SUBREL_STATE_READY,
-									InvalidXLogRecPtr);
-			ereport(DEBUG1,
-					(errmsg("table \"%s.%s\" added to subscription \"%s\"",
-							rv->schemaname, rv->relname, sub->name)));
-		}
-	}
+			Oid			relid = subrel_local_oids[off];
 
-	/*
-	 * Next remove state for tables we should not care about anymore using the
-	 * data we collected above
-	 */
-	qsort(pubrel_local_oids, list_length(pubrel_names),
-		  sizeof(Oid), oid_cmp);
+			if (!bsearch(&relid, pubrel_local_oids,
+						 list_length(pubrel_names), sizeof(Oid), oid_cmp))
+			{
+				char		state;
+				XLogRecPtr	statelsn;
+
+				/*
+				 * Lock pg_subscription_rel with AccessExclusiveLock to
+				 * prevent any race conditions with the apply worker
+				 * re-launching workers at the same time this code is trying
+				 * to remove those tables.
+				 *
+				 * Even if new worker for this particular rel is restarted it
+				 * won't be able to make any progress as we hold exclusive
+				 * lock on subscription_rel till the transaction end. It will
+				 * simply exit as there is no corresponding rel entry.
+				 *
+				 * This locking also ensures that the state of rels won't
+				 * change till we are done with this refresh operation.
+				 */
+				if (!rel)
+					rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
+
+				/* Last known rel state. */
+				state = GetSubscriptionRelState(sub->oid, relid, &statelsn);
+
+				sub_remove_rels[remove_rel_len].relid = relid;
+				sub_remove_rels[remove_rel_len++].state = state;
+
+				RemoveSubscriptionRel(sub->oid, relid);
+
+				logicalrep_worker_stop(sub->oid, relid);
+
+				/*
+				 * For READY state, we would have already dropped the
+				 * tablesync origin.
+				 */
+				if (state != SUBREL_STATE_READY)
+				{
+					char		originname[NAMEDATALEN];
+
+					/*
+					 * Drop the tablesync's origin tracking if exists.
+					 *
+					 * It is possible that the origin is not yet created for
+					 * tablesync worker, this can happen for the states before
+					 * SUBREL_STATE_FINISHEDCOPY. The apply worker can also
+					 * concurrently try to drop the origin and by this time
+					 * the origin might be already removed. For these reasons,
+					 * passing missing_ok = true.
+					 */
+					ReplicationOriginNameForTablesync(sub->oid, relid, originname);
+					replorigin_drop_by_name(originname, true, false);
+				}
 
-	for (off = 0; off < list_length(subrel_states); off++)
-	{
-		Oid			relid = subrel_local_oids[off];
+				ereport(DEBUG1,
+						(errmsg("table \"%s.%s\" removed from subscription \"%s\"",
+								get_namespace_name(get_rel_namespace(relid)),
+								get_rel_name(relid),
+								sub->name)));
+			}
+		}
 
-		if (!bsearch(&relid, pubrel_local_oids,
-					 list_length(pubrel_names), sizeof(Oid), oid_cmp))
+		/*
+		 * Drop the tablesync slots associated with removed tables. This has
+		 * to be at the end because otherwise if there is an error while doing
+		 * the database operations we won't be able to rollback dropped slots.
+		 */
+		for (off = 0; off < remove_rel_len; off++)
 		{
-			RemoveSubscriptionRel(sub->oid, relid);
-
-			logicalrep_worker_stop_at_commit(sub->oid, relid);
-
-			ereport(DEBUG1,
-					(errmsg("table \"%s.%s\" removed from subscription \"%s\"",
-							get_namespace_name(get_rel_namespace(relid)),
-							get_rel_name(relid),
-							sub->name)));
+			if (sub_remove_rels[off].state != SUBREL_STATE_READY &&
+				sub_remove_rels[off].state != SUBREL_STATE_SYNCDONE)
+			{
+				char		syncslotname[NAMEDATALEN] = {0};
+
+				/*
+				 * For READY/SYNCDONE states we know the tablesync slot has
+				 * already been dropped by the tablesync worker.
+				 *
+				 * For other states, there is no certainty, maybe the slot
+				 * does not exist yet. Also, if we fail after removing some of
+				 * the slots, next time, it will again try to drop already
+				 * dropped slots and fail. For these reasons, we allow
+				 * missing_ok = true for the drop.
+				 */
+				ReplicationSlotNameForTablesync(sub->oid, sub_remove_rels[off].relid, syncslotname);
+				ReplicationSlotDropAtPubNode(wrconn, syncslotname, true);
+			}
 		}
 	}
+	PG_FINALLY();
+	{
+		if (wrconn)
+			walrcv_disconnect(wrconn);
+	}
+	PG_END_TRY();
+
+	if (rel)
+		table_close(rel, NoLock);
 }
 
 /*
  * Alter the existing subscription.
  */
 ObjectAddress
-AlterSubscription(AlterSubscriptionStmt *stmt)
+AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 {
 	Relation	rel;
 	ObjectAddress myself;
@@ -848,6 +951,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
+
 					/* Make sure refresh sees the new list of publications. */
 					sub->publications = stmt->publication;
 
@@ -877,6 +982,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL, NULL,	/* no "binary" */
 										   NULL, NULL); /* no "streaming" */
 
+				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
+
 				AlterSubscription_refresh(sub, copy_data);
 
 				break;
@@ -927,8 +1034,8 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
 	char		originname[NAMEDATALEN];
 	char	   *err = NULL;
 	WalReceiverConn *wrconn = NULL;
-	StringInfoData cmd;
 	Form_pg_subscription form;
+	List	   *rstates;
 
 	/*
 	 * Lock pg_subscription with AccessExclusiveLock to ensure that the
@@ -1041,6 +1148,36 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
 	}
 	list_free(subworkers);
 
+	/*
+	 * Cleanup of tablesync replication origins.
+	 *
+	 * Any READY-state relations would already have dealt with clean-ups.
+	 *
+	 * Note that the state can't change because we have already stopped both
+	 * the apply and tablesync workers and they can't restart because of
+	 * exclusive lock on the subscription.
+	 */
+	rstates = GetSubscriptionNotReadyRelations(subid);
+	foreach(lc, rstates)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+		Oid			relid = rstate->relid;
+
+		/* Only cleanup resources of tablesync workers */
+		if (!OidIsValid(relid))
+			continue;
+
+		/*
+		 * Drop the tablesync's origin tracking if exists.
+		 *
+		 * It is possible that the origin is not yet created for tablesync
+		 * worker so passing missing_ok = true. This can happen for the states
+		 * before SUBREL_STATE_FINISHEDCOPY.
+		 */
+		ReplicationOriginNameForTablesync(subid, relid, originname);
+		replorigin_drop_by_name(originname, true, false);
+	}
+
 	/* Clean up dependencies */
 	deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0);
 
@@ -1055,30 +1192,110 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
 	 * If there is no slot associated with the subscription, we can finish
 	 * here.
 	 */
-	if (!slotname)
+	if (!slotname && rstates == NIL)
 	{
 		table_close(rel, NoLock);
 		return;
 	}
 
 	/*
-	 * Otherwise drop the replication slot at the publisher node using the
-	 * replication connection.
+	 * Try to acquire the connection necessary for dropping slots.
+	 *
+	 * Note: If the slotname is NONE/NULL then we allow the command to finish
+	 * and users need to manually cleanup the apply and tablesync worker slots
+	 * later.
+	 *
+	 * This has to be at the end because otherwise if there is an error while
+	 * doing the database operations we won't be able to rollback dropped
+	 * slot.
 	 */
 	load_file("libpqwalreceiver", false);
 
-	initStringInfo(&cmd);
-	appendStringInfo(&cmd, "DROP_REPLICATION_SLOT %s WAIT", quote_identifier(slotname));
-
 	wrconn = walrcv_connect(conninfo, true, subname, &err);
 	if (wrconn == NULL)
-		ereport(ERROR,
-				(errmsg("could not connect to publisher when attempting to "
-						"drop the replication slot \"%s\"", slotname),
-				 errdetail("The error was: %s", err),
-		/* translator: %s is an SQL ALTER command */
-				 errhint("Use %s to disassociate the subscription from the slot.",
-						 "ALTER SUBSCRIPTION ... SET (slot_name = NONE)")));
+	{
+		if (!slotname)
+		{
+			/* be tidy */
+			list_free(rstates);
+			table_close(rel, NoLock);
+			return;
+		}
+		else
+		{
+			ReportSlotConnectionError(rstates, subid, slotname, err);
+		}
+	}
+
+	PG_TRY();
+	{
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+			Oid			relid = rstate->relid;
+
+			/* Only cleanup resources of tablesync workers */
+			if (!OidIsValid(relid))
+				continue;
+
+			/*
+			 * Drop the tablesync slots associated with removed tables.
+			 *
+			 * For SYNCDONE/READY states, the tablesync slot is known to have
+			 * already been dropped by the tablesync worker.
+			 *
+			 * For other states, there is no certainty, maybe the slot does
+			 * not exist yet. Also, if we fail after removing some of the
+			 * slots, next time, it will again try to drop already dropped
+			 * slots and fail. For these reasons, we allow missing_ok = true
+			 * for the drop.
+			 */
+			if (rstate->state != SUBREL_STATE_SYNCDONE)
+			{
+				char		syncslotname[NAMEDATALEN] = {0};
+
+				ReplicationSlotNameForTablesync(subid, relid, syncslotname);
+				ReplicationSlotDropAtPubNode(wrconn, syncslotname, true);
+			}
+		}
+
+		list_free(rstates);
+
+		/*
+		 * If there is a slot associated with the subscription, then drop the
+		 * replication slot at the publisher.
+		 */
+		if (slotname)
+			ReplicationSlotDropAtPubNode(wrconn, slotname, false);
+
+	}
+	PG_FINALLY();
+	{
+		walrcv_disconnect(wrconn);
+	}
+	PG_END_TRY();
+
+	table_close(rel, NoLock);
+}
+
+/*
+ * Drop the replication slot at the publisher node using the replication
+ * connection.
+ *
+ * missing_ok - if true then only issue a WARNING message if the slot doesn't
+ * exist.
+ */
+void
+ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok)
+{
+	StringInfoData cmd;
+
+	Assert(wrconn);
+
+	load_file("libpqwalreceiver", false);
+
+	initStringInfo(&cmd);
+	appendStringInfo(&cmd, "DROP_REPLICATION_SLOT %s WAIT", quote_identifier(slotname));
 
 	PG_TRY();
 	{
@@ -1086,27 +1303,39 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
 
 		res = walrcv_exec(wrconn, cmd.data, 0, NULL);
 
-		if (res->status != WALRCV_OK_COMMAND)
-			ereport(ERROR,
+		if (res->status == WALRCV_OK_COMMAND)
+		{
+			/* NOTICE. Success. */
+			ereport(NOTICE,
+					(errmsg("dropped replication slot \"%s\" on publisher",
+							slotname)));
+		}
+		else if (res->status == WALRCV_ERROR &&
+				 missing_ok &&
+				 res->sqlstate == ERRCODE_UNDEFINED_OBJECT)
+		{
+			/* WARNING. Error, but missing_ok = true. */
+			ereport(WARNING,
 					(errmsg("could not drop the replication slot \"%s\" on publisher",
 							slotname),
 					 errdetail("The error was: %s", res->err)));
+		}
 		else
-			ereport(NOTICE,
-					(errmsg("dropped replication slot \"%s\" on publisher",
-							slotname)));
+		{
+			/* ERROR. */
+			ereport(ERROR,
+					(errmsg("could not drop the replication slot \"%s\" on publisher",
+							slotname),
+					 errdetail("The error was: %s", res->err)));
+		}
 
 		walrcv_clear_result(res);
 	}
 	PG_FINALLY();
 	{
-		walrcv_disconnect(wrconn);
+		pfree(cmd.data);
 	}
 	PG_END_TRY();
-
-	pfree(cmd.data);
-
-	table_close(rel, NoLock);
 }
 
 /*
@@ -1275,3 +1504,45 @@ fetch_table_list(WalReceiverConn *wrconn, List *publications)
 
 	return tablelist;
 }
+
+/*
+ * This is to report the connection failure while dropping replication slots.
+ * Here, we report the WARNING for all tablesync slots so that user can drop
+ * them manually, if required.
+ */
+static void
+ReportSlotConnectionError(List *rstates, Oid subid, char *slotname, char *err)
+{
+	ListCell   *lc;
+
+	foreach(lc, rstates)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+		Oid			relid = rstate->relid;
+
+		/* Only cleanup resources of tablesync workers */
+		if (!OidIsValid(relid))
+			continue;
+
+		/*
+		 * Caller needs to ensure that relstate doesn't change underneath us.
+		 * See DropSubscription where we get the relstates.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE)
+		{
+			char		syncslotname[NAMEDATALEN] = {0};
+
+			ReplicationSlotNameForTablesync(subid, relid, syncslotname);
+			elog(WARNING, "could not drop tablesync replication slot \"%s\"",
+				 syncslotname);
+		}
+	}
+
+	ereport(ERROR,
+			(errmsg("could not connect to publisher when attempting to "
+					"drop the replication slot \"%s\"", slotname),
+			 errdetail("The error was: %s", err),
+	/* translator: %s is an SQL ALTER command */
+			 errhint("Use %s to disassociate the subscription from the slot.",
+					 "ALTER SUBSCRIPTION ... SET (slot_name = NONE)")));
+}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e958274..7714696 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -982,6 +982,7 @@ libpqrcv_exec(WalReceiverConn *conn, const char *query,
 {
 	PGresult   *pgres = NULL;
 	WalRcvExecResult *walres = palloc0(sizeof(WalRcvExecResult));
+	char	   *diag_sqlstate;
 
 	if (MyDatabaseId == InvalidOid)
 		ereport(ERROR,
@@ -1025,6 +1026,13 @@ libpqrcv_exec(WalReceiverConn *conn, const char *query,
 		case PGRES_BAD_RESPONSE:
 			walres->status = WALRCV_ERROR;
 			walres->err = pchomp(PQerrorMessage(conn->streamConn));
+			diag_sqlstate = PQresultErrorField(pgres, PG_DIAG_SQLSTATE);
+			if (diag_sqlstate)
+				walres->sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+												 diag_sqlstate[1],
+												 diag_sqlstate[2],
+												 diag_sqlstate[3],
+												 diag_sqlstate[4]);
 			break;
 	}
 
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 186514c..58082dd 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -73,20 +73,6 @@ typedef struct LogicalRepWorkerId
 	Oid			relid;
 } LogicalRepWorkerId;
 
-typedef struct StopWorkersData
-{
-	int			nestDepth;		/* Sub-transaction nest level */
-	List	   *workers;		/* List of LogicalRepWorkerId */
-	struct StopWorkersData *parent; /* This need not be an immediate
-									 * subtransaction parent */
-} StopWorkersData;
-
-/*
- * Stack of StopWorkersData elements. Each stack element contains the workers
- * to be stopped for that subtransaction.
- */
-static StopWorkersData *on_commit_stop_workers = NULL;
-
 static void ApplyLauncherWakeup(void);
 static void logicalrep_launcher_onexit(int code, Datum arg);
 static void logicalrep_worker_onexit(int code, Datum arg);
@@ -547,51 +533,6 @@ logicalrep_worker_stop(Oid subid, Oid relid)
 }
 
 /*
- * Request worker for specified sub/rel to be stopped on commit.
- */
-void
-logicalrep_worker_stop_at_commit(Oid subid, Oid relid)
-{
-	int			nestDepth = GetCurrentTransactionNestLevel();
-	LogicalRepWorkerId *wid;
-	MemoryContext oldctx;
-
-	/* Make sure we store the info in context that survives until commit. */
-	oldctx = MemoryContextSwitchTo(TopTransactionContext);
-
-	/* Check that previous transactions were properly cleaned up. */
-	Assert(on_commit_stop_workers == NULL ||
-		   nestDepth >= on_commit_stop_workers->nestDepth);
-
-	/*
-	 * Push a new stack element if we don't already have one for the current
-	 * nestDepth.
-	 */
-	if (on_commit_stop_workers == NULL ||
-		nestDepth > on_commit_stop_workers->nestDepth)
-	{
-		StopWorkersData *newdata = palloc(sizeof(StopWorkersData));
-
-		newdata->nestDepth = nestDepth;
-		newdata->workers = NIL;
-		newdata->parent = on_commit_stop_workers;
-		on_commit_stop_workers = newdata;
-	}
-
-	/*
-	 * Finally add a new worker into the worker list of the current
-	 * subtransaction.
-	 */
-	wid = palloc(sizeof(LogicalRepWorkerId));
-	wid->subid = subid;
-	wid->relid = relid;
-	on_commit_stop_workers->workers =
-		lappend(on_commit_stop_workers->workers, wid);
-
-	MemoryContextSwitchTo(oldctx);
-}
-
-/*
  * Wake up (using latch) any logical replication worker for specified sub/rel.
  */
 void
@@ -820,109 +761,21 @@ ApplyLauncherShmemInit(void)
 }
 
 /*
- * Check whether current transaction has manipulated logical replication
- * workers.
- */
-bool
-XactManipulatesLogicalReplicationWorkers(void)
-{
-	return (on_commit_stop_workers != NULL);
-}
-
-/*
  * Wakeup the launcher on commit if requested.
  */
 void
 AtEOXact_ApplyLauncher(bool isCommit)
 {
-
-	Assert(on_commit_stop_workers == NULL ||
-		   (on_commit_stop_workers->nestDepth == 1 &&
-			on_commit_stop_workers->parent == NULL));
-
 	if (isCommit)
 	{
-		ListCell   *lc;
-
-		if (on_commit_stop_workers != NULL)
-		{
-			List	   *workers = on_commit_stop_workers->workers;
-
-			foreach(lc, workers)
-			{
-				LogicalRepWorkerId *wid = lfirst(lc);
-
-				logicalrep_worker_stop(wid->subid, wid->relid);
-			}
-		}
-
 		if (on_commit_launcher_wakeup)
 			ApplyLauncherWakeup();
 	}
 
-	/*
-	 * No need to pfree on_commit_stop_workers.  It was allocated in
-	 * transaction memory context, which is going to be cleaned soon.
-	 */
-	on_commit_stop_workers = NULL;
 	on_commit_launcher_wakeup = false;
 }
 
 /*
- * On commit, merge the current on_commit_stop_workers list into the
- * immediate parent, if present.
- * On rollback, discard the current on_commit_stop_workers list.
- * Pop out the stack.
- */
-void
-AtEOSubXact_ApplyLauncher(bool isCommit, int nestDepth)
-{
-	StopWorkersData *parent;
-
-	/* Exit immediately if there's no work to do at this level. */
-	if (on_commit_stop_workers == NULL ||
-		on_commit_stop_workers->nestDepth < nestDepth)
-		return;
-
-	Assert(on_commit_stop_workers->nestDepth == nestDepth);
-
-	parent = on_commit_stop_workers->parent;
-
-	if (isCommit)
-	{
-		/*
-		 * If the upper stack element is not an immediate parent
-		 * subtransaction, just decrement the notional nesting depth without
-		 * doing any real work.  Else, we need to merge the current workers
-		 * list into the parent.
-		 */
-		if (!parent || parent->nestDepth < nestDepth - 1)
-		{
-			on_commit_stop_workers->nestDepth--;
-			return;
-		}
-
-		parent->workers =
-			list_concat(parent->workers, on_commit_stop_workers->workers);
-	}
-	else
-	{
-		/*
-		 * Abandon everything that was done at this nesting level.  Explicitly
-		 * free memory to avoid a transaction-lifespan leak.
-		 */
-		list_free_deep(on_commit_stop_workers->workers);
-	}
-
-	/*
-	 * We have taken care of the current subtransaction workers list for both
-	 * abort or commit. So we are ready to pop the stack.
-	 */
-	pfree(on_commit_stop_workers);
-	on_commit_stop_workers = parent;
-}
-
-/*
  * Request wakeup of the launcher on commit of the transaction.
  *
  * This is used to send launcher signal to stop sleeping and process the
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index ccbdbcf..19cc804 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -31,8 +31,11 @@
  *		 table state to INIT.
  *	   - Tablesync worker starts; changes table state from INIT to DATASYNC while
  *		 copying.
- *	   - Tablesync worker finishes the copy and sets table state to SYNCWAIT;
- *		 waits for state change.
+ *	   - Tablesync worker does initial table copy; there is a FINISHEDCOPY (sync
+ *		 worker specific) state to indicate when the copy phase has completed, so
+ *		 if the worker crashes with this (non-memory) state then the copy will not
+ *		 be re-attempted.
+ *	   - Tablesync worker then sets table state to SYNCWAIT; waits for state change.
  *	   - Apply worker periodically checks for tables in SYNCWAIT state.  When
  *		 any appear, it sets the table state to CATCHUP and starts loop-waiting
  *		 until either the table state is set to SYNCDONE or the sync worker
@@ -48,8 +51,8 @@
  *		 point it sets state to READY and stops tracking.  Again, there might
  *		 be zero changes in between.
  *
- *	  So the state progression is always: INIT -> DATASYNC -> SYNCWAIT ->
- *	  CATCHUP -> SYNCDONE -> READY.
+ *	  So the state progression is always: INIT -> DATASYNC -> FINISHEDCOPY
+ *	  -> SYNCWAIT -> CATCHUP -> SYNCDONE -> READY.
  *
  *	  The catalog pg_subscription_rel is used to keep information about
  *	  subscribed tables and their state.  The catalog holds all states
@@ -58,6 +61,7 @@
  *	  Example flows look like this:
  *	   - Apply is in front:
  *		  sync:8
+ *			-> set in catalog FINISHEDCOPY
  *			-> set in memory SYNCWAIT
  *		  apply:10
  *			-> set in memory CATCHUP
@@ -73,6 +77,7 @@
  *
  *	   - Sync is in front:
  *		  sync:10
+ *			-> set in catalog FINISHEDCOPY
  *			-> set in memory SYNCWAIT
  *		  apply:8
  *			-> set in memory CATCHUP
@@ -101,7 +106,10 @@
 #include "replication/logicalrelation.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
+#include "replication/slot.h"
+#include "replication/origin.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -269,26 +277,52 @@ invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 static void
 process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
-	Assert(IsTransactionState());
-
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
 		TimeLineID	tli;
+		char		syncslotname[NAMEDATALEN] = {0};
 
 		MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE;
 		MyLogicalRepWorker->relstate_lsn = current_lsn;
 
 		SpinLockRelease(&MyLogicalRepWorker->relmutex);
 
+		/*
+		 * UpdateSubscriptionRelState must be called within a transaction.
+		 * That transaction will be ended within the finish_sync_worker().
+		 */
+		if (!IsTransactionState())
+			StartTransactionCommand();
+
 		UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
 								   MyLogicalRepWorker->relid,
 								   MyLogicalRepWorker->relstate,
 								   MyLogicalRepWorker->relstate_lsn);
 
+		/* End wal streaming so wrconn can be re-used to drop the slot. */
 		walrcv_endstreaming(wrconn, &tli);
+
+		/*
+		 * Cleanup the tablesync slot.
+		 *
+		 * This has to be done after updating the state because otherwise if
+		 * there is an error while doing the database operations we won't be
+		 * able to rollback dropped slot.
+		 */
+		ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid,
+										MyLogicalRepWorker->relid,
+										syncslotname);
+
+		/*
+		 * It is important to give an error if we are unable to drop the slot,
+		 * otherwise, it won't be dropped till the corresponding subscription
+		 * is dropped. So passing missing_ok = false.
+		 */
+		ReplicationSlotDropAtPubNode(wrconn, syncslotname, false);
+
 		finish_sync_worker();
 	}
 	else
@@ -403,6 +437,8 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 			 */
 			if (current_lsn >= rstate->lsn)
 			{
+				char		originname[NAMEDATALEN];
+
 				rstate->state = SUBREL_STATE_READY;
 				rstate->lsn = current_lsn;
 				if (!started_tx)
@@ -411,6 +447,27 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 					started_tx = true;
 				}
 
+				/*
+				 * Remove the tablesync origin tracking if exists.
+				 *
+				 * The normal case origin drop is done here instead of in the
+				 * process_syncing_tables_for_sync function because we don't
+				 * allow to drop the origin till the process owning the origin
+				 * is alive.
+				 *
+				 * There is a chance that the user is concurrently performing
+				 * refresh for the subscription where we remove the table
+				 * state and its origin and by this time the origin might be
+				 * already removed. So passing missing_ok = true.
+				 */
+				ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
+												  rstate->relid,
+												  originname);
+				replorigin_drop_by_name(originname, true, false);
+
+				/*
+				 * Update the state to READY only after the origin cleanup.
+				 */
 				UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
 										   rstate->relid, rstate->state,
 										   rstate->lsn);
@@ -806,6 +863,50 @@ copy_table(Relation rel)
 }
 
 /*
+ * Determine the tablesync slot name.
+ *
+ * The name must not exceed NAMEDATALEN - 1 because of remote node constraints
+ * on slot name length. We append system_identifier to avoid slot_name
+ * collision with subscriptions in other clusters. With the current scheme
+ * pg_%u_sync_%u_UINT64_FORMAT (3 + 10 + 6 + 10 + 20 + '\0'), the maximum
+ * length of slot_name will be 50.
+ *
+ * The returned slot name is either:
+ * - stored in the supplied buffer (syncslotname), or
+ * - palloc'ed in current memory context (if syncslotname = NULL).
+ *
+ * Note: We don't use the subscription slot name as part of tablesync slot name
+ * because we are responsible for cleaning up these slots and it could become
+ * impossible to recalculate what name to cleanup if the subscription slot name
+ * had changed.
+ */
+char *
+ReplicationSlotNameForTablesync(Oid suboid, Oid relid,
+								char syncslotname[NAMEDATALEN])
+{
+	if (syncslotname)
+		sprintf(syncslotname, "pg_%u_sync_%u_" UINT64_FORMAT, suboid, relid,
+				GetSystemIdentifier());
+	else
+		syncslotname = psprintf("pg_%u_sync_%u_" UINT64_FORMAT, suboid, relid,
+								GetSystemIdentifier());
+
+	return syncslotname;
+}
+
+/*
+ * Form the origin name for tablesync.
+ *
+ * Return the name in the supplied buffer.
+ */
+void
+ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
+								  char originname[NAMEDATALEN])
+{
+	snprintf(originname, NAMEDATALEN, "pg_%u_%u", suboid, relid);
+}
+
+/*
  * Start syncing the table in the sync worker.
  *
  * If nothing needs to be done to sync the table, we exit the worker without
@@ -822,6 +923,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	XLogRecPtr	relstate_lsn;
 	Relation	rel;
 	WalRcvExecResult *res;
+	char		originname[NAMEDATALEN];
+	RepOriginId originid;
 
 	/* Check the state of the table synchronization. */
 	StartTransactionCommand();
@@ -847,19 +950,10 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 			finish_sync_worker();	/* doesn't return */
 	}
 
-	/*
-	 * To build a slot name for the sync work, we are limited to NAMEDATALEN -
-	 * 1 characters.  We cut the original slot name to NAMEDATALEN - 28 chars
-	 * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0').  (It's actually the
-	 * NAMEDATALEN on the remote that matters, but this scheme will also work
-	 * reasonably if that is different.)
-	 */
-	StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small");	/* for sanity */
-	slotname = psprintf("%.*s_%u_sync_%u",
-						NAMEDATALEN - 28,
-						MySubscription->slotname,
-						MySubscription->oid,
-						MyLogicalRepWorker->relid);
+	/* Calculate the name of the tablesync slot. */
+	slotname = ReplicationSlotNameForTablesync(MySubscription->oid,
+											   MyLogicalRepWorker->relid,
+											   NULL /* use palloc */ );
 
 	/*
 	 * Here we use the slot name instead of the subscription name as the
@@ -872,7 +966,50 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 				(errmsg("could not connect to the publisher: %s", err)));
 
 	Assert(MyLogicalRepWorker->relstate == SUBREL_STATE_INIT ||
-		   MyLogicalRepWorker->relstate == SUBREL_STATE_DATASYNC);
+		   MyLogicalRepWorker->relstate == SUBREL_STATE_DATASYNC ||
+		   MyLogicalRepWorker->relstate == SUBREL_STATE_FINISHEDCOPY);
+
+	/* Assign the origin tracking record name. */
+	ReplicationOriginNameForTablesync(MySubscription->oid,
+									  MyLogicalRepWorker->relid,
+									  originname);
+
+	if (MyLogicalRepWorker->relstate == SUBREL_STATE_DATASYNC)
+	{
+		/*
+		 * We have previously errored out before finishing the copy so the
+		 * replication slot might exist. We want to remove the slot if it
+		 * already exists and proceed.
+		 *
+		 * XXX We could also instead try to drop the slot, last time we failed
+		 * but for that, we might need to clean up the copy state as it might
+		 * be in the middle of fetching the rows. Also, if there is a network
+		 * breakdown then it wouldn't have succeeded so trying it next time
+		 * seems like a better bet.
+		 */
+		ReplicationSlotDropAtPubNode(wrconn, slotname, true);
+	}
+	else if (MyLogicalRepWorker->relstate == SUBREL_STATE_FINISHEDCOPY)
+	{
+		/*
+		 * The COPY phase was previously done, but tablesync then crashed
+		 * before it was able to finish normally.
+		 */
+		StartTransactionCommand();
+
+		/*
+		 * The origin tracking name must already exist. It was created first
+		 * time this tablesync was launched.
+		 */
+		originid = replorigin_by_name(originname, false);
+		replorigin_session_setup(originid);
+		replorigin_session_origin = originid;
+		*origin_startpos = replorigin_session_get_progress(false);
+
+		CommitTransactionCommand();
+
+		goto copy_table_done;
+	}
 
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 	MyLogicalRepWorker->relstate = SUBREL_STATE_DATASYNC;
@@ -888,9 +1025,6 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
-	/*
-	 * We want to do the table data sync in a single transaction.
-	 */
 	StartTransactionCommand();
 
 	/*
@@ -916,13 +1050,46 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	walrcv_clear_result(res);
 
 	/*
-	 * Create a new temporary logical decoding slot.  This slot will be used
+	 * Create a new permanent logical decoding slot. This slot will be used
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, true,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
+	/*
+	 * Setup replication origin tracking. The purpose of doing this before the
+	 * copy is to avoid doing the copy again due to any error in setting up
+	 * origin tracking.
+	 */
+	originid = replorigin_by_name(originname, true);
+	if (!OidIsValid(originid))
+	{
+		/*
+		 * Origin tracking does not exist, so create it now.
+		 *
+		 * Then advance to the LSN got from walrcv_create_slot. This is WAL
+		 * logged for the purpose of recovery. Locks are to prevent the
+		 * replication origin from vanishing while advancing.
+		 */
+		originid = replorigin_create(originname);
+
+		LockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
+		replorigin_advance(originid, *origin_startpos, InvalidXLogRecPtr,
+						   true /* go backward */ , true /* WAL log */ );
+		UnlockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
+
+		replorigin_session_setup(originid);
+		replorigin_session_origin = originid;
+	}
+	else
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("replication origin \"%s\" already exists",
+						originname)));
+	}
+
 	/* Now do the initial data copy */
 	PushActiveSnapshot(GetTransactionSnapshot());
 	copy_table(rel);
@@ -941,6 +1108,25 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	CommandCounterIncrement();
 
 	/*
+	 * Update the persisted state to indicate the COPY phase is done; make it
+	 * visible to others.
+	 */
+	UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
+							   MyLogicalRepWorker->relid,
+							   SUBREL_STATE_FINISHEDCOPY,
+							   MyLogicalRepWorker->relstate_lsn);
+
+	CommitTransactionCommand();
+
+copy_table_done:
+
+	elog(DEBUG1,
+		 "LogicalRepSyncTableStart: '%s' origin_startpos lsn %X/%X",
+		 originname,
+		 (uint32) (*origin_startpos >> 32),
+		 (uint32) *origin_startpos);
+
+	/*
 	 * We are done with the initial data synchronization, update the state.
 	 */
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index eb7db89..cfc924c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -807,12 +807,8 @@ apply_handle_stream_stop(StringInfo s)
 	/* We must be in a valid transaction state */
 	Assert(IsTransactionState());
 
-	/* The synchronization worker runs in single transaction. */
-	if (!am_tablesync_worker())
-	{
-		/* Commit the per-stream transaction */
-		CommitTransactionCommand();
-	}
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
 
 	in_streamed_transaction = false;
 
@@ -889,9 +885,7 @@ apply_handle_stream_abort(StringInfo s)
 			/* Cleanup the subxact info */
 			cleanup_subxact_info();
 
-			/* The synchronization worker runs in single transaction */
-			if (!am_tablesync_worker())
-				CommitTransactionCommand();
+			CommitTransactionCommand();
 			return;
 		}
 
@@ -918,8 +912,7 @@ apply_handle_stream_abort(StringInfo s)
 		/* write the updated subxact list */
 		subxact_info_write(MyLogicalRepWorker->subid, xid);
 
-		if (!am_tablesync_worker())
-			CommitTransactionCommand();
+		CommitTransactionCommand();
 	}
 }
 
@@ -1062,8 +1055,7 @@ apply_handle_stream_commit(StringInfo s)
 static void
 apply_handle_commit_internal(StringInfo s, LogicalRepCommitData *commit_data)
 {
-	/* The synchronization worker runs in single transaction. */
-	if (IsTransactionState() && !am_tablesync_worker())
+	if (IsTransactionState())
 	{
 		/*
 		 * Update origin state so we can restart streaming from correct
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1d81071..05bb698 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1786,7 +1786,8 @@ ProcessUtilitySlow(ParseState *pstate,
 				break;
 
 			case T_AlterSubscriptionStmt:
-				address = AlterSubscription((AlterSubscriptionStmt *) parsetree);
+				address = AlterSubscription((AlterSubscriptionStmt *) parsetree,
+											isTopLevel);
 				break;
 
 			case T_DropSubscriptionStmt:
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 2bea2c5..ed94f57 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -61,6 +61,8 @@ DECLARE_UNIQUE_INDEX_PKEY(pg_subscription_rel_srrelid_srsubid_index, 6117, on pg
 #define SUBREL_STATE_INIT		'i' /* initializing (sublsn NULL) */
 #define SUBREL_STATE_DATASYNC	'd' /* data is being synchronized (sublsn
 									 * NULL) */
+#define SUBREL_STATE_FINISHEDCOPY 'f'	/* tablesync copy phase is completed
+										 * (sublsn NULL) */
 #define SUBREL_STATE_SYNCDONE	's' /* synchronization finished in front of
 									 * apply (sublsn set) */
 #define SUBREL_STATE_READY		'r' /* ready (sublsn set) */
diff --git a/src/include/commands/subscriptioncmds.h b/src/include/commands/subscriptioncmds.h
index a818650..3b926f3 100644
--- a/src/include/commands/subscriptioncmds.h
+++ b/src/include/commands/subscriptioncmds.h
@@ -20,7 +20,7 @@
 
 extern ObjectAddress CreateSubscription(CreateSubscriptionStmt *stmt,
 										bool isTopLevel);
-extern ObjectAddress AlterSubscription(AlterSubscriptionStmt *stmt);
+extern ObjectAddress AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel);
 extern void DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel);
 
 extern ObjectAddress AlterSubscriptionOwner(const char *name, Oid newOwnerId);
diff --git a/src/include/replication/logicallauncher.h b/src/include/replication/logicallauncher.h
index 421ec15..301e494 100644
--- a/src/include/replication/logicallauncher.h
+++ b/src/include/replication/logicallauncher.h
@@ -22,9 +22,7 @@ extern Size ApplyLauncherShmemSize(void);
 extern void ApplyLauncherShmemInit(void);
 
 extern void ApplyLauncherWakeupAtCommit(void);
-extern bool XactManipulatesLogicalReplicationWorkers(void);
 extern void AtEOXact_ApplyLauncher(bool isCommit);
-extern void AtEOSubXact_ApplyLauncher(bool isCommit, int nestDepth);
 
 extern bool IsLogicalLauncher(void);
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 53f636c..5f52335 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -15,6 +15,7 @@
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
+#include "replication/walreceiver.h"
 
 /*
  * Behaviour of replication slots, upon release or crash.
@@ -211,6 +212,8 @@ extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name);
+extern char *ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslotname);
+extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4313f51..a97a59a 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -210,6 +210,7 @@ typedef enum
 typedef struct WalRcvExecResult
 {
 	WalRcvExecStatus status;
+	int			sqlstate;
 	char	   *err;
 	Tuplestorestate *tuplestore;
 	TupleDesc	tupledesc;
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index d046022..4a5adc2 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -77,13 +77,14 @@ extern List *logicalrep_workers_find(Oid subid, bool only_running);
 extern void logicalrep_worker_launch(Oid dbid, Oid subid, const char *subname,
 									 Oid userid, Oid relid);
 extern void logicalrep_worker_stop(Oid subid, Oid relid);
-extern void logicalrep_worker_stop_at_commit(Oid subid, Oid relid);
 extern void logicalrep_worker_wakeup(Oid subid, Oid relid);
 extern void logicalrep_worker_wakeup_ptr(LogicalRepWorker *worker);
 
 extern int	logicalrep_sync_worker_count(Oid subid);
 
+extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid, char *originname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 2fa9bce..7802279 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -201,6 +201,27 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=postgres' PUBLICATION mypub
+       WITH (enabled = true, create_slot = false, copy_data = false);
+-- fail - ALTER SUBSCRIPTION with refresh is not allowed in a transaction
+-- block or function
+BEGIN;
+ALTER SUBSCRIPTION regress_testsub SET PUBLICATION mypub WITH (refresh = true);
+ERROR:  ALTER SUBSCRIPTION with refresh cannot run inside a transaction block
+END;
+BEGIN;
+ALTER SUBSCRIPTION regress_testsub REFRESH PUBLICATION;
+ERROR:  ALTER SUBSCRIPTION ... REFRESH cannot run inside a transaction block
+END;
+CREATE FUNCTION func() RETURNS VOID AS
+$$ ALTER SUBSCRIPTION regress_testsub SET PUBLICATION mypub WITH (refresh = true) $$ LANGUAGE SQL;
+SELECT func();
+ERROR:  ALTER SUBSCRIPTION with refresh cannot be executed from a function
+CONTEXT:  SQL function "func" statement 1
+ALTER SUBSCRIPTION regress_testsub DISABLE;
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+DROP FUNCTION func;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 14fa0b2..ca0d782 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -147,6 +147,28 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 
 DROP SUBSCRIPTION regress_testsub;
 
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=postgres' PUBLICATION mypub
+       WITH (enabled = true, create_slot = false, copy_data = false);
+
+-- fail - ALTER SUBSCRIPTION with refresh is not allowed in a transaction
+-- block or function
+BEGIN;
+ALTER SUBSCRIPTION regress_testsub SET PUBLICATION mypub WITH (refresh = true);
+END;
+
+BEGIN;
+ALTER SUBSCRIPTION regress_testsub REFRESH PUBLICATION;
+END;
+
+CREATE FUNCTION func() RETURNS VOID AS
+$$ ALTER SUBSCRIPTION regress_testsub SET PUBLICATION mypub WITH (refresh = true) $$ LANGUAGE SQL;
+SELECT func();
+
+ALTER SUBSCRIPTION regress_testsub DISABLE;
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+DROP FUNCTION func;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..c792668 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 8;
 
 # Initialize publisher node
 my $node_publisher = get_new_node('publisher');
@@ -149,7 +149,26 @@ $result = $node_subscriber->safe_psql('postgres',
 is($result, qq(20),
 	'changes for table added after subscription initialized replicated');
 
+# clean up
+$node_publisher->safe_psql('postgres', "DROP TABLE tab_rep_next");
+$node_subscriber->safe_psql('postgres', "DROP TABLE tab_rep_next");
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
 
+# Table tap_rep already has the same records on both publisher and subscriber
+# at this time. Recreate the subscription which will do the initial copy of
+# the table again and fails due to unique constraint violation.
+$node_subscriber->safe_psql('postgres',
+	 "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub");
+
+$result = $node_subscriber->poll_query_until('postgres', $started_query)
+    or die "Timed out while waiting for subscriber to start sync";
+
+# DROP SUBSCRIPTION must clean up slots on the publisher side when the
+# subscriber is stuck on data copy for constraint violation.
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'DROP SUBSCRIPTION during error can clean up the slots on the publisher');
+
 $node_subscriber->stop('fast');
 $node_publisher->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1d540fe..bab4f3a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2397,7 +2397,6 @@ StdAnalyzeData
 StdRdOptions
 Step
 StopList
-StopWorkersData
 StrategyNumber
 StreamCtl
 StreamXidHash
@@ -2408,6 +2407,7 @@ SubLink
 SubLinkType
 SubPlan
 SubPlanState
+SubRemoveRels
 SubTransactionId
 SubXactCallback
 SubXactCallbackItem
-- 
1.8.3.1

v38-0003-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v38-0003-Track-replication-origin-progress-for-rollbacks.patchDownload
From 6d92d84a68697062b0a91b90a9cb6c2bdd159608 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Feb 2021 17:56:54 +1100
Subject: [PATCH v38] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index fc18b77..609cf02 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2277,6 +2277,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2299,6 +2307,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3c8b4eb..7f5e678 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5709,8 +5709,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5916,7 +5915,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5965,6 +5965,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6006,7 +6013,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6014,7 +6022,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v38-0006-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v38-0006-Support-2PC-txn-Subscription-option.patchDownload
From 718098b9bd4403d2849dbd1d8b73fa4ab7d8b177 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Feb 2021 19:54:38 +1100
Subject: [PATCH v38] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 202 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index bcb0acf..7610ab2 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -184,8 +184,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 750ec2a..ceeae36 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd..55dd8da 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1167,7 +1167,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 7996f84..2f56fab 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -358,6 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +399,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +468,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -823,6 +842,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -833,7 +854,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -872,6 +894,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -890,7 +919,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -935,7 +965,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -980,7 +1011,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 7714696..c602c3e 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -429,6 +429,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e01d02e..6146b77 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2787,6 +2787,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3433,6 +3434,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index c33ea25..3cd228d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -180,13 +180,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -254,6 +256,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -267,6 +279,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -291,7 +304,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -332,6 +346,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index d99b61e..c16811f 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 1290f96..07c3ad8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..4ac4924 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 40417e6..8e94a26 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..41e0d8c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 7802279..c2fa9fc 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -222,6 +222,29 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index ca0d782..1da95a4 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -169,6 +169,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v38-0007-Support-2PC-txn-tests-for-concurrent-aborts.patchapplication/octet-stream; name=v38-0007-Support-2PC-txn-tests-for-concurrent-aborts.patchDownload
From 8f2466cb5d90f6f763913500936dfb162845808d Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Feb 2021 21:14:33 +1100
Subject: [PATCH v38] Support 2PC txn tests for concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2PC.
---
 contrib/test_decoding/Makefile                    |   2 +
 contrib/test_decoding/t/001_twophase.pl           | 121 ++++++++++++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++++++
 contrib/test_decoding/test_decoding.c             |  58 ++++++++++
 src/backend/replication/logical/reorderbuffer.c   |   5 +
 5 files changed, 319 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index c5e28ce..e0cd841 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -10,6 +10,8 @@ ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..3b3e7b8
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of prepared txn test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..15001c6
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 929255e..3fa172a 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,11 +11,13 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
+#include "storage/procarray.h"
 
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -35,6 +37,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -174,6 +177,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -275,6 +279,24 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -471,6 +493,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -620,6 +666,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -706,6 +755,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -918,6 +970,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -971,6 +1026,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5a62ab8..4a4a9ed 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2489,6 +2489,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
-- 
1.8.3.1

#186Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#185)

On Wed, Feb 10, 2021 at 3:59 PM Peter Smith <smithpb2250@gmail.com> wrote:

PSA the new patch set v38*.

This patch set has been rebased to use the most recent tablesync patch
from other thread [1]
(i.e. notice that v38-0001 is an exact copy of that thread's tablesync
patch v31)

I see one problem which might lead to the skip of prepared xacts for
some of the subscriptions. The problem is that we skip the prepared
xacts based on GID and the same prepared transaction arrives on the
subscriber for different subscriptions. And even if we wouldn't have
skipped the prepared xact, it would have lead to an error "transaction
identifier "p1" is already in use". See the scenario below:

On Publisher:
===========
CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
CREATE TABLE mytbl2(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
postgres=# BEGIN;
BEGIN
postgres=*# INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
INSERT 0 1
postgres=*# INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
INSERT 0 1
postgres=*# COMMIT;
COMMIT
postgres=# BEGIN;
BEGIN
postgres=*# INSERT INTO mytbl2(somedata, text) VALUES (1, 1);
INSERT 0 1
postgres=*# INSERT INTO mytbl2(somedata, text) VALUES (1, 2);
INSERT 0 1
postgres=*# Commit;
COMMIT
postgres=# CREATE PUBLICATION mypub1 FOR TABLE mytbl1;
CREATE PUBLICATION
postgres=# CREATE PUBLICATION mypub2 FOR TABLE mytbl2;
CREATE PUBLICATION

On Subscriber:
============
CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
CREATE TABLE mytbl2(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
postgres=# CREATE SUBSCRIPTION mysub1
postgres-# CONNECTION 'host=localhost port=5432 dbname=postgres'
postgres-# PUBLICATION mypub1;
NOTICE: created replication slot "mysub1" on publisher
CREATE SUBSCRIPTION
postgres=# CREATE SUBSCRIPTION mysub2
postgres-# CONNECTION 'host=localhost port=5432 dbname=postgres'
postgres-# PUBLICATION mypub2;
NOTICE: created replication slot "mysub2" on publisher
CREATE SUBSCRIPTION

On Publisher:
============
postgres=# Begin;
BEGIN
postgres=*# INSERT INTO mytbl1(somedata, text) VALUES (1, 3);
INSERT 0 1
postgres=*# INSERT INTO mytbl2(somedata, text) VALUES (1, 3);
INSERT 0 1
postgres=*# Prepare Transaction 'myprep1';

After this step, wait for few seconds and then perform Commit Prepared
'myprep1'; on Publisher and you will notice following error in the
subscriber log: "ERROR: prepared transaction with identifier
"myprep1" does not exist"

One idea to avoid this is that we use subscription_id along with GID
on subscription for prepared xacts. Let me know if you have any better
ideas to handle this?

Few other minor comments on
v38-0004-Add-support-for-apply-at-prepare-time-to-built-i:
======================================================================
1.
- * Mark the prepared transaction as valid.  As soon as xact.c marks
- * MyProc as not running our XID (which it will do immediately after
- * this function returns), others can commit/rollback the xact.
+ * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+ * as not running our XID (which it will do immediately after this
+ * function returns), others can commit/rollback the xact.

Why this change in this patch? Is it due to pgindent? If so, you need
to exclude this change?

2.
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,

pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);

- /* send the flags field (unused for now) */
+ /* send the flags field */
  pq_sendbyte(out, flags);

Is there a reason to change the above comment?

--
With Regards,
Amit Kapila.

#187Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#186)
7 attachment(s)

On Thu, Feb 11, 2021 at 12:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few other minor comments on
v38-0004-Add-support-for-apply-at-prepare-time-to-built-i:
======================================================================
1.
- * Mark the prepared transaction as valid.  As soon as xact.c marks
- * MyProc as not running our XID (which it will do immediately after
- * this function returns), others can commit/rollback the xact.
+ * Mark the prepared transaction as valid.  As soon as xact.c marks MyProc
+ * as not running our XID (which it will do immediately after this
+ * function returns), others can commit/rollback the xact.

Why this change in this patch? Is it due to pgindent? If so, you need
to exclude this change?

Fixed in V39.

2.
@@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,

pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT);

- /* send the flags field (unused for now) */
+ /* send the flags field */
pq_sendbyte(out, flags);

Is there a reason to change the above comment?

Fixed in V39.

----------

Please find attached the new 2PC patch set v39*

This fixes some recent feedback comments (see above).

----
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v39-0005-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v39-0005-Support-2PC-txn-subscriber-tests.patchDownload
From 8b7d45e9e97eccd7dbbaeaef705a901bdd390735 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 11 Feb 2021 18:20:26 +1100
Subject: [PATCH v39] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v39-0001-Tablesync-V31.patchapplication/octet-stream; name=v39-0001-Tablesync-V31.patchDownload
From 4ace9858ffab9dcc8eb57ae63421abef4a276695 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Feb 2021 17:35:42 +1100
Subject: [PATCH v39] Tablesync V31

DO NOT COMMIT THIS CODE.

This is v31 of the tablesync patch. Please see [1] for the latest version of this patch to be committed.

[1] https://www.postgresql.org/message-id/flat/CAA4eK1JHBqwtGcdjRnaCoD%2B_1G87pFnVw3AJjyBGx%2BYTN%3DuZTg%40mail.gmail.com#55fd22d19c95080996c99393c1cb2fad
---
 doc/src/sgml/catalogs.sgml                         |   1 +
 doc/src/sgml/logical-replication.sgml              |  59 ++-
 doc/src/sgml/ref/alter_subscription.sgml           |  18 +
 doc/src/sgml/ref/drop_subscription.sgml            |   6 +-
 src/backend/access/transam/xact.c                  |  11 -
 src/backend/catalog/pg_subscription.c              |  39 ++
 src/backend/commands/subscriptioncmds.c            | 467 ++++++++++++++++-----
 .../libpqwalreceiver/libpqwalreceiver.c            |   8 +
 src/backend/replication/logical/launcher.c         | 147 -------
 src/backend/replication/logical/tablesync.c        | 236 +++++++++--
 src/backend/replication/logical/worker.c           |  18 +-
 src/backend/tcop/utility.c                         |   3 +-
 src/include/catalog/pg_subscription_rel.h          |   2 +
 src/include/commands/subscriptioncmds.h            |   2 +-
 src/include/replication/logicallauncher.h          |   2 -
 src/include/replication/slot.h                     |   3 +
 src/include/replication/walreceiver.h              |   1 +
 src/include/replication/worker_internal.h          |   3 +-
 src/test/regress/expected/subscription.out         |  21 +
 src/test/regress/sql/subscription.sql              |  22 +
 src/test/subscription/t/004_sync.pl                |  21 +-
 src/tools/pgindent/typedefs.list                   |   2 +-
 22 files changed, 767 insertions(+), 325 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index ea222c0..692ad65 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7673,6 +7673,7 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
        State code:
        <literal>i</literal> = initialize,
        <literal>d</literal> = data is being copied,
+       <literal>f</literal> = finished table copy,
        <literal>s</literal> = synchronized,
        <literal>r</literal> = ready (normal replication)
       </para></entry>
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index a560ad6..d0742f2 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -186,9 +186,10 @@
 
   <para>
    Each subscription will receive changes via one replication slot (see
-   <xref linkend="streaming-replication-slots"/>).  Additional temporary
-   replication slots may be required for the initial data synchronization
-   of pre-existing table data.
+   <xref linkend="streaming-replication-slots"/>).  Additional replication
+   slots may be required for the initial data synchronization of
+   pre-existing table data and those will be dropped at the end of data
+   synchronization.
   </para>
 
   <para>
@@ -248,13 +249,23 @@
 
    <para>
     As mentioned earlier, each (active) subscription receives changes from a
-    replication slot on the remote (publishing) side.  Normally, the remote
-    replication slot is created automatically when the subscription is created
-    using <command>CREATE SUBSCRIPTION</command> and it is dropped
-    automatically when the subscription is dropped using <command>DROP
-    SUBSCRIPTION</command>.  In some situations, however, it can be useful or
-    necessary to manipulate the subscription and the underlying replication
-    slot separately.  Here are some scenarios:
+    replication slot on the remote (publishing) side.
+   </para>
+   <para>
+    Additional table synchronization slots are normally transient, created
+    internally to perform initial table synchronization and dropped
+    automatically when they are no longer needed. These table synchronization
+    slots have generated names: <quote><literal>pg_%u_sync_%u_%llu</literal></quote>
+    (parameters: Subscription <parameter>oid</parameter>,
+    Table <parameter>relid</parameter>, system identifier <parameter>sysid</parameter>)
+   </para>
+   <para>
+    Normally, the remote replication slot is created automatically when the
+    subscription is created using <command>CREATE SUBSCRIPTION</command> and it
+    is dropped automatically when the subscription is dropped using
+    <command>DROP SUBSCRIPTION</command>.  In some situations, however, it can
+    be useful or necessary to manipulate the subscription and the underlying
+    replication slot separately.  Here are some scenarios:
 
     <itemizedlist>
      <listitem>
@@ -294,8 +305,9 @@
        using <command>ALTER SUBSCRIPTION</command> before attempting to drop
        the subscription.  If the remote database instance no longer exists, no
        further action is then necessary.  If, however, the remote database
-       instance is just unreachable, the replication slot should then be
-       dropped manually; otherwise it would continue to reserve WAL and might
+       instance is just unreachable, the replication slot (and any still 
+       remaining table synchronization slots) should then be
+       dropped manually; otherwise it/they would continue to reserve WAL and might
        eventually cause the disk to fill up.  Such cases should be carefully
        investigated.
       </para>
@@ -468,16 +480,19 @@
   <sect2 id="logical-replication-snapshot">
     <title>Initial Snapshot</title>
     <para>
-      The initial data in existing subscribed tables are snapshotted and
-      copied in a parallel instance of a special kind of apply process.
-      This process will create its own temporary replication slot and
-      copy the existing data. Once existing data is copied, the worker
-      enters synchronization mode, which ensures that the table is brought
-      up to a synchronized state with the main apply process by streaming
-      any changes that happened during the initial data copy using standard
-      logical replication. Once the synchronization is done, the control
-      of the replication of the table is given back to the main apply
-      process where the replication continues as normal.
+     The initial data in existing subscribed tables are snapshotted and
+     copied in a parallel instance of a special kind of apply process.
+     This process will create its own replication slot and copy the existing
+     data.  As soon as the copy is finished the table contents will become
+     visible to other backends.  Once existing data is copied, the worker
+     enters synchronization mode, which ensures that the table is brought
+     up to a synchronized state with the main apply process by streaming
+     any changes that happened during the initial data copy using standard
+     logical replication.  During this synchronization phase, the changes
+     are applied and committed in the same order as they happened on the
+     publisher.  Once the synchronization is done, the control of the
+     replication of the table is given back to the main apply process where
+     the replication continues as normal.
     </para>
   </sect2>
  </sect1>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index db5e59f..bcb0acf 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -48,6 +48,24 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    (Currently, all subscription owners must be superusers, so the owner checks
    will be bypassed in practice.  But this might change in the future.)
   </para>
+  
+  <para>
+   When refreshing a publication we remove the relations that are no longer
+   part of the publication and we also remove the tablesync slots if there are
+   any. It is necessary to remove tablesync slots so that the resources
+   allocated for the subscription on the remote host are released. If due to
+   network breakdown or some other error, <productname>PostgreSQL</productname>
+   is unable to remove the slots, an ERROR will be reported. To proceed in this
+   situation, either the user need to retry the operation or disassociate the
+   slot from the subscription and drop the subscription as explained in
+   <xref linkend="sql-dropsubscription"/>.
+  </para>
+
+  <para>
+   Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
+   <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with refresh
+   option as true cannot be executed inside a transaction block.
+  </para>
  </refsect1>
 
  <refsect1>
diff --git a/doc/src/sgml/ref/drop_subscription.sgml b/doc/src/sgml/ref/drop_subscription.sgml
index adbdeaf..aee9615 100644
--- a/doc/src/sgml/ref/drop_subscription.sgml
+++ b/doc/src/sgml/ref/drop_subscription.sgml
@@ -79,7 +79,8 @@ DROP SUBSCRIPTION [ IF EXISTS ] <replaceable class="parameter">name</replaceable
   <para>
    When dropping a subscription that is associated with a replication slot on
    the remote host (the normal state), <command>DROP SUBSCRIPTION</command>
-   will connect to the remote host and try to drop the replication slot as
+   will connect to the remote host and try to drop the replication slot (and
+   any remaining table synchronization slots) as
    part of its operation.  This is necessary so that the resources allocated
    for the subscription on the remote host are released.  If this fails,
    either because the remote host is not reachable or because the remote
@@ -89,7 +90,8 @@ DROP SUBSCRIPTION [ IF EXISTS ] <replaceable class="parameter">name</replaceable
    executing <literal>ALTER SUBSCRIPTION ... SET (slot_name = NONE)</literal>.
    After that, <command>DROP SUBSCRIPTION</command> will no longer attempt any
    actions on a remote host.  Note that if the remote replication slot still
-   exists, it should then be dropped manually; otherwise it will continue to
+   exists, it (and any related table synchronization slots) should then be
+   dropped manually; otherwise it/they will continue to
    reserve WAL and might eventually cause the disk to fill up.  See
    also <xref linkend="logical-replication-subscription-slot"/>.
   </para>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a2068e3..3c8b4eb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2432,15 +2432,6 @@ PrepareTransaction(void)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot PREPARE a transaction that has exported snapshots")));
 
-	/*
-	 * Don't allow PREPARE but for transaction that has/might kill logical
-	 * replication workers.
-	 */
-	if (XactManipulatesLogicalReplicationWorkers())
-		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("cannot PREPARE a transaction that has manipulated logical replication workers")));
-
 	/* Prevent cancel/die interrupt while cleaning up */
 	HOLD_INTERRUPTS();
 
@@ -4899,7 +4890,6 @@ CommitSubTransaction(void)
 	AtEOSubXact_HashTables(true, s->nestingLevel);
 	AtEOSubXact_PgStat(true, s->nestingLevel);
 	AtSubCommit_Snapshot(s->nestingLevel);
-	AtEOSubXact_ApplyLauncher(true, s->nestingLevel);
 
 	/*
 	 * We need to restore the upper transaction's read-only state, in case the
@@ -5059,7 +5049,6 @@ AbortSubTransaction(void)
 		AtEOSubXact_HashTables(false, s->nestingLevel);
 		AtEOSubXact_PgStat(false, s->nestingLevel);
 		AtSubAbort_Snapshot(s->nestingLevel);
-		AtEOSubXact_ApplyLauncher(false, s->nestingLevel);
 	}
 
 	/*
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 44cb285..750ec2a 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -29,6 +29,7 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
 #include "utils/pg_lsn.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
@@ -337,6 +338,13 @@ GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn)
 	char		substate;
 	bool		isnull;
 	Datum		d;
+	Relation	rel;
+
+	/*
+	 * This is to avoid the race condition with AlterSubscription which tries
+	 * to remove this relstate.
+	 */
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
 
 	/* Try finding the mapping. */
 	tup = SearchSysCache2(SUBSCRIPTIONRELMAP,
@@ -363,6 +371,8 @@ GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn)
 	/* Cleanup */
 	ReleaseSysCache(tup);
 
+	table_close(rel, AccessShareLock);
+
 	return substate;
 }
 
@@ -403,6 +413,35 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	scan = table_beginscan_catalog(rel, nkeys, skey);
 	while (HeapTupleIsValid(tup = heap_getnext(scan, ForwardScanDirection)))
 	{
+		Form_pg_subscription_rel subrel;
+
+		subrel = (Form_pg_subscription_rel) GETSTRUCT(tup);
+
+		/*
+		 * We don't allow to drop the relation mapping when the table
+		 * synchronization is in progress unless the caller updates the
+		 * corresponding subscription as well. This is to ensure that we don't
+		 * leave tablesync slots or origins in the system when the
+		 * corresponding table is dropped.
+		 */
+		if (!OidIsValid(subid) && subrel->srsubstate != SUBREL_STATE_READY)
+		{
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("could not drop relation mapping for subscription \"%s\"",
+							get_subscription_name(subrel->srsubid, false)),
+					 errdetail("Table synchronization for relation \"%s\" is in progress and is in state \"%c\".",
+							   get_rel_name(relid), subrel->srsubstate),
+
+			/*
+			 * translator: first %s is a SQL ALTER command and second %s is a
+			 * SQL DROP command
+			 */
+					 errhint("Use %s to enable subscription if not already enabled or use %s to drop the subscription.",
+							 "ALTER SUBSCRIPTION ... ENABLE",
+							 "DROP SUBSCRIPTION ...")));
+		}
+
 		CatalogTupleDelete(rel, &tup->t_self);
 	}
 	table_endscan(scan);
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 5ccbc9d..7996f84 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -34,6 +34,7 @@
 #include "nodes/makefuncs.h"
 #include "replication/logicallauncher.h"
 #include "replication/origin.h"
+#include "replication/slot.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "replication/worker_internal.h"
@@ -46,6 +47,8 @@
 #include "utils/syscache.h"
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
+static void ReportSlotConnectionError(List *rstates, Oid subid, char *slotname, char *err);
+
 
 /*
  * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands.
@@ -566,107 +569,207 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
 	Oid		   *pubrel_local_oids;
 	ListCell   *lc;
 	int			off;
+	int			remove_rel_len;
+	Relation	rel = NULL;
+	typedef struct SubRemoveRels
+	{
+		Oid			relid;
+		char		state;
+	} SubRemoveRels;
+	SubRemoveRels *sub_remove_rels;
 
 	/* Load the library providing us libpq calls. */
 	load_file("libpqwalreceiver", false);
 
-	/* Try to connect to the publisher. */
-	wrconn = walrcv_connect(sub->conninfo, true, sub->name, &err);
-	if (!wrconn)
-		ereport(ERROR,
-				(errmsg("could not connect to the publisher: %s", err)));
-
-	/* Get the table list from publisher. */
-	pubrel_names = fetch_table_list(wrconn, sub->publications);
+	PG_TRY();
+	{
+		/* Try to connect to the publisher. */
+		wrconn = walrcv_connect(sub->conninfo, true, sub->name, &err);
+		if (!wrconn)
+			ereport(ERROR,
+					(errmsg("could not connect to the publisher: %s", err)));
 
-	/* We are done with the remote side, close connection. */
-	walrcv_disconnect(wrconn);
+		/* Get the table list from publisher. */
+		pubrel_names = fetch_table_list(wrconn, sub->publications);
 
-	/* Get local table list. */
-	subrel_states = GetSubscriptionRelations(sub->oid);
+		/* Get local table list. */
+		subrel_states = GetSubscriptionRelations(sub->oid);
 
-	/*
-	 * Build qsorted array of local table oids for faster lookup. This can
-	 * potentially contain all tables in the database so speed of lookup is
-	 * important.
-	 */
-	subrel_local_oids = palloc(list_length(subrel_states) * sizeof(Oid));
-	off = 0;
-	foreach(lc, subrel_states)
-	{
-		SubscriptionRelState *relstate = (SubscriptionRelState *) lfirst(lc);
+		/*
+		 * Build qsorted array of local table oids for faster lookup. This can
+		 * potentially contain all tables in the database so speed of lookup
+		 * is important.
+		 */
+		subrel_local_oids = palloc(list_length(subrel_states) * sizeof(Oid));
+		off = 0;
+		foreach(lc, subrel_states)
+		{
+			SubscriptionRelState *relstate = (SubscriptionRelState *) lfirst(lc);
 
-		subrel_local_oids[off++] = relstate->relid;
-	}
-	qsort(subrel_local_oids, list_length(subrel_states),
-		  sizeof(Oid), oid_cmp);
+			subrel_local_oids[off++] = relstate->relid;
+		}
+		qsort(subrel_local_oids, list_length(subrel_states),
+			  sizeof(Oid), oid_cmp);
+
+		/*
+		 * Rels that we want to remove from subscription and drop any slots
+		 * and origins corresponding to them.
+		 */
+		sub_remove_rels = palloc(list_length(subrel_states) * sizeof(SubRemoveRels));
+
+		/*
+		 * Walk over the remote tables and try to match them to locally known
+		 * tables. If the table is not known locally create a new state for
+		 * it.
+		 *
+		 * Also builds array of local oids of remote tables for the next step.
+		 */
+		off = 0;
+		pubrel_local_oids = palloc(list_length(pubrel_names) * sizeof(Oid));
+
+		foreach(lc, pubrel_names)
+		{
+			RangeVar   *rv = (RangeVar *) lfirst(lc);
+			Oid			relid;
 
-	/*
-	 * Walk over the remote tables and try to match them to locally known
-	 * tables. If the table is not known locally create a new state for it.
-	 *
-	 * Also builds array of local oids of remote tables for the next step.
-	 */
-	off = 0;
-	pubrel_local_oids = palloc(list_length(pubrel_names) * sizeof(Oid));
+			relid = RangeVarGetRelid(rv, AccessShareLock, false);
 
-	foreach(lc, pubrel_names)
-	{
-		RangeVar   *rv = (RangeVar *) lfirst(lc);
-		Oid			relid;
+			/* Check for supported relkind. */
+			CheckSubscriptionRelkind(get_rel_relkind(relid),
+									 rv->schemaname, rv->relname);
 
-		relid = RangeVarGetRelid(rv, AccessShareLock, false);
+			pubrel_local_oids[off++] = relid;
 
-		/* Check for supported relkind. */
-		CheckSubscriptionRelkind(get_rel_relkind(relid),
-								 rv->schemaname, rv->relname);
+			if (!bsearch(&relid, subrel_local_oids,
+						 list_length(subrel_states), sizeof(Oid), oid_cmp))
+			{
+				AddSubscriptionRelState(sub->oid, relid,
+										copy_data ? SUBREL_STATE_INIT : SUBREL_STATE_READY,
+										InvalidXLogRecPtr);
+				ereport(DEBUG1,
+						(errmsg("table \"%s.%s\" added to subscription \"%s\"",
+								rv->schemaname, rv->relname, sub->name)));
+			}
+		}
 
-		pubrel_local_oids[off++] = relid;
+		/*
+		 * Next remove state for tables we should not care about anymore using
+		 * the data we collected above
+		 */
+		qsort(pubrel_local_oids, list_length(pubrel_names),
+			  sizeof(Oid), oid_cmp);
 
-		if (!bsearch(&relid, subrel_local_oids,
-					 list_length(subrel_states), sizeof(Oid), oid_cmp))
+		remove_rel_len = 0;
+		for (off = 0; off < list_length(subrel_states); off++)
 		{
-			AddSubscriptionRelState(sub->oid, relid,
-									copy_data ? SUBREL_STATE_INIT : SUBREL_STATE_READY,
-									InvalidXLogRecPtr);
-			ereport(DEBUG1,
-					(errmsg("table \"%s.%s\" added to subscription \"%s\"",
-							rv->schemaname, rv->relname, sub->name)));
-		}
-	}
+			Oid			relid = subrel_local_oids[off];
 
-	/*
-	 * Next remove state for tables we should not care about anymore using the
-	 * data we collected above
-	 */
-	qsort(pubrel_local_oids, list_length(pubrel_names),
-		  sizeof(Oid), oid_cmp);
+			if (!bsearch(&relid, pubrel_local_oids,
+						 list_length(pubrel_names), sizeof(Oid), oid_cmp))
+			{
+				char		state;
+				XLogRecPtr	statelsn;
+
+				/*
+				 * Lock pg_subscription_rel with AccessExclusiveLock to
+				 * prevent any race conditions with the apply worker
+				 * re-launching workers at the same time this code is trying
+				 * to remove those tables.
+				 *
+				 * Even if new worker for this particular rel is restarted it
+				 * won't be able to make any progress as we hold exclusive
+				 * lock on subscription_rel till the transaction end. It will
+				 * simply exit as there is no corresponding rel entry.
+				 *
+				 * This locking also ensures that the state of rels won't
+				 * change till we are done with this refresh operation.
+				 */
+				if (!rel)
+					rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
+
+				/* Last known rel state. */
+				state = GetSubscriptionRelState(sub->oid, relid, &statelsn);
+
+				sub_remove_rels[remove_rel_len].relid = relid;
+				sub_remove_rels[remove_rel_len++].state = state;
+
+				RemoveSubscriptionRel(sub->oid, relid);
+
+				logicalrep_worker_stop(sub->oid, relid);
+
+				/*
+				 * For READY state, we would have already dropped the
+				 * tablesync origin.
+				 */
+				if (state != SUBREL_STATE_READY)
+				{
+					char		originname[NAMEDATALEN];
+
+					/*
+					 * Drop the tablesync's origin tracking if exists.
+					 *
+					 * It is possible that the origin is not yet created for
+					 * tablesync worker, this can happen for the states before
+					 * SUBREL_STATE_FINISHEDCOPY. The apply worker can also
+					 * concurrently try to drop the origin and by this time
+					 * the origin might be already removed. For these reasons,
+					 * passing missing_ok = true.
+					 */
+					ReplicationOriginNameForTablesync(sub->oid, relid, originname);
+					replorigin_drop_by_name(originname, true, false);
+				}
 
-	for (off = 0; off < list_length(subrel_states); off++)
-	{
-		Oid			relid = subrel_local_oids[off];
+				ereport(DEBUG1,
+						(errmsg("table \"%s.%s\" removed from subscription \"%s\"",
+								get_namespace_name(get_rel_namespace(relid)),
+								get_rel_name(relid),
+								sub->name)));
+			}
+		}
 
-		if (!bsearch(&relid, pubrel_local_oids,
-					 list_length(pubrel_names), sizeof(Oid), oid_cmp))
+		/*
+		 * Drop the tablesync slots associated with removed tables. This has
+		 * to be at the end because otherwise if there is an error while doing
+		 * the database operations we won't be able to rollback dropped slots.
+		 */
+		for (off = 0; off < remove_rel_len; off++)
 		{
-			RemoveSubscriptionRel(sub->oid, relid);
-
-			logicalrep_worker_stop_at_commit(sub->oid, relid);
-
-			ereport(DEBUG1,
-					(errmsg("table \"%s.%s\" removed from subscription \"%s\"",
-							get_namespace_name(get_rel_namespace(relid)),
-							get_rel_name(relid),
-							sub->name)));
+			if (sub_remove_rels[off].state != SUBREL_STATE_READY &&
+				sub_remove_rels[off].state != SUBREL_STATE_SYNCDONE)
+			{
+				char		syncslotname[NAMEDATALEN] = {0};
+
+				/*
+				 * For READY/SYNCDONE states we know the tablesync slot has
+				 * already been dropped by the tablesync worker.
+				 *
+				 * For other states, there is no certainty, maybe the slot
+				 * does not exist yet. Also, if we fail after removing some of
+				 * the slots, next time, it will again try to drop already
+				 * dropped slots and fail. For these reasons, we allow
+				 * missing_ok = true for the drop.
+				 */
+				ReplicationSlotNameForTablesync(sub->oid, sub_remove_rels[off].relid, syncslotname);
+				ReplicationSlotDropAtPubNode(wrconn, syncslotname, true);
+			}
 		}
 	}
+	PG_FINALLY();
+	{
+		if (wrconn)
+			walrcv_disconnect(wrconn);
+	}
+	PG_END_TRY();
+
+	if (rel)
+		table_close(rel, NoLock);
 }
 
 /*
  * Alter the existing subscription.
  */
 ObjectAddress
-AlterSubscription(AlterSubscriptionStmt *stmt)
+AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 {
 	Relation	rel;
 	ObjectAddress myself;
@@ -848,6 +951,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
+
 					/* Make sure refresh sees the new list of publications. */
 					sub->publications = stmt->publication;
 
@@ -877,6 +982,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 										   NULL, NULL,	/* no "binary" */
 										   NULL, NULL); /* no "streaming" */
 
+				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
+
 				AlterSubscription_refresh(sub, copy_data);
 
 				break;
@@ -927,8 +1034,8 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
 	char		originname[NAMEDATALEN];
 	char	   *err = NULL;
 	WalReceiverConn *wrconn = NULL;
-	StringInfoData cmd;
 	Form_pg_subscription form;
+	List	   *rstates;
 
 	/*
 	 * Lock pg_subscription with AccessExclusiveLock to ensure that the
@@ -1041,6 +1148,36 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
 	}
 	list_free(subworkers);
 
+	/*
+	 * Cleanup of tablesync replication origins.
+	 *
+	 * Any READY-state relations would already have dealt with clean-ups.
+	 *
+	 * Note that the state can't change because we have already stopped both
+	 * the apply and tablesync workers and they can't restart because of
+	 * exclusive lock on the subscription.
+	 */
+	rstates = GetSubscriptionNotReadyRelations(subid);
+	foreach(lc, rstates)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+		Oid			relid = rstate->relid;
+
+		/* Only cleanup resources of tablesync workers */
+		if (!OidIsValid(relid))
+			continue;
+
+		/*
+		 * Drop the tablesync's origin tracking if exists.
+		 *
+		 * It is possible that the origin is not yet created for tablesync
+		 * worker so passing missing_ok = true. This can happen for the states
+		 * before SUBREL_STATE_FINISHEDCOPY.
+		 */
+		ReplicationOriginNameForTablesync(subid, relid, originname);
+		replorigin_drop_by_name(originname, true, false);
+	}
+
 	/* Clean up dependencies */
 	deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0);
 
@@ -1055,30 +1192,110 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
 	 * If there is no slot associated with the subscription, we can finish
 	 * here.
 	 */
-	if (!slotname)
+	if (!slotname && rstates == NIL)
 	{
 		table_close(rel, NoLock);
 		return;
 	}
 
 	/*
-	 * Otherwise drop the replication slot at the publisher node using the
-	 * replication connection.
+	 * Try to acquire the connection necessary for dropping slots.
+	 *
+	 * Note: If the slotname is NONE/NULL then we allow the command to finish
+	 * and users need to manually cleanup the apply and tablesync worker slots
+	 * later.
+	 *
+	 * This has to be at the end because otherwise if there is an error while
+	 * doing the database operations we won't be able to rollback dropped
+	 * slot.
 	 */
 	load_file("libpqwalreceiver", false);
 
-	initStringInfo(&cmd);
-	appendStringInfo(&cmd, "DROP_REPLICATION_SLOT %s WAIT", quote_identifier(slotname));
-
 	wrconn = walrcv_connect(conninfo, true, subname, &err);
 	if (wrconn == NULL)
-		ereport(ERROR,
-				(errmsg("could not connect to publisher when attempting to "
-						"drop the replication slot \"%s\"", slotname),
-				 errdetail("The error was: %s", err),
-		/* translator: %s is an SQL ALTER command */
-				 errhint("Use %s to disassociate the subscription from the slot.",
-						 "ALTER SUBSCRIPTION ... SET (slot_name = NONE)")));
+	{
+		if (!slotname)
+		{
+			/* be tidy */
+			list_free(rstates);
+			table_close(rel, NoLock);
+			return;
+		}
+		else
+		{
+			ReportSlotConnectionError(rstates, subid, slotname, err);
+		}
+	}
+
+	PG_TRY();
+	{
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+			Oid			relid = rstate->relid;
+
+			/* Only cleanup resources of tablesync workers */
+			if (!OidIsValid(relid))
+				continue;
+
+			/*
+			 * Drop the tablesync slots associated with removed tables.
+			 *
+			 * For SYNCDONE/READY states, the tablesync slot is known to have
+			 * already been dropped by the tablesync worker.
+			 *
+			 * For other states, there is no certainty, maybe the slot does
+			 * not exist yet. Also, if we fail after removing some of the
+			 * slots, next time, it will again try to drop already dropped
+			 * slots and fail. For these reasons, we allow missing_ok = true
+			 * for the drop.
+			 */
+			if (rstate->state != SUBREL_STATE_SYNCDONE)
+			{
+				char		syncslotname[NAMEDATALEN] = {0};
+
+				ReplicationSlotNameForTablesync(subid, relid, syncslotname);
+				ReplicationSlotDropAtPubNode(wrconn, syncslotname, true);
+			}
+		}
+
+		list_free(rstates);
+
+		/*
+		 * If there is a slot associated with the subscription, then drop the
+		 * replication slot at the publisher.
+		 */
+		if (slotname)
+			ReplicationSlotDropAtPubNode(wrconn, slotname, false);
+
+	}
+	PG_FINALLY();
+	{
+		walrcv_disconnect(wrconn);
+	}
+	PG_END_TRY();
+
+	table_close(rel, NoLock);
+}
+
+/*
+ * Drop the replication slot at the publisher node using the replication
+ * connection.
+ *
+ * missing_ok - if true then only issue a WARNING message if the slot doesn't
+ * exist.
+ */
+void
+ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok)
+{
+	StringInfoData cmd;
+
+	Assert(wrconn);
+
+	load_file("libpqwalreceiver", false);
+
+	initStringInfo(&cmd);
+	appendStringInfo(&cmd, "DROP_REPLICATION_SLOT %s WAIT", quote_identifier(slotname));
 
 	PG_TRY();
 	{
@@ -1086,27 +1303,39 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
 
 		res = walrcv_exec(wrconn, cmd.data, 0, NULL);
 
-		if (res->status != WALRCV_OK_COMMAND)
-			ereport(ERROR,
+		if (res->status == WALRCV_OK_COMMAND)
+		{
+			/* NOTICE. Success. */
+			ereport(NOTICE,
+					(errmsg("dropped replication slot \"%s\" on publisher",
+							slotname)));
+		}
+		else if (res->status == WALRCV_ERROR &&
+				 missing_ok &&
+				 res->sqlstate == ERRCODE_UNDEFINED_OBJECT)
+		{
+			/* WARNING. Error, but missing_ok = true. */
+			ereport(WARNING,
 					(errmsg("could not drop the replication slot \"%s\" on publisher",
 							slotname),
 					 errdetail("The error was: %s", res->err)));
+		}
 		else
-			ereport(NOTICE,
-					(errmsg("dropped replication slot \"%s\" on publisher",
-							slotname)));
+		{
+			/* ERROR. */
+			ereport(ERROR,
+					(errmsg("could not drop the replication slot \"%s\" on publisher",
+							slotname),
+					 errdetail("The error was: %s", res->err)));
+		}
 
 		walrcv_clear_result(res);
 	}
 	PG_FINALLY();
 	{
-		walrcv_disconnect(wrconn);
+		pfree(cmd.data);
 	}
 	PG_END_TRY();
-
-	pfree(cmd.data);
-
-	table_close(rel, NoLock);
 }
 
 /*
@@ -1275,3 +1504,45 @@ fetch_table_list(WalReceiverConn *wrconn, List *publications)
 
 	return tablelist;
 }
+
+/*
+ * This is to report the connection failure while dropping replication slots.
+ * Here, we report the WARNING for all tablesync slots so that user can drop
+ * them manually, if required.
+ */
+static void
+ReportSlotConnectionError(List *rstates, Oid subid, char *slotname, char *err)
+{
+	ListCell   *lc;
+
+	foreach(lc, rstates)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+		Oid			relid = rstate->relid;
+
+		/* Only cleanup resources of tablesync workers */
+		if (!OidIsValid(relid))
+			continue;
+
+		/*
+		 * Caller needs to ensure that relstate doesn't change underneath us.
+		 * See DropSubscription where we get the relstates.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE)
+		{
+			char		syncslotname[NAMEDATALEN] = {0};
+
+			ReplicationSlotNameForTablesync(subid, relid, syncslotname);
+			elog(WARNING, "could not drop tablesync replication slot \"%s\"",
+				 syncslotname);
+		}
+	}
+
+	ereport(ERROR,
+			(errmsg("could not connect to publisher when attempting to "
+					"drop the replication slot \"%s\"", slotname),
+			 errdetail("The error was: %s", err),
+	/* translator: %s is an SQL ALTER command */
+			 errhint("Use %s to disassociate the subscription from the slot.",
+					 "ALTER SUBSCRIPTION ... SET (slot_name = NONE)")));
+}
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index e958274..7714696 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -982,6 +982,7 @@ libpqrcv_exec(WalReceiverConn *conn, const char *query,
 {
 	PGresult   *pgres = NULL;
 	WalRcvExecResult *walres = palloc0(sizeof(WalRcvExecResult));
+	char	   *diag_sqlstate;
 
 	if (MyDatabaseId == InvalidOid)
 		ereport(ERROR,
@@ -1025,6 +1026,13 @@ libpqrcv_exec(WalReceiverConn *conn, const char *query,
 		case PGRES_BAD_RESPONSE:
 			walres->status = WALRCV_ERROR;
 			walres->err = pchomp(PQerrorMessage(conn->streamConn));
+			diag_sqlstate = PQresultErrorField(pgres, PG_DIAG_SQLSTATE);
+			if (diag_sqlstate)
+				walres->sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+												 diag_sqlstate[1],
+												 diag_sqlstate[2],
+												 diag_sqlstate[3],
+												 diag_sqlstate[4]);
 			break;
 	}
 
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 186514c..58082dd 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -73,20 +73,6 @@ typedef struct LogicalRepWorkerId
 	Oid			relid;
 } LogicalRepWorkerId;
 
-typedef struct StopWorkersData
-{
-	int			nestDepth;		/* Sub-transaction nest level */
-	List	   *workers;		/* List of LogicalRepWorkerId */
-	struct StopWorkersData *parent; /* This need not be an immediate
-									 * subtransaction parent */
-} StopWorkersData;
-
-/*
- * Stack of StopWorkersData elements. Each stack element contains the workers
- * to be stopped for that subtransaction.
- */
-static StopWorkersData *on_commit_stop_workers = NULL;
-
 static void ApplyLauncherWakeup(void);
 static void logicalrep_launcher_onexit(int code, Datum arg);
 static void logicalrep_worker_onexit(int code, Datum arg);
@@ -547,51 +533,6 @@ logicalrep_worker_stop(Oid subid, Oid relid)
 }
 
 /*
- * Request worker for specified sub/rel to be stopped on commit.
- */
-void
-logicalrep_worker_stop_at_commit(Oid subid, Oid relid)
-{
-	int			nestDepth = GetCurrentTransactionNestLevel();
-	LogicalRepWorkerId *wid;
-	MemoryContext oldctx;
-
-	/* Make sure we store the info in context that survives until commit. */
-	oldctx = MemoryContextSwitchTo(TopTransactionContext);
-
-	/* Check that previous transactions were properly cleaned up. */
-	Assert(on_commit_stop_workers == NULL ||
-		   nestDepth >= on_commit_stop_workers->nestDepth);
-
-	/*
-	 * Push a new stack element if we don't already have one for the current
-	 * nestDepth.
-	 */
-	if (on_commit_stop_workers == NULL ||
-		nestDepth > on_commit_stop_workers->nestDepth)
-	{
-		StopWorkersData *newdata = palloc(sizeof(StopWorkersData));
-
-		newdata->nestDepth = nestDepth;
-		newdata->workers = NIL;
-		newdata->parent = on_commit_stop_workers;
-		on_commit_stop_workers = newdata;
-	}
-
-	/*
-	 * Finally add a new worker into the worker list of the current
-	 * subtransaction.
-	 */
-	wid = palloc(sizeof(LogicalRepWorkerId));
-	wid->subid = subid;
-	wid->relid = relid;
-	on_commit_stop_workers->workers =
-		lappend(on_commit_stop_workers->workers, wid);
-
-	MemoryContextSwitchTo(oldctx);
-}
-
-/*
  * Wake up (using latch) any logical replication worker for specified sub/rel.
  */
 void
@@ -820,109 +761,21 @@ ApplyLauncherShmemInit(void)
 }
 
 /*
- * Check whether current transaction has manipulated logical replication
- * workers.
- */
-bool
-XactManipulatesLogicalReplicationWorkers(void)
-{
-	return (on_commit_stop_workers != NULL);
-}
-
-/*
  * Wakeup the launcher on commit if requested.
  */
 void
 AtEOXact_ApplyLauncher(bool isCommit)
 {
-
-	Assert(on_commit_stop_workers == NULL ||
-		   (on_commit_stop_workers->nestDepth == 1 &&
-			on_commit_stop_workers->parent == NULL));
-
 	if (isCommit)
 	{
-		ListCell   *lc;
-
-		if (on_commit_stop_workers != NULL)
-		{
-			List	   *workers = on_commit_stop_workers->workers;
-
-			foreach(lc, workers)
-			{
-				LogicalRepWorkerId *wid = lfirst(lc);
-
-				logicalrep_worker_stop(wid->subid, wid->relid);
-			}
-		}
-
 		if (on_commit_launcher_wakeup)
 			ApplyLauncherWakeup();
 	}
 
-	/*
-	 * No need to pfree on_commit_stop_workers.  It was allocated in
-	 * transaction memory context, which is going to be cleaned soon.
-	 */
-	on_commit_stop_workers = NULL;
 	on_commit_launcher_wakeup = false;
 }
 
 /*
- * On commit, merge the current on_commit_stop_workers list into the
- * immediate parent, if present.
- * On rollback, discard the current on_commit_stop_workers list.
- * Pop out the stack.
- */
-void
-AtEOSubXact_ApplyLauncher(bool isCommit, int nestDepth)
-{
-	StopWorkersData *parent;
-
-	/* Exit immediately if there's no work to do at this level. */
-	if (on_commit_stop_workers == NULL ||
-		on_commit_stop_workers->nestDepth < nestDepth)
-		return;
-
-	Assert(on_commit_stop_workers->nestDepth == nestDepth);
-
-	parent = on_commit_stop_workers->parent;
-
-	if (isCommit)
-	{
-		/*
-		 * If the upper stack element is not an immediate parent
-		 * subtransaction, just decrement the notional nesting depth without
-		 * doing any real work.  Else, we need to merge the current workers
-		 * list into the parent.
-		 */
-		if (!parent || parent->nestDepth < nestDepth - 1)
-		{
-			on_commit_stop_workers->nestDepth--;
-			return;
-		}
-
-		parent->workers =
-			list_concat(parent->workers, on_commit_stop_workers->workers);
-	}
-	else
-	{
-		/*
-		 * Abandon everything that was done at this nesting level.  Explicitly
-		 * free memory to avoid a transaction-lifespan leak.
-		 */
-		list_free_deep(on_commit_stop_workers->workers);
-	}
-
-	/*
-	 * We have taken care of the current subtransaction workers list for both
-	 * abort or commit. So we are ready to pop the stack.
-	 */
-	pfree(on_commit_stop_workers);
-	on_commit_stop_workers = parent;
-}
-
-/*
  * Request wakeup of the launcher on commit of the transaction.
  *
  * This is used to send launcher signal to stop sleeping and process the
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index ccbdbcf..19cc804 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -31,8 +31,11 @@
  *		 table state to INIT.
  *	   - Tablesync worker starts; changes table state from INIT to DATASYNC while
  *		 copying.
- *	   - Tablesync worker finishes the copy and sets table state to SYNCWAIT;
- *		 waits for state change.
+ *	   - Tablesync worker does initial table copy; there is a FINISHEDCOPY (sync
+ *		 worker specific) state to indicate when the copy phase has completed, so
+ *		 if the worker crashes with this (non-memory) state then the copy will not
+ *		 be re-attempted.
+ *	   - Tablesync worker then sets table state to SYNCWAIT; waits for state change.
  *	   - Apply worker periodically checks for tables in SYNCWAIT state.  When
  *		 any appear, it sets the table state to CATCHUP and starts loop-waiting
  *		 until either the table state is set to SYNCDONE or the sync worker
@@ -48,8 +51,8 @@
  *		 point it sets state to READY and stops tracking.  Again, there might
  *		 be zero changes in between.
  *
- *	  So the state progression is always: INIT -> DATASYNC -> SYNCWAIT ->
- *	  CATCHUP -> SYNCDONE -> READY.
+ *	  So the state progression is always: INIT -> DATASYNC -> FINISHEDCOPY
+ *	  -> SYNCWAIT -> CATCHUP -> SYNCDONE -> READY.
  *
  *	  The catalog pg_subscription_rel is used to keep information about
  *	  subscribed tables and their state.  The catalog holds all states
@@ -58,6 +61,7 @@
  *	  Example flows look like this:
  *	   - Apply is in front:
  *		  sync:8
+ *			-> set in catalog FINISHEDCOPY
  *			-> set in memory SYNCWAIT
  *		  apply:10
  *			-> set in memory CATCHUP
@@ -73,6 +77,7 @@
  *
  *	   - Sync is in front:
  *		  sync:10
+ *			-> set in catalog FINISHEDCOPY
  *			-> set in memory SYNCWAIT
  *		  apply:8
  *			-> set in memory CATCHUP
@@ -101,7 +106,10 @@
 #include "replication/logicalrelation.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
+#include "replication/slot.h"
+#include "replication/origin.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -269,26 +277,52 @@ invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 static void
 process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
-	Assert(IsTransactionState());
-
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
 		TimeLineID	tli;
+		char		syncslotname[NAMEDATALEN] = {0};
 
 		MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE;
 		MyLogicalRepWorker->relstate_lsn = current_lsn;
 
 		SpinLockRelease(&MyLogicalRepWorker->relmutex);
 
+		/*
+		 * UpdateSubscriptionRelState must be called within a transaction.
+		 * That transaction will be ended within the finish_sync_worker().
+		 */
+		if (!IsTransactionState())
+			StartTransactionCommand();
+
 		UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
 								   MyLogicalRepWorker->relid,
 								   MyLogicalRepWorker->relstate,
 								   MyLogicalRepWorker->relstate_lsn);
 
+		/* End wal streaming so wrconn can be re-used to drop the slot. */
 		walrcv_endstreaming(wrconn, &tli);
+
+		/*
+		 * Cleanup the tablesync slot.
+		 *
+		 * This has to be done after updating the state because otherwise if
+		 * there is an error while doing the database operations we won't be
+		 * able to rollback dropped slot.
+		 */
+		ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid,
+										MyLogicalRepWorker->relid,
+										syncslotname);
+
+		/*
+		 * It is important to give an error if we are unable to drop the slot,
+		 * otherwise, it won't be dropped till the corresponding subscription
+		 * is dropped. So passing missing_ok = false.
+		 */
+		ReplicationSlotDropAtPubNode(wrconn, syncslotname, false);
+
 		finish_sync_worker();
 	}
 	else
@@ -403,6 +437,8 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 			 */
 			if (current_lsn >= rstate->lsn)
 			{
+				char		originname[NAMEDATALEN];
+
 				rstate->state = SUBREL_STATE_READY;
 				rstate->lsn = current_lsn;
 				if (!started_tx)
@@ -411,6 +447,27 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 					started_tx = true;
 				}
 
+				/*
+				 * Remove the tablesync origin tracking if exists.
+				 *
+				 * The normal case origin drop is done here instead of in the
+				 * process_syncing_tables_for_sync function because we don't
+				 * allow to drop the origin till the process owning the origin
+				 * is alive.
+				 *
+				 * There is a chance that the user is concurrently performing
+				 * refresh for the subscription where we remove the table
+				 * state and its origin and by this time the origin might be
+				 * already removed. So passing missing_ok = true.
+				 */
+				ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
+												  rstate->relid,
+												  originname);
+				replorigin_drop_by_name(originname, true, false);
+
+				/*
+				 * Update the state to READY only after the origin cleanup.
+				 */
 				UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
 										   rstate->relid, rstate->state,
 										   rstate->lsn);
@@ -806,6 +863,50 @@ copy_table(Relation rel)
 }
 
 /*
+ * Determine the tablesync slot name.
+ *
+ * The name must not exceed NAMEDATALEN - 1 because of remote node constraints
+ * on slot name length. We append system_identifier to avoid slot_name
+ * collision with subscriptions in other clusters. With the current scheme
+ * pg_%u_sync_%u_UINT64_FORMAT (3 + 10 + 6 + 10 + 20 + '\0'), the maximum
+ * length of slot_name will be 50.
+ *
+ * The returned slot name is either:
+ * - stored in the supplied buffer (syncslotname), or
+ * - palloc'ed in current memory context (if syncslotname = NULL).
+ *
+ * Note: We don't use the subscription slot name as part of tablesync slot name
+ * because we are responsible for cleaning up these slots and it could become
+ * impossible to recalculate what name to cleanup if the subscription slot name
+ * had changed.
+ */
+char *
+ReplicationSlotNameForTablesync(Oid suboid, Oid relid,
+								char syncslotname[NAMEDATALEN])
+{
+	if (syncslotname)
+		sprintf(syncslotname, "pg_%u_sync_%u_" UINT64_FORMAT, suboid, relid,
+				GetSystemIdentifier());
+	else
+		syncslotname = psprintf("pg_%u_sync_%u_" UINT64_FORMAT, suboid, relid,
+								GetSystemIdentifier());
+
+	return syncslotname;
+}
+
+/*
+ * Form the origin name for tablesync.
+ *
+ * Return the name in the supplied buffer.
+ */
+void
+ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
+								  char originname[NAMEDATALEN])
+{
+	snprintf(originname, NAMEDATALEN, "pg_%u_%u", suboid, relid);
+}
+
+/*
  * Start syncing the table in the sync worker.
  *
  * If nothing needs to be done to sync the table, we exit the worker without
@@ -822,6 +923,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	XLogRecPtr	relstate_lsn;
 	Relation	rel;
 	WalRcvExecResult *res;
+	char		originname[NAMEDATALEN];
+	RepOriginId originid;
 
 	/* Check the state of the table synchronization. */
 	StartTransactionCommand();
@@ -847,19 +950,10 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 			finish_sync_worker();	/* doesn't return */
 	}
 
-	/*
-	 * To build a slot name for the sync work, we are limited to NAMEDATALEN -
-	 * 1 characters.  We cut the original slot name to NAMEDATALEN - 28 chars
-	 * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0').  (It's actually the
-	 * NAMEDATALEN on the remote that matters, but this scheme will also work
-	 * reasonably if that is different.)
-	 */
-	StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small");	/* for sanity */
-	slotname = psprintf("%.*s_%u_sync_%u",
-						NAMEDATALEN - 28,
-						MySubscription->slotname,
-						MySubscription->oid,
-						MyLogicalRepWorker->relid);
+	/* Calculate the name of the tablesync slot. */
+	slotname = ReplicationSlotNameForTablesync(MySubscription->oid,
+											   MyLogicalRepWorker->relid,
+											   NULL /* use palloc */ );
 
 	/*
 	 * Here we use the slot name instead of the subscription name as the
@@ -872,7 +966,50 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 				(errmsg("could not connect to the publisher: %s", err)));
 
 	Assert(MyLogicalRepWorker->relstate == SUBREL_STATE_INIT ||
-		   MyLogicalRepWorker->relstate == SUBREL_STATE_DATASYNC);
+		   MyLogicalRepWorker->relstate == SUBREL_STATE_DATASYNC ||
+		   MyLogicalRepWorker->relstate == SUBREL_STATE_FINISHEDCOPY);
+
+	/* Assign the origin tracking record name. */
+	ReplicationOriginNameForTablesync(MySubscription->oid,
+									  MyLogicalRepWorker->relid,
+									  originname);
+
+	if (MyLogicalRepWorker->relstate == SUBREL_STATE_DATASYNC)
+	{
+		/*
+		 * We have previously errored out before finishing the copy so the
+		 * replication slot might exist. We want to remove the slot if it
+		 * already exists and proceed.
+		 *
+		 * XXX We could also instead try to drop the slot, last time we failed
+		 * but for that, we might need to clean up the copy state as it might
+		 * be in the middle of fetching the rows. Also, if there is a network
+		 * breakdown then it wouldn't have succeeded so trying it next time
+		 * seems like a better bet.
+		 */
+		ReplicationSlotDropAtPubNode(wrconn, slotname, true);
+	}
+	else if (MyLogicalRepWorker->relstate == SUBREL_STATE_FINISHEDCOPY)
+	{
+		/*
+		 * The COPY phase was previously done, but tablesync then crashed
+		 * before it was able to finish normally.
+		 */
+		StartTransactionCommand();
+
+		/*
+		 * The origin tracking name must already exist. It was created first
+		 * time this tablesync was launched.
+		 */
+		originid = replorigin_by_name(originname, false);
+		replorigin_session_setup(originid);
+		replorigin_session_origin = originid;
+		*origin_startpos = replorigin_session_get_progress(false);
+
+		CommitTransactionCommand();
+
+		goto copy_table_done;
+	}
 
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 	MyLogicalRepWorker->relstate = SUBREL_STATE_DATASYNC;
@@ -888,9 +1025,6 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
 
-	/*
-	 * We want to do the table data sync in a single transaction.
-	 */
 	StartTransactionCommand();
 
 	/*
@@ -916,13 +1050,46 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	walrcv_clear_result(res);
 
 	/*
-	 * Create a new temporary logical decoding slot.  This slot will be used
+	 * Create a new permanent logical decoding slot. This slot will be used
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, true,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
+	/*
+	 * Setup replication origin tracking. The purpose of doing this before the
+	 * copy is to avoid doing the copy again due to any error in setting up
+	 * origin tracking.
+	 */
+	originid = replorigin_by_name(originname, true);
+	if (!OidIsValid(originid))
+	{
+		/*
+		 * Origin tracking does not exist, so create it now.
+		 *
+		 * Then advance to the LSN got from walrcv_create_slot. This is WAL
+		 * logged for the purpose of recovery. Locks are to prevent the
+		 * replication origin from vanishing while advancing.
+		 */
+		originid = replorigin_create(originname);
+
+		LockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
+		replorigin_advance(originid, *origin_startpos, InvalidXLogRecPtr,
+						   true /* go backward */ , true /* WAL log */ );
+		UnlockRelationOid(ReplicationOriginRelationId, RowExclusiveLock);
+
+		replorigin_session_setup(originid);
+		replorigin_session_origin = originid;
+	}
+	else
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("replication origin \"%s\" already exists",
+						originname)));
+	}
+
 	/* Now do the initial data copy */
 	PushActiveSnapshot(GetTransactionSnapshot());
 	copy_table(rel);
@@ -941,6 +1108,25 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	CommandCounterIncrement();
 
 	/*
+	 * Update the persisted state to indicate the COPY phase is done; make it
+	 * visible to others.
+	 */
+	UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
+							   MyLogicalRepWorker->relid,
+							   SUBREL_STATE_FINISHEDCOPY,
+							   MyLogicalRepWorker->relstate_lsn);
+
+	CommitTransactionCommand();
+
+copy_table_done:
+
+	elog(DEBUG1,
+		 "LogicalRepSyncTableStart: '%s' origin_startpos lsn %X/%X",
+		 originname,
+		 (uint32) (*origin_startpos >> 32),
+		 (uint32) *origin_startpos);
+
+	/*
 	 * We are done with the initial data synchronization, update the state.
 	 */
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index eb7db89..cfc924c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -807,12 +807,8 @@ apply_handle_stream_stop(StringInfo s)
 	/* We must be in a valid transaction state */
 	Assert(IsTransactionState());
 
-	/* The synchronization worker runs in single transaction. */
-	if (!am_tablesync_worker())
-	{
-		/* Commit the per-stream transaction */
-		CommitTransactionCommand();
-	}
+	/* Commit the per-stream transaction */
+	CommitTransactionCommand();
 
 	in_streamed_transaction = false;
 
@@ -889,9 +885,7 @@ apply_handle_stream_abort(StringInfo s)
 			/* Cleanup the subxact info */
 			cleanup_subxact_info();
 
-			/* The synchronization worker runs in single transaction */
-			if (!am_tablesync_worker())
-				CommitTransactionCommand();
+			CommitTransactionCommand();
 			return;
 		}
 
@@ -918,8 +912,7 @@ apply_handle_stream_abort(StringInfo s)
 		/* write the updated subxact list */
 		subxact_info_write(MyLogicalRepWorker->subid, xid);
 
-		if (!am_tablesync_worker())
-			CommitTransactionCommand();
+		CommitTransactionCommand();
 	}
 }
 
@@ -1062,8 +1055,7 @@ apply_handle_stream_commit(StringInfo s)
 static void
 apply_handle_commit_internal(StringInfo s, LogicalRepCommitData *commit_data)
 {
-	/* The synchronization worker runs in single transaction. */
-	if (IsTransactionState() && !am_tablesync_worker())
+	if (IsTransactionState())
 	{
 		/*
 		 * Update origin state so we can restart streaming from correct
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1d81071..05bb698 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1786,7 +1786,8 @@ ProcessUtilitySlow(ParseState *pstate,
 				break;
 
 			case T_AlterSubscriptionStmt:
-				address = AlterSubscription((AlterSubscriptionStmt *) parsetree);
+				address = AlterSubscription((AlterSubscriptionStmt *) parsetree,
+											isTopLevel);
 				break;
 
 			case T_DropSubscriptionStmt:
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 2bea2c5..ed94f57 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -61,6 +61,8 @@ DECLARE_UNIQUE_INDEX_PKEY(pg_subscription_rel_srrelid_srsubid_index, 6117, on pg
 #define SUBREL_STATE_INIT		'i' /* initializing (sublsn NULL) */
 #define SUBREL_STATE_DATASYNC	'd' /* data is being synchronized (sublsn
 									 * NULL) */
+#define SUBREL_STATE_FINISHEDCOPY 'f'	/* tablesync copy phase is completed
+										 * (sublsn NULL) */
 #define SUBREL_STATE_SYNCDONE	's' /* synchronization finished in front of
 									 * apply (sublsn set) */
 #define SUBREL_STATE_READY		'r' /* ready (sublsn set) */
diff --git a/src/include/commands/subscriptioncmds.h b/src/include/commands/subscriptioncmds.h
index a818650..3b926f3 100644
--- a/src/include/commands/subscriptioncmds.h
+++ b/src/include/commands/subscriptioncmds.h
@@ -20,7 +20,7 @@
 
 extern ObjectAddress CreateSubscription(CreateSubscriptionStmt *stmt,
 										bool isTopLevel);
-extern ObjectAddress AlterSubscription(AlterSubscriptionStmt *stmt);
+extern ObjectAddress AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel);
 extern void DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel);
 
 extern ObjectAddress AlterSubscriptionOwner(const char *name, Oid newOwnerId);
diff --git a/src/include/replication/logicallauncher.h b/src/include/replication/logicallauncher.h
index 421ec15..301e494 100644
--- a/src/include/replication/logicallauncher.h
+++ b/src/include/replication/logicallauncher.h
@@ -22,9 +22,7 @@ extern Size ApplyLauncherShmemSize(void);
 extern void ApplyLauncherShmemInit(void);
 
 extern void ApplyLauncherWakeupAtCommit(void);
-extern bool XactManipulatesLogicalReplicationWorkers(void);
 extern void AtEOXact_ApplyLauncher(bool isCommit);
-extern void AtEOSubXact_ApplyLauncher(bool isCommit, int nestDepth);
 
 extern bool IsLogicalLauncher(void);
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 53f636c..5f52335 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -15,6 +15,7 @@
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
+#include "replication/walreceiver.h"
 
 /*
  * Behaviour of replication slots, upon release or crash.
@@ -211,6 +212,8 @@ extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
 extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
 extern ReplicationSlot *SearchNamedReplicationSlot(const char *name);
+extern char *ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslotname);
+extern void ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn, char *slotname, bool missing_ok);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4313f51..a97a59a 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -210,6 +210,7 @@ typedef enum
 typedef struct WalRcvExecResult
 {
 	WalRcvExecStatus status;
+	int			sqlstate;
 	char	   *err;
 	Tuplestorestate *tuplestore;
 	TupleDesc	tupledesc;
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index d046022..4a5adc2 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -77,13 +77,14 @@ extern List *logicalrep_workers_find(Oid subid, bool only_running);
 extern void logicalrep_worker_launch(Oid dbid, Oid subid, const char *subname,
 									 Oid userid, Oid relid);
 extern void logicalrep_worker_stop(Oid subid, Oid relid);
-extern void logicalrep_worker_stop_at_commit(Oid subid, Oid relid);
 extern void logicalrep_worker_wakeup(Oid subid, Oid relid);
 extern void logicalrep_worker_wakeup_ptr(LogicalRepWorker *worker);
 
 extern int	logicalrep_sync_worker_count(Oid subid);
 
+extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid, char *originname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 2fa9bce..7802279 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -201,6 +201,27 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=postgres' PUBLICATION mypub
+       WITH (enabled = true, create_slot = false, copy_data = false);
+-- fail - ALTER SUBSCRIPTION with refresh is not allowed in a transaction
+-- block or function
+BEGIN;
+ALTER SUBSCRIPTION regress_testsub SET PUBLICATION mypub WITH (refresh = true);
+ERROR:  ALTER SUBSCRIPTION with refresh cannot run inside a transaction block
+END;
+BEGIN;
+ALTER SUBSCRIPTION regress_testsub REFRESH PUBLICATION;
+ERROR:  ALTER SUBSCRIPTION ... REFRESH cannot run inside a transaction block
+END;
+CREATE FUNCTION func() RETURNS VOID AS
+$$ ALTER SUBSCRIPTION regress_testsub SET PUBLICATION mypub WITH (refresh = true) $$ LANGUAGE SQL;
+SELECT func();
+ERROR:  ALTER SUBSCRIPTION with refresh cannot be executed from a function
+CONTEXT:  SQL function "func" statement 1
+ALTER SUBSCRIPTION regress_testsub DISABLE;
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+DROP FUNCTION func;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 14fa0b2..ca0d782 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -147,6 +147,28 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 
 DROP SUBSCRIPTION regress_testsub;
 
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=postgres' PUBLICATION mypub
+       WITH (enabled = true, create_slot = false, copy_data = false);
+
+-- fail - ALTER SUBSCRIPTION with refresh is not allowed in a transaction
+-- block or function
+BEGIN;
+ALTER SUBSCRIPTION regress_testsub SET PUBLICATION mypub WITH (refresh = true);
+END;
+
+BEGIN;
+ALTER SUBSCRIPTION regress_testsub REFRESH PUBLICATION;
+END;
+
+CREATE FUNCTION func() RETURNS VOID AS
+$$ ALTER SUBSCRIPTION regress_testsub SET PUBLICATION mypub WITH (refresh = true) $$ LANGUAGE SQL;
+SELECT func();
+
+ALTER SUBSCRIPTION regress_testsub DISABLE;
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+DROP FUNCTION func;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/004_sync.pl b/src/test/subscription/t/004_sync.pl
index e111ab9..c792668 100644
--- a/src/test/subscription/t/004_sync.pl
+++ b/src/test/subscription/t/004_sync.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 8;
 
 # Initialize publisher node
 my $node_publisher = get_new_node('publisher');
@@ -149,7 +149,26 @@ $result = $node_subscriber->safe_psql('postgres',
 is($result, qq(20),
 	'changes for table added after subscription initialized replicated');
 
+# clean up
+$node_publisher->safe_psql('postgres', "DROP TABLE tab_rep_next");
+$node_subscriber->safe_psql('postgres', "DROP TABLE tab_rep_next");
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
 
+# Table tap_rep already has the same records on both publisher and subscriber
+# at this time. Recreate the subscription which will do the initial copy of
+# the table again and fails due to unique constraint violation.
+$node_subscriber->safe_psql('postgres',
+	 "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub");
+
+$result = $node_subscriber->poll_query_until('postgres', $started_query)
+    or die "Timed out while waiting for subscriber to start sync";
+
+# DROP SUBSCRIPTION must clean up slots on the publisher side when the
+# subscriber is stuck on data copy for constraint violation.
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'DROP SUBSCRIPTION during error can clean up the slots on the publisher');
+
 $node_subscriber->stop('fast');
 $node_publisher->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1d540fe..bab4f3a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2397,7 +2397,6 @@ StdAnalyzeData
 StdRdOptions
 Step
 StopList
-StopWorkersData
 StrategyNumber
 StreamCtl
 StreamXidHash
@@ -2408,6 +2407,7 @@ SubLink
 SubLinkType
 SubPlan
 SubPlanState
+SubRemoveRels
 SubTransactionId
 SubXactCallback
 SubXactCallbackItem
-- 
1.8.3.1

v39-0006-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v39-0006-Support-2PC-txn-Subscription-option.patchDownload
From 58500236c2a59e13ce4f1d4304fd36cb85d7009d Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 11 Feb 2021 18:34:37 +1100
Subject: [PATCH v39] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 202 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index bcb0acf..7610ab2 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -184,8 +184,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 750ec2a..ceeae36 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd..55dd8da 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1167,7 +1167,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 7996f84..2f56fab 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -358,6 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +399,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +468,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -823,6 +842,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -833,7 +854,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -872,6 +894,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -890,7 +919,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -935,7 +965,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -980,7 +1011,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 7714696..c602c3e 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -429,6 +429,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e01d02e..6146b77 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2787,6 +2787,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3433,6 +3434,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index c33ea25..3cd228d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -180,13 +180,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -254,6 +256,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -267,6 +279,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -291,7 +304,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -332,6 +346,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index d99b61e..c16811f 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 1290f96..07c3ad8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..4ac4924 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 40417e6..8e94a26 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..41e0d8c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 7802279..c2fa9fc 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -222,6 +222,29 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index ca0d782..1da95a4 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -169,6 +169,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v39-0007-Support-2PC-txn-tests-for-concurrent-aborts.patchapplication/octet-stream; name=v39-0007-Support-2PC-txn-tests-for-concurrent-aborts.patchDownload
From cd2f4377d9db4937f90ef3f87ddca6b13856fcc0 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 11 Feb 2021 18:53:56 +1100
Subject: [PATCH v39] Support 2PC txn tests for concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2PC.
---
 contrib/test_decoding/Makefile                    |   2 +
 contrib/test_decoding/t/001_twophase.pl           | 121 ++++++++++++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++++++
 contrib/test_decoding/test_decoding.c             |  58 ++++++++++
 src/backend/replication/logical/reorderbuffer.c   |   5 +
 5 files changed, 319 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index c5e28ce..e0cd841 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -10,6 +10,8 @@ ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..3b3e7b8
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of prepared txn test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..15001c6
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 929255e..3fa172a 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,11 +11,13 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
+#include "storage/procarray.h"
 
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -35,6 +37,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -174,6 +177,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -275,6 +279,24 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -471,6 +493,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -620,6 +666,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -706,6 +755,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -918,6 +970,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -971,6 +1026,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5a62ab8..4a4a9ed 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2489,6 +2489,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
-- 
1.8.3.1

v39-0002-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v39-0002-Refactor-spool-file-logic-in-worker.c.patchDownload
From 8cb175ccc55f9c1581a1313be1bc611af0610e56 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Feb 2021 17:49:57 +1100
Subject: [PATCH v39] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index cfc924c..b50f962 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -917,30 +919,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +941,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +956,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1031,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v39-0003-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v39-0003-Track-replication-origin-progress-for-rollbacks.patchDownload
From 6d92d84a68697062b0a91b90a9cb6c2bdd159608 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Feb 2021 17:56:54 +1100
Subject: [PATCH v39] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index fc18b77..609cf02 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2277,6 +2277,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2299,6 +2307,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3c8b4eb..7f5e678 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5709,8 +5709,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5916,7 +5915,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5965,6 +5965,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6006,7 +6013,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6014,7 +6022,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v39-0004-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v39-0004-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From f50b3b6936de8ed4dd54ad9e9eed39af2e209e68 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 11 Feb 2021 18:08:26 +1100
Subject: [PATCH v39] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

* We allow skipping prepared transactions if they are already prepared.
We do ensure that we skip only when the GID, origin_lsn, and
origin_timestamp of a prepared xact matches to avoid the possibility of
a match of prepared xact from two different nodes. This can happen when
the server or apply worker restarts after a prepared transaction.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  68 ++++++
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 258 ++++++++++++++++++++++
 src/backend/replication/logical/worker.c    | 329 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 172 ++++++++++++---
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  75 ++++++-
 src/include/replication/reorderbuffer.h     |  12 +
 src/tools/pgindent/typedefs.list            |   3 +
 9 files changed, 890 insertions(+), 36 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 609cf02..1f530f4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2446,3 +2446,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 685eaa6..73b420a 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -974,8 +974,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..1585754 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,264 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b50f962..e01d02e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -169,6 +170,9 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
+/* for skipping prepared transaction */
+bool        skip_prepared_txn = false;
+
 /*
  * Hash table for storing the streaming xid information along with shared file
  * set for streaming and subxact files.
@@ -690,6 +694,12 @@ apply_handle_begin(StringInfo s)
 {
 	LogicalRepBeginData begin_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_begin(s, &begin_data);
 
 	remote_final_lsn = begin_data.final_lsn;
@@ -709,6 +719,12 @@ apply_handle_commit(StringInfo s)
 {
 	LogicalRepCommitData commit_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_commit(s, &commit_data);
 
 	Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -722,6 +738,263 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+	{
+		/*
+		 * If this gid has already been prepared then we don't want to apply
+		 * this txn again. This can happen after restart where upstream can
+		 * send the prepared transaction again. See
+		 * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+		 */
+		skip_prepared_txn = true;
+		return;
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (skip_prepared_txn)
+	{
+		/*
+		 * If we are skipping this transaction because it was previously
+		 * prepared, ignore it and reset the flag.
+		 */
+		Assert(LookupGXact(prepare_data.gid, prepare_data.end_lsn,
+						   prepare_data.preparetime));
+		skip_prepared_txn = false;
+		return;
+	}
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -753,6 +1026,12 @@ apply_handle_stream_start(StringInfo s)
 	Assert(!in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Start a transaction on stream start, this transaction will be committed
 	 * on the stream stop unless it is a tablesync worker in which case it
 	 * will be committed after processing all the messages. We need the
@@ -800,6 +1079,12 @@ apply_handle_stream_stop(StringInfo s)
 	Assert(in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Close the file with serialized changes, and serialize information about
 	 * subxacts for the toplevel transaction.
 	 */
@@ -831,6 +1116,12 @@ apply_handle_stream_abort(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_stream_abort(s, &xid, &subxid);
 
 	/*
@@ -1046,6 +1337,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	xid = logicalrep_read_stream_commit(s, &commit_data);
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
@@ -1168,6 +1465,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1289,6 +1589,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1447,6 +1750,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1816,6 +2122,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1972,6 +2281,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 79765f9..c33ea25 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +67,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +78,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +173,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +344,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,27 +364,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -378,6 +385,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -776,17 +845,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -867,6 +927,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1192,3 +1270,31 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..40417e6 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bab31bf..6bb162e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bab4f3a..048681c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

#188osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Peter Smith (#187)
RE: [HACKERS] logical decoding of two-phase transactions

Hi

On Thursday, February 11, 2021 5:10 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the new 2PC patch set v39*

I started to review the patchset
so, let me give some comments I have at this moment.

(1)

File : v39-0007-Support-2PC-txn-tests-for-concurrent-aborts.patch
Modification :

@@ -620,6 +666,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
txndata->xact_wrote_changes = true;

+       /* For testing concurrent  aborts */
+       test_concurrent_aborts(data);
+
        class_form = RelationGetForm(relation);
        tupdesc = RelationGetDescr(relation);

Comment : There are unnecessary whitespaces in comments like above in v37-007
Please check such as pg_decode_change(), pg_decode_truncate(), pg_decode_stream_truncate() as well.
I suggest you align the code formats by pgindent.

(2)

File : v39-0006-Support-2PC-txn-Subscription-option.patch

@@ -213,6 +219,15 @@ parse_subscription_options(List *options,
                        *streaming_given = true;
                        *streaming = defGetBoolean(defel);
                }
+               else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+               {
+                       if (*twophase_given)
+                               ereport(ERROR,
+                                               (errcode(ERRCODE_SYNTAX_ERROR),
+                                                errmsg("conflicting or redundant options")));
+                       *twophase_given = true;
+                       *twophase = defGetBoolean(defel);
+               }

You can add this test in subscription.sql easily with double twophase options.

When I find something else, I'll let you know.

Best Regards,
Takamichi Osumi

#189Amit Kapila
amit.kapila16@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#188)

On Fri, Feb 12, 2021 at 12:29 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

On Thursday, February 11, 2021 5:10 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the new 2PC patch set v39*

I started to review the patchset
so, let me give some comments I have at this moment.

(1)

File : v39-0007-Support-2PC-txn-tests-for-concurrent-aborts.patch
Modification :

@@ -620,6 +666,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
}
txndata->xact_wrote_changes = true;

+       /* For testing concurrent  aborts */
+       test_concurrent_aborts(data);
+
class_form = RelationGetForm(relation);
tupdesc = RelationGetDescr(relation);

Comment : There are unnecessary whitespaces in comments like above in v37-007
Please check such as pg_decode_change(), pg_decode_truncate(), pg_decode_stream_truncate() as well.
I suggest you align the code formats by pgindent.

This patch (v39-0007-Support-2PC-txn-tests-for-concurrent-aborts.patch)
is mostly for dev-testing purpose. We don't intend to commit as this
has a lot of timing-dependent tests and I am not sure if it is
valuable enough at this stage. So, we can ignore cosmetic comments in
this patch for now.

--
With Regards,
Amit Kapila.

#190Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#187)
6 attachment(s)

Please find attached the new patch set v40*

The tablesync patch [1]/messages/by-id/CAA4eK1KHJxaZS-fod-0fey=0tq3=Gkn4ho=8N4-5HWiCfu0H1A@mail.gmail.com was already committed [2]https://github.com/postgres/postgres/commit/ce0fdbfe9722867b7fad4d3ede9b6a6bfc51fb4e, so the v39-0001
patch is no longer required.

v40* has been rebased to HEAD.

----
[1]: /messages/by-id/CAA4eK1KHJxaZS-fod-0fey=0tq3=Gkn4ho=8N4-5HWiCfu0H1A@mail.gmail.com
[2]: https://github.com/postgres/postgres/commit/ce0fdbfe9722867b7fad4d3ede9b6a6bfc51fb4e

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v40-0005-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v40-0005-Support-2PC-txn-Subscription-option.patchDownload
From 101df53e36b0f5195344964425987631986a7a61 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 16 Feb 2021 07:52:43 +1100
Subject: [PATCH v40] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 202 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index bcb0acf..7610ab2 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -184,8 +184,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index c32fc81..98070a9 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd..55dd8da 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1167,7 +1167,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index e5ae453..5212ab8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -358,6 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +399,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +468,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -825,6 +844,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -835,7 +856,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -874,6 +896,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -892,7 +921,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +967,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1013,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 7714696..c602c3e 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -429,6 +429,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e01d02e..6146b77 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2787,6 +2787,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3433,6 +3434,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index c33ea25..3cd228d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -180,13 +180,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -254,6 +256,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -267,6 +279,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -291,7 +304,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -332,6 +346,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..4ac4924 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 40417e6..8e94a26 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..41e0d8c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..8d24b2e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,29 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..5c79dbd 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v40-0001-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v40-0001-Refactor-spool-file-logic-in-worker.c.patchDownload
From 62d498f5861436123d3e657158961fa45be7b56f Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 16 Feb 2021 07:16:51 +1100
Subject: [PATCH v40] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index cfc924c..b50f962 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -917,30 +919,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +941,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +956,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1031,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v40-0002-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v40-0002-Track-replication-origin-progress-for-rollbacks.patchDownload
From 1b2673ef03598d7be50f1b3b18b1582ea4289d5a Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 16 Feb 2021 07:24:48 +1100
Subject: [PATCH v40] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 70d2257..8a4e149 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2284,6 +2284,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2306,6 +2314,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 17fbc41..fc94821 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5709,8 +5709,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5916,7 +5915,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5965,6 +5965,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6006,7 +6013,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6014,7 +6022,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v40-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v40-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 90c64e766a520d86b5b3e38bbf72eb52e2ef4bd2 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 16 Feb 2021 07:39:58 +1100
Subject: [PATCH v40] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

* We allow skipping prepared transactions if they are already prepared.
We do ensure that we skip only when the GID, origin_lsn, and
origin_timestamp of a prepared xact matches to avoid the possibility of
a match of prepared xact from two different nodes. This can happen when
the server or apply worker restarts after a prepared transaction.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  68 ++++++
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 258 ++++++++++++++++++++++
 src/backend/replication/logical/worker.c    | 329 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 172 ++++++++++++---
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  75 ++++++-
 src/include/replication/reorderbuffer.h     |  12 +
 src/tools/pgindent/typedefs.list            |   3 +
 9 files changed, 890 insertions(+), 36 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 8a4e149..262ceef 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2453,3 +2453,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 685eaa6..73b420a 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -974,8 +974,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..1585754 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,264 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b50f962..e01d02e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -169,6 +170,9 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
+/* for skipping prepared transaction */
+bool        skip_prepared_txn = false;
+
 /*
  * Hash table for storing the streaming xid information along with shared file
  * set for streaming and subxact files.
@@ -690,6 +694,12 @@ apply_handle_begin(StringInfo s)
 {
 	LogicalRepBeginData begin_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_begin(s, &begin_data);
 
 	remote_final_lsn = begin_data.final_lsn;
@@ -709,6 +719,12 @@ apply_handle_commit(StringInfo s)
 {
 	LogicalRepCommitData commit_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_commit(s, &commit_data);
 
 	Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -722,6 +738,263 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+	{
+		/*
+		 * If this gid has already been prepared then we don't want to apply
+		 * this txn again. This can happen after restart where upstream can
+		 * send the prepared transaction again. See
+		 * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+		 */
+		skip_prepared_txn = true;
+		return;
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (skip_prepared_txn)
+	{
+		/*
+		 * If we are skipping this transaction because it was previously
+		 * prepared, ignore it and reset the flag.
+		 */
+		Assert(LookupGXact(prepare_data.gid, prepare_data.end_lsn,
+						   prepare_data.preparetime));
+		skip_prepared_txn = false;
+		return;
+	}
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -753,6 +1026,12 @@ apply_handle_stream_start(StringInfo s)
 	Assert(!in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Start a transaction on stream start, this transaction will be committed
 	 * on the stream stop unless it is a tablesync worker in which case it
 	 * will be committed after processing all the messages. We need the
@@ -800,6 +1079,12 @@ apply_handle_stream_stop(StringInfo s)
 	Assert(in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Close the file with serialized changes, and serialize information about
 	 * subxacts for the toplevel transaction.
 	 */
@@ -831,6 +1116,12 @@ apply_handle_stream_abort(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_stream_abort(s, &xid, &subxid);
 
 	/*
@@ -1046,6 +1337,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	xid = logicalrep_read_stream_commit(s, &commit_data);
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
@@ -1168,6 +1465,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1289,6 +1589,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1447,6 +1750,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1816,6 +2122,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1972,6 +2281,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 79765f9..c33ea25 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +67,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +78,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +173,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +344,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,27 +364,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -378,6 +385,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -776,17 +845,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/* Message boundary */
-		OutputPluginWrite(ctx, false);
-		OutputPluginPrepareWrite(ctx, true);
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -867,6 +927,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1192,3 +1270,31 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/* Message boundary */
+		OutputPluginWrite(ctx, false);
+		OutputPluginPrepareWrite(ctx, true);
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..40417e6 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bab31bf..6bb162e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bab4f3a..048681c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v40-0004-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v40-0004-Support-2PC-txn-subscriber-tests.patchDownload
From aa281769fa3166147a57ee204a410be161f23ca6 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 16 Feb 2021 07:42:13 +1100
Subject: [PATCH v40] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v40-0006-Support-2PC-txn-tests-for-concurrent-aborts.patchapplication/octet-stream; name=v40-0006-Support-2PC-txn-tests-for-concurrent-aborts.patchDownload
From 03f5b4852963c34dded67e4dfa7dc77baf993f0b Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 16 Feb 2021 08:05:18 +1100
Subject: [PATCH v40] Support 2PC txn tests for concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2PC.
---
 contrib/test_decoding/Makefile                    |   2 +
 contrib/test_decoding/t/001_twophase.pl           | 121 ++++++++++++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++++++
 contrib/test_decoding/test_decoding.c             |  58 ++++++++++
 src/backend/replication/logical/reorderbuffer.c   |   5 +
 5 files changed, 319 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index c5e28ce..e0cd841 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -10,6 +10,8 @@ ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..3b3e7b8
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of prepared txn test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..15001c6
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 929255e..3fa172a 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,11 +11,13 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
+#include "storage/procarray.h"
 
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -35,6 +37,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -174,6 +177,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -275,6 +279,24 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -471,6 +493,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -620,6 +666,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -706,6 +755,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -918,6 +970,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -971,6 +1026,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5a62ab8..4a4a9ed 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2489,6 +2489,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
-- 
1.8.3.1

#191Peter Smith
smithpb2250@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#188)

On Fri, Feb 12, 2021 at 5:59 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

(2)

File : v39-0006-Support-2PC-txn-Subscription-option.patch

@@ -213,6 +219,15 @@ parse_subscription_options(List *options,
*streaming_given = true;
*streaming = defGetBoolean(defel);
}
+               else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+               {
+                       if (*twophase_given)
+                               ereport(ERROR,
+                                               (errcode(ERRCODE_SYNTAX_ERROR),
+                                                errmsg("conflicting or redundant options")));
+                       *twophase_given = true;
+                       *twophase = defGetBoolean(defel);
+               }

You can add this test in subscription.sql easily with double twophase options.

Thanks for the feedback. You are right.

But in the pgoutput.c there are several other potential syntax errors
"conflicting or redundant options" which are just like this
"two_phase" one.
e.g. there is the same error for options "proto_version",
"publication_names", "binary", "streaming".

AFAIK none of those other syntax errors had any regression tests. That
is the reason why I did not include any new test for the "two_phase"
option.

So:
a) should I add a new test per your feedback comment, or
b) should I be consistent with the other similar errors, and not add the test?

Of course it is easy to add a new test if you think option (a) is best.

Thoughts?

-----
Kind Regards,
Peter Smith.
Fujitsu Australia

#192osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Peter Smith (#191)
RE: [HACKERS] logical decoding of two-phase transactions

Hi

On Tuesday, February 16, 2021 8:33 AM Peter Smith <smithpb2250@gmail.com>

On Fri, Feb 12, 2021 at 5:59 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

(2)

File : v39-0006-Support-2PC-txn-Subscription-option.patch

@@ -213,6 +219,15 @@ parse_subscription_options(List *options,
*streaming_given = true;
*streaming = defGetBoolean(defel);
}
+               else if (strcmp(defel->defname, "two_phase") == 0 &&

twophase)

+               {
+                       if (*twophase_given)
+                               ereport(ERROR,
+

(errcode(ERRCODE_SYNTAX_ERROR),

+ errmsg("conflicting or

redundant options")));

+                       *twophase_given = true;
+                       *twophase = defGetBoolean(defel);
+               }

You can add this test in subscription.sql easily with double twophase

options.

Thanks for the feedback. You are right.

But in the pgoutput.c there are several other potential syntax errors
"conflicting or redundant options" which are just like this "two_phase" one.
e.g. there is the same error for options "proto_version", "publication_names",
"binary", "streaming".

AFAIK none of those other syntax errors had any regression tests. That is the
reason why I did not include any new test for the "two_phase"
option.

So:
a) should I add a new test per your feedback comment, or
b) should I be consistent with the other similar errors, and not add the test?

Of course it is easy to add a new test if you think option (a) is best.

Thoughts?

OK. Then, we can think previously, such tests for other options are
regarded as needless because the result are too apparent.
Let's choose (b) to make the patch set aligned with other similar past codes.
Thanks.

Best Regards,
Takamichi Osumi

#193Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#190)
6 attachment(s)

Please find attached the new patch set v41*

(v40* needed to be rebased to current HEAD)

----
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v41-0001-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v41-0001-Refactor-spool-file-logic-in-worker.c.patchDownload
From c73e8dbbdc37d43a2c7bc56873b11eacc7e319e3 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 18 Feb 2021 09:08:47 +1100
Subject: [PATCH v41] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index cfc924c..b50f962 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -917,30 +919,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +941,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +956,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1031,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v41-0005-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v41-0005-Support-2PC-txn-Subscription-option.patchDownload
From ffa49332b803fc148c46e3c4638012d17b20b8ab Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 18 Feb 2021 10:53:46 +1100
Subject: [PATCH v41] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 202 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index bcb0acf..7610ab2 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -184,8 +184,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index c32fc81..98070a9 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd..55dd8da 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1167,7 +1167,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..a069c76 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -358,6 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +399,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +468,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -825,6 +844,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -835,7 +856,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -874,6 +896,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -892,7 +921,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +967,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1013,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 7714696..c602c3e 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -429,6 +429,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e01d02e..6146b77 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2787,6 +2787,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3433,6 +3434,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2bf1295..3a1b404 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -180,13 +180,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -254,6 +256,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -267,6 +279,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -291,7 +304,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -332,6 +346,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..4ac4924 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 40417e6..8e94a26 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..41e0d8c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..8d24b2e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,29 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..5c79dbd 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v41-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v41-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From ab58a07e00f581f141a5a86a93d4d7c57065e691 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 18 Feb 2021 10:23:04 +1100
Subject: [PATCH v41] Add support for apply at prepare time to built-in logical
  replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

* We allow skipping prepared transactions if they are already prepared.
We do ensure that we skip only when the GID, origin_lsn, and
origin_timestamp of a prepared xact matches to avoid the possibility of
a match of prepared xact from two different nodes. This can happen when
the server or apply worker restarts after a prepared transaction.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  68 ++++++
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 258 ++++++++++++++++++++++
 src/backend/replication/logical/worker.c    | 329 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 177 ++++++++++++---
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  75 ++++++-
 src/include/replication/reorderbuffer.h     |  12 +
 src/tools/pgindent/typedefs.list            |   3 +
 9 files changed, 892 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 8a4e149..262ceef 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2453,3 +2453,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 685eaa6..73b420a 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -974,8 +974,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..1585754 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,264 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b50f962..e01d02e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -169,6 +170,9 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
+/* for skipping prepared transaction */
+bool        skip_prepared_txn = false;
+
 /*
  * Hash table for storing the streaming xid information along with shared file
  * set for streaming and subxact files.
@@ -690,6 +694,12 @@ apply_handle_begin(StringInfo s)
 {
 	LogicalRepBeginData begin_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_begin(s, &begin_data);
 
 	remote_final_lsn = begin_data.final_lsn;
@@ -709,6 +719,12 @@ apply_handle_commit(StringInfo s)
 {
 	LogicalRepCommitData commit_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_commit(s, &commit_data);
 
 	Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -722,6 +738,263 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+	{
+		/*
+		 * If this gid has already been prepared then we don't want to apply
+		 * this txn again. This can happen after restart where upstream can
+		 * send the prepared transaction again. See
+		 * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+		 */
+		skip_prepared_txn = true;
+		return;
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (skip_prepared_txn)
+	{
+		/*
+		 * If we are skipping this transaction because it was previously
+		 * prepared, ignore it and reset the flag.
+		 */
+		Assert(LookupGXact(prepare_data.gid, prepare_data.end_lsn,
+						   prepare_data.preparetime));
+		skip_prepared_txn = false;
+		return;
+	}
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -753,6 +1026,12 @@ apply_handle_stream_start(StringInfo s)
 	Assert(!in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Start a transaction on stream start, this transaction will be committed
 	 * on the stream stop unless it is a tablesync worker in which case it
 	 * will be committed after processing all the messages. We need the
@@ -800,6 +1079,12 @@ apply_handle_stream_stop(StringInfo s)
 	Assert(in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Close the file with serialized changes, and serialize information about
 	 * subxacts for the toplevel transaction.
 	 */
@@ -831,6 +1116,12 @@ apply_handle_stream_abort(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_stream_abort(s, &xid, &subxid);
 
 	/*
@@ -1046,6 +1337,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	xid = logicalrep_read_stream_commit(s, &commit_data);
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
@@ -1168,6 +1465,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1289,6 +1589,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1447,6 +1750,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1816,6 +2122,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1972,6 +2281,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2bf1295 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +67,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +78,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +173,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +344,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +364,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +385,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +845,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -870,6 +927,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1195,3 +1270,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..40417e6 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bab31bf..6bb162e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bab4f3a..048681c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v41-0002-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v41-0002-Track-replication-origin-progress-for-rollbacks.patchDownload
From 3bb5f1f37869711e55a84084efecf64500906b87 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 18 Feb 2021 09:20:34 +1100
Subject: [PATCH v41] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 70d2257..8a4e149 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2284,6 +2284,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2306,6 +2314,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 17fbc41..fc94821 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5709,8 +5709,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5916,7 +5915,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5965,6 +5965,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6006,7 +6013,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6014,7 +6022,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v41-0004-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v41-0004-Support-2PC-txn-subscriber-tests.patchDownload
From 312e527f9941c4a14a6b22ee0a43c170fd6db5a6 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 18 Feb 2021 10:30:13 +1100
Subject: [PATCH v41] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v41-0006-Support-2PC-txn-tests-for-concurrent-aborts.patchapplication/octet-stream; name=v41-0006-Support-2PC-txn-tests-for-concurrent-aborts.patchDownload
From e63adb199d0f09438319adba9396c958f625f913 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 18 Feb 2021 11:09:36 +1100
Subject: [PATCH v41] Support 2PC txn tests for concurrent aborts.

Add tap tests to test_decoding for testing concurrent aborts during 2PC.
---
 contrib/test_decoding/Makefile                    |   2 +
 contrib/test_decoding/t/001_twophase.pl           | 121 ++++++++++++++++++++
 contrib/test_decoding/t/002_twophase_streaming.pl | 133 ++++++++++++++++++++++
 contrib/test_decoding/test_decoding.c             |  58 ++++++++++
 src/backend/replication/logical/reorderbuffer.c   |   5 +
 5 files changed, 319 insertions(+)
 create mode 100644 contrib/test_decoding/t/001_twophase.pl
 create mode 100644 contrib/test_decoding/t/002_twophase_streaming.pl

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index c5e28ce..e0cd841 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -10,6 +10,8 @@ ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot
 
+TAP_TESTS = 1
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/t/001_twophase.pl b/contrib/test_decoding/t/001_twophase.pl
new file mode 100644
index 0000000..3b3e7b8
--- /dev/null
+++ b/contrib/test_decoding/t/001_twophase.pl
@@ -0,0 +1,121 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE tab (a int PRIMARY KEY)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO tab SELECT generate_series(1,10)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13,14);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test_prepared_tab'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of prepared txn test_prepared_tab")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# Test 2:
+# Check that commit prepared is decoded properly on immediate restart
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab VALUES (11);
+    INSERT INTO tab VALUES (12);
+    ALTER TABLE tab ADD COLUMN b INT;
+    INSERT INTO tab VALUES (13, 11);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+# consume changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+$node_logical->stop('immediate');
+$node_logical->start;
+
+# commit post the restart
+$node_logical->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1');");
+
+# check inserts are visible
+my $result = $node_logical->safe_psql('postgres', "SELECT count(*) FROM tab where a IN (11,12) OR b IN (11);");
+is($result, qq(3), 'Rows inserted via 2PC are visible on restart');
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/t/002_twophase_streaming.pl b/contrib/test_decoding/t/002_twophase_streaming.pl
new file mode 100644
index 0000000..15001c6
--- /dev/null
+++ b/contrib/test_decoding/t/002_twophase_streaming.pl
@@ -0,0 +1,133 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 2;
+use Time::HiRes qw(usleep);
+use Scalar::Util qw(looks_like_number);
+
+# Initialize node
+my $node_logical = get_new_node('logical');
+$node_logical->init(allows_streaming => 'logical');
+$node_logical->append_conf(
+        'postgresql.conf', qq(
+        max_prepared_transactions = 10
+		logical_decoding_work_mem = 64kB
+));
+$node_logical->start;
+
+# Create some pre-existing content on logical
+$node_logical->safe_psql('postgres', "CREATE TABLE stream_test (data text)");
+$node_logical->safe_psql('postgres',
+	"INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,3) g(i)");
+$node_logical->safe_psql('postgres',
+	"SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');");
+
+#Test 1:
+# This test is specifically for testing concurrent abort while logical decode
+# is ongoing. We will pass in the xid of the 2PC to the plugin as an option.
+# On the receipt of a valid "check-xid-aborted", the change API in the test decoding
+# plugin will wait for it to be aborted.
+#
+# We will fire off a ROLLBACK from another session when this decode
+# is waiting.
+#
+# The status of "check-xid-aborted" will change from in-progress to not-committed
+# (hence aborted) and we will stop decoding because the subsequent
+# system catalog scan will error out.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+	savepoint s1;
+	SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+	INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+	TRUNCATE table stream_test;
+	rollback to s1;
+	INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i);
+	PREPARE TRANSACTION 'test1';");
+# get XID of the above two-phase transaction
+my $xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test1'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# start decoding the above by passing the "check-xid-aborted"
+my $logical_connstr = $node_logical->connstr . ' dbname=postgres';
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test1';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+# consume any remaining changes
+$node_logical->safe_psql('postgres', "SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1','stream-changes', '1');");
+
+# Test 2:
+# Check concurrent aborts while decoding a TRUNCATE.
+
+$node_logical->safe_psql('postgres', "
+    BEGIN;
+    savepoint s1;
+    SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+    INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM generate_series(1, 35) g(i);
+    TRUNCATE table stream_test;
+    rollback to s1;
+    TRUNCATE table stream_test;
+    PREPARE TRANSACTION 'test2';");
+# get XID of the above two-phase transaction
+$xid2pc = $node_logical->safe_psql('postgres', "SELECT transaction FROM pg_prepared_xacts WHERE gid = 'test2'");
+is(looks_like_number($xid2pc), qq(1), 'Got a valid two-phase XID');
+
+# decode now, it should include an ABORT entry because of the ROLLBACK below
+system_log("psql -d \"$logical_connstr\" -c \"SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'check-xid-aborted', '$xid2pc','stream-changes', '1');\" \&");
+
+# check that decode starts waiting for this $xid2pc
+poll_output_until("waiting for $xid2pc to abort")
+    or die "no wait happened for the abort";
+
+# rollback the prepared transaction
+$node_logical->safe_psql('postgres', "ROLLBACK PREPARED 'test2';");
+
+# check for occurrence of the log about stopping this decoding
+poll_output_until("stop decoding of txn $xid2pc")
+    or die "no decoding stop for the rollback";
+
+
+$node_logical->safe_psql('postgres', "SELECT pg_drop_replication_slot('regression_slot');");
+$node_logical->stop('fast');
+
+sub poll_output_until
+{
+    my ($expected) = @_;
+
+    $expected = 'xxxxxx' unless defined($expected); # default junk value
+
+    my $max_attempts = 180 * 10;
+    my $attempts     = 0;
+
+    my $output_file = '';
+    while ($attempts < $max_attempts)
+    {
+        $output_file = slurp_file($node_logical->logfile());
+
+        if ($output_file =~ $expected)
+        {
+            return 1;
+        }
+
+        # Wait 0.1 second before retrying.
+        usleep(100_000);
+        $attempts++;
+    }
+
+    # The output result didn't change in 180 seconds. Give up
+    return 0;
+}
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 929255e..3fa172a 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -11,11 +11,13 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include "miscadmin.h"
 
 #include "catalog/pg_type.h"
 
 #include "replication/logical.h"
 #include "replication/origin.h"
+#include "storage/procarray.h"
 
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -35,6 +37,7 @@ typedef struct
 	bool		include_timestamp;
 	bool		skip_empty_xacts;
 	bool		only_local;
+	TransactionId check_xid_aborted;	/* track abort of this txid */
 } TestDecodingData;
 
 /*
@@ -174,6 +177,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	data->include_timestamp = false;
 	data->skip_empty_xacts = false;
 	data->only_local = false;
+	data->check_xid_aborted = InvalidTransactionId;
 
 	ctx->output_plugin_private = data;
 
@@ -275,6 +279,24 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 						 errmsg("could not parse value \"%s\" for parameter \"%s\"",
 								strVal(elem->arg), elem->defname)));
 		}
+		else if (strcmp(elem->defname, "check-xid-aborted") == 0)
+		{
+			if (elem->arg == NULL)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("check-xid-aborted needs an input value")));
+			else
+			{
+				errno = 0;
+				data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0);
+
+				if (errno || !TransactionIdIsValid(data->check_xid_aborted))
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+									strVal(elem->arg))));
+			}
+		}
 		else
 		{
 			ereport(ERROR,
@@ -471,6 +493,30 @@ pg_decode_filter(LogicalDecodingContext *ctx,
 	return false;
 }
 
+static void
+test_concurrent_aborts(TestDecodingData *data)
+{
+	/*
+	 * If check_xid_aborted is a valid xid, then it was passed in as an option
+	 * to check if the transaction having this xid would be aborted. This is
+	 * to test concurrent aborts.
+	 */
+	if (TransactionIdIsValid(data->check_xid_aborted))
+	{
+		elog(LOG, "waiting for %u to abort", data->check_xid_aborted);
+		while (TransactionIdIsInProgress(data->check_xid_aborted))
+		{
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(10000L);
+		}
+		if (!TransactionIdIsInProgress(data->check_xid_aborted) &&
+				!TransactionIdDidCommit(data->check_xid_aborted))
+			elog(LOG, "%u aborted", data->check_xid_aborted);
+
+		Assert(TransactionIdDidAbort(data->check_xid_aborted));
+	}
+}
+
 /*
  * Print literal `outputstr' already represented as string of type `typid'
  * into stringbuf `s'.
@@ -620,6 +666,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	class_form = RelationGetForm(relation);
 	tupdesc = RelationGetDescr(relation);
 
@@ -706,6 +755,9 @@ pg_decode_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -918,6 +970,9 @@ pg_decode_stream_change(LogicalDecodingContext *ctx,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* Test for concurrent aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
@@ -971,6 +1026,9 @@ pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	}
 	txndata->xact_wrote_changes = txndata->stream_wrote_changes = true;
 
+	/* For testing concurrent  aborts */
+	test_concurrent_aborts(data);
+
 	OutputPluginPrepareWrite(ctx, true);
 	if (data->include_xids)
 		appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5a62ab8..4a4a9ed 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2489,6 +2489,11 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			curtxn->concurrent_abort = true;
 
 			/* Reset the TXN so that it is allowed to stream remaining data. */
+			if (rbtxn_prepared(txn))
+				elog(LOG, "stop decoding of prepared txn %s (%u)",
+					 txn->gid != NULL ? txn->gid : "", txn->xid);
+			else
+				elog(LOG, "stop decoding of txn %u", txn->xid);
 			ReorderBufferResetTXN(rb, txn, snapshot_now,
 								  command_id, prev_lsn,
 								  specinsert);
-- 
1.8.3.1

#194Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#193)

On Thu, Feb 18, 2021 at 5:48 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the new patch set v41*

I see one issue here. Currently, when we create a subscription, we
first launch apply-worker and create the main apply worker slot and
then launch table sync workers as required. Now, assume, the apply
worker slot is created and after that, we launch tablesync worker,
which will initiate its slot (sync_slot) creation. Then, on the
publisher-side, the situation is such that there is a prepared
transaction that happens before we reach a consistent snapshot. We can
assume the exact scenario as we have in twophase_snapshot.spec where
we skip prepared xact due to this reason.

Because the WALSender corresponding to apply worker is already running
so it will be in consistent state, for it, such a prepared xact can be
decoded and it will send the same to the subscriber. On the
subscriber-side, it can skip applying the data-modification operations
because the corresponding rel is still not in a ready state (see
should_apply_changes_for_rel and its callers) simply because the
corresponding table sync worker is not finished yet. But prepare will
occur and it will lead to a prepared transaction on the subscriber.

In this situation, tablesync worker has skipped prepare because the
snapshot was not consistent and then it exited because it is in sync
with the apply worker. And apply worker has skipped because tablesync
was in-progress. Later when Commit prepared will come, the
apply-worker will simply commit the previously prepared transaction
and we will never see the prepared transaction data.

So, the basic premise is that we can't allow tablesync workers to skip
prepared transactions (which can be processed by apply worker) and
process later commits.

I have one idea to address this. When we get the first begin_prepare
in the apply-worker, we can check if there are any relations in
"not_ready" state and if so then just wait till all the relations
become in sync with the apply worker. This is to avoid that any of the
tablesync workers might skip prepared xact and we don't want apply
worker to also skip the same.

Now, it is possible that some tablesync worker has copied the data and
moved the sync position ahead of where the current apply worker's
position is. In such a case, we need to process transactions in apply
worker such that we can process commits if any, and write prepared
transactions to file. For prepared transactions, we can take decisions
only once the commit prepared for them has arrived.

--
With Regards,
Amit Kapila.

#195Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#193)
5 attachment(s)

Please find attached the new patch set v42*

This removes the (development only) patch v41-0006 which was causing
some random cfbot fails.

----
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v42-0001-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v42-0001-Refactor-spool-file-logic-in-worker.c.patchDownload
From b81e3cae8cf27a9f3c985e24d3e51495be5ed612 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 18 Feb 2021 09:08:47 +1100
Subject: [PATCH v42] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index cfc924c..b50f962 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -917,30 +919,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +941,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +956,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1031,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v42-0002-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v42-0002-Track-replication-origin-progress-for-rollbacks.patchDownload
From b119be8be46241ef454b3d3f0a51d154ba4fd178 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 18 Feb 2021 09:20:34 +1100
Subject: [PATCH v42] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 70d2257..8a4e149 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2284,6 +2284,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2306,6 +2314,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 17fbc41..fc94821 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5709,8 +5709,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5916,7 +5915,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5965,6 +5965,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6006,7 +6013,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6014,7 +6022,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v42-0005-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v42-0005-Support-2PC-txn-Subscription-option.patchDownload
From d51da57cfcdd4364e611c1a5dcafe3a6d8268fc0 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 18 Feb 2021 10:53:46 +1100
Subject: [PATCH v42] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 202 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index bcb0acf..7610ab2 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -184,8 +184,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index c32fc81..98070a9 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd..55dd8da 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1167,7 +1167,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..a069c76 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -358,6 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +399,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +468,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -825,6 +844,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -835,7 +856,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -874,6 +896,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -892,7 +921,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +967,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1013,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 7714696..c602c3e 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -429,6 +429,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e01d02e..6146b77 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2787,6 +2787,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3433,6 +3434,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2bf1295..3a1b404 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -180,13 +180,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -254,6 +256,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -267,6 +279,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -291,7 +304,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -332,6 +346,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..4ac4924 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 40417e6..8e94a26 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..41e0d8c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..8d24b2e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,29 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..5c79dbd 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v42-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v42-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From b7f2df5c96683bfcf72f7aa24cef401d5e82b36e Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 18 Feb 2021 10:23:04 +1100
Subject: [PATCH v42] Add support for apply at prepare time to built-in logical
  replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

* We allow skipping prepared transactions if they are already prepared.
We do ensure that we skip only when the GID, origin_lsn, and
origin_timestamp of a prepared xact matches to avoid the possibility of
a match of prepared xact from two different nodes. This can happen when
the server or apply worker restarts after a prepared transaction.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  68 ++++++
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 258 ++++++++++++++++++++++
 src/backend/replication/logical/worker.c    | 329 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 177 ++++++++++++---
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  75 ++++++-
 src/include/replication/reorderbuffer.h     |  12 +
 src/tools/pgindent/typedefs.list            |   3 +
 9 files changed, 892 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 8a4e149..262ceef 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2453,3 +2453,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 685eaa6..73b420a 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -974,8 +974,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..1585754 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,264 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b50f962..e01d02e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -169,6 +170,9 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
+/* for skipping prepared transaction */
+bool        skip_prepared_txn = false;
+
 /*
  * Hash table for storing the streaming xid information along with shared file
  * set for streaming and subxact files.
@@ -690,6 +694,12 @@ apply_handle_begin(StringInfo s)
 {
 	LogicalRepBeginData begin_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_begin(s, &begin_data);
 
 	remote_final_lsn = begin_data.final_lsn;
@@ -709,6 +719,12 @@ apply_handle_commit(StringInfo s)
 {
 	LogicalRepCommitData commit_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_commit(s, &commit_data);
 
 	Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -722,6 +738,263 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+	{
+		/*
+		 * If this gid has already been prepared then we don't want to apply
+		 * this txn again. This can happen after restart where upstream can
+		 * send the prepared transaction again. See
+		 * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+		 */
+		skip_prepared_txn = true;
+		return;
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (skip_prepared_txn)
+	{
+		/*
+		 * If we are skipping this transaction because it was previously
+		 * prepared, ignore it and reset the flag.
+		 */
+		Assert(LookupGXact(prepare_data.gid, prepare_data.end_lsn,
+						   prepare_data.preparetime));
+		skip_prepared_txn = false;
+		return;
+	}
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -753,6 +1026,12 @@ apply_handle_stream_start(StringInfo s)
 	Assert(!in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Start a transaction on stream start, this transaction will be committed
 	 * on the stream stop unless it is a tablesync worker in which case it
 	 * will be committed after processing all the messages. We need the
@@ -800,6 +1079,12 @@ apply_handle_stream_stop(StringInfo s)
 	Assert(in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Close the file with serialized changes, and serialize information about
 	 * subxacts for the toplevel transaction.
 	 */
@@ -831,6 +1116,12 @@ apply_handle_stream_abort(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_stream_abort(s, &xid, &subxid);
 
 	/*
@@ -1046,6 +1337,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	xid = logicalrep_read_stream_commit(s, &commit_data);
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
@@ -1168,6 +1465,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1289,6 +1589,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1447,6 +1750,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1816,6 +2122,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1972,6 +2281,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2bf1295 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +67,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +78,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +173,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +344,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +364,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +385,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +845,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -870,6 +927,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1195,3 +1270,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..40417e6 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bab31bf..6bb162e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bab4f3a..048681c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v42-0004-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v42-0004-Support-2PC-txn-subscriber-tests.patchDownload
From bcb61314a4a9c06227140ba7b1913d3ac6352e48 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 18 Feb 2021 10:30:13 +1100
Subject: [PATCH v42] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#196Markus Wanner
markus@bluegap.ch
In reply to: Amit Kapila (#177)

Hello Amit,

On 04.01.21 09:18, Amit Kapila wrote:

Thanks, I have pushed the 0001* patch after making the above and a few
other cosmetic modifications.

That commit added the following snippet to the top of
ReorderBufferFinishPrepared:

txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, false);

/* unknown transaction, nothing to do */
if (txn == NULL)
return;

Passing true for the create argument seems like an oversight. I think
this should pass false and not ever (have to) create a ReorderBufferTXN
entry.

Regards

Markus

#197Amit Kapila
amit.kapila16@gmail.com
In reply to: Markus Wanner (#196)

On Mon, Feb 22, 2021 at 11:04 PM Markus Wanner <markus@bluegap.ch> wrote:

On 04.01.21 09:18, Amit Kapila wrote:

Thanks, I have pushed the 0001* patch after making the above and a few
other cosmetic modifications.

That commit added the following snippet to the top of
ReorderBufferFinishPrepared:

txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, false);

/* unknown transaction, nothing to do */
if (txn == NULL)
return;

Passing true for the create argument seems like an oversight. I think
this should pass false and not ever (have to) create a ReorderBufferTXN
entry.

Right, I'll push a fix for this. Thanks for the report!

--
With Regards,
Amit Kapila.

#198Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#197)

On Tue, Feb 23, 2021 at 7:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Feb 22, 2021 at 11:04 PM Markus Wanner <markus@bluegap.ch> wrote:

On 04.01.21 09:18, Amit Kapila wrote:

Thanks, I have pushed the 0001* patch after making the above and a few
other cosmetic modifications.

That commit added the following snippet to the top of
ReorderBufferFinishPrepared:

txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, false);

/* unknown transaction, nothing to do */
if (txn == NULL)
return;

Passing true for the create argument seems like an oversight. I think
this should pass false and not ever (have to) create a ReorderBufferTXN
entry.

Right, I'll push a fix for this.

Pushed!

--
With Regards,
Amit Kapila.

#199Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#195)
8 attachment(s)

Please find attached the latest patch set v43*

Differences from v42*

- Rebased to HEAD as @ today

- Added new patch 0006 "Tablesync early exit" as discussed here [1]/messages/by-id/CAHut+Ptjk-Qgd3R1a1_tr62CmiswcYphuv0pLmVA-+2s8r0Bkw@mail.gmail.com

- Added new patch 0007 "Fix apply worker prepare" as discussed here [2]/messages/by-id/CAA4eK1L=dhuCRvyDvrXX5wZgc7s1hLRD29CKCK6oaHtVCPgiFA@mail.gmail.com

- Added new patch 0008 "Fix apply worker prepare (dev logs)" (to aid
testing of patch 0007)

~~

(The 0006 patch has a known whitespace problem. I will fix that next time)

----
[1]: /messages/by-id/CAHut+Ptjk-Qgd3R1a1_tr62CmiswcYphuv0pLmVA-+2s8r0Bkw@mail.gmail.com
[2]: /messages/by-id/CAA4eK1L=dhuCRvyDvrXX5wZgc7s1hLRD29CKCK6oaHtVCPgiFA@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v43-0001-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v43-0001-Refactor-spool-file-logic-in-worker.c.patchDownload
From 10e5067cc251a9bdf0f255600ff520ba25e71b88 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 19 Feb 2021 10:54:14 +1100
Subject: [PATCH v43] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 18d0528..45ac498 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -917,30 +919,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +941,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +956,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1031,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v43-0002-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v43-0002-Track-replication-origin-progress-for-rollbacks.patchDownload
From c2c94dbc21263bb80d82595556ef2ff2581ae8b3 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 19 Feb 2021 10:56:18 +1100
Subject: [PATCH v43] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20..6023e7c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2276,6 +2276,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2298,6 +2306,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4e6a3df..acdb28d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5716,8 +5716,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5923,7 +5922,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5972,6 +5972,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6013,7 +6020,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6021,7 +6029,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v43-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v43-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From d37773aea7a7fae407c082873db587a74e0a16d8 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 19 Feb 2021 10:58:23 +1100
Subject: [PATCH v43] Add support for apply at prepare time to built-in logical
   replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

* We allow skipping prepared transactions if they are already prepared.
We do ensure that we skip only when the GID, origin_lsn, and
origin_timestamp of a prepared xact matches to avoid the possibility of
a match of prepared xact from two different nodes. This can happen when
the server or apply worker restarts after a prepared transaction.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  68 ++++++
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 258 ++++++++++++++++++++++
 src/backend/replication/logical/worker.c    | 329 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 177 ++++++++++++---
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  75 ++++++-
 src/include/replication/reorderbuffer.h     |  12 +
 src/tools/pgindent/typedefs.list            |   3 +
 9 files changed, 892 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..81cb765 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..1585754 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,264 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 45ac498..6a5b23f 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -169,6 +170,9 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
+/* for skipping prepared transaction */
+bool        skip_prepared_txn = false;
+
 /*
  * Hash table for storing the streaming xid information along with shared file
  * set for streaming and subxact files.
@@ -690,6 +694,12 @@ apply_handle_begin(StringInfo s)
 {
 	LogicalRepBeginData begin_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_begin(s, &begin_data);
 
 	remote_final_lsn = begin_data.final_lsn;
@@ -709,6 +719,12 @@ apply_handle_commit(StringInfo s)
 {
 	LogicalRepCommitData commit_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_commit(s, &commit_data);
 
 	Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -722,6 +738,263 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+	{
+		/*
+		 * If this gid has already been prepared then we don't want to apply
+		 * this txn again. This can happen after restart where upstream can
+		 * send the prepared transaction again. See
+		 * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+		 */
+		skip_prepared_txn = true;
+		return;
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (skip_prepared_txn)
+	{
+		/*
+		 * If we are skipping this transaction because it was previously
+		 * prepared, ignore it and reset the flag.
+		 */
+		Assert(LookupGXact(prepare_data.gid, prepare_data.end_lsn,
+						   prepare_data.preparetime));
+		skip_prepared_txn = false;
+		return;
+	}
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -753,6 +1026,12 @@ apply_handle_stream_start(StringInfo s)
 	Assert(!in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Start a transaction on stream start, this transaction will be committed
 	 * on the stream stop unless it is a tablesync worker in which case it
 	 * will be committed after processing all the messages. We need the
@@ -800,6 +1079,12 @@ apply_handle_stream_stop(StringInfo s)
 	Assert(in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Close the file with serialized changes, and serialize information about
 	 * subxacts for the toplevel transaction.
 	 */
@@ -831,6 +1116,12 @@ apply_handle_stream_abort(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_stream_abort(s, &xid, &subxid);
 
 	/*
@@ -1046,6 +1337,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	xid = logicalrep_read_stream_commit(s, &commit_data);
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
@@ -1168,6 +1465,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1289,6 +1589,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1447,6 +1750,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1816,6 +2122,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1972,6 +2281,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2bf1295 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +67,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +78,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +173,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +344,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +364,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +385,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +845,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -870,6 +927,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1195,3 +1270,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..40417e6 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bab31bf..6bb162e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bab4f3a..048681c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v43-0005-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v43-0005-Support-2PC-txn-Subscription-option.patchDownload
From 4fbcd535b6cc7c658b3a8d4faf21f74476672c95 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 19 Feb 2021 11:03:00 +1100
Subject: [PATCH v43] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 202 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..9c23497 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -184,8 +184,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd..55dd8da 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1167,7 +1167,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..a069c76 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -358,6 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +399,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +468,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -825,6 +844,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -835,7 +856,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -874,6 +896,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -892,7 +921,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +967,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1013,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..74787d1 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -427,6 +427,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6a5b23f..24e49f0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2786,6 +2786,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3432,6 +3433,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2bf1295..3a1b404 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -180,13 +180,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -254,6 +256,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -267,6 +279,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -291,7 +304,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -332,6 +346,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..4ac4924 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 40417e6..8e94a26 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..41e0d8c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..8d24b2e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,29 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..5c79dbd 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v43-0004-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v43-0004-Support-2PC-txn-subscriber-tests.patchDownload
From 4ff5f4af8fa2d03220029cedeab2838c2d9355b1 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 19 Feb 2021 11:01:39 +1100
Subject: [PATCH v43] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v43-0006-Tablesync-early-exit.patchapplication/octet-stream; name=v43-0006-Tablesync-early-exit.patchDownload
From 7a1c7795f4276541d5062a6ecd7d55ade9f117ee Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sun, 21 Feb 2021 08:12:27 +1100
Subject: [PATCH v43] Tablesync early exit.

Give the tablesync worker an opportunity to see if it can exit immediately
(because it has already caught-up) without it needing to process a message
first before discovering that.
---
 src/backend/replication/logical/worker.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 24e49f0..b80bf69 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2407,6 +2407,16 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	bool		ping_sent = false;
 	TimeLineID	tli;
 
+	if (am_tablesync_worker())
+	{
+		/* 
+		 * Give the tablesync worker an opportunity see if it can immediately
+		 * exit, instead of handling a message (which the apply worker could
+		 * handle) before discovering that.
+		 */
+		process_syncing_tables(last_received);
+	}
+
 	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
-- 
1.8.3.1

v43-0007-Fix-apply-worker-empty-prepare.patchapplication/octet-stream; name=v43-0007-Fix-apply-worker-empty-prepare.patchDownload
From d1813c3d6024e71fce5a5cb4228776078df98fd8 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 25 Feb 2021 16:41:01 +1100
Subject: [PATCH v43] Fix apply worker empty prepare.

By sad timing of apply/tablesync workers it is possible to have a "consistent snapshot" that spans prepare/commit in such a way that the tablesync did not do the prepare (because snapshot not consistent) and the apply worker does the prepare ('b') but it skips all the prepared operations [e.g. inserts] while the tablesync was still busy (see the condition of should_apply_changes_for_rel). Later, at the commit prepared time when the apply worker does the commit prepare ('K'), there is nothing committed (because the inserts were skipped earlier).

This patch implements a two-part fix as suggested [1] on hackers.

Part 1 - The begin_prepare handler of apply will always wait for any busy tablesync workers to acheive SYNCDONE/READY state.

Part 2 - If (after Part 1) the apply-worker's prepare is found to be lagging behind any of the sync-workers then the subsequent prepared operations will be spooled to a file to be replayed at commit_prepared time.

Discussion:
[1] https://www.postgresql.org/message-id/CAA4eK1L%3DdhuCRvyDvrXX5wZgc7s1hLRD29CKCK6oaHtVCPgiFA%40mail.gmail.com
---
 src/backend/replication/logical/tablesync.c | 184 +++++++--
 src/backend/replication/logical/worker.c    | 595 +++++++++++++++++++++++++++-
 src/include/replication/worker_internal.h   |   3 +
 src/tools/pgindent/typedefs.list            |   1 +
 4 files changed, 747 insertions(+), 36 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..8b20519 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -115,7 +115,10 @@
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 
-static bool table_states_valid = false;
+static bool		table_states_valid = false;
+static List    *table_states_not_ready = NIL;
+static List	   *table_states_all = NIL;
+static void		FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1137,3 +1111,145 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * XXX - Is there a potential timing problem here - e.g. if signal arrives
+ * while executing this then maybe we will set table_states_valid without
+ * refetching them?
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+BusyTablesyncs()
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach (lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "BusyTablesyncs: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * XXX - When the process_syncing_tables_for_sync changes the state
+		 * from SYNCDONE to READY, that change is actually written directly
+		 * into the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state event though there was nont initially. That is why we need
+		 * to check for it below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "BusyTablesyncs: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach (lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b80bf69..d9b7cfa 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -212,6 +212,45 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/*
+ * A contest for the prepare spooling
+ */
+static MemoryContext PsfContext = NULL;
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+	char	name[MAXPGPATH];	/* Hash key --- must be first */
+	SharedFileSet *fileset;		/* shared file set for prepare spoolfile */
+} PsfHashEntry;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * The spoolfile handle is only valid between begin_prepare and prepare.
+ */
+static BufFile *psf_fd = NULL;
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_cleanup(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int  prepare_spoolfile_replay_messages(char *path, XLogRecPtr lsn);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -759,6 +798,76 @@ apply_handle_begin_prepare(StringInfo s)
 		return;
 	}
 
+	/*
+	 * A Problem:
+	 *
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that the
+	 * tablesync did not do the prepare (because snapshot not consistent) and
+	 * the apply worker does the prepare (‘b’) but it skips all the prepared
+	 * operations [e.g. inserts] while the tablesync was still busy (see the
+	 * condition of should_apply_changes_for_rel). Later at the commit prepared
+	 * time when the apply worker does the commit prepare (‘K’), there is
+	 * nothing in it (because the inserts were skipped earlier).
+	 *
+	 * The following code has a 2-part workaround for that scenario.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Workaround Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state before
+		 * letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, begin_data.end_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (BusyTablesyncs())
+		{
+			elog(DEBUG1, "apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
+			process_syncing_tables(begin_data.end_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Workaround Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went beyond
+		 * this begin_prepare LSN then set all messages (until prepared) will be
+		 * saved to a spoolfile for replay later at commit_prepared time.
+		 */
+		if (begin_data.end_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true /* XXX - Add this line to force psf (for easier debugging) */
+#endif
+		   )
+		{
+			char psfpath[MAXPGPATH];
+
+			/* The begin_prepare's LSN has been overtaken. */
+
+			/*
+			 * We need a transaction for handling the buffile, used for serializing
+			 * prepared messages. This transaction lasts until the commit_prepared/
+			 * rollback_prepared.
+			 */
+			ensure_transaction();
+
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+		}
+	}
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	in_remote_transaction = true;
@@ -788,6 +897,27 @@ apply_handle_prepare(StringInfo s)
 		return;
 	}
 
+	if (psf_fd)
+	{
+		/*
+		 * The psf_fd is meaningful only between begin_prepare and prepared.
+		 * So close it now. If we had been writing any messages to the psf_fd
+		 * (the spoolfile) then those will be applied later during
+		 * handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		/*
+		 * And end the transaction that was created by begin_prepare for
+		 * working with the psf buffiles.
+		 */
+		Assert(IsTransactionState());
+		CommitTransactionCommand();
+
+		in_remote_transaction = false;
+		return;
+	}
+
 	Assert(prepare_data.prepare_lsn == remote_final_lsn);
 
 	if (IsTransactionState())
@@ -835,6 +965,7 @@ static void
 apply_handle_commit_prepared(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
+	char psfpath[MAXPGPATH];
 
 	/*
 	 * We don't expect any other transaction data while skipping a prepared
@@ -844,6 +975,55 @@ apply_handle_commit_prepared(StringInfo s)
 
 	logicalrep_read_commit_prepared(s, &prepare_data);
 
+	/*
+	 * If this prepare's messages were being spooled to a file,
+	 * then replay them all now, and afterwards cleanup the spoolfile.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int nchanges;
+
+		/*
+		 * 1. replay the spooled messages
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.end_lsn);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		prepare_spoolfile_cleanup(psfpath);
+
+		/*
+		 * 2. mark as PREPARED
+		 */
+
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+
 	/* there is no transaction when COMMIT PREPARED is called */
 	ensure_transaction();
 
@@ -874,6 +1054,8 @@ static void
 apply_handle_rollback_prepared(StringInfo s)
 {
 	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool using_psf;
+	char psfpath[MAXPGPATH];
 
 	/*
 	 * We don't expect any other transaction data while skipping a prepared
@@ -884,11 +1066,46 @@ apply_handle_rollback_prepared(StringInfo s)
 	logicalrep_read_rollback_prepared(s, &rollback_data);
 
 	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		if (psf_fd)
+		{
+			/* XXX - For some reason it is currently possible (due to bug?) it
+			 * is possibe to get here, after a restart, when there was a
+			 * begin_prepare but there was NO prepare. Since there was no
+			 * prepare, the psf_fd and the transaction are still lingering so
+			 * they need to be cleaned up now.
+			 */
+			prepare_spoolfile_close();
+
+			/*
+			 * And end the transaction that was created by the begin_prepare
+			 * for working with psf buffiles.
+			 */
+			Assert(IsTransactionState());
+			AbortCurrentTransaction();
+		}
+
+		prepare_spoolfile_cleanup(psfpath);
+	}
+
+	/*
 	 * It is possible that we haven't received prepare because it occurred
 	 * before walsender reached a consistent point in which case we need to
 	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
 	 */
-	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
 	{
 		/*
@@ -941,6 +1158,26 @@ apply_handle_stream_prepare(StringInfo s)
 	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
 
 	/*
+	 * Wait for all the sync workers to read a SYNCDONE/READY state.
+	 *
+	 * This is same waiting logic as in appy_handle_begin_prepare function (see
+	 * that function for more details comments about this).
+	 */
+	if (!am_tablesync_worker())
+	{
+		while (BusyTablesyncs())
+		{
+			process_syncing_tables(prepare_data.end_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+	}
+
+	/*
 	 *
 	 * --------------------------------------------------------------------------
 	 * 1. Replay all the spooled operations - Similar code as for
@@ -1007,6 +1244,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		psf_fd == NULL &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1468,6 +1706,9 @@ apply_handle_insert(StringInfo s)
 	if (skip_prepared_txn)
 		return;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1592,6 +1833,9 @@ apply_handle_update(StringInfo s)
 	if (skip_prepared_txn)
 		return;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1753,6 +1997,9 @@ apply_handle_delete(StringInfo s)
 	if (skip_prepared_txn)
 		return;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -2125,6 +2372,9 @@ apply_handle_truncate(StringInfo s)
 	if (skip_prepared_txn)
 		return;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -2418,6 +2668,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	}
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL			hash_ctl;
+		PsfHashEntry   *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2433,6 +2700,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 													"LogicalStreamingContext",
 													ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context is used when the prepare spooling is used. It
+	 * is reset at prepare commit/rollback time.
+	 */
+	PsfContext = AllocSetContextCreate(ApplyContext,
+									   "PsfContext",
+									   ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -2537,7 +2812,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && psf_fd == NULL)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -3462,3 +3737,319 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time. If needed, this is the common function to do that file redirection.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	if (psf_fd == NULL)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	MemoryContext	oldctx;
+	bool			found;
+	PsfHashEntry   *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(psf_fd == NULL);
+
+	/* create or find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER | HASH_FIND,
+										  &found);
+
+	/*
+	 * Create/open the bufFiles under the Prepare Spoolfile Context so that we
+	 * have those files until prepare commit/rollback.
+	 */
+	oldctx = MemoryContextSwitchTo(PsfContext);
+
+	if (!found)
+	{
+		MemoryContext	savectx;
+		SharedFileSet  *fileset;
+
+		elog(DEBUG1, "Not found file \"%s\"", path);
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		psf_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember this path's fileset for the next time */
+		memcpy(hentry->name, path, MAXPGPATH);
+		hentry->fileset = fileset;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the beginning because we always want to
+		 * create/overwrite this file.
+		 */
+		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		psf_fd = BufFileOpenShared(hentry->fileset, path, O_RDWR);
+		BufFileSeek(psf_fd, 0, 0L, SEEK_SET);
+	}
+
+	MemoryContextSwitchTo(oldctx);
+
+	/* Sanity check */
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_fd)
+		BufFileClose(psf_fd);
+	psf_fd = NULL;
+}
+
+/*
+ * Delete the specified psf spoolfile.
+ */
+static void
+prepare_spoolfile_cleanup(char *path)
+{
+	PsfHashEntry   *hentry;
+
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* And remove the path entry from the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_REMOVE,
+										  NULL);
+
+	/* By this time we must have created the entry */
+	Assert(hentry != NULL);
+
+	/* Delete the file and release the fileset memory */
+	SharedFileSetDeleteAll(hentry->fileset);
+	pfree(hentry->fileset);
+	hentry->fileset = NULL;
+
+	/* Reset the memory context used during the file creation */
+	MemoryContextReset(PsfContext);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(psf_fd != NULL);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(psf_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(psf_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(psf_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Is there a prepare spoolfile for the specified gid?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	bool	found;
+
+	/* Find the prepare spoolfile entry in the psf_hash */
+	hash_search(psf_hash,
+				path,
+				HASH_FIND,
+				&found);
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 found ? "found" : "not found");
+
+	return found;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ *
+ * [Note: this is mostly copied code from apply_spooled_messages function]
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, XLogRecPtr lsn)
+{
+	StringInfoData	s2;
+	int				nchanges = 0;
+	char		   *buffer = NULL;
+	MemoryContext	oldctx, oldctx2;
+	bool			found = false;
+	PsfHashEntry   *hentry;
+	BufFile		   *fd;
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* Open the spool file */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_FIND,
+										  &found);
+	Assert(found);
+	fd = BufFileOpenShared(hentry->fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	remote_final_lsn = lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int	nbytes;
+		int	len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldctx2);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB.
+	 *
+	 * Therefore, the name and the key must be exactly same lengths and padded
+	 * with '\0' so garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "%u-%s.prep_changes", subid, gid);
+}
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..602b4f7 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool BusyTablesyncs(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 048681c..b7da03b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1958,6 +1958,7 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfContext
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v43-0008-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v43-0008-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From a2c4c4b6fd01fe00e8e87fd586e37e8d775a8c4b Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 25 Feb 2021 17:24:52 +1100
Subject: [PATCH v43] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 +++++++++++++----
 src/backend/replication/logical/worker.c    | 50 ++++++++++++++++++++---------
 2 files changed, 58 insertions(+), 21 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 8b20519..5b5b910 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1131,6 +1137,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1153,6 +1161,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1160,12 +1169,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1179,6 +1193,8 @@ BusyTablesyncs()
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> BusyTablesyncs");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1190,8 +1206,8 @@ BusyTablesyncs()
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "BusyTablesyncs: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> BusyTablesyncs: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1208,6 +1224,7 @@ BusyTablesyncs()
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> BusyTablesyncs: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1219,8 +1236,8 @@ BusyTablesyncs()
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "BusyTablesyncs: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> BusyTablesyncs: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1246,8 +1263,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index d9b7cfa..addcbbf 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -820,14 +820,14 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state before
 		 * letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, begin_data.end_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, begin_data.end_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (BusyTablesyncs())
 		{
-			elog(DEBUG1, "apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
 
 			process_syncing_tables(begin_data.end_lsn);
 
@@ -985,6 +985,8 @@ apply_handle_commit_prepared(StringInfo s)
 	{
 		int nchanges;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * 1. replay the spooled messages
 		 */
@@ -992,8 +994,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.end_lsn);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		prepare_spoolfile_cleanup(psfpath);
@@ -1104,6 +1106,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf=%d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2533,18 +2536,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3751,6 +3758,11 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_fd ? "Do" : "Don't");
+
 	if (psf_fd == NULL)
 		return false;
 
@@ -3772,7 +3784,7 @@ prepare_spoolfile_create(char *path)
 	bool			found;
 	PsfHashEntry   *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(psf_fd == NULL);
 
@@ -3793,7 +3805,7 @@ prepare_spoolfile_create(char *path)
 		MemoryContext	savectx;
 		SharedFileSet  *fileset;
 
-		elog(DEBUG1, "Not found file \"%s\"", path);
+		elog(LOG, "!!> Not found file \"%s\". Create it.", path);
 		savectx = MemoryContextSwitchTo(ApplyContext);
 		fileset = palloc(sizeof(SharedFileSet));
 
@@ -3812,7 +3824,7 @@ prepare_spoolfile_create(char *path)
 		 * Open the file and seek to the beginning because we always want to
 		 * create/overwrite this file.
 		 */
-		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		elog(LOG, "!!> Found file \"%s\". Overwrite it.", path);
 		psf_fd = BufFileOpenShared(hentry->fileset, path, O_RDWR);
 		BufFileSeek(psf_fd, 0, 0L, SEEK_SET);
 	}
@@ -3829,6 +3841,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_fd)
 		BufFileClose(psf_fd);
 	psf_fd = NULL;
@@ -3842,6 +3855,8 @@ prepare_spoolfile_cleanup(char *path)
 {
 	PsfHashEntry   *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_cleanup: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3877,20 +3892,23 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_fd != NULL);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	BufFileWrite(psf_fd, &len, sizeof(len));
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	BufFileWrite(psf_fd, &action, sizeof(action));
 
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	BufFileWrite(psf_fd, &s->data[s->cursor], len);
 }
 
@@ -3908,8 +3926,8 @@ prepare_spoolfile_exists(char *path)
 				HASH_FIND,
 				&found);
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
 		 path,
 		 found ? "found" : "not found");
 
@@ -3932,8 +3950,8 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr lsn)
 	PsfHashEntry   *hentry;
 	BufFile		   *fd;
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3978,6 +3996,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr lsn)
 
 		/* read length of the on-disk record */
 		nbytes = BufFileRead(fd, &len, sizeof(len));
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -3996,6 +4015,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr lsn)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		if (BufFileRead(fd, buffer, len) != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -4018,7 +4038,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr lsn)
 		nchanges++;
 
 		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
+			elog(LOG, "!!> replayed %d changes from file '%s'",
 				 nchanges, path);
 	}
 
@@ -4027,7 +4047,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr lsn)
 	pfree(buffer);
 	pfree(s2.data);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
-- 
1.8.3.1

#200onlinebusinessindia
businessgrowthnamanverma@gmail.com
In reply to: Peter Smith (#199)
Re: logical decoding of two-phase transactions

That's where you've misunderstood - it isn't committed yet. The point or
this change is to allow us to do logical decoding at the PREPARE
TRANSACTION
point. The xact is not yet committed or rolled back.

Yes, I got that. I was looking for a why or an actual use-case.

Stas wants this for a conflict-free logical semi-synchronous replication
multi master solution.

This sentence is hard to decrypt, less without "multi master" as the
concept applies basically only to only one master node.

At PREPARE TRANSACTION time we replay the xact to
other nodes, each of which applies it and PREPARE TRANSACTION, then
replies
to confirm it has successfully prepared the xact. When all nodes confirm
the
xact is prepared it is safe for the origin node to COMMIT PREPARED. The
other nodes then see hat the first node has committed and they commit too.

OK, this is the argument I was looking for. So in your schema the
origin node, the one generating the changes, is itself in charge of
deciding if the 2PC should work or not. There are two channels between
the origin node and the replicas replaying the logical changes, one is
for the logical decoder with a receiver, the second one is used to
communicate the WAL apply status. I thought about something like
postgres_fdw doing this job with a transaction that does writes across
several nodes, that's why I got confused about this feature.
Everything goes through one channel, so the failure handling is just
simplified.

Alternately if any node replies "could not replay xact" or "could not
prepare xact" the origin node knows to ROLLBACK PREPARED. All the other
nodes see that and rollback too.

The origin node could just issue the ROLLBACK or COMMIT and the
logical replicas would just apply this change.

To really make it rock solid you also have to send the old and new values
of
a row, or have row versions, or send old row hashes. Something I also want
to have, but we can mostly get that already with REPLICA IDENTITY FULL.

On a primary key (or a unique index), the default replica identity is
enough I think.

It is of interest to me because schema changes in MM logical replication
are
more challenging awkward and restrictive without it. Optimistic conflict
resolution doesn't work well for schema changes and once the conflicting
schema changes are committed on different nodes there is no going back. So
you need your async system to have a global locking model for schema
changes
to stop conflicts arising. Or expect the user not to do anything silly /
misunderstand anything and know all the relevant system limitations and
requirements... which we all know works just great in practice. You also
need a way to ensure that schema changes don't render
committed-but-not-yet-replayed row changes from other peers nonsensical.
The
safest way is a barrier where all row changes committed on any node before
committing the schema change on the origin node must be fully replayed on
every other node, making an async MM system temporarily sync single master
(and requiring all nodes to be up and reachable). Otherwise you need a way
to figure out how to conflict-resolve incoming rows with missing columns /
added columns / changed types / renamed tables etc which is no fun and
nearly impossible in the general case.

... [show rest of quote]

That's one vision of things, FDW-like approaches would be a second,
but those are not able to pass down utility statements natively,
though this stuff can be done with the utility hook.

I think the purpose of having the GID available to the decoding output
plugin at PREPARE TRANSACTION time is that it can co-operate with a global
transaction manager that way. Each node can tell the GTM "I'm ready to
commit [X]". It is IMO not crucial since you can otherwise use a (node-id,
xid) tuple, but it'd be nice for coordinating with external systems,
simplifying inter node chatter, integrating logical deocding into bigger
systems with external transaction coordinators/arbitrators etc. It seems
pretty silly _not_ to have it really.

Well, Postgres-XC/XL save the 2PC GID for this purpose in the GTM,
this way the COMMIT/ABORT PREPARED can be issued from any nodes, and
there is a centralized conflict resolution, the latter being done with
a huge cost, causing much bottleneck in scaling performance.

Personally I don't think lack of access to the GID justifies blocking 2PC
logical decoding. It can be added separately. But it'd be nice to have
especially if it's cheap.

I think it should be added reading this thread.
--
Naman

-----
Online Business in India
--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

#201Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#199)

On Thu, Feb 25, 2021 at 12:32 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v43*

Differences from v42*

- Rebased to HEAD as @ today

- Added new patch 0006 "Tablesync early exit" as discussed here [1]

I feel we can start a separate thread for this as it can be done
independently unless there are reasons for not doing so.

- Added new patch 0007 "Fix apply worker prepare" as discussed here [2]

Few comments on v43-0007-Fix-apply-worker-empty-prepare:
================================================
1. The patch v43-0007-Fix-apply-worker-empty-prepare should be fourth
patch in series, immediately after the main-apply worker patch.
2.
apply_handle_begin_prepare
{
..
+#if 0
+ || true /* XXX - Add this line to force psf (for easier debugging) */
+#endif

Please remove such debugging hacks.

3.
+ * [Note: this is mostly copied code from apply_spooled_messages function]
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, XLogRecPtr lsn)

I think we can try to unify the code in this function and
apply_spooled_messages. Basically, if we pass the sharedfileset handle
to apply_spooled_messages, then it should be possible to unify these
two functions.

4.
@@ -788,6 +897,27 @@ apply_handle_prepare(StringInfo s)
return;
}

+ if (psf_fd)
+ {
+ /*
+ * The psf_fd is meaningful only between begin_prepare and prepared.
+ * So close it now. If we had been writing any messages to the psf_fd
+ * (the spoolfile) then those will be applied later during
+ * handle_commit_prepared.
+ */
+ prepare_spoolfile_close();
+
+ /*
+ * And end the transaction that was created by begin_prepare for
+ * working with the psf buffiles.
+ */
+ Assert(IsTransactionState());
+ CommitTransactionCommand();
+
+ in_remote_transaction = false;
+ return;
+ }

Don't we need to write prepare to the spool file as well? Because, if
we do that then I think you don't need special handling in
apply_handle_commit_prepared where you are preparing the transaction
after replaying the messages from the spool file. I think in
apply_handle_commit_prepared while doing prepare, you have used
commit's lsn which is wrong and that will also be solved if you do
what I am suggesting.

5. You need to write/sync the spool file at prepare time because after
restart between prepare and commit prepared the changes can be lost
and won't be resent by the publisher assuming there are commits of
other transactions between prepare and commit prepared. For the same
reason, I am not sure if we can just rely on the in-memory hash table
for it (prepare_spoolfile_exists). Sure, if it exists and there is no
restart then it would be cheap to check in the hash table but I don't
think it is guaranteed.

--
With Regards,
Amit Kapila.

#202Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#199)
8 attachment(s)

Please find attached the latest patch set v44*

Differences from v43*

* Rebased to HEAD as @ today

* Patch 0003 "Add support for apply at prepare time"
- minor code refactor
- minor comment changes

* Patch 0006 "Tablesync early exit"
- minor comment changes
- fix whitespace

* Patch 0007 "Fix apply worker empty prepare"
- minor comment changes
- pgindent changed lots of code formatting

-----
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v44-0002-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v44-0002-Track-replication-origin-progress-for-rollbacks.patchDownload
From 447b01a41adb4d790b71c76b27ea46dc0004490d Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 26 Feb 2021 09:46:37 +1100
Subject: [PATCH v44] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20..6023e7c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2276,6 +2276,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2298,6 +2306,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4e6a3df..acdb28d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5716,8 +5716,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5923,7 +5922,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5972,6 +5972,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6013,7 +6020,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6021,7 +6029,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v44-0001-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v44-0001-Refactor-spool-file-logic-in-worker.c.patchDownload
From 453186fe2a3bffaffe57e49b7e60cc0bed2f5acb Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 26 Feb 2021 09:32:46 +1100
Subject: [PATCH v44] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 18d0528..45ac498 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -917,30 +919,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +941,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +956,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1031,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v44-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v44-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 48a4e56dd1f57d2016f6d06bb2850dca2f41de82 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 26 Feb 2021 11:31:59 +1100
Subject: [PATCH v44] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

* We allow skipping prepared transactions if they are already prepared.
We do ensure that we skip only when the GID, origin_lsn, and
origin_timestamp of a prepared xact matches to avoid the possibility of
a match of prepared xact from two different nodes. This can happen when
the server or apply worker restarts after a prepared transaction.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  68 ++++++
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 258 ++++++++++++++++++++++
 src/backend/replication/logical/worker.c    | 329 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 177 ++++++++++++---
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  75 ++++++-
 src/include/replication/reorderbuffer.h     |  12 +
 src/tools/pgindent/typedefs.list            |   3 +
 9 files changed, 892 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..81cb765 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..1585754 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,264 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 45ac498..15b7c99 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -169,6 +170,9 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
+/* for skipping prepared transaction */
+bool        skip_prepared_txn = false;
+
 /*
  * Hash table for storing the streaming xid information along with shared file
  * set for streaming and subxact files.
@@ -690,6 +694,12 @@ apply_handle_begin(StringInfo s)
 {
 	LogicalRepBeginData begin_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_begin(s, &begin_data);
 
 	remote_final_lsn = begin_data.final_lsn;
@@ -709,6 +719,12 @@ apply_handle_commit(StringInfo s)
 {
 	LogicalRepCommitData commit_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_commit(s, &commit_data);
 
 	Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -722,6 +738,263 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+	{
+		/*
+		 * If this gid has already been prepared then we don't want to apply
+		 * this txn again. This can happen after restart where upstream can
+		 * send the prepared transaction again. See
+		 * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+		 */
+		skip_prepared_txn = true;
+		return;
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (skip_prepared_txn)
+	{
+		/*
+		 * If we are skipping this transaction because it was previously
+		 * prepared, ignore it and reset the flag.
+		 */
+		Assert(LookupGXact(prepare_data.gid, prepare_data.end_lsn,
+						   prepare_data.preparetime));
+		skip_prepared_txn = false;
+		return;
+	}
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -753,6 +1026,12 @@ apply_handle_stream_start(StringInfo s)
 	Assert(!in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Start a transaction on stream start, this transaction will be committed
 	 * on the stream stop unless it is a tablesync worker in which case it
 	 * will be committed after processing all the messages. We need the
@@ -800,6 +1079,12 @@ apply_handle_stream_stop(StringInfo s)
 	Assert(in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Close the file with serialized changes, and serialize information about
 	 * subxacts for the toplevel transaction.
 	 */
@@ -831,6 +1116,12 @@ apply_handle_stream_abort(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_stream_abort(s, &xid, &subxid);
 
 	/*
@@ -1046,6 +1337,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	xid = logicalrep_read_stream_commit(s, &commit_data);
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
@@ -1168,6 +1465,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1289,6 +1589,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1447,6 +1750,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1816,6 +2122,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1972,6 +2281,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2bf1295 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +67,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +78,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +173,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +344,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +364,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +385,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +845,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -870,6 +927,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1195,3 +1270,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..358b14a 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bab31bf..6bb162e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bab4f3a..048681c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v44-0005-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v44-0005-Support-2PC-txn-Subscription-option.patchDownload
From 009b8a33799bf37d7bb012b85d60c04f56cb3f04 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 26 Feb 2021 12:29:03 +1100
Subject: [PATCH v44] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 202 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..9c23497 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -184,8 +184,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd..55dd8da 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1167,7 +1167,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..a069c76 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -358,6 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +399,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +468,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -825,6 +844,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -835,7 +856,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -874,6 +896,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -892,7 +921,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +967,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1013,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..74787d1 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -427,6 +427,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 15b7c99..f837d71 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2786,6 +2786,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3432,6 +3433,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2bf1295..3a1b404 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -180,13 +180,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -254,6 +256,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -267,6 +279,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -291,7 +304,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -332,6 +346,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..4ac4924 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 358b14a..eebedda 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..41e0d8c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..8d24b2e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,29 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..5c79dbd 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v44-0004-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v44-0004-Support-2PC-txn-subscriber-tests.patchDownload
From de5bcca6160635c2fd42ec1ee7bc31875bf19007 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 26 Feb 2021 11:46:32 +1100
Subject: [PATCH v44] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v44-0006-Tablesync-early-exit.patchapplication/octet-stream; name=v44-0006-Tablesync-early-exit.patchDownload
From 555cbf8dbf18837df803154d9b2bb22d339b4f85 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 26 Feb 2021 13:23:40 +1100
Subject: [PATCH v44] Tablesync early exit.

Give the tablesync worker an opportunity to see if it can exit immediately
(because it has already caught-up) without it needing to process a message
first before discovering that.
---
 src/backend/replication/logical/worker.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f837d71..5a0b806 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2407,6 +2407,16 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	bool		ping_sent = false;
 	TimeLineID	tli;
 
+	if (am_tablesync_worker())
+	{
+		/*
+		 * Give the tablesync worker an opportunity see if it can immediately
+		 * exit, instead of always handling a message (which maybe the apply
+		 * worker could have handled).
+		 */
+		process_syncing_tables(last_received);
+	}
+
 	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
-- 
1.8.3.1

v44-0007-Fix-apply-worker-empty-prepare.patchapplication/octet-stream; name=v44-0007-Fix-apply-worker-empty-prepare.patchDownload
From 91074c1dd9570524caab91b15262b45a2216f9b2 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 26 Feb 2021 14:52:44 +1100
Subject: [PATCH v44] Fix apply worker empty prepare.

By sad timing of apply/tablesync workers it is possible to have a "consistent snapshot" that spans prepare/commit in such a way that the tablesync did not do the prepare (because snapshot not consistent) and the apply worker does the prepare ('b') but it skips all the prepared operations [e.g. inserts] while the tablesync was still busy (see the condition of should_apply_changes_for_rel). Later, at the commit prepared time when the apply worker does the commit prepare ('K'), there is nothing committed (because the inserts were skipped earlier).

This patch implements a two-part fix as suggested [1] on hackers.

Part 1 - The begin_prepare handler of apply will always wait for any busy tablesync workers to acheive SYNCDONE/READY state.

Part 2 - If (after Part 1) the apply-worker's prepare is found to be lagging behind any of the sync-workers then the subsequent prepared operations will be spooled to a file to be replayed at commit_prepared time.

Discussion:
[1] https://www.postgresql.org/message-id/CAA4eK1L%3DdhuCRvyDvrXX5wZgc7s1hLRD29CKCK6oaHtVCPgiFA%40mail.gmail.com
---
 src/backend/replication/logical/tablesync.c | 182 +++++++--
 src/backend/replication/logical/worker.c    | 606 +++++++++++++++++++++++++++-
 src/include/replication/worker_internal.h   |   3 +
 src/tools/pgindent/typedefs.list            |   1 +
 4 files changed, 754 insertions(+), 38 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..42ecc53 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1137,3 +1111,145 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * XXX - Is there a potential timing problem here - e.g. if signal arrives
+ * while executing this then maybe we will set table_states_valid without
+ * refetching them?
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+BusyTablesyncs()
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "BusyTablesyncs: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * XXX - When the process_syncing_tables_for_sync changes the state
+		 * from SYNCDONE to READY, that change is actually written directly
+		 * into the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state it it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "BusyTablesyncs: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 5a0b806..333e988 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -171,7 +171,7 @@ bool		in_streamed_transaction = false;
 static TransactionId stream_xid = InvalidTransactionId;
 
 /* for skipping prepared transaction */
-bool        skip_prepared_txn = false;
+bool		skip_prepared_txn = false;
 
 /*
  * Hash table for storing the streaming xid information along with shared file
@@ -212,6 +212,45 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/*
+ * A contest for the prepare spooling
+ */
+static MemoryContext PsfContext = NULL;
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	SharedFileSet *fileset;		/* shared file set for prepare spoolfile */
+}			PsfHashEntry;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * The spoolfile handle is only valid between begin_prepare and prepare.
+ */
+static BufFile *psf_fd = NULL;
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_cleanup(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, XLogRecPtr lsn);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -759,6 +798,79 @@ apply_handle_begin_prepare(StringInfo s)
 		return;
 	}
 
+	/*
+	 * A Problem:
+	 *
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel). Later at the
+	 * commit prepared time when the apply worker does the commit prepare
+	 * (‘K’), there is nothing in it (because the inserts were skipped
+	 * earlier).
+	 *
+	 * The following code has a 2-part workaround for that scenario.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Workaround Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, begin_data.end_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (BusyTablesyncs())
+		{
+			elog(DEBUG1, "apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
+			process_syncing_tables(begin_data.end_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Workaround Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.end_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
+		{
+			char		psfpath[MAXPGPATH];
+
+			/* The begin_prepare's LSN has been overtaken. */
+
+			/*
+			 * We need a transaction for handling the buffile, used for
+			 * serializing prepared messages. This transaction lasts until the
+			 * commit_prepared/ rollback_prepared.
+			 */
+			ensure_transaction();
+
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+		}
+	}
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	in_remote_transaction = true;
@@ -788,6 +900,27 @@ apply_handle_prepare(StringInfo s)
 		return;
 	}
 
+	if (psf_fd)
+	{
+		/*
+		 * The psf_fd is meaningful only between begin_prepare and prepared.
+		 * So close it now. If we had been writing any messages to the psf_fd
+		 * (the spoolfile) then those will be applied later during
+		 * handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		/*
+		 * And end the transaction that was created by begin_prepare for
+		 * working with the psf buffiles.
+		 */
+		Assert(IsTransactionState());
+		CommitTransactionCommand();
+
+		in_remote_transaction = false;
+		return;
+	}
+
 	Assert(prepare_data.prepare_lsn == remote_final_lsn);
 
 	if (IsTransactionState())
@@ -835,6 +968,7 @@ static void
 apply_handle_commit_prepared(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
 
 	/*
 	 * We don't expect any other transaction data while skipping a prepared
@@ -844,6 +978,55 @@ apply_handle_commit_prepared(StringInfo s)
 
 	logicalrep_read_commit_prepared(s, &prepare_data);
 
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now, and afterwards cleanup the spoolfile.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+
+		/*
+		 * 1. replay the spooled messages
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.end_lsn);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		prepare_spoolfile_cleanup(psfpath);
+
+		/*
+		 * 2. mark as PREPARED
+		 */
+
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+
 	/* there is no transaction when COMMIT PREPARED is called */
 	ensure_transaction();
 
@@ -874,6 +1057,8 @@ static void
 apply_handle_rollback_prepared(StringInfo s)
 {
 	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
 
 	/*
 	 * We don't expect any other transaction data while skipping a prepared
@@ -884,16 +1069,52 @@ apply_handle_rollback_prepared(StringInfo s)
 	logicalrep_read_rollback_prepared(s, &rollback_data);
 
 	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		if (psf_fd)
+		{
+			/*
+			 * XXX - For some reason it is currently possible (due to bug?) it
+			 * is possibe to get here, after a restart, when there was a
+			 * begin_prepare but there was NO prepare. Since there was no
+			 * prepare, the psf_fd and the transaction are still lingering so
+			 * they need to be cleaned up now.
+			 */
+			prepare_spoolfile_close();
+
+			/*
+			 * And end the transaction that was created by the begin_prepare
+			 * for working with psf buffiles.
+			 */
+			Assert(IsTransactionState());
+			AbortCurrentTransaction();
+		}
+
+		prepare_spoolfile_cleanup(psfpath);
+	}
+
+	/*
 	 * It is possible that we haven't received prepare because it occurred
 	 * before walsender reached a consistent point in which case we need to
 	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
 	 */
-	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
 	{
 		/*
-		 * Update origin state so we can restart streaming from correct position
-		 * in case of crash.
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
 		 */
 		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
 		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
@@ -941,6 +1162,26 @@ apply_handle_stream_prepare(StringInfo s)
 	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
 
 	/*
+	 * Wait for all the sync workers to reach the SYNCDONE/READY state.
+	 *
+	 * This is same waiting logic as in appy_handle_begin_prepare function
+	 * (see that function for more details about this).
+	 */
+	if (!am_tablesync_worker())
+	{
+		while (BusyTablesyncs())
+		{
+			process_syncing_tables(prepare_data.end_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+	}
+
+	/*
 	 *
 	 * --------------------------------------------------------------------------
 	 * 1. Replay all the spooled operations - Similar code as for
@@ -1007,6 +1248,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		psf_fd == NULL &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1468,6 +1710,9 @@ apply_handle_insert(StringInfo s)
 	if (skip_prepared_txn)
 		return;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1592,6 +1837,9 @@ apply_handle_update(StringInfo s)
 	if (skip_prepared_txn)
 		return;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1753,6 +2001,9 @@ apply_handle_delete(StringInfo s)
 	if (skip_prepared_txn)
 		return;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -2125,6 +2376,9 @@ apply_handle_truncate(StringInfo s)
 	if (skip_prepared_txn)
 		return;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -2418,6 +2672,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	}
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL		hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2433,6 +2704,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 													"LogicalStreamingContext",
 													ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context is used when the prepare spooling is used. It is
+	 * reset at prepare commit/rollback time.
+	 */
+	PsfContext = AllocSetContextCreate(ApplyContext,
+									   "PsfContext",
+									   ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -2537,7 +2816,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && psf_fd == NULL)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -3462,3 +3741,320 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time. If needed, this is the common function to do that file redirection.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	if (psf_fd == NULL)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	MemoryContext oldctx;
+	bool		found;
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(psf_fd == NULL);
+
+	/* create or find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER | HASH_FIND,
+										  &found);
+
+	/*
+	 * Create/open the bufFiles under the Prepare Spoolfile Context so that we
+	 * have those files until prepare commit/rollback.
+	 */
+	oldctx = MemoryContextSwitchTo(PsfContext);
+
+	if (!found)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		elog(DEBUG1, "Not found file \"%s\"", path);
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		psf_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember this path's fileset for the next time */
+		memcpy(hentry->name, path, MAXPGPATH);
+		hentry->fileset = fileset;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the beginning because we always want to
+		 * create/overwrite this file.
+		 */
+		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		psf_fd = BufFileOpenShared(hentry->fileset, path, O_RDWR);
+		BufFileSeek(psf_fd, 0, 0L, SEEK_SET);
+	}
+
+	MemoryContextSwitchTo(oldctx);
+
+	/* Sanity check */
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_fd)
+		BufFileClose(psf_fd);
+	psf_fd = NULL;
+}
+
+/*
+ * Delete the specified psf spoolfile.
+ */
+static void
+prepare_spoolfile_cleanup(char *path)
+{
+	PsfHashEntry *hentry;
+
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* And remove the path entry from the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_REMOVE,
+										  NULL);
+
+	/* By this time we must have created the entry */
+	Assert(hentry != NULL);
+
+	/* Delete the file and release the fileset memory */
+	SharedFileSetDeleteAll(hentry->fileset);
+	pfree(hentry->fileset);
+	hentry->fileset = NULL;
+
+	/* Reset the memory context used during the file creation */
+	MemoryContextReset(PsfContext);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(psf_fd != NULL);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(psf_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(psf_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(psf_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Is there a prepare spoolfile for the specified gid?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	bool		found;
+
+	/* Find the prepare spoolfile entry in the psf_hash */
+	hash_search(psf_hash,
+				path,
+				HASH_FIND,
+				&found);
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 found ? "found" : "not found");
+
+	return found;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ *
+ * [Note: this is mostly copied code from apply_spooled_messages function]
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, XLogRecPtr lsn)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	bool		found = false;
+	PsfHashEntry *hentry;
+	BufFile    *fd;
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* Open the spool file */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_FIND,
+										  &found);
+	Assert(found);
+	fd = BufFileOpenShared(hentry->fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	remote_final_lsn = lsn;
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldctx2);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	BufFileClose(fd);
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB.
+	 *
+	 * Therefore, the name and the key must be exactly same lengths and padded
+	 * with '\0' so garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "%u-%s.prep_changes", subid, gid);
+}
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..602b4f7 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool BusyTablesyncs(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 048681c..b7da03b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1958,6 +1958,7 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfContext
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v44-0008-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v44-0008-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From 2d0138a4514093d54371f40c237de556727022eb Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 26 Feb 2021 15:31:47 +1100
Subject: [PATCH v44] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 +++++++++++++----
 src/backend/replication/logical/worker.c    | 50 ++++++++++++++++++++---------
 2 files changed, 58 insertions(+), 21 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 42ecc53..ac11a2d 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1131,6 +1137,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1153,6 +1161,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1160,12 +1169,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1179,6 +1193,8 @@ BusyTablesyncs()
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> BusyTablesyncs");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1190,8 +1206,8 @@ BusyTablesyncs()
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "BusyTablesyncs: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> BusyTablesyncs: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1208,6 +1224,7 @@ BusyTablesyncs()
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> BusyTablesyncs: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1219,8 +1236,8 @@ BusyTablesyncs()
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "BusyTablesyncs: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> BusyTablesyncs: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1246,8 +1263,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 333e988..0ce6ca2 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -821,14 +821,14 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, begin_data.end_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, begin_data.end_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (BusyTablesyncs())
 		{
-			elog(DEBUG1, "apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
 
 			process_syncing_tables(begin_data.end_lsn);
 
@@ -988,6 +988,8 @@ apply_handle_commit_prepared(StringInfo s)
 	{
 		int			nchanges;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * 1. replay the spooled messages
 		 */
@@ -995,8 +997,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.end_lsn);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		prepare_spoolfile_cleanup(psfpath);
@@ -1108,6 +1110,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf=%d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2537,18 +2540,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3755,6 +3762,11 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_fd ? "Do" : "Don't");
+
 	if (psf_fd == NULL)
 		return false;
 
@@ -3776,7 +3788,7 @@ prepare_spoolfile_create(char *path)
 	bool		found;
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(psf_fd == NULL);
 
@@ -3797,7 +3809,7 @@ prepare_spoolfile_create(char *path)
 		MemoryContext savectx;
 		SharedFileSet *fileset;
 
-		elog(DEBUG1, "Not found file \"%s\"", path);
+		elog(LOG, "!!> Not found file \"%s\". Create it.", path);
 		savectx = MemoryContextSwitchTo(ApplyContext);
 		fileset = palloc(sizeof(SharedFileSet));
 
@@ -3816,7 +3828,7 @@ prepare_spoolfile_create(char *path)
 		 * Open the file and seek to the beginning because we always want to
 		 * create/overwrite this file.
 		 */
-		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		elog(LOG, "!!> Found file \"%s\". Overwrite it.", path);
 		psf_fd = BufFileOpenShared(hentry->fileset, path, O_RDWR);
 		BufFileSeek(psf_fd, 0, 0L, SEEK_SET);
 	}
@@ -3833,6 +3845,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_fd)
 		BufFileClose(psf_fd);
 	psf_fd = NULL;
@@ -3846,6 +3859,8 @@ prepare_spoolfile_cleanup(char *path)
 {
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_cleanup: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3881,20 +3896,23 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_fd != NULL);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	BufFileWrite(psf_fd, &len, sizeof(len));
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	BufFileWrite(psf_fd, &action, sizeof(action));
 
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	BufFileWrite(psf_fd, &s->data[s->cursor], len);
 }
 
@@ -3912,8 +3930,8 @@ prepare_spoolfile_exists(char *path)
 				HASH_FIND,
 				&found);
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
 		 path,
 		 found ? "found" : "not found");
 
@@ -3937,8 +3955,8 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr lsn)
 	PsfHashEntry *hentry;
 	BufFile    *fd;
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3983,6 +4001,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr lsn)
 
 		/* read length of the on-disk record */
 		nbytes = BufFileRead(fd, &len, sizeof(len));
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -4001,6 +4020,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr lsn)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		if (BufFileRead(fd, buffer, len) != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -4023,7 +4043,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr lsn)
 		nchanges++;
 
 		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
+			elog(LOG, "!!> replayed %d changes from file '%s'",
 				 nchanges, path);
 	}
 
@@ -4032,7 +4052,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr lsn)
 	pfree(buffer);
 	pfree(s2.data);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
-- 
1.8.3.1

#203Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#201)

On Fri, Feb 26, 2021 at 9:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 25, 2021 at 12:32 PM Peter Smith <smithpb2250@gmail.com> wrote:

5. You need to write/sync the spool file at prepare time because after
restart between prepare and commit prepared the changes can be lost
and won't be resent by the publisher assuming there are commits of
other transactions between prepare and commit prepared. For the same
reason, I am not sure if we can just rely on the in-memory hash table
for it (prepare_spoolfile_exists). Sure, if it exists and there is no
restart then it would be cheap to check in the hash table but I don't
think it is guaranteed.

As we can't rely on the hash table, I think we can get rid of it and
always check if the corresponding file exists.

Few more comments on v43-0007-Fix-apply-worker-empty-prepare
====================================================
1.
+ * So the "table_states_not_ready" list might end up having a READY
+ * state it it even though

The above sentence doesn't sound correct to me.

2.
@@ -759,6 +798,79 @@ apply_handle_begin_prepare(StringInfo s)
{
..
+ */
+ if (!am_tablesync_worker())
+ {

I think here we should have an Assert for tablesync worker because it
should never receive prepare.

3.
+ while (BusyTablesyncs())
+ {
+ elog(DEBUG1, "apply_handle_begin_prepare - waiting for all sync
workers to be DONE/READY");
+
+ process_syncing_tables(begin_data.end_lsn);

..
+ if (begin_data.end_lsn < BiggestTablesyncLSN()

In both the above places, you need to use begin_data.final_lsn because
the prepare is yet not replayed so we can't use its end_lsn for
syncup.

4.
+/*
+ * Are there any tablesyncs which have still not yet reached
SYNCDONE/READY state?
+ */
+bool
+BusyTablesyncs()

The function name is not clear enough. Can we change it to something
like AnyTableSyncInProgress?

5.
+/*
+ * Are there any tablesyncs which have still not yet reached
SYNCDONE/READY state?
+ */
+bool
+BusyTablesyncs()
{
..
+ /*
+ * XXX - When the process_syncing_tables_for_sync changes the state
+ * from SYNCDONE to READY, that change is actually written directly

In the above comment, do you mean to process_syncing_tables_for_apply
because that is where we change state to READY? And, I don't think we
need to mark this comment as XXX.

6.
+ * XXX - Is there a potential timing problem here - e.g. if signal arrives
+ * while executing this then maybe we will set table_states_valid without
+ * refetching them?
+ */
+static void
+FetchTableStates(bool *started_tx)
..

Can you explain which race condition you are worried about here which
is not possible earlier but can happen after this patch?

7.
@@ -941,6 +1162,26 @@ apply_handle_stream_prepare(StringInfo s)
elog(DEBUG1, "received prepare for streamed transaction %u", xid);

  /*
+ * Wait for all the sync workers to reach the SYNCDONE/READY state.
+ *
+ * This is same waiting logic as in appy_handle_begin_prepare function
+ * (see that function for more details about this).
+ */
+ if (!am_tablesync_worker())
+ {
+ while (BusyTablesyncs())
+ {
+ process_syncing_tables(prepare_data.end_lsn);
+
+ /* This latch is to prevent 100% CPU looping. */
+ (void) WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+ 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+ ResetLatch(MyLatch);
+ }
+ }

I think we need similar handling in stream_prepare as in begin_prepare
for writing to spool file because this has the same danger. But here
we need to write it xid spool file in StreamXidHash. Another thing we
need to ensure is to sync that file in stream prepare so that it can
survive restarts. Then in apply_handle_commit_prepared, after checking
for prepared spool file, we need to check the existence of xid spool
file, and if the same exists then apply messages from that file.

Again, like begin_prepare, in apply_handle_stream_prepare also we
should have an Assert for table sync worker.

I feel that 2PC and streaming case is a bit complicated to deal with.
How about, for now, we won't allow users to enable streaming if 2PC
option is enabled for Subscription. This requires some change (error
out if both streaming and 2PC options are enabled) in both
createsubscrition and altersubscription but that change should be
fairly small. If we follow this, then in apply_dispatch (for case
LOGICAL_REP_MSG_STREAM_PREPARE), we should report an ERROR "invalid
logical replication message type".

--
With Regards,
Amit Kapila.

#204osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Peter Smith (#199)
RE: [HACKERS] logical decoding of two-phase transactions

Hi

On Thursday, February 25, 2021 4:02 PM Peter Smith <smithpb2250@gmail.com>

Please find attached the latest patch set v43*

- Added new patch 0007 "Fix apply worker prepare" as discussed here [2]

[2]
/messages/by-id/CAA4eK1L=dhuCRvyDvrXX5wZ
gc7s1hLRD29CKCK6oaHtVCPgiFA%40mail.gmail.com

I tested the scenario that
we resulted in skipping prepared transaction data and
the replica became out of sync, which was described in [2].
And, as you said, the problem is addressed in v43.

I used twophase_snapshot.spec as a reference
for the flow (e.g. how to make a consistent snapshot
between prepare and commit prepared) and this time,
as an alternative of the SQL API(pg_create_logical_replication_slot),
I issued CREATE SUBSCRIPTION, and other than that,
I followed other flows in the spec file mainly.

I checked that the replica has the same data at the end of this test,
which means the mechanism of spoolfile works.

Best Regards,
Takamichi Osumi

#205Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#203)

On Fri, Feb 26, 2021 at 9:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

6.
+ * XXX - Is there a potential timing problem here - e.g. if signal arrives
+ * while executing this then maybe we will set table_states_valid without
+ * refetching them?
+ */
+static void
+FetchTableStates(bool *started_tx)
..

Can you explain which race condition you are worried about here which
is not possible earlier but can happen after this patch?

Yes, my question (in that XXX comment) was not about anything new for
the current patch, because this FetchTableStates function has exactly
the same logic as the HEAD code.

I was only wondering if there is any possibility that one of the
function calls (inside the if block) can end up calling
CHECK_INTERRUPTS. If that could happen, then perhaps the
table_states_valid flag could be assigned false (by the
invalidate_syncing_table_states signal handler) only to be
immediately/wrongly overwritten as table_states_valid = true in this
FetchTableStates code.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#206Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#205)

On Sat, Feb 27, 2021 at 7:31 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Fri, Feb 26, 2021 at 9:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

6.
+ * XXX - Is there a potential timing problem here - e.g. if signal arrives
+ * while executing this then maybe we will set table_states_valid without
+ * refetching them?
+ */
+static void
+FetchTableStates(bool *started_tx)
..

Can you explain which race condition you are worried about here which
is not possible earlier but can happen after this patch?

Yes, my question (in that XXX comment) was not about anything new for
the current patch, because this FetchTableStates function has exactly
the same logic as the HEAD code.

I was only wondering if there is any possibility that one of the
function calls (inside the if block) can end up calling
CHECK_INTERRUPTS. If that could happen, then perhaps the
table_states_valid flag could be assigned false (by the
invalidate_syncing_table_states signal handler) only to be
immediately/wrongly overwritten as table_states_valid = true in this
FetchTableStates code.

This is not related to CHECK_FOR_INTERRUPTS. The
invalidate_syncing_table_states() can be called only when we process
invalidation messages which we do while locking the relation via
GetSubscriptionRelationstable_open->relation_open->LockRelationOid.
After that, it won't be done in that part of the code. So, I think we
don't need this comment.

--
With Regards,
Amit Kapila.

#207Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#202)
8 attachment(s)

Please find attached the latest patch set v45*

Differences from v44*:

* Rebased to HEAD

* Addressed some feedback comments for the 0007 ("empty prepare") patch.

[ak1] #1 - TODO
[ak1] #2 - Fixed. Removed #if 0 debugging
[ak1] #3 - TODO
[ak1] #4 - Fixed. Now BEGIN_PREPARE and PREPARE msgs are spooled. The
lsns are obtained from them.
[ak1] #5 - TODO

[ak2] #1 - Fixed. Bad comment text
[ak2] #2 - Fixed. Added Assert that tablesync should never receive prepares
[ak2] #3 - Fixed. Use correct lsns for sync wait loop, and BiggestLSN checks
[ak2] #4 - Fixed. Rename Busytablesyncs to AnyTablesyncInProgress
[ak2] #5 - Fixed. Typo in comment. Removed XXX
[ak2] #6 - Fixed. Remove unwarranted XXX comment for FetchTableStates
[ak2] #7 - TODO

-----
[ak1] /messages/by-id/CAA4eK1JWNitcTrcD51vLrh2GxKxVau0EU-5UCg6K9ZNQzPcz+Q@mail.gmail.com
[ak2] /messages/by-id/CAA4eK1LodEqax+xYOYdqgY5oEM54TdjagA0zT7QjKiC0NRNv=g@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v45-0004-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v45-0004-Support-2PC-txn-subscriber-tests.patchDownload
From 4a886e0dbeba5ec10e3f3908a3c7823efc934f02 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 26 Feb 2021 11:46:32 +1100
Subject: [PATCH v45] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v45-0001-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v45-0001-Refactor-spool-file-logic-in-worker.c.patchDownload
From 1f6a00eb0fcfb2ea39abd2c4e85aed68f54b96b0 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 26 Feb 2021 09:32:46 +1100
Subject: [PATCH v45] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 18d0528..45ac498 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -917,30 +919,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +941,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +956,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1031,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v45-0002-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v45-0002-Track-replication-origin-progress-for-rollbacks.patchDownload
From cf86d6386bfcf6d0d77c39219e344fcc59bb59f2 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 26 Feb 2021 09:46:37 +1100
Subject: [PATCH v45] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20..6023e7c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2276,6 +2276,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2298,6 +2306,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4e6a3df..acdb28d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5716,8 +5716,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5923,7 +5922,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5972,6 +5972,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6013,7 +6020,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6021,7 +6029,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v45-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v45-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From df5eb944155f4885f5834fd11ddbf262c2a95e8d Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 26 Feb 2021 11:31:59 +1100
Subject: [PATCH v45] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

* We allow skipping prepared transactions if they are already prepared.
We do ensure that we skip only when the GID, origin_lsn, and
origin_timestamp of a prepared xact matches to avoid the possibility of
a match of prepared xact from two different nodes. This can happen when
the server or apply worker restarts after a prepared transaction.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  68 ++++++
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 258 ++++++++++++++++++++++
 src/backend/replication/logical/worker.c    | 329 ++++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 177 ++++++++++++---
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  75 ++++++-
 src/include/replication/reorderbuffer.h     |  12 +
 src/tools/pgindent/typedefs.list            |   3 +
 9 files changed, 892 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..81cb765 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..1585754 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,264 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 45ac498..15b7c99 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -169,6 +170,9 @@ bool		in_streamed_transaction = false;
 
 static TransactionId stream_xid = InvalidTransactionId;
 
+/* for skipping prepared transaction */
+bool        skip_prepared_txn = false;
+
 /*
  * Hash table for storing the streaming xid information along with shared file
  * set for streaming and subxact files.
@@ -690,6 +694,12 @@ apply_handle_begin(StringInfo s)
 {
 	LogicalRepBeginData begin_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_begin(s, &begin_data);
 
 	remote_final_lsn = begin_data.final_lsn;
@@ -709,6 +719,12 @@ apply_handle_commit(StringInfo s)
 {
 	LogicalRepCommitData commit_data;
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_commit(s, &commit_data);
 
 	Assert(commit_data.commit_lsn == remote_final_lsn);
@@ -722,6 +738,263 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+	{
+		/*
+		 * If this gid has already been prepared then we don't want to apply
+		 * this txn again. This can happen after restart where upstream can
+		 * send the prepared transaction again. See
+		 * ReorderBufferFinishPrepared. Don't update remote_final_lsn.
+		 */
+		skip_prepared_txn = true;
+		return;
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (skip_prepared_txn)
+	{
+		/*
+		 * If we are skipping this transaction because it was previously
+		 * prepared, ignore it and reset the flag.
+		 */
+		Assert(LookupGXact(prepare_data.gid, prepare_data.end_lsn,
+						   prepare_data.preparetime));
+		skip_prepared_txn = false;
+		return;
+	}
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -753,6 +1026,12 @@ apply_handle_stream_start(StringInfo s)
 	Assert(!in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Start a transaction on stream start, this transaction will be committed
 	 * on the stream stop unless it is a tablesync worker in which case it
 	 * will be committed after processing all the messages. We need the
@@ -800,6 +1079,12 @@ apply_handle_stream_stop(StringInfo s)
 	Assert(in_streamed_transaction);
 
 	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
+	/*
 	 * Close the file with serialized changes, and serialize information about
 	 * subxacts for the toplevel transaction.
 	 */
@@ -831,6 +1116,12 @@ apply_handle_stream_abort(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	logicalrep_read_stream_abort(s, &xid, &subxid);
 
 	/*
@@ -1046,6 +1337,12 @@ apply_handle_stream_commit(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/*
+	 * We don't expect any other transaction data while skipping a prepared
+	 * xact.
+	 */
+	Assert(!skip_prepared_txn);
+
 	xid = logicalrep_read_stream_commit(s, &commit_data);
 
 	elog(DEBUG1, "received commit for streamed transaction %u", xid);
@@ -1168,6 +1465,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1289,6 +1589,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1447,6 +1750,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1816,6 +2122,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (skip_prepared_txn)
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1972,6 +2281,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2bf1295 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +67,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +78,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +173,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +344,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +364,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +385,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +845,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -870,6 +927,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1195,3 +1270,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..358b14a 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bab31bf..6bb162e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bab4f3a..048681c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v45-0005-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v45-0005-Support-2PC-txn-Subscription-option.patchDownload
From 5dc06ffe75214bb27a855fa421101751babba315 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 26 Feb 2021 12:29:03 +1100
Subject: [PATCH v45] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 202 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..9c23497 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -184,8 +184,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd..55dd8da 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1167,7 +1167,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..a069c76 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -358,6 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +399,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +468,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -825,6 +844,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -835,7 +856,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -874,6 +896,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -892,7 +921,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +967,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1013,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..74787d1 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -427,6 +427,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 15b7c99..f837d71 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2786,6 +2786,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3432,6 +3433,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2bf1295..3a1b404 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -180,13 +180,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -254,6 +256,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -267,6 +279,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -291,7 +304,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -332,6 +346,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..4ac4924 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 358b14a..eebedda 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..41e0d8c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..8d24b2e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,29 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..5c79dbd 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v45-0006-Tablesync-early-exit.patchapplication/octet-stream; name=v45-0006-Tablesync-early-exit.patchDownload
From c59348287c4af0884a1aae2d50ac84dee5e28b85 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 26 Feb 2021 13:23:40 +1100
Subject: [PATCH v45] Tablesync early exit.

Give the tablesync worker an opportunity to see if it can exit immediately
(because it has already caught-up) without it needing to process a message
first before discovering that.
---
 src/backend/replication/logical/worker.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f837d71..5a0b806 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2407,6 +2407,16 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	bool		ping_sent = false;
 	TimeLineID	tli;
 
+	if (am_tablesync_worker())
+	{
+		/*
+		 * Give the tablesync worker an opportunity see if it can immediately
+		 * exit, instead of always handling a message (which maybe the apply
+		 * worker could have handled).
+		 */
+		process_syncing_tables(last_received);
+	}
+
 	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
-- 
1.8.3.1

v45-0007-Fix-apply-worker-empty-prepare.patchapplication/octet-stream; name=v45-0007-Fix-apply-worker-empty-prepare.patchDownload
From b7b8bf3d943ad3dda0bda683b6e58979696bf59d Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sun, 28 Feb 2021 01:01:42 +1100
Subject: [PATCH v45] Fix apply worker empty prepare.

By sad timing of apply/tablesync workers it is possible to have a "consistent snapshot" that spans prepare/commit in such a way that the tablesync did not do the prepare (because snapshot not consistent) and the apply worker does the prepare ('b') but it skips all the prepared operations [e.g. inserts] while the tablesync was still busy (see the condition of should_apply_changes_for_rel). Later, at the commit prepared time when the apply worker does the commit prepare ('K'), there is nothing committed (because the inserts were skipped earlier).

This patch implements a two-part fix as suggested [1] on hackers.

Part 1 - The begin_prepare handler of apply will always wait for any busy tablesync workers to acheive SYNCDONE/READY state.

Part 2 - If (after Part 1) the apply-worker's prepare is found to be lagging behind any of the sync-workers then the subsequent prepared operations will be spooled to a file to be replayed at commit_prepared time.

Discussion:
[1] https://www.postgresql.org/message-id/CAA4eK1L%3DdhuCRvyDvrXX5wZgc7s1hLRD29CKCK6oaHtVCPgiFA%40mail.gmail.com
---
 src/backend/replication/logical/tablesync.c | 178 ++++++--
 src/backend/replication/logical/worker.c    | 673 +++++++++++++++++++++++++++-
 src/include/replication/worker_internal.h   |   3 +
 src/tools/pgindent/typedefs.list            |   1 +
 4 files changed, 817 insertions(+), 38 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..5f897d3 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1137,3 +1111,141 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+AnyTablesyncInProgress()
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * When the process_syncing_tables_for_apply changes the state
+		 * from SYNCDONE to READY, that change is actually written directly
+		 * into the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state it it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 5a0b806..c7c98c9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -171,7 +171,7 @@ bool		in_streamed_transaction = false;
 static TransactionId stream_xid = InvalidTransactionId;
 
 /* for skipping prepared transaction */
-bool        skip_prepared_txn = false;
+bool		skip_prepared_txn = false;
 
 /*
  * Hash table for storing the streaming xid information along with shared file
@@ -212,6 +212,45 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/*
+ * A contest for the prepare spooling
+ */
+static MemoryContext PsfContext = NULL;
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	SharedFileSet *fileset;		/* shared file set for prepare spoolfile */
+}			PsfHashEntry;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * The spoolfile handle is only valid between begin_prepare and prepare.
+ */
+static BufFile *psf_fd = NULL;
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_cleanup(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -745,6 +784,9 @@ apply_handle_begin_prepare(StringInfo s)
 {
 	LogicalRepBeginPrepareData begin_data;
 
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
 	logicalrep_read_begin_prepare(s, &begin_data);
 
 	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
@@ -759,6 +801,81 @@ apply_handle_begin_prepare(StringInfo s)
 		return;
 	}
 
+	/*
+	 * A Problem:
+	 *
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel). Later at the
+	 * commit prepared time when the apply worker does the commit prepare
+	 * (‘K’), there is nothing in it (because the inserts were skipped
+	 * earlier).
+	 *
+	 * The following code has a 2-part workaround for that scenario.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Workaround Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(begin_data.final_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (AnyTablesyncInProgress())
+		{
+			process_syncing_tables(begin_data.final_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Workaround Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		{
+			char		psfpath[MAXPGPATH];
+			StringInfoData sid;
+
+			/* The begin_prepare's LSN has been overtaken. */
+
+			/*
+			 * We need a transaction for handling the buffile, used for
+			 * serializing prepared messages. This transaction lasts until the
+			 * commit_prepared/ rollback_prepared.
+			 */
+			ensure_transaction();
+
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+
+			/*
+			 * Write BEGIN_PREPARE as the first message of the psf file.
+			 */
+			initStringInfo(&sid);
+			appendBinaryStringInfo(&sid, (char *)&begin_data, sizeof(begin_data));
+			Assert(prepare_spoolfile_handler(LOGICAL_REP_MSG_BEGIN_PREPARE, &sid));
+		}
+	}
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	in_remote_transaction = true;
@@ -774,6 +891,38 @@ apply_handle_prepare(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
 
+	/*
+	 * If we were using a psf spoolfile, then write the PREPARE as the final
+	 * message. This prepare information will be used at commit_prepared time.
+	 */
+	if (psf_fd)
+	{
+		/* Write the PREPARE info to the psf file. */
+		Assert(prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s));
+
+		/*
+		 * TODO - Flush the spoolfile so changes can survive a restart.
+		 */
+
+		/*
+		 * The psf_fd is meaningful only between begin_prepare and prepared.
+		 * So close it now. If we had been writing any messages to the psf_fd
+		 * (the spoolfile) then those will be applied later during
+		 * handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		/*
+		 * And end the transaction that was created by begin_prepare for
+		 * working with the psf buffiles.
+		 */
+		Assert(IsTransactionState());
+		CommitTransactionCommand();
+
+		in_remote_transaction = false;
+		return;
+	}
+
 	logicalrep_read_prepare(s, &prepare_data);
 
 	if (skip_prepared_txn)
@@ -835,6 +984,7 @@ static void
 apply_handle_commit_prepared(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
 
 	/*
 	 * We don't expect any other transaction data while skipping a prepared
@@ -844,6 +994,56 @@ apply_handle_commit_prepared(StringInfo s)
 
 	logicalrep_read_commit_prepared(s, &prepare_data);
 
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+		LogicalRepPreparedTxnData pdata;
+
+		/*
+		 * 1. replay the spooled messages
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, &pdata);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		prepare_spoolfile_cleanup(psfpath);
+
+		/*
+		 * 2. mark as PREPARED (use prepare_data info from the psf file)
+		 */
+
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = pdata.end_lsn;
+		replorigin_session_origin_timestamp = pdata.preparetime;
+
+		PrepareTransactionBlock(pdata.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(pdata.end_lsn);
+	}
+
 	/* there is no transaction when COMMIT PREPARED is called */
 	ensure_transaction();
 
@@ -874,6 +1074,8 @@ static void
 apply_handle_rollback_prepared(StringInfo s)
 {
 	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
 
 	/*
 	 * We don't expect any other transaction data while skipping a prepared
@@ -884,16 +1086,52 @@ apply_handle_rollback_prepared(StringInfo s)
 	logicalrep_read_rollback_prepared(s, &rollback_data);
 
 	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		if (psf_fd)
+		{
+			/*
+			 * XXX - For some reason it is currently possible (due to bug?) it
+			 * is possibe to get here, after a restart, when there was a
+			 * begin_prepare but there was NO prepare. Since there was no
+			 * prepare, the psf_fd and the transaction are still lingering so
+			 * they need to be cleaned up now.
+			 */
+			prepare_spoolfile_close();
+
+			/*
+			 * And end the transaction that was created by the begin_prepare
+			 * for working with psf buffiles.
+			 */
+			Assert(IsTransactionState());
+			AbortCurrentTransaction();
+		}
+
+		prepare_spoolfile_cleanup(psfpath);
+	}
+
+	/*
 	 * It is possible that we haven't received prepare because it occurred
 	 * before walsender reached a consistent point in which case we need to
 	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
 	 */
-	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
 	{
 		/*
-		 * Update origin state so we can restart streaming from correct position
-		 * in case of crash.
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
 		 */
 		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
 		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
@@ -931,6 +1169,9 @@ apply_handle_stream_prepare(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
 	/*
 	 * We don't expect any other transaction data while skipping a prepared
 	 * xact.
@@ -941,6 +1182,26 @@ apply_handle_stream_prepare(StringInfo s)
 	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
 
 	/*
+	 * Wait for all the sync workers to reach the SYNCDONE/READY state.
+	 *
+	 * This is same waiting logic as in apply_handle_begin_prepare function
+	 * (see that function for more details about this).
+	 */
+	if (!am_tablesync_worker())
+	{
+		while (AnyTablesyncInProgress())
+		{
+			process_syncing_tables(prepare_data.end_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+	}
+
+	/*
 	 *
 	 * --------------------------------------------------------------------------
 	 * 1. Replay all the spooled operations - Similar code as for
@@ -1007,6 +1268,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		psf_fd == NULL &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1468,6 +1730,9 @@ apply_handle_insert(StringInfo s)
 	if (skip_prepared_txn)
 		return;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1592,6 +1857,9 @@ apply_handle_update(StringInfo s)
 	if (skip_prepared_txn)
 		return;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1753,6 +2021,9 @@ apply_handle_delete(StringInfo s)
 	if (skip_prepared_txn)
 		return;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -2125,6 +2396,9 @@ apply_handle_truncate(StringInfo s)
 	if (skip_prepared_txn)
 		return;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -2418,6 +2692,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	}
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL		hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2433,6 +2724,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 													"LogicalStreamingContext",
 													ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context is used when the prepare spooling is used. It is
+	 * reset at prepare commit/rollback time.
+	 */
+	PsfContext = AllocSetContextCreate(ApplyContext,
+									   "PsfContext",
+									   ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -2537,7 +2836,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && psf_fd == NULL)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -3462,3 +3761,367 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time. If needed, this is the common function to do that file redirection.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	elog(DEBUG1,
+		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_fd ? "Do" : "Don't");
+
+	if (psf_fd == NULL)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	MemoryContext oldctx;
+	bool		found;
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(psf_fd == NULL);
+
+	/* create or find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER | HASH_FIND,
+										  &found);
+
+	/*
+	 * Create/open the bufFiles under the Prepare Spoolfile Context so that we
+	 * have those files until prepare commit/rollback.
+	 */
+	oldctx = MemoryContextSwitchTo(PsfContext);
+
+	if (!found)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		psf_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember this path's fileset for the next time */
+		memcpy(hentry->name, path, MAXPGPATH);
+		hentry->fileset = fileset;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		elog(DEBUG1, "Found file \"%s\". Open for append.", path);
+		psf_fd = BufFileOpenShared(hentry->fileset, path, O_RDWR);
+		BufFileSeek(psf_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldctx);
+
+	/* Sanity check */
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_fd)
+		BufFileClose(psf_fd);
+	psf_fd = NULL;
+}
+
+/*
+ * Delete the specified psf spoolfile.
+ */
+static void
+prepare_spoolfile_cleanup(char *path)
+{
+	PsfHashEntry *hentry;
+
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* And remove the path entry from the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_REMOVE,
+										  NULL);
+
+	/* By this time we must have created the entry */
+	Assert(hentry != NULL);
+
+	/* Delete the file and release the fileset memory */
+	SharedFileSetDeleteAll(hentry->fileset);
+	pfree(hentry->fileset);
+	hentry->fileset = NULL;
+
+	/* Reset the memory context used during the file creation */
+	MemoryContextReset(PsfContext);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(psf_fd != NULL);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(psf_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(psf_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(psf_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Is there a prepare spoolfile for the specified gid?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	bool		found;
+
+	/* Find the prepare spoolfile entry in the psf_hash */
+	hash_search(psf_hash,
+				path,
+				HASH_FIND,
+				&found);
+
+	return found;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ *
+ * [Note: this is similar to apply_spooled_messages function]
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	bool		found = false;
+	PsfHashEntry *hentry;
+	BufFile    *fd;
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* Open the spool file */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_FIND,
+										  &found);
+	Assert(found);
+	fd = BufFileOpenShared(hentry->fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/*
+		 * The psf spoolfile contents will have first and last messages as
+		 * BEGIN_PREPARE and PREPARE message respectively. These two are
+		 * processed specially within this function.
+		 *
+		 * BEGIN_PREPARE msg: This will be the first message of the psf file.
+		 * Use this begin_data information to set the remote_final_lsn.
+		 *
+		 * PREPARE msg: The prepare_data information is returned so that the
+		 * prepare lsn values are available to the caller (commit_prepared).
+		 * Unfortunately, just dispatching the PREPARE message is problematic
+		 * because its transaction commits have side effects on this replay
+		 * loop which is still running.
+		 *
+		 * All other message content (between the BEGIN_PREPARE and the PREARE)
+		 * will be delivered to apply_dispatch as they normally would be.
+		 */
+		if (s2.data[0] == LOGICAL_REP_MSG_BEGIN_PREPARE)
+		{
+			LogicalRepBeginPrepareData bdata;
+
+			/* read/skip the action byte. */
+			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+			/* read the begin_data. */
+			logicalrep_read_begin_prepare(&s2, &bdata);
+
+			elog(DEBUG1, "BEGIN_PREPARE info: gid = '%s', final_lsn = %X/%X, end_lsn = %X/%X",
+				 bdata.gid,
+				 LSN_FORMAT_ARGS(bdata.final_lsn),
+				 LSN_FORMAT_ARGS(bdata.end_lsn));
+
+			/*
+			 * Make sure the handle apply_dispatch methods are aware we're in a remote
+			 * transaction.
+			 */
+			remote_final_lsn = bdata.final_lsn;
+			in_remote_transaction = true;
+			pgstat_report_activity(STATE_RUNNING, NULL);
+		}
+		else if (s2.data[0] == LOGICAL_REP_MSG_PREPARE)
+		{
+			/* read/skip the action byte. */
+			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_PREPARE);
+
+			/* read and return the prepare_data info to the caller */
+			logicalrep_read_prepare(&s2, pdata);
+
+			elog(DEBUG1, "PREPARE info: gid = '%s', prepare_lsn = %X/%X, end_lsn = %X/%X",
+				 pdata->gid,
+				 LSN_FORMAT_ARGS(pdata->prepare_lsn),
+				 LSN_FORMAT_ARGS(pdata->end_lsn));
+		}
+		else
+		{
+			/* Ensure we are reading the data into our memory context. */
+			oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+			apply_dispatch(&s2);
+
+			MemoryContextReset(ApplyMessageContext);
+
+			MemoryContextSwitchTo(oldctx2);
+
+			nchanges++;
+
+			if (nchanges % 1000 == 0)
+				elog(DEBUG1, "replayed %d changes from file '%s'",
+					 nchanges, path);
+		}
+	}
+
+	BufFileClose(fd);
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB.
+	 *
+	 * Therefore, the name and the key must be exactly same lengths and padded
+	 * with '\0' so garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "%u-%s.prep_changes", subid, gid);
+}
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..95d78e9 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AnyTablesyncInProgress(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 048681c..b7da03b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1958,6 +1958,7 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfContext
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v45-0008-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v45-0008-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From 7a7270db5d9e9f51a1ab902c13585ec90a0f4669 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sun, 28 Feb 2021 01:25:26 +1100
Subject: [PATCH v45] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 ++++++++++---
 src/backend/replication/logical/worker.c    | 65 +++++++++++++++++++++--------
 2 files changed, 71 insertions(+), 23 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 5f897d3..c77730a 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1127,6 +1133,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1149,6 +1157,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1156,12 +1165,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1175,6 +1189,8 @@ AnyTablesyncInProgress()
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncInProgress?");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1186,8 +1202,8 @@ AnyTablesyncInProgress()
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1204,6 +1220,7 @@ AnyTablesyncInProgress()
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncInProgress?: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1215,8 +1232,8 @@ AnyTablesyncInProgress()
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1242,8 +1259,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c7c98c9..a78d470 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -824,14 +824,16 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(begin_data.final_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (AnyTablesyncInProgress())
 		{
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
 			process_syncing_tables(begin_data.final_lsn);
 
 			/* This latch is to prevent 100% CPU looping. */
@@ -849,7 +851,12 @@ apply_handle_begin_prepare(StringInfo s)
 		 * prepared) will be saved to a spoolfile for replay later at
 		 * commit_prepared time.
 		 */
-		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		if (begin_data.final_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
 		{
 			char		psfpath[MAXPGPATH];
 			StringInfoData sid;
@@ -1005,6 +1012,8 @@ apply_handle_commit_prepared(StringInfo s)
 		int			nchanges;
 		LogicalRepPreparedTxnData pdata;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * 1. replay the spooled messages
 		 */
@@ -1012,8 +1021,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, &pdata);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		prepare_spoolfile_cleanup(psfpath);
@@ -1125,6 +1134,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf = %d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2557,18 +2567,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3775,8 +3789,8 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
-	elog(DEBUG1,
-		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
 		 action,
 		 psf_fd ? "Do" : "Don't");
 
@@ -3801,7 +3815,7 @@ prepare_spoolfile_create(char *path)
 	bool		found;
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(psf_fd == NULL);
 
@@ -3822,7 +3836,7 @@ prepare_spoolfile_create(char *path)
 		MemoryContext savectx;
 		SharedFileSet *fileset;
 
-		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		elog(LOG, "!!> Not found file \"%s\". Create it.", path);
 		savectx = MemoryContextSwitchTo(ApplyContext);
 		fileset = palloc(sizeof(SharedFileSet));
 
@@ -3841,7 +3855,7 @@ prepare_spoolfile_create(char *path)
 		 * Open the file and seek to the end of the file because we always
 		 * append the changes file.
 		 */
-		elog(DEBUG1, "Found file \"%s\". Open for append.", path);
+		elog(LOG, "!!> Found file \"%s\". Open for append.", path);
 		psf_fd = BufFileOpenShared(hentry->fileset, path, O_RDWR);
 		BufFileSeek(psf_fd, 0, 0, SEEK_END);
 	}
@@ -3858,6 +3872,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_fd)
 		BufFileClose(psf_fd);
 	psf_fd = NULL;
@@ -3871,6 +3886,8 @@ prepare_spoolfile_cleanup(char *path)
 {
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_cleanup: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3906,20 +3923,23 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_fd != NULL);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	BufFileWrite(psf_fd, &len, sizeof(len));
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	BufFileWrite(psf_fd, &action, sizeof(action));
 
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	BufFileWrite(psf_fd, &s->data[s->cursor], len);
 }
 
@@ -3937,6 +3957,11 @@ prepare_spoolfile_exists(char *path)
 				HASH_FIND,
 				&found);
 
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 found ? "found" : "not found");
+
 	return found;
 }
 
@@ -3957,8 +3982,8 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 	PsfHashEntry *hentry;
 	BufFile    *fd;
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3994,6 +4019,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 
 		/* read length of the on-disk record */
 		nbytes = BufFileRead(fd, &len, sizeof(len));
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -4012,6 +4038,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		if (BufFileRead(fd, buffer, len) != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -4043,13 +4070,15 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		{
 			LogicalRepBeginPrepareData bdata;
 
+			elog(LOG, "!!> prepare_spoolfile_replay_messages: Found the BEGIN_PREPARE info");
+
 			/* read/skip the action byte. */
 			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_BEGIN_PREPARE);
 
 			/* read the begin_data. */
 			logicalrep_read_begin_prepare(&s2, &bdata);
 
-			elog(DEBUG1, "BEGIN_PREPARE info: gid = '%s', final_lsn = %X/%X, end_lsn = %X/%X",
+			elog(LOG, "!!> BEGIN_PREPARE info: gid = '%s', final_lsn = %X/%X, end_lsn = %X/%X",
 				 bdata.gid,
 				 LSN_FORMAT_ARGS(bdata.final_lsn),
 				 LSN_FORMAT_ARGS(bdata.end_lsn));
@@ -4064,13 +4093,15 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		}
 		else if (s2.data[0] == LOGICAL_REP_MSG_PREPARE)
 		{
+			elog(LOG, "!!> prepare_spoolfile_replay_messages: Found the PREPARE info");
+
 			/* read/skip the action byte. */
 			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_PREPARE);
 
 			/* read and return the prepare_data info to the caller */
 			logicalrep_read_prepare(&s2, pdata);
 
-			elog(DEBUG1, "PREPARE info: gid = '%s', prepare_lsn = %X/%X, end_lsn = %X/%X",
+			elog(LOG, "!!> PREPARE info: gid = '%s', prepare_lsn = %X/%X, end_lsn = %X/%X",
 				 pdata->gid,
 				 LSN_FORMAT_ARGS(pdata->prepare_lsn),
 				 LSN_FORMAT_ARGS(pdata->end_lsn));
@@ -4089,7 +4120,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 			nchanges++;
 
 			if (nchanges % 1000 == 0)
-				elog(DEBUG1, "replayed %d changes from file '%s'",
+				elog(LOG, "!!> replayed %d changes from file '%s'",
 					 nchanges, path);
 		}
 	}
@@ -4099,7 +4130,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 	pfree(buffer);
 	pfree(s2.data);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
-- 
1.8.3.1

#208Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#207)
8 attachment(s)

Please find attached the latest patch set v46*

Differences from v45*

* Rebased to HEAD

* Patch v46-0003 is modified to be compatible with a recent push for
"avoiding repeated decoding of prepare" [1]https://github.com/postgres/postgres/commit/8bdb1332eb51837c15a10a972c179b84f654279e.

-----
[1]: https://github.com/postgres/postgres/commit/8bdb1332eb51837c15a10a972c179b84f654279e

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v46-0001-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v46-0001-Refactor-spool-file-logic-in-worker.c.patchDownload
From 5dbec6845dbe00db55a3f793c04aab72f21c7b85 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 1 Mar 2021 10:26:06 +1100
Subject: [PATCH v46] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 18d0528..45ac498 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -917,30 +919,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +941,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +956,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1031,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v46-0005-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v46-0005-Support-2PC-txn-Subscription-option.patchDownload
From 5e915b0e79ebb3272fa1e4a3eb4c4416b06a1cae Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 1 Mar 2021 12:34:02 +1100
Subject: [PATCH v46] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 44 ++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 202 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..9c23497 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -184,8 +184,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd..55dd8da 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1167,7 +1167,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..a069c76 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -358,6 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +399,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +468,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -825,6 +844,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -835,7 +856,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -874,6 +896,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -892,7 +921,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +967,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1013,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..74787d1 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -427,6 +427,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4f6406e..9e0c2dc 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2702,6 +2702,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3348,6 +3349,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2bf1295..3a1b404 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -180,13 +180,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -254,6 +256,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -267,6 +279,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -291,7 +304,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -332,6 +346,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..4ac4924 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 358b14a..eebedda 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..41e0d8c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..8d24b2e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,29 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..5c79dbd 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v46-0002-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v46-0002-Track-replication-origin-progress-for-rollbacks.patchDownload
From 7d1827e093ba838dc37a791cf760591ca7b83f70 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 1 Mar 2021 10:41:13 +1100
Subject: [PATCH v46] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20..6023e7c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2276,6 +2276,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2298,6 +2306,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4e6a3df..acdb28d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5716,8 +5716,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5923,7 +5922,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5972,6 +5972,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6013,7 +6020,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6021,7 +6029,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v46-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v46-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 4594c098dc9c32f0ffad7a00f124325d34ba8136 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 1 Mar 2021 12:08:50 +1100
Subject: [PATCH v46] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  68 ++++++++
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 257 ++++++++++++++++++++++++++++
 src/backend/replication/logical/worker.c    | 245 ++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 177 +++++++++++++++----
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  75 +++++++-
 src/include/replication/reorderbuffer.h     |  12 ++
 src/tools/pgindent/typedefs.list            |   3 +
 9 files changed, 807 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..81cb765 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..46c52e5 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,263 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 45ac498..4f6406e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -722,6 +723,230 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/*
+	 * The gid must not already be prepared.
+	 */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				errmsg("transaction identifier \"%s\" is already in use",
+					   begin_data.gid)));
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1972,6 +2197,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2bf1295 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +67,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +78,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +173,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +344,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +364,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +385,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +845,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -870,6 +927,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1195,3 +1270,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..358b14a 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..0c95dc6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bab4f3a..048681c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v46-0004-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v46-0004-Support-2PC-txn-subscriber-tests.patchDownload
From 5b8446141ea3f3056b6b9fdea37e5d954ad76690 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 1 Mar 2021 12:19:24 +1100
Subject: [PATCH v46] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v46-0006-Tablesync-early-exit.patchapplication/octet-stream; name=v46-0006-Tablesync-early-exit.patchDownload
From 68d00873f08f9745b6b53c2d12a150e71d0c42ee Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 1 Mar 2021 12:45:09 +1100
Subject: [PATCH v46] Tablesync early exit.

Give the tablesync worker an opportunity to see if it can exit immediately
(because it has already caught-up) without it needing to process a message
first before discovering that.
---
 src/backend/replication/logical/worker.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9e0c2dc..ad537f4 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2323,6 +2323,16 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	bool		ping_sent = false;
 	TimeLineID	tli;
 
+	if (am_tablesync_worker())
+	{
+		/*
+		 * Give the tablesync worker an opportunity see if it can immediately
+		 * exit, instead of always handling a message (which maybe the apply
+		 * worker could have handled).
+		 */
+		process_syncing_tables(last_received);
+	}
+
 	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
-- 
1.8.3.1

v46-0007-Fix-apply-worker-empty-prepare.patchapplication/octet-stream; name=v46-0007-Fix-apply-worker-empty-prepare.patchDownload
From 842f12a3293ebfb33b961dede5505d827b32e1a9 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 1 Mar 2021 15:04:18 +1100
Subject: [PATCH v46] Fix apply worker empty prepare.

By sad timing of apply/tablesync workers it is possible to have a "consistent snapshot" that spans prepare/commit in such a way that the tablesync did not do the prepare (because snapshot not consistent) and the apply worker does the prepare ('b') but it skips all the prepared operations [e.g. inserts] while the tablesync was still busy (see the condition of should_apply_changes_for_rel). Later, at the commit prepared time when the apply worker does the commit prepare ('K'), there is nothing committed (because the inserts were skipped earlier).

This patch implements a two-part fix as suggested [1] on hackers.

Part 1 - The begin_prepare handler of apply will always wait for any busy tablesync workers to acheive SYNCDONE/READY state.

Part 2 - If (after Part 1) the apply-worker's prepare is found to be lagging behind any of the sync-workers then the subsequent prepared operations will be spooled to a file to be replayed at commit_prepared time.

Discussion:
[1] https://www.postgresql.org/message-id/CAA4eK1L%3DdhuCRvyDvrXX5wZgc7s1hLRD29CKCK6oaHtVCPgiFA%40mail.gmail.com
---
 src/backend/replication/logical/tablesync.c | 178 ++++++--
 src/backend/replication/logical/worker.c    | 668 +++++++++++++++++++++++++++-
 src/include/replication/worker_internal.h   |   3 +
 src/tools/pgindent/typedefs.list            |   1 +
 4 files changed, 815 insertions(+), 35 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..5f897d3 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1137,3 +1111,141 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+AnyTablesyncInProgress()
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * When the process_syncing_tables_for_apply changes the state
+		 * from SYNCDONE to READY, that change is actually written directly
+		 * into the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state it it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ad537f4..fcf7bc2 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -209,6 +209,45 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/*
+ * A contest for the prepare spooling
+ */
+static MemoryContext PsfContext = NULL;
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	SharedFileSet *fileset;		/* shared file set for prepare spoolfile */
+}			PsfHashEntry;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * The spoolfile handle is only valid between begin_prepare and prepare.
+ */
+static BufFile *psf_fd = NULL;
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_cleanup(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -730,6 +769,9 @@ apply_handle_begin_prepare(StringInfo s)
 {
 	LogicalRepBeginPrepareData begin_data;
 
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
 	logicalrep_read_begin_prepare(s, &begin_data);
 
 	/*
@@ -741,6 +783,81 @@ apply_handle_begin_prepare(StringInfo s)
 				errmsg("transaction identifier \"%s\" is already in use",
 					   begin_data.gid)));
 
+	/*
+	 * A Problem:
+	 *
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel). Later at the
+	 * commit prepared time when the apply worker does the commit prepare
+	 * (‘K’), there is nothing in it (because the inserts were skipped
+	 * earlier).
+	 *
+	 * The following code has a 2-part workaround for that scenario.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Workaround Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(begin_data.final_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (AnyTablesyncInProgress())
+		{
+			process_syncing_tables(begin_data.final_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Workaround Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		{
+			char		psfpath[MAXPGPATH];
+			StringInfoData sid;
+
+			/* The begin_prepare's LSN has been overtaken. */
+
+			/*
+			 * We need a transaction for handling the buffile, used for
+			 * serializing prepared messages. This transaction lasts until the
+			 * commit_prepared/ rollback_prepared.
+			 */
+			ensure_transaction();
+
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+
+			/*
+			 * Write BEGIN_PREPARE as the first message of the psf file.
+			 */
+			initStringInfo(&sid);
+			appendBinaryStringInfo(&sid, (char *)&begin_data, sizeof(begin_data));
+			Assert(prepare_spoolfile_handler(LOGICAL_REP_MSG_BEGIN_PREPARE, &sid));
+		}
+	}
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	in_remote_transaction = true;
@@ -756,6 +873,38 @@ apply_handle_prepare(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
 
+	/*
+	 * If we were using a psf spoolfile, then write the PREPARE as the final
+	 * message. This prepare information will be used at commit_prepared time.
+	 */
+	if (psf_fd)
+	{
+		/* Write the PREPARE info to the psf file. */
+		Assert(prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s));
+
+		/*
+		 * TODO - Flush the spoolfile so changes can survive a restart.
+		 */
+
+		/*
+		 * The psf_fd is meaningful only between begin_prepare and prepared.
+		 * So close it now. If we had been writing any messages to the psf_fd
+		 * (the spoolfile) then those will be applied later during
+		 * handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		/*
+		 * And end the transaction that was created by begin_prepare for
+		 * working with the psf buffiles.
+		 */
+		Assert(IsTransactionState());
+		CommitTransactionCommand();
+
+		in_remote_transaction = false;
+		return;
+	}
+
 	logicalrep_read_prepare(s, &prepare_data);
 
 	Assert(prepare_data.prepare_lsn == remote_final_lsn);
@@ -805,9 +954,61 @@ static void
 apply_handle_commit_prepared(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
+
 
 	logicalrep_read_commit_prepared(s, &prepare_data);
 
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+		LogicalRepPreparedTxnData pdata;
+
+		/*
+		 * 1. replay the spooled messages
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, &pdata);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		prepare_spoolfile_cleanup(psfpath);
+
+		/*
+		 * 2. mark as PREPARED (use prepare_data info from the psf file)
+		 */
+
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = pdata.end_lsn;
+		replorigin_session_origin_timestamp = pdata.preparetime;
+
+		PrepareTransactionBlock(pdata.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(pdata.end_lsn);
+	}
+
 	/* there is no transaction when COMMIT PREPARED is called */
 	ensure_transaction();
 
@@ -838,15 +1039,53 @@ static void
 apply_handle_rollback_prepared(StringInfo s)
 {
 	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
 
 	logicalrep_read_rollback_prepared(s, &rollback_data);
 
 	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		if (psf_fd)
+		{
+			/*
+			 * XXX - For some reason it is currently possible (due to bug?) it
+			 * is possibe to get here, after a restart, when there was a
+			 * begin_prepare but there was NO prepare. Since there was no
+			 * prepare, the psf_fd and the transaction are still lingering so
+			 * they need to be cleaned up now.
+			 */
+			prepare_spoolfile_close();
+
+			/*
+			 * And end the transaction that was created by the begin_prepare
+			 * for working with psf buffiles.
+			 */
+			Assert(IsTransactionState());
+			AbortCurrentTransaction();
+		}
+
+		prepare_spoolfile_cleanup(psfpath);
+	}
+
+	/*
 	 * It is possible that we haven't received prepare because it occurred
 	 * before walsender reached a consistent point in which case we need to
 	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
 	 */
-	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
 	{
 		/*
@@ -889,10 +1128,33 @@ apply_handle_stream_prepare(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
 	xid = logicalrep_read_stream_prepare(s, &prepare_data);
 	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
 
 	/*
+	 * Wait for all the sync workers to reach the SYNCDONE/READY state.
+	 *
+	 * This is same waiting logic as in apply_handle_begin_prepare function
+	 * (see that function for more details about this).
+	 */
+	if (!am_tablesync_worker())
+	{
+		while (AnyTablesyncInProgress())
+		{
+			process_syncing_tables(prepare_data.end_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+	}
+
+	/*
 	 *
 	 * --------------------------------------------------------------------------
 	 * 1. Replay all the spooled operations - Similar code as for
@@ -959,6 +1221,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		psf_fd == NULL &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1393,6 +1656,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1514,6 +1780,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1672,6 +1941,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -2041,6 +2313,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -2334,6 +2609,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	}
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL		hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2349,6 +2641,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 													"LogicalStreamingContext",
 													ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context is used when the prepare spooling is used. It is
+	 * reset at prepare commit/rollback time.
+	 */
+	PsfContext = AllocSetContextCreate(ApplyContext,
+									   "PsfContext",
+									   ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -2453,7 +2753,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && psf_fd == NULL)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -3378,3 +3678,367 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time. If needed, this is the common function to do that file redirection.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	elog(DEBUG1,
+		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_fd ? "Do" : "Don't");
+
+	if (psf_fd == NULL)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	MemoryContext oldctx;
+	bool		found;
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(psf_fd == NULL);
+
+	/* create or find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER | HASH_FIND,
+										  &found);
+
+	/*
+	 * Create/open the bufFiles under the Prepare Spoolfile Context so that we
+	 * have those files until prepare commit/rollback.
+	 */
+	oldctx = MemoryContextSwitchTo(PsfContext);
+
+	if (!found)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		psf_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember this path's fileset for the next time */
+		memcpy(hentry->name, path, MAXPGPATH);
+		hentry->fileset = fileset;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		elog(DEBUG1, "Found file \"%s\". Open for append.", path);
+		psf_fd = BufFileOpenShared(hentry->fileset, path, O_RDWR);
+		BufFileSeek(psf_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldctx);
+
+	/* Sanity check */
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_fd)
+		BufFileClose(psf_fd);
+	psf_fd = NULL;
+}
+
+/*
+ * Delete the specified psf spoolfile.
+ */
+static void
+prepare_spoolfile_cleanup(char *path)
+{
+	PsfHashEntry *hentry;
+
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* And remove the path entry from the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_REMOVE,
+										  NULL);
+
+	/* By this time we must have created the entry */
+	Assert(hentry != NULL);
+
+	/* Delete the file and release the fileset memory */
+	SharedFileSetDeleteAll(hentry->fileset);
+	pfree(hentry->fileset);
+	hentry->fileset = NULL;
+
+	/* Reset the memory context used during the file creation */
+	MemoryContextReset(PsfContext);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(psf_fd != NULL);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(psf_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(psf_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(psf_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Is there a prepare spoolfile for the specified gid?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	bool		found;
+
+	/* Find the prepare spoolfile entry in the psf_hash */
+	hash_search(psf_hash,
+				path,
+				HASH_FIND,
+				&found);
+
+	return found;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ *
+ * [Note: this is similar to apply_spooled_messages function]
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	bool		found = false;
+	PsfHashEntry *hentry;
+	BufFile    *fd;
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* Open the spool file */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_FIND,
+										  &found);
+	Assert(found);
+	fd = BufFileOpenShared(hentry->fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/*
+		 * The psf spoolfile contents will have first and last messages as
+		 * BEGIN_PREPARE and PREPARE message respectively. These two are
+		 * processed specially within this function.
+		 *
+		 * BEGIN_PREPARE msg: This will be the first message of the psf file.
+		 * Use this begin_data information to set the remote_final_lsn.
+		 *
+		 * PREPARE msg: The prepare_data information is returned so that the
+		 * prepare lsn values are available to the caller (commit_prepared).
+		 * Unfortunately, just dispatching the PREPARE message is problematic
+		 * because its transaction commits have side effects on this replay
+		 * loop which is still running.
+		 *
+		 * All other message content (between the BEGIN_PREPARE and the PREARE)
+		 * will be delivered to apply_dispatch as they normally would be.
+		 */
+		if (s2.data[0] == LOGICAL_REP_MSG_BEGIN_PREPARE)
+		{
+			LogicalRepBeginPrepareData bdata;
+
+			/* read/skip the action byte. */
+			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+			/* read the begin_data. */
+			logicalrep_read_begin_prepare(&s2, &bdata);
+
+			elog(DEBUG1, "BEGIN_PREPARE info: gid = '%s', final_lsn = %X/%X, end_lsn = %X/%X",
+				 bdata.gid,
+				 LSN_FORMAT_ARGS(bdata.final_lsn),
+				 LSN_FORMAT_ARGS(bdata.end_lsn));
+
+			/*
+			 * Make sure the handle apply_dispatch methods are aware we're in a remote
+			 * transaction.
+			 */
+			remote_final_lsn = bdata.final_lsn;
+			in_remote_transaction = true;
+			pgstat_report_activity(STATE_RUNNING, NULL);
+		}
+		else if (s2.data[0] == LOGICAL_REP_MSG_PREPARE)
+		{
+			/* read/skip the action byte. */
+			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_PREPARE);
+
+			/* read and return the prepare_data info to the caller */
+			logicalrep_read_prepare(&s2, pdata);
+
+			elog(DEBUG1, "PREPARE info: gid = '%s', prepare_lsn = %X/%X, end_lsn = %X/%X",
+				 pdata->gid,
+				 LSN_FORMAT_ARGS(pdata->prepare_lsn),
+				 LSN_FORMAT_ARGS(pdata->end_lsn));
+		}
+		else
+		{
+			/* Ensure we are reading the data into our memory context. */
+			oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+			apply_dispatch(&s2);
+
+			MemoryContextReset(ApplyMessageContext);
+
+			MemoryContextSwitchTo(oldctx2);
+
+			nchanges++;
+
+			if (nchanges % 1000 == 0)
+				elog(DEBUG1, "replayed %d changes from file '%s'",
+					 nchanges, path);
+		}
+	}
+
+	BufFileClose(fd);
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB.
+	 *
+	 * Therefore, the name and the key must be exactly same lengths and padded
+	 * with '\0' so garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "%u-%s.prep_changes", subid, gid);
+}
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..95d78e9 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AnyTablesyncInProgress(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 048681c..b7da03b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1958,6 +1958,7 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfContext
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v46-0008-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v46-0008-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From df47a9fac637ee6067371a2deef111e656c1e8ff Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 1 Mar 2021 15:16:54 +1100
Subject: [PATCH v46] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 ++++++++++---
 src/backend/replication/logical/worker.c    | 65 +++++++++++++++++++++--------
 2 files changed, 71 insertions(+), 23 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 5f897d3..c77730a 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1127,6 +1133,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1149,6 +1157,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1156,12 +1165,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1175,6 +1189,8 @@ AnyTablesyncInProgress()
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncInProgress?");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1186,8 +1202,8 @@ AnyTablesyncInProgress()
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1204,6 +1220,7 @@ AnyTablesyncInProgress()
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncInProgress?: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1215,8 +1232,8 @@ AnyTablesyncInProgress()
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1242,8 +1259,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fcf7bc2..348a54d 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -806,14 +806,16 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(begin_data.final_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (AnyTablesyncInProgress())
 		{
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
 			process_syncing_tables(begin_data.final_lsn);
 
 			/* This latch is to prevent 100% CPU looping. */
@@ -831,7 +833,12 @@ apply_handle_begin_prepare(StringInfo s)
 		 * prepared) will be saved to a spoolfile for replay later at
 		 * commit_prepared time.
 		 */
-		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		if (begin_data.final_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
 		{
 			char		psfpath[MAXPGPATH];
 			StringInfoData sid;
@@ -970,6 +977,8 @@ apply_handle_commit_prepared(StringInfo s)
 		int			nchanges;
 		LogicalRepPreparedTxnData pdata;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * 1. replay the spooled messages
 		 */
@@ -977,8 +986,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, &pdata);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		prepare_spoolfile_cleanup(psfpath);
@@ -1084,6 +1093,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf = %d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2474,18 +2484,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3692,8 +3706,8 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
-	elog(DEBUG1,
-		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
 		 action,
 		 psf_fd ? "Do" : "Don't");
 
@@ -3718,7 +3732,7 @@ prepare_spoolfile_create(char *path)
 	bool		found;
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(psf_fd == NULL);
 
@@ -3739,7 +3753,7 @@ prepare_spoolfile_create(char *path)
 		MemoryContext savectx;
 		SharedFileSet *fileset;
 
-		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		elog(LOG, "!!> Not found file \"%s\". Create it.", path);
 		savectx = MemoryContextSwitchTo(ApplyContext);
 		fileset = palloc(sizeof(SharedFileSet));
 
@@ -3758,7 +3772,7 @@ prepare_spoolfile_create(char *path)
 		 * Open the file and seek to the end of the file because we always
 		 * append the changes file.
 		 */
-		elog(DEBUG1, "Found file \"%s\". Open for append.", path);
+		elog(LOG, "!!> Found file \"%s\". Open for append.", path);
 		psf_fd = BufFileOpenShared(hentry->fileset, path, O_RDWR);
 		BufFileSeek(psf_fd, 0, 0, SEEK_END);
 	}
@@ -3775,6 +3789,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_fd)
 		BufFileClose(psf_fd);
 	psf_fd = NULL;
@@ -3788,6 +3803,8 @@ prepare_spoolfile_cleanup(char *path)
 {
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_cleanup: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3823,20 +3840,23 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_fd != NULL);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	BufFileWrite(psf_fd, &len, sizeof(len));
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	BufFileWrite(psf_fd, &action, sizeof(action));
 
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	BufFileWrite(psf_fd, &s->data[s->cursor], len);
 }
 
@@ -3854,6 +3874,11 @@ prepare_spoolfile_exists(char *path)
 				HASH_FIND,
 				&found);
 
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 found ? "found" : "not found");
+
 	return found;
 }
 
@@ -3874,8 +3899,8 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 	PsfHashEntry *hentry;
 	BufFile    *fd;
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3911,6 +3936,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 
 		/* read length of the on-disk record */
 		nbytes = BufFileRead(fd, &len, sizeof(len));
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -3929,6 +3955,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		if (BufFileRead(fd, buffer, len) != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -3960,13 +3987,15 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		{
 			LogicalRepBeginPrepareData bdata;
 
+			elog(LOG, "!!> prepare_spoolfile_replay_messages: Found the BEGIN_PREPARE info");
+
 			/* read/skip the action byte. */
 			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_BEGIN_PREPARE);
 
 			/* read the begin_data. */
 			logicalrep_read_begin_prepare(&s2, &bdata);
 
-			elog(DEBUG1, "BEGIN_PREPARE info: gid = '%s', final_lsn = %X/%X, end_lsn = %X/%X",
+			elog(LOG, "!!> BEGIN_PREPARE info: gid = '%s', final_lsn = %X/%X, end_lsn = %X/%X",
 				 bdata.gid,
 				 LSN_FORMAT_ARGS(bdata.final_lsn),
 				 LSN_FORMAT_ARGS(bdata.end_lsn));
@@ -3981,13 +4010,15 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		}
 		else if (s2.data[0] == LOGICAL_REP_MSG_PREPARE)
 		{
+			elog(LOG, "!!> prepare_spoolfile_replay_messages: Found the PREPARE info");
+
 			/* read/skip the action byte. */
 			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_PREPARE);
 
 			/* read and return the prepare_data info to the caller */
 			logicalrep_read_prepare(&s2, pdata);
 
-			elog(DEBUG1, "PREPARE info: gid = '%s', prepare_lsn = %X/%X, end_lsn = %X/%X",
+			elog(LOG, "!!> PREPARE info: gid = '%s', prepare_lsn = %X/%X, end_lsn = %X/%X",
 				 pdata->gid,
 				 LSN_FORMAT_ARGS(pdata->prepare_lsn),
 				 LSN_FORMAT_ARGS(pdata->end_lsn));
@@ -4006,7 +4037,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 			nchanges++;
 
 			if (nchanges % 1000 == 0)
-				elog(DEBUG1, "replayed %d changes from file '%s'",
+				elog(LOG, "!!> replayed %d changes from file '%s'",
 					 nchanges, path);
 		}
 	}
@@ -4016,7 +4047,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 	pfree(buffer);
 	pfree(s2.data);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
-- 
1.8.3.1

#209Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#203)

On Fri, Feb 26, 2021 at 4:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Feb 26, 2021 at 9:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 25, 2021 at 12:32 PM Peter Smith <smithpb2250@gmail.com> wrote:

5. You need to write/sync the spool file at prepare time because after
restart between prepare and commit prepared the changes can be lost
and won't be resent by the publisher assuming there are commits of
other transactions between prepare and commit prepared. For the same
reason, I am not sure if we can just rely on the in-memory hash table
for it (prepare_spoolfile_exists). Sure, if it exists and there is no
restart then it would be cheap to check in the hash table but I don't
think it is guaranteed.

As we can't rely on the hash table, I think we can get rid of it and
always check if the corresponding file exists.

Few more related points:
====================
1. Currently, the patch will always clean up the files if there is an
error because SharedFileSetInit registers the cleanup function.
However, we want the files to be removed only if any error happens
before flushing prepare. Once prepare is flushed, we expect the file
will be cleaned up by commit prepared. So, we need to probably call
SharedFileSetUnregister after prepare has been flushed to file.

2. The other point is that I think we need to drop these files (if
any) on Drop Subscription. Investigate if any variant of Alter needs
similar handling.

--
With Regards,
Amit Kapila.

#210Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#208)
9 attachment(s)

Please find attached the latest patch set v47

Differences from v46

* Rebased to HEAD

* New patch v47-0004 incorporates a change to command
CREATE_REPLICATION_SLOT to now have an option to specify if two-phase
is to be enabled.
This patch enables two-phase by default while creating logical
replication slots.

* patch v47-0006 (prev. v46-0005) modified to enable two-phase only
when the subscription is created using that option.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v47-0001-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v47-0001-Refactor-spool-file-logic-in-worker.c.patchDownload
From 92eec7fc06961ddb0a6b49ad17029ce4e22612f6 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 3 Mar 2021 02:58:37 -0500
Subject: [PATCH v47] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 18d0528..45ac498 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -917,30 +919,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +941,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +956,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1031,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v47-0002-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v47-0002-Track-replication-origin-progress-for-rollbacks.patchDownload
From 0d6231e2b4ae5c721c7beed1a29da9869742eba8 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 3 Mar 2021 03:00:15 -0500
Subject: [PATCH v47] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20..6023e7c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2276,6 +2276,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2298,6 +2306,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4e6a3df..acdb28d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5716,8 +5716,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5923,7 +5922,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5972,6 +5972,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6013,7 +6020,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6021,7 +6029,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v47-0004-Add-two_phase-option-to-CREATE-REPLICATION-SLOT.patchapplication/octet-stream; name=v47-0004-Add-two_phase-option-to-CREATE-REPLICATION-SLOT.patchDownload
From 6d8c10d48ab2ac7dd9fea9a34f348202bac12e21 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 3 Mar 2021 04:58:18 -0500
Subject: [PATCH v47] Add two_phase option to CREATE REPLICATION SLOT.

This patch adds new option to enable two_phase while creating a slot.
---
 src/backend/commands/subscriptioncmds.c                    |  2 +-
 .../replication/libpqwalreceiver/libpqwalreceiver.c        |  6 +++++-
 src/backend/replication/logical/tablesync.c                |  2 +-
 src/backend/replication/repl_gram.y                        | 14 +++++++++++---
 src/backend/replication/repl_scanner.l                     |  1 +
 src/backend/replication/walreceiver.c                      |  2 +-
 src/include/replication/walreceiver.h                      |  5 +++--
 7 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..f6793f0 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -528,7 +528,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				walrcv_create_slot(wrconn, slotname, false, true,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..9e822f9 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -827,7 +828,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +842,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..50c3ea7 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1052,7 +1052,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
 	/*
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..c5154ae 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -242,15 +244,16 @@ create_replication_slot:
 					$$ = (Node *) cmd;
 				}
 			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e5f8a06..e40d2d0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -363,7 +363,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..f55b07c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -345,6 +345,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -418,8 +419,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
-- 
1.8.3.1

v47-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v47-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From c05a1ad49ebf3b2078faeb421b060fd40ca77272 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 3 Mar 2021 04:52:35 -0500
Subject: [PATCH v47] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  68 ++++++++
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 257 ++++++++++++++++++++++++++++
 src/backend/replication/logical/worker.c    | 245 ++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 177 +++++++++++++++----
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  75 +++++++-
 src/include/replication/reorderbuffer.h     |  12 ++
 src/tools/pgindent/typedefs.list            |   3 +
 9 files changed, 807 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..81cb765 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..46c52e5 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,263 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 45ac498..4f6406e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -722,6 +723,230 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/*
+	 * The gid must not already be prepared.
+	 */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				errmsg("transaction identifier \"%s\" is already in use",
+					   begin_data.gid)));
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1972,6 +2197,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2bf1295 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +67,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +78,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +173,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +344,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +364,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +385,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +845,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -870,6 +927,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1195,3 +1270,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..358b14a 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..0c95dc6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8bd95ae..745b51d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v47-0005-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v47-0005-Support-2PC-txn-subscriber-tests.patchDownload
From b3e59b2de9cd4f503910db24700ef38804b73b4a Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 3 Mar 2021 05:06:38 -0500
Subject: [PATCH v47] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v47-0007-Tablesync-early-exit.patchapplication/octet-stream; name=v47-0007-Tablesync-early-exit.patchDownload
From 9913c9c37af252ff062baaecab73830800d3f5a0 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 3 Mar 2021 05:33:01 -0500
Subject: [PATCH v47] Tablesync early exit.

Give the tablesync worker an opportunity to see if it can exit immediately
(because it has already caught-up) without it needing to process a message
first before discovering that.
---
 src/backend/replication/logical/worker.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9e0c2dc..ad537f4 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2323,6 +2323,16 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	bool		ping_sent = false;
 	TimeLineID	tli;
 
+	if (am_tablesync_worker())
+	{
+		/*
+		 * Give the tablesync worker an opportunity see if it can immediately
+		 * exit, instead of always handling a message (which maybe the apply
+		 * worker could have handled).
+		 */
+		process_syncing_tables(last_received);
+	}
+
 	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
-- 
1.8.3.1

v47-0006-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v47-0006-Support-2PC-txn-Subscription-option.patchDownload
From 6e340df8292d4a83c2be531a90df615f1a6dbb4c Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 3 Mar 2021 05:26:56 -0500
Subject: [PATCH v47] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 46 +++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 203 insertions(+), 52 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..9c23497 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -184,8 +184,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73..060fab4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1168,7 +1168,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f6793f0..dacc890 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -358,6 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +399,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +468,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +547,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false, true,
+				walrcv_create_slot(wrconn, slotname, false, twophase,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -825,6 +844,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -835,7 +856,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -874,6 +896,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -892,7 +921,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +967,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1013,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9e822f9..1daa585 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -428,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4f6406e..9e0c2dc 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2702,6 +2702,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3348,6 +3349,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2bf1295..3a1b404 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -180,13 +180,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -254,6 +256,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -267,6 +279,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -291,7 +304,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -332,6 +346,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..4ac4924 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 358b14a..eebedda 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f55b07c..0ed8e9d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..8d24b2e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,29 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..5c79dbd 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v47-0008-Fix-apply-worker-empty-prepare.patchapplication/octet-stream; name=v47-0008-Fix-apply-worker-empty-prepare.patchDownload
From b8239158c47f5632805b3655540846ffb054e033 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 3 Mar 2021 05:44:59 -0500
Subject: [PATCH v47] Fix apply worker empty prepare.

By sad timing of apply/tablesync workers it is possible to have a "consistent snapshot" that spans prepare/commit in such a way that the tablesync did not do the prepare (because snapshot not consistent) and the apply worker does the prepare ('b') but it skips all the prepared operations [e.g. inserts] while the tablesync was still busy (see the condition of should_apply_changes_for_rel). Later, at the commit prepared time when the apply worker does the commit prepare ('K'), there is nothing committed (because the inserts were skipped earlier).

This patch implements a two-part fix as suggested [1] on hackers.

Part 1 - The begin_prepare handler of apply will always wait for any busy tablesync workers to acheive SYNCDONE/READY state.

Part 2 - If (after Part 1) the apply-worker's prepare is found to be lagging behind any of the sync-workers then the subsequent prepared operations will be spooled to a file to be replayed at commit_prepared time.

Discussion:
[1] https://www.postgresql.org/message-id/CAA4eK1L%3DdhuCRvyDvrXX5wZgc7s1hLRD29CKCK6oaHtVCPgiFA%40mail.gmail.com
---
 src/backend/replication/logical/tablesync.c | 178 ++++++--
 src/backend/replication/logical/worker.c    | 668 +++++++++++++++++++++++++++-
 src/include/replication/worker_internal.h   |   3 +
 src/tools/pgindent/typedefs.list            |   1 +
 4 files changed, 815 insertions(+), 35 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 50c3ea7..9e2b48c 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1137,3 +1111,141 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+AnyTablesyncInProgress()
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * When the process_syncing_tables_for_apply changes the state
+		 * from SYNCDONE to READY, that change is actually written directly
+		 * into the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state it it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ad537f4..fcf7bc2 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -209,6 +209,45 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/*
+ * A contest for the prepare spooling
+ */
+static MemoryContext PsfContext = NULL;
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	SharedFileSet *fileset;		/* shared file set for prepare spoolfile */
+}			PsfHashEntry;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * The spoolfile handle is only valid between begin_prepare and prepare.
+ */
+static BufFile *psf_fd = NULL;
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_cleanup(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -730,6 +769,9 @@ apply_handle_begin_prepare(StringInfo s)
 {
 	LogicalRepBeginPrepareData begin_data;
 
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
 	logicalrep_read_begin_prepare(s, &begin_data);
 
 	/*
@@ -741,6 +783,81 @@ apply_handle_begin_prepare(StringInfo s)
 				errmsg("transaction identifier \"%s\" is already in use",
 					   begin_data.gid)));
 
+	/*
+	 * A Problem:
+	 *
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel). Later at the
+	 * commit prepared time when the apply worker does the commit prepare
+	 * (‘K’), there is nothing in it (because the inserts were skipped
+	 * earlier).
+	 *
+	 * The following code has a 2-part workaround for that scenario.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Workaround Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(begin_data.final_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (AnyTablesyncInProgress())
+		{
+			process_syncing_tables(begin_data.final_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Workaround Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		{
+			char		psfpath[MAXPGPATH];
+			StringInfoData sid;
+
+			/* The begin_prepare's LSN has been overtaken. */
+
+			/*
+			 * We need a transaction for handling the buffile, used for
+			 * serializing prepared messages. This transaction lasts until the
+			 * commit_prepared/ rollback_prepared.
+			 */
+			ensure_transaction();
+
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+
+			/*
+			 * Write BEGIN_PREPARE as the first message of the psf file.
+			 */
+			initStringInfo(&sid);
+			appendBinaryStringInfo(&sid, (char *)&begin_data, sizeof(begin_data));
+			Assert(prepare_spoolfile_handler(LOGICAL_REP_MSG_BEGIN_PREPARE, &sid));
+		}
+	}
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	in_remote_transaction = true;
@@ -756,6 +873,38 @@ apply_handle_prepare(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
 
+	/*
+	 * If we were using a psf spoolfile, then write the PREPARE as the final
+	 * message. This prepare information will be used at commit_prepared time.
+	 */
+	if (psf_fd)
+	{
+		/* Write the PREPARE info to the psf file. */
+		Assert(prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s));
+
+		/*
+		 * TODO - Flush the spoolfile so changes can survive a restart.
+		 */
+
+		/*
+		 * The psf_fd is meaningful only between begin_prepare and prepared.
+		 * So close it now. If we had been writing any messages to the psf_fd
+		 * (the spoolfile) then those will be applied later during
+		 * handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		/*
+		 * And end the transaction that was created by begin_prepare for
+		 * working with the psf buffiles.
+		 */
+		Assert(IsTransactionState());
+		CommitTransactionCommand();
+
+		in_remote_transaction = false;
+		return;
+	}
+
 	logicalrep_read_prepare(s, &prepare_data);
 
 	Assert(prepare_data.prepare_lsn == remote_final_lsn);
@@ -805,9 +954,61 @@ static void
 apply_handle_commit_prepared(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
+
 
 	logicalrep_read_commit_prepared(s, &prepare_data);
 
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+		LogicalRepPreparedTxnData pdata;
+
+		/*
+		 * 1. replay the spooled messages
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, &pdata);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		prepare_spoolfile_cleanup(psfpath);
+
+		/*
+		 * 2. mark as PREPARED (use prepare_data info from the psf file)
+		 */
+
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = pdata.end_lsn;
+		replorigin_session_origin_timestamp = pdata.preparetime;
+
+		PrepareTransactionBlock(pdata.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(pdata.end_lsn);
+	}
+
 	/* there is no transaction when COMMIT PREPARED is called */
 	ensure_transaction();
 
@@ -838,15 +1039,53 @@ static void
 apply_handle_rollback_prepared(StringInfo s)
 {
 	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
 
 	logicalrep_read_rollback_prepared(s, &rollback_data);
 
 	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		if (psf_fd)
+		{
+			/*
+			 * XXX - For some reason it is currently possible (due to bug?) it
+			 * is possibe to get here, after a restart, when there was a
+			 * begin_prepare but there was NO prepare. Since there was no
+			 * prepare, the psf_fd and the transaction are still lingering so
+			 * they need to be cleaned up now.
+			 */
+			prepare_spoolfile_close();
+
+			/*
+			 * And end the transaction that was created by the begin_prepare
+			 * for working with psf buffiles.
+			 */
+			Assert(IsTransactionState());
+			AbortCurrentTransaction();
+		}
+
+		prepare_spoolfile_cleanup(psfpath);
+	}
+
+	/*
 	 * It is possible that we haven't received prepare because it occurred
 	 * before walsender reached a consistent point in which case we need to
 	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
 	 */
-	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
 	{
 		/*
@@ -889,10 +1128,33 @@ apply_handle_stream_prepare(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
 	xid = logicalrep_read_stream_prepare(s, &prepare_data);
 	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
 
 	/*
+	 * Wait for all the sync workers to reach the SYNCDONE/READY state.
+	 *
+	 * This is same waiting logic as in apply_handle_begin_prepare function
+	 * (see that function for more details about this).
+	 */
+	if (!am_tablesync_worker())
+	{
+		while (AnyTablesyncInProgress())
+		{
+			process_syncing_tables(prepare_data.end_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+	}
+
+	/*
 	 *
 	 * --------------------------------------------------------------------------
 	 * 1. Replay all the spooled operations - Similar code as for
@@ -959,6 +1221,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		psf_fd == NULL &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1393,6 +1656,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1514,6 +1780,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1672,6 +1941,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -2041,6 +2313,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -2334,6 +2609,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	}
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL		hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2349,6 +2641,14 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 													"LogicalStreamingContext",
 													ALLOCSET_DEFAULT_SIZES);
 
+	/*
+	 * This memory context is used when the prepare spooling is used. It is
+	 * reset at prepare commit/rollback time.
+	 */
+	PsfContext = AllocSetContextCreate(ApplyContext,
+									   "PsfContext",
+									   ALLOCSET_DEFAULT_SIZES);
+
 	/* mark as idle, before starting to loop */
 	pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -2453,7 +2753,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && psf_fd == NULL)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -3378,3 +3678,367 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time. If needed, this is the common function to do that file redirection.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	elog(DEBUG1,
+		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_fd ? "Do" : "Don't");
+
+	if (psf_fd == NULL)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	MemoryContext oldctx;
+	bool		found;
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(psf_fd == NULL);
+
+	/* create or find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER | HASH_FIND,
+										  &found);
+
+	/*
+	 * Create/open the bufFiles under the Prepare Spoolfile Context so that we
+	 * have those files until prepare commit/rollback.
+	 */
+	oldctx = MemoryContextSwitchTo(PsfContext);
+
+	if (!found)
+	{
+		MemoryContext savectx;
+		SharedFileSet *fileset;
+
+		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		savectx = MemoryContextSwitchTo(ApplyContext);
+		fileset = palloc(sizeof(SharedFileSet));
+
+		SharedFileSetInit(fileset, NULL);
+		MemoryContextSwitchTo(savectx);
+
+		psf_fd = BufFileCreateShared(fileset, path);
+
+		/* Remember this path's fileset for the next time */
+		memcpy(hentry->name, path, MAXPGPATH);
+		hentry->fileset = fileset;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the end of the file because we always
+		 * append the changes file.
+		 */
+		elog(DEBUG1, "Found file \"%s\". Open for append.", path);
+		psf_fd = BufFileOpenShared(hentry->fileset, path, O_RDWR);
+		BufFileSeek(psf_fd, 0, 0, SEEK_END);
+	}
+
+	MemoryContextSwitchTo(oldctx);
+
+	/* Sanity check */
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_fd)
+		BufFileClose(psf_fd);
+	psf_fd = NULL;
+}
+
+/*
+ * Delete the specified psf spoolfile.
+ */
+static void
+prepare_spoolfile_cleanup(char *path)
+{
+	PsfHashEntry *hentry;
+
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* And remove the path entry from the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_REMOVE,
+										  NULL);
+
+	/* By this time we must have created the entry */
+	Assert(hentry != NULL);
+
+	/* Delete the file and release the fileset memory */
+	SharedFileSetDeleteAll(hentry->fileset);
+	pfree(hentry->fileset);
+	hentry->fileset = NULL;
+
+	/* Reset the memory context used during the file creation */
+	MemoryContextReset(PsfContext);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+
+	Assert(psf_fd != NULL);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	BufFileWrite(psf_fd, &len, sizeof(len));
+
+	/* then the action */
+	BufFileWrite(psf_fd, &action, sizeof(action));
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	BufFileWrite(psf_fd, &s->data[s->cursor], len);
+}
+
+/*
+ * Is there a prepare spoolfile for the specified gid?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	bool		found;
+
+	/* Find the prepare spoolfile entry in the psf_hash */
+	hash_search(psf_hash,
+				path,
+				HASH_FIND,
+				&found);
+
+	return found;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ *
+ * [Note: this is similar to apply_spooled_messages function]
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	bool		found = false;
+	PsfHashEntry *hentry;
+	BufFile    *fd;
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	/* Open the spool file */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_FIND,
+										  &found);
+	Assert(found);
+	fd = BufFileOpenShared(hentry->fileset, path, O_RDONLY);
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = BufFileRead(fd, &len, sizeof(len));
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		if (BufFileRead(fd, buffer, len) != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/*
+		 * The psf spoolfile contents will have first and last messages as
+		 * BEGIN_PREPARE and PREPARE message respectively. These two are
+		 * processed specially within this function.
+		 *
+		 * BEGIN_PREPARE msg: This will be the first message of the psf file.
+		 * Use this begin_data information to set the remote_final_lsn.
+		 *
+		 * PREPARE msg: The prepare_data information is returned so that the
+		 * prepare lsn values are available to the caller (commit_prepared).
+		 * Unfortunately, just dispatching the PREPARE message is problematic
+		 * because its transaction commits have side effects on this replay
+		 * loop which is still running.
+		 *
+		 * All other message content (between the BEGIN_PREPARE and the PREARE)
+		 * will be delivered to apply_dispatch as they normally would be.
+		 */
+		if (s2.data[0] == LOGICAL_REP_MSG_BEGIN_PREPARE)
+		{
+			LogicalRepBeginPrepareData bdata;
+
+			/* read/skip the action byte. */
+			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+			/* read the begin_data. */
+			logicalrep_read_begin_prepare(&s2, &bdata);
+
+			elog(DEBUG1, "BEGIN_PREPARE info: gid = '%s', final_lsn = %X/%X, end_lsn = %X/%X",
+				 bdata.gid,
+				 LSN_FORMAT_ARGS(bdata.final_lsn),
+				 LSN_FORMAT_ARGS(bdata.end_lsn));
+
+			/*
+			 * Make sure the handle apply_dispatch methods are aware we're in a remote
+			 * transaction.
+			 */
+			remote_final_lsn = bdata.final_lsn;
+			in_remote_transaction = true;
+			pgstat_report_activity(STATE_RUNNING, NULL);
+		}
+		else if (s2.data[0] == LOGICAL_REP_MSG_PREPARE)
+		{
+			/* read/skip the action byte. */
+			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_PREPARE);
+
+			/* read and return the prepare_data info to the caller */
+			logicalrep_read_prepare(&s2, pdata);
+
+			elog(DEBUG1, "PREPARE info: gid = '%s', prepare_lsn = %X/%X, end_lsn = %X/%X",
+				 pdata->gid,
+				 LSN_FORMAT_ARGS(pdata->prepare_lsn),
+				 LSN_FORMAT_ARGS(pdata->end_lsn));
+		}
+		else
+		{
+			/* Ensure we are reading the data into our memory context. */
+			oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+			apply_dispatch(&s2);
+
+			MemoryContextReset(ApplyMessageContext);
+
+			MemoryContextSwitchTo(oldctx2);
+
+			nchanges++;
+
+			if (nchanges % 1000 == 0)
+				elog(DEBUG1, "replayed %d changes from file '%s'",
+					 nchanges, path);
+		}
+	}
+
+	BufFileClose(fd);
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB.
+	 *
+	 * Therefore, the name and the key must be exactly same lengths and padded
+	 * with '\0' so garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "%u-%s.prep_changes", subid, gid);
+}
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..95d78e9 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AnyTablesyncInProgress(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 745b51d..1b0c25b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1958,6 +1958,7 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfContext
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v47-0009-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v47-0009-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From 44986461ef243a7449eb26c28370cd77504c9f2f Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 3 Mar 2021 05:59:22 -0500
Subject: [PATCH v47] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 ++++++++++---
 src/backend/replication/logical/worker.c    | 65 +++++++++++++++++++++--------
 2 files changed, 71 insertions(+), 23 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 9e2b48c..a867dba 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1127,6 +1133,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1149,6 +1157,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1156,12 +1165,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1175,6 +1189,8 @@ AnyTablesyncInProgress()
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncInProgress?");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1186,8 +1202,8 @@ AnyTablesyncInProgress()
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1204,6 +1220,7 @@ AnyTablesyncInProgress()
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncInProgress?: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1215,8 +1232,8 @@ AnyTablesyncInProgress()
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1242,8 +1259,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fcf7bc2..348a54d 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -806,14 +806,16 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(begin_data.final_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (AnyTablesyncInProgress())
 		{
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
 			process_syncing_tables(begin_data.final_lsn);
 
 			/* This latch is to prevent 100% CPU looping. */
@@ -831,7 +833,12 @@ apply_handle_begin_prepare(StringInfo s)
 		 * prepared) will be saved to a spoolfile for replay later at
 		 * commit_prepared time.
 		 */
-		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		if (begin_data.final_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
 		{
 			char		psfpath[MAXPGPATH];
 			StringInfoData sid;
@@ -970,6 +977,8 @@ apply_handle_commit_prepared(StringInfo s)
 		int			nchanges;
 		LogicalRepPreparedTxnData pdata;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * 1. replay the spooled messages
 		 */
@@ -977,8 +986,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, &pdata);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		prepare_spoolfile_cleanup(psfpath);
@@ -1084,6 +1093,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf = %d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2474,18 +2484,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3692,8 +3706,8 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
-	elog(DEBUG1,
-		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
 		 action,
 		 psf_fd ? "Do" : "Don't");
 
@@ -3718,7 +3732,7 @@ prepare_spoolfile_create(char *path)
 	bool		found;
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(psf_fd == NULL);
 
@@ -3739,7 +3753,7 @@ prepare_spoolfile_create(char *path)
 		MemoryContext savectx;
 		SharedFileSet *fileset;
 
-		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		elog(LOG, "!!> Not found file \"%s\". Create it.", path);
 		savectx = MemoryContextSwitchTo(ApplyContext);
 		fileset = palloc(sizeof(SharedFileSet));
 
@@ -3758,7 +3772,7 @@ prepare_spoolfile_create(char *path)
 		 * Open the file and seek to the end of the file because we always
 		 * append the changes file.
 		 */
-		elog(DEBUG1, "Found file \"%s\". Open for append.", path);
+		elog(LOG, "!!> Found file \"%s\". Open for append.", path);
 		psf_fd = BufFileOpenShared(hentry->fileset, path, O_RDWR);
 		BufFileSeek(psf_fd, 0, 0, SEEK_END);
 	}
@@ -3775,6 +3789,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_fd)
 		BufFileClose(psf_fd);
 	psf_fd = NULL;
@@ -3788,6 +3803,8 @@ prepare_spoolfile_cleanup(char *path)
 {
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_cleanup: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3823,20 +3840,23 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_fd != NULL);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	BufFileWrite(psf_fd, &len, sizeof(len));
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	BufFileWrite(psf_fd, &action, sizeof(action));
 
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	BufFileWrite(psf_fd, &s->data[s->cursor], len);
 }
 
@@ -3854,6 +3874,11 @@ prepare_spoolfile_exists(char *path)
 				HASH_FIND,
 				&found);
 
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 found ? "found" : "not found");
+
 	return found;
 }
 
@@ -3874,8 +3899,8 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 	PsfHashEntry *hentry;
 	BufFile    *fd;
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3911,6 +3936,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 
 		/* read length of the on-disk record */
 		nbytes = BufFileRead(fd, &len, sizeof(len));
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -3929,6 +3955,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		if (BufFileRead(fd, buffer, len) != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -3960,13 +3987,15 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		{
 			LogicalRepBeginPrepareData bdata;
 
+			elog(LOG, "!!> prepare_spoolfile_replay_messages: Found the BEGIN_PREPARE info");
+
 			/* read/skip the action byte. */
 			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_BEGIN_PREPARE);
 
 			/* read the begin_data. */
 			logicalrep_read_begin_prepare(&s2, &bdata);
 
-			elog(DEBUG1, "BEGIN_PREPARE info: gid = '%s', final_lsn = %X/%X, end_lsn = %X/%X",
+			elog(LOG, "!!> BEGIN_PREPARE info: gid = '%s', final_lsn = %X/%X, end_lsn = %X/%X",
 				 bdata.gid,
 				 LSN_FORMAT_ARGS(bdata.final_lsn),
 				 LSN_FORMAT_ARGS(bdata.end_lsn));
@@ -3981,13 +4010,15 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		}
 		else if (s2.data[0] == LOGICAL_REP_MSG_PREPARE)
 		{
+			elog(LOG, "!!> prepare_spoolfile_replay_messages: Found the PREPARE info");
+
 			/* read/skip the action byte. */
 			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_PREPARE);
 
 			/* read and return the prepare_data info to the caller */
 			logicalrep_read_prepare(&s2, pdata);
 
-			elog(DEBUG1, "PREPARE info: gid = '%s', prepare_lsn = %X/%X, end_lsn = %X/%X",
+			elog(LOG, "!!> PREPARE info: gid = '%s', prepare_lsn = %X/%X, end_lsn = %X/%X",
 				 pdata->gid,
 				 LSN_FORMAT_ARGS(pdata->prepare_lsn),
 				 LSN_FORMAT_ARGS(pdata->end_lsn));
@@ -4006,7 +4037,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 			nchanges++;
 
 			if (nchanges % 1000 == 0)
-				elog(DEBUG1, "replayed %d changes from file '%s'",
+				elog(LOG, "!!> replayed %d changes from file '%s'",
 					 nchanges, path);
 		}
 	}
@@ -4016,7 +4047,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 	pfree(buffer);
 	pfree(s2.data);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
-- 
1.8.3.1

#211Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#207)

On Sat, Feb 27, 2021 at 8:10 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v45*

Differences from v44*:

* Rebased to HEAD

* Addressed some feedback comments for the 0007 ("empty prepare") patch.

[ak1] #1 - TODO
[ak1] #2 - Fixed. Removed #if 0 debugging
[ak1] #3 - TODO
[ak1] #4 - Fixed. Now BEGIN_PREPARE and PREPARE msgs are spooled. The
lsns are obtained from them.

@@ -774,6 +891,38 @@ apply_handle_prepare(StringInfo s)
{
LogicalRepPreparedTxnData prepare_data;

+ /*
+ * If we were using a psf spoolfile, then write the PREPARE as the final
+ * message. This prepare information will be used at commit_prepared time.
+ */
+ if (psf_fd)
+ {
+ /* Write the PREPARE info to the psf file. */
+ Assert(prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s));

Why writing prepare is under Assert?

Similarly, the commit_prepared code as below still does prepare:
+ /*
+ * 2. mark as PREPARED (use prepare_data info from the psf file)
+ */
+
+ /*
+ * BeginTransactionBlock is necessary to balance the
+ * EndTransactionBlock called within the PrepareTransactionBlock
+ * below.
+ */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+
+ /*
+ * Update origin state so we can restart streaming from correct
+ * position in case of crash.
+ */
+ replorigin_session_origin_lsn = pdata.end_lsn;
+ replorigin_session_origin_timestamp = pdata.preparetime;
+
+ PrepareTransactionBlock(pdata.gid);
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(pdata.end_lsn);

This should automatically happen via apply_handle_prepare if we write
it to spool file.

* prepare_spoolfile_replay_messages() shouldn't handle special cases
for BEGIN_PREPARE and PREPARE messages. Those should be handled by
there corresponding apply_handle_* functions. Before processing the
messages remote_final_lsn needs to be set as commit_prepared's
commit_lsn (aka prepare_data.prepare_lsn)

--
With Regards,
Amit Kapila.

#212Peter Smith
smithpb2250@gmail.com
In reply to: Ajin Cherian (#210)
9 attachment(s)

Please find attached the latest patch set v48*

Differences from v47* are:

* Rebased to HEAD @ today

* Patch v46-0008 "empty prepare" updated
Modified code to use File API instead of BufFile API for prepare spoolfile (psf)
Various other feedback items also addressed:
[05a] Now syncing the psf file at prepare time
[05e] Now spooling psf files should delete on error, or if already
prepared then delete only when they are committed/rollbacked
[06]: Now checking existence of psf file on disk if not in memory (in case HTAB lost after restart)
case HTAB lost after restart)
[16]: Fixed. Remove unnecessary Assert with spooled PREPARE message
[20]: Fixed. Typo "it it" in comment.

KNOWN ISSUES
* Patch 0008 has more feedback comments to be addressed

-----

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v48-0001-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v48-0001-Refactor-spool-file-logic-in-worker.c.patchDownload
From 76c99759727e070111fc96575cf7a5505e44f118 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 4 Mar 2021 15:58:35 +1100
Subject: [PATCH v48] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 18d0528..45ac498 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -917,30 +919,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +941,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +956,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1031,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v48-0002-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v48-0002-Track-replication-origin-progress-for-rollbacks.patchDownload
From a23e7056277f6e90c0c5356de572756dbb3b068e Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 4 Mar 2021 16:01:05 +1100
Subject: [PATCH v48] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20..6023e7c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2276,6 +2276,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2298,6 +2306,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4e6a3df..acdb28d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5716,8 +5716,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5923,7 +5922,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5972,6 +5972,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6013,7 +6020,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6021,7 +6029,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v48-0004-Add-two_phase-option-to-CREATE-REPLICATION-SLOT.patchapplication/octet-stream; name=v48-0004-Add-two_phase-option-to-CREATE-REPLICATION-SLOT.patchDownload
From 4d6c38bba0bd8f3891411762d3b0acda08dbc1f1 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 4 Mar 2021 16:09:51 +1100
Subject: [PATCH v48] Add two_phase option to CREATE REPLICATION SLOT.

This patch adds new option to enable two_phase while creating a slot.
---
 src/backend/commands/subscriptioncmds.c                    |  2 +-
 .../replication/libpqwalreceiver/libpqwalreceiver.c        |  6 +++++-
 src/backend/replication/logical/tablesync.c                |  2 +-
 src/backend/replication/repl_gram.y                        | 14 +++++++++++---
 src/backend/replication/repl_scanner.l                     |  1 +
 src/backend/replication/walreceiver.c                      |  2 +-
 src/include/replication/walreceiver.h                      |  5 +++--
 7 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..f6793f0 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -528,7 +528,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				walrcv_create_slot(wrconn, slotname, false, true,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..9e822f9 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -827,7 +828,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +842,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..50c3ea7 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1052,7 +1052,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
 	/*
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..c5154ae 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -242,15 +244,16 @@ create_replication_slot:
 					$$ = (Node *) cmd;
 				}
 			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e5f8a06..e40d2d0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -363,7 +363,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..f55b07c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -345,6 +345,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -418,8 +419,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
-- 
1.8.3.1

v48-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v48-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 4cac1dd34ff10620c698f6fab5ca2ed07cac8866 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 4 Mar 2021 16:02:47 +1100
Subject: [PATCH v48] Add support for apply at prepare time to built-in logical
  replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the
changes accumulated in spool-file at prepare time.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c       |  68 ++++++++
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 257 ++++++++++++++++++++++++++++
 src/backend/replication/logical/worker.c    | 245 ++++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 177 +++++++++++++++----
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  75 +++++++-
 src/include/replication/reorderbuffer.h     |  12 ++
 src/tools/pgindent/typedefs.list            |   3 +
 9 files changed, 807 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..81cb765 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..46c52e5 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,263 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 45ac498..4f6406e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -722,6 +723,230 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/*
+	 * The gid must not already be prepared.
+	 */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				errmsg("transaction identifier \"%s\" is already in use",
+					   begin_data.gid)));
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 * --------------------------------------------------------------------------
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 *
+	 * --------------------------------------------------------------------------
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 * --------------------------------------------------------------------------
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare_txn: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1972,6 +2197,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2bf1295 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -57,6 +67,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -66,6 +78,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +173,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -322,8 +344,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +364,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +385,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +845,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -870,6 +927,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1195,3 +1270,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..358b14a 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -171,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..0c95dc6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8bd95ae..745b51d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v48-0005-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v48-0005-Support-2PC-txn-subscriber-tests.patchDownload
From c1f03b5192078658864e4f0f251a130c6db56345 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 4 Mar 2021 16:15:24 +1100
Subject: [PATCH v48] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl            | 338 ++++++++++++++
 src/test/subscription/t/021_twophase_stream.pl     | 517 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 282 +++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 319 +++++++++++++
 4 files changed, 1456 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_stream.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
new file mode 100644
index 0000000..9ec1e31
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -0,0 +1,517 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Test setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+	or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED '';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Test cases involving DDL
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3c6470d
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,319 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', q{
+    DELETE FROM test_tab WHERE a > 2;});
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PRPEARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v48-0007-Tablesync-early-exit.patchapplication/octet-stream; name=v48-0007-Tablesync-early-exit.patchDownload
From 49825f929630201ddccfba0181f735f1171cdc54 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 4 Mar 2021 16:39:10 +1100
Subject: [PATCH v48] Tablesync early exit.

Give the tablesync worker an opportunity to see if it can exit immediately
(because it has already caught-up) without it needing to process a message
first before discovering that.
---
 src/backend/replication/logical/worker.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9e0c2dc..ad537f4 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2323,6 +2323,16 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	bool		ping_sent = false;
 	TimeLineID	tli;
 
+	if (am_tablesync_worker())
+	{
+		/*
+		 * Give the tablesync worker an opportunity see if it can immediately
+		 * exit, instead of always handling a message (which maybe the apply
+		 * worker could have handled).
+		 */
+		process_syncing_tables(last_received);
+	}
+
 	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
-- 
1.8.3.1

v48-0006-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v48-0006-Support-2PC-txn-Subscription-option.patchDownload
From 68b7c9526be03649cfbf33c05541fed5e55308fd Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 4 Mar 2021 16:19:34 +1100
Subject: [PATCH v48] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/alter_subscription.sgml           |  5 +-
 doc/src/sgml/ref/create_subscription.sgml          | 15 ++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 46 +++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 ++
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 +++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 ++++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 +--
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 ++
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 79 ++++++++++++++--------
 src/test/regress/sql/subscription.sql              | 15 ++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/021_twophase_stream.pl     |  2 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 .../subscription/t/023_twophase_cascade_stream.pl  |  4 +-
 20 files changed, 203 insertions(+), 52 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..9c23497 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -184,8 +184,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,
+      <literal>streaming</literal>, and
+      <literal>two_phase</literal>.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1332a83 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,21 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73..060fab4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1168,7 +1168,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f6793f0..dacc890 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,15 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
+		{
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -358,6 +373,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +399,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +468,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +547,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false, true,
+				walrcv_create_slot(wrconn, slotname, false, twophase,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -825,6 +844,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 				bool		binary;
 				bool		streaming_given;
 				bool		streaming;
+				bool		twophase_given;
+				bool		twophase;
 
 				parse_subscription_options(stmt->options,
 										   NULL,	/* no "connect" */
@@ -835,7 +856,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   &twophase_given, &twophase);
 
 				if (slotname_given)
 				{
@@ -874,6 +896,13 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 					replaces[Anum_pg_subscription_substream - 1] = true;
 				}
 
+				if (twophase_given)
+				{
+					values[Anum_pg_subscription_subtwophase - 1] =
+						BoolGetDatum(twophase);
+					replaces[Anum_pg_subscription_subtwophase - 1] = true;
+				}
+
 				update_tuple = true;
 				break;
 			}
@@ -892,7 +921,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +967,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1013,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9e822f9..1daa585 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -428,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4f6406e..9e0c2dc 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2702,6 +2702,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3348,6 +3349,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2bf1295..3a1b404 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -180,13 +180,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -254,6 +256,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -267,6 +279,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -291,7 +304,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -332,6 +346,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..4ac4924 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 358b14a..eebedda 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f55b07c..0ed8e9d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..8d24b2e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,29 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..5c79dbd 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,21 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/021_twophase_stream.pl b/src/test/subscription/t/021_twophase_stream.pl
index 9ec1e31..a2d4824 100644
--- a/src/test/subscription/t/021_twophase_stream.pl
+++ b/src/test/subscription/t/021_twophase_stream.pl
@@ -41,7 +41,7 @@ $node_subscriber->safe_psql('postgres', "
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
 	PUBLICATION tap_pub
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
index 3c6470d..ffba03f 100644
--- a/src/test/subscription/t/023_twophase_cascade_stream.pl
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -56,7 +56,7 @@ $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
 	PUBLICATION tap_pub_A
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -69,7 +69,7 @@ $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
 	PUBLICATION tap_pub_B
-	WITH (streaming = on)");
+	WITH (streaming = on, two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v48-0008-Fix-apply-worker-empty-prepare.patchapplication/octet-stream; name=v48-0008-Fix-apply-worker-empty-prepare.patchDownload
From 56963d06eff84220a560b2bdb362e9d94a5e012a Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 4 Mar 2021 20:01:59 +1100
Subject: [PATCH v48] Fix apply worker empty prepare.

By sad timing of apply/tablesync workers it is possible to have a "consistent snapshot" that spans prepare/commit in such a way that the tablesync did not do the prepare (because snapshot not consistent) and the apply worker does the prepare ('b') but it skips all the prepared operations [e.g. inserts] while the tablesync was still busy (see the condition of should_apply_changes_for_rel). Later, at the commit prepared time when the apply worker does the commit prepare ('K'), there is nothing committed (because the inserts were skipped earlier).

This patch implements a two-part fix as suggested [1] on hackers.

Part 1 - The begin_prepare handler of apply will always wait for any busy tablesync workers to acheive SYNCDONE/READY state.

Part 2 - If (after Part 1) the apply-worker's prepare is found to be lagging behind any of the sync-workers then the subsequent prepared operations will be spooled to a file to be replayed at commit_prepared time.

Discussion:
[1] https://www.postgresql.org/message-id/CAA4eK1L%3DdhuCRvyDvrXX5wZgc7s1hLRD29CKCK6oaHtVCPgiFA%40mail.gmail.com
---
 src/backend/replication/logical/tablesync.c | 178 +++++--
 src/backend/replication/logical/worker.c    | 731 +++++++++++++++++++++++++++-
 src/include/replication/worker_internal.h   |   3 +
 src/tools/pgindent/typedefs.list            |   2 +
 4 files changed, 879 insertions(+), 35 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 50c3ea7..97fc399 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1137,3 +1111,141 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+AnyTablesyncInProgress()
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * When the process_syncing_tables_for_apply changes the state
+		 * from SYNCDONE to READY, that change is actually written directly
+		 * into the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ad537f4..f753e33 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -209,6 +209,54 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	bool allow_delete; /* ok to delete? */
+}			PsfHashEntry;
+
+/*
+ * Information about the "current" psf spoolfile.
+ */
+typedef struct PsfFile
+{
+	char	name[MAXPGPATH];/* psf name - same as the HTAB key. */
+	bool	is_spooling;	/* are we currently spooling to this file? */
+	File 	vfd;			/* -1 when the file is closed. */
+	off_t	cur_offset;		/* offset for the next write or read. Reset to 0
+							 * when file is opened. */
+} PsfFile;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * Information about the 'current' open spoolfile is only valid when spooling.
+ * This is flagged as 'is_spooling' only between begin_prepare and prepare.
+ */
+static PsfFile psf_cur = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_delete(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+static void prepare_spoolfile_on_proc_exit(int status, Datum arg);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -730,6 +778,9 @@ apply_handle_begin_prepare(StringInfo s)
 {
 	LogicalRepBeginPrepareData begin_data;
 
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
 	logicalrep_read_begin_prepare(s, &begin_data);
 
 	/*
@@ -741,6 +792,81 @@ apply_handle_begin_prepare(StringInfo s)
 				errmsg("transaction identifier \"%s\" is already in use",
 					   begin_data.gid)));
 
+	/*
+	 * A Problem:
+	 *
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel). Later at the
+	 * commit prepared time when the apply worker does the commit prepare
+	 * (‘K’), there is nothing in it (because the inserts were skipped
+	 * earlier).
+	 *
+	 * The following code has a 2-part workaround for that scenario.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Workaround Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(begin_data.final_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (AnyTablesyncInProgress())
+		{
+			process_syncing_tables(begin_data.final_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Workaround Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		{
+			char		psfpath[MAXPGPATH];
+			StringInfoData sid;
+
+			/*
+			 * Create the spoolfile.
+			 */
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+
+			/*
+			 * From now, until the handle_prepare we are spooling to the
+			 * current psf.
+			 */
+			psf_cur.is_spooling = true;
+
+			/*
+			 * Write BEGIN_PREPARE as the first message of the psf file.
+			 */
+			initStringInfo(&sid);
+			appendBinaryStringInfo(&sid, (char *)&begin_data, sizeof(begin_data));
+			Assert(prepare_spoolfile_handler(LOGICAL_REP_MSG_BEGIN_PREPARE, &sid));
+		}
+	}
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	in_remote_transaction = true;
@@ -756,6 +882,50 @@ apply_handle_prepare(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
 
+	/*
+	 * If we were using a psf spoolfile, then write the PREPARE as the final
+	 * message. This prepare information will be used at commit_prepared time.
+	 */
+	if (psf_cur.is_spooling)
+	{
+		PsfHashEntry *hentry;
+
+		/* Write the PREPARE info to the psf file. */
+		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
+
+		/*
+		 * Flush the spoolfile, so changes can survive a restart.
+		 */
+		FileSync(psf_cur.vfd, WAIT_EVENT_DATA_FILE_SYNC);
+
+		/*
+		 * We are finished spooling to the current psf.
+		 */
+		psf_cur.is_spooling = false;
+
+		/*
+		 * The commit_prepare will need the spoolfile, so unregister it for
+		 * removal on proc-exit just in case there is an unexpected restart
+		 * between now and when commit_prepared happens.
+		 */
+		hentry = (PsfHashEntry *) hash_search(psf_hash,
+											  psf_cur.name,
+											  HASH_FIND,
+											  NULL);
+		Assert(hentry);
+		hentry->allow_delete = false;
+
+		/*
+		 * The psf_cur.vfd is meaningful only between begin_prepare and prepared.
+		 * So close it now. Any messages written to the psf will be applied
+		 * later during handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		in_remote_transaction = false;
+		return;
+	}
+
 	logicalrep_read_prepare(s, &prepare_data);
 
 	Assert(prepare_data.prepare_lsn == remote_final_lsn);
@@ -805,9 +975,63 @@ static void
 apply_handle_commit_prepared(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
 
 	logicalrep_read_commit_prepared(s, &prepare_data);
 
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+		LogicalRepPreparedTxnData pdata;
+
+		/*
+		 * 1. replay the spooled messages
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, &pdata);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		/*
+		 * 2. mark as PREPARED (use prepare_data info from the psf file)
+		 */
+
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = pdata.end_lsn;
+		replorigin_session_origin_timestamp = pdata.preparetime;
+
+		PrepareTransactionBlock(pdata.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(pdata.end_lsn);
+
+		/*
+		 * Now that we replayed the psf it is no longer needed. Just delete it.
+		 */
+		prepare_spoolfile_delete(psfpath);
+	}
+
 	/* there is no transaction when COMMIT PREPARED is called */
 	ensure_transaction();
 
@@ -838,15 +1062,49 @@ static void
 apply_handle_rollback_prepared(StringInfo s)
 {
 	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
 
 	logicalrep_read_rollback_prepared(s, &rollback_data);
 
 	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		if (psf_cur.is_spooling)
+		{
+			/*
+			 * XXX - For some reason it is currently possible (due to bug?) it
+			 * is possibe to get here, after a restart, when there was a
+			 * begin_prepare but there was NO prepare. Since there was no
+			 * prepare, the psf_cur and the transaction are still lingering
+			 * so they need to be cleaned up now.
+			 */
+			prepare_spoolfile_close();
+		}
+
+		/*
+		 * We are finished with this spoolfile. Delete it.
+		 */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/*
 	 * It is possible that we haven't received prepare because it occurred
 	 * before walsender reached a consistent point in which case we need to
 	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
 	 */
-	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
 	{
 		/*
@@ -889,10 +1147,33 @@ apply_handle_stream_prepare(StringInfo s)
 
 	Assert(!in_streamed_transaction);
 
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
 	xid = logicalrep_read_stream_prepare(s, &prepare_data);
 	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
 
 	/*
+	 * Wait for all the sync workers to reach the SYNCDONE/READY state.
+	 *
+	 * This is same waiting logic as in apply_handle_begin_prepare function
+	 * (see that function for more details about this).
+	 */
+	if (!am_tablesync_worker())
+	{
+		while (AnyTablesyncInProgress())
+		{
+			process_syncing_tables(prepare_data.end_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+	}
+
+	/*
 	 *
 	 * --------------------------------------------------------------------------
 	 * 1. Replay all the spooled operations - Similar code as for
@@ -959,6 +1240,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		!psf_cur.is_spooling &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1393,6 +1675,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1514,6 +1799,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1672,6 +1960,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -2041,6 +2332,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -2334,6 +2628,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	}
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL		hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2453,7 +2764,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && !psf_cur.is_spooling)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -3201,6 +3512,9 @@ ApplyWorkerMain(Datum main_arg)
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
 
+	/* Arrange to delete any unwanted psf file(s) at proc-exit */
+	on_proc_exit(prepare_spoolfile_on_proc_exit, 0);
+
 	/* Setup signal handling */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGTERM, die);
@@ -3378,3 +3692,416 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time. If needed, this is the common function to do that file redirection.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	elog(DEBUG1,
+		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_cur.is_spooling ? "Do" : "Don't");
+
+	if (!psf_cur.is_spooling)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	bool		found;
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(!psf_cur.is_spooling);
+
+	/* create or find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER | HASH_FIND,
+										  &found);
+
+	if (!found)
+	{
+		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create file \"%s\": %m", path)));
+		}
+		memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+		psf_cur.cur_offset = 0;
+		hentry->allow_delete = true;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the beginning because we always want to
+		 * create/overwrite this file.
+		 */
+		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m", path)));
+		}
+		memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+		psf_cur.cur_offset = 0;
+		hentry->allow_delete = true;
+	}
+
+	/* Sanity checks */
+	Assert(psf_cur.vfd >= 0);
+	Assert(psf_cur.cur_offset == 0);
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_cur.vfd >= 0)
+		FileClose(psf_cur.vfd);
+
+	/* Mark this fd as not valid to use anymore. */
+	psf_cur.is_spooling = false;
+	psf_cur.vfd = -1;
+	psf_cur.cur_offset = 0;
+}
+
+/*
+ * Delete the specified psf spoolfile, and any HTAB associated with it.
+ */
+static void
+prepare_spoolfile_delete(char *path)
+{
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* Delete the file off the disk. */
+	unlink(path);
+
+	/* Remove any entry from the psf_hash, if present */
+	hash_search(psf_hash, path, HASH_REMOVE, NULL);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+	int			bytes_written;
+
+	Assert(psf_cur.is_spooling);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(len));
+	psf_cur.cur_offset += bytes_written;
+
+	/* then the action */
+	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(action));
+	psf_cur.cur_offset += bytes_written;
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == len);
+	psf_cur.cur_offset += bytes_written;
+}
+
+/*
+ * Is there a prepare spoolfile for the specified path?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	bool		found;
+	PsfHashEntry *hentry;
+
+	/* Find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_FIND,
+										  &found);
+
+	if (!found)
+	{
+		/*
+		 * Hash doesn't know about it, but perhaps the Hash was destroyed by a
+		 * restart, so let's check the file existence on disk.
+		 */
+		File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+
+		found = fd >= 0;
+		if (fd >= 0)
+			FileClose(fd);
+
+		/*
+		 * And if it was found on disk then create the HTAB entry for it.
+		 */
+		if (found)
+		{
+			hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER,
+										  NULL);
+			hentry->allow_delete = false;
+		}
+	}
+
+	return found;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ *
+ * [Note: this is similar to apply_spooled_messages function]
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	psf.vfd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	if (psf.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from prepared spoolfile \"%s\": %m",
+						path)));
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		nbytes = FileRead(psf.vfd, buffer, len,
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+		if (nbytes != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/*
+		 * The psf spoolfile contents will have first and last messages as
+		 * BEGIN_PREPARE and PREPARE message respectively. These two are
+		 * processed specially within this function.
+		 *
+		 * BEGIN_PREPARE msg: This will be the first message of the psf file.
+		 * Use this begin_data information to set the remote_final_lsn.
+		 *
+		 * PREPARE msg: The prepare_data information is returned so that the
+		 * prepare lsn values are available to the caller (commit_prepared).
+		 * Unfortunately, just dispatching the PREPARE message is problematic
+		 * because its transaction commits have side effects on this replay
+		 * loop which is still running.
+		 *
+		 * All other message content (between the BEGIN_PREPARE and the PREPARE)
+		 * will be delivered to apply_dispatch as they normally would be.
+		 */
+		if (s2.data[0] == LOGICAL_REP_MSG_BEGIN_PREPARE)
+		{
+			LogicalRepBeginPrepareData bdata;
+
+			/* read/skip the action byte. */
+			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+			/* read the begin_data. */
+			logicalrep_read_begin_prepare(&s2, &bdata);
+
+			elog(DEBUG1, "BEGIN_PREPARE info: gid = '%s', final_lsn = %X/%X, end_lsn = %X/%X",
+				 bdata.gid,
+				 LSN_FORMAT_ARGS(bdata.final_lsn),
+				 LSN_FORMAT_ARGS(bdata.end_lsn));
+
+			/*
+			 * Make sure the handle apply_dispatch methods are aware we're in a remote
+			 * transaction.
+			 */
+			remote_final_lsn = bdata.final_lsn;
+			in_remote_transaction = true;
+			pgstat_report_activity(STATE_RUNNING, NULL);
+		}
+		else if (s2.data[0] == LOGICAL_REP_MSG_PREPARE)
+		{
+			/* read/skip the action byte. */
+			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_PREPARE);
+
+			/* read and return the prepare_data info to the caller */
+			logicalrep_read_prepare(&s2, pdata);
+
+			elog(DEBUG1, "PREPARE info: gid = '%s', prepare_lsn = %X/%X, end_lsn = %X/%X",
+				 pdata->gid,
+				 LSN_FORMAT_ARGS(pdata->prepare_lsn),
+				 LSN_FORMAT_ARGS(pdata->end_lsn));
+		}
+		else
+		{
+			/* Ensure we are reading the data into our memory context. */
+			oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+			apply_dispatch(&s2);
+
+			MemoryContextReset(ApplyMessageContext);
+
+			MemoryContextSwitchTo(oldctx2);
+
+			nchanges++;
+
+			if (nchanges % 1000 == 0)
+				elog(DEBUG1, "replayed %d changes from file '%s'",
+					 nchanges, path);
+		}
+	}
+
+	FileClose(psf.vfd);
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB.
+	 *
+	 * Therefore, the name and the key must be exactly same lengths and padded
+	 * with '\0' so garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "pg_twophase/%u-%s.prep_changes", subid, gid);
+}
+
+/*
+ * proc_exit callback to remove unwanted psf files.
+ */
+static void
+prepare_spoolfile_on_proc_exit(int status, Datum arg)
+{
+	HASH_SEQ_STATUS seq_status;
+	PsfHashEntry *hentry;
+
+	/* Iterate the HTAB looking for what file can be deleted. */
+	if (psf_hash)
+	{
+		hash_seq_init(&seq_status, psf_hash);
+		while ((hentry = (PsfHashEntry *) hash_seq_search(&seq_status)) != NULL)
+		{
+			char *path = hentry->name;
+
+			if (hentry->allow_delete)
+				prepare_spoolfile_delete(path);
+		}
+	}
+}
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..95d78e9 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AnyTablesyncInProgress(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 745b51d..4ffcef5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1958,6 +1958,8 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfFile
+PsfHashEntry
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v48-0009-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v48-0009-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From 5d0c4bbe6459e13b6336d6934ba231b6e1bba928 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 4 Mar 2021 20:12:24 +1100
Subject: [PATCH v48] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 ++++++++---
 src/backend/replication/logical/worker.c    | 77 ++++++++++++++++++++++-------
 2 files changed, 83 insertions(+), 23 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 97fc399..f3984d4 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1127,6 +1133,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1149,6 +1157,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1156,12 +1165,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1175,6 +1189,8 @@ AnyTablesyncInProgress()
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncInProgress?");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1186,8 +1202,8 @@ AnyTablesyncInProgress()
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1204,6 +1220,7 @@ AnyTablesyncInProgress()
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncInProgress?: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1215,8 +1232,8 @@ AnyTablesyncInProgress()
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1242,8 +1259,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f753e33..3c71f54 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -815,14 +815,16 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(begin_data.final_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (AnyTablesyncInProgress())
 		{
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
 			process_syncing_tables(begin_data.final_lsn);
 
 			/* This latch is to prevent 100% CPU looping. */
@@ -840,7 +842,12 @@ apply_handle_begin_prepare(StringInfo s)
 		 * prepared) will be saved to a spoolfile for replay later at
 		 * commit_prepared time.
 		 */
-		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		if (begin_data.final_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
 		{
 			char		psfpath[MAXPGPATH];
 			StringInfoData sid;
@@ -908,6 +915,8 @@ apply_handle_prepare(StringInfo s)
 		 * removal on proc-exit just in case there is an unexpected restart
 		 * between now and when commit_prepared happens.
 		 */
+		elog(LOG,
+			"!!> apply_handle_prepare: Make sure the spoolfile is not removed on proc-exit");
 		hentry = (PsfHashEntry *) hash_search(psf_hash,
 											  psf_cur.name,
 											  HASH_FIND,
@@ -990,6 +999,8 @@ apply_handle_commit_prepared(StringInfo s)
 		int			nchanges;
 		LogicalRepPreparedTxnData pdata;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * 1. replay the spooled messages
 		 */
@@ -997,8 +1008,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, &pdata);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		/*
@@ -1103,6 +1114,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf = %d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2493,18 +2505,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3706,8 +3722,8 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
-	elog(DEBUG1,
-		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
 		 action,
 		 psf_cur.is_spooling ? "Do" : "Don't");
 
@@ -3731,7 +3747,7 @@ prepare_spoolfile_create(char *path)
 	bool		found;
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(!psf_cur.is_spooling);
 
@@ -3743,7 +3759,7 @@ prepare_spoolfile_create(char *path)
 
 	if (!found)
 	{
-		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		elog(LOG, "!!> Not found file \"%s\". Create it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 		{
@@ -3761,7 +3777,7 @@ prepare_spoolfile_create(char *path)
 		 * Open the file and seek to the beginning because we always want to
 		 * create/overwrite this file.
 		 */
-		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		elog(LOG, "!!> Found file \"%s\". Overwrite it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 		{
@@ -3786,6 +3802,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_cur.vfd >= 0)
 		FileClose(psf_cur.vfd);
 
@@ -3801,6 +3818,8 @@ prepare_spoolfile_close()
 static void
 prepare_spoolfile_delete(char *path)
 {
+	elog(LOG, "!!> prepare_spoolfile_delete: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3826,18 +3845,20 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_cur.is_spooling);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(len));
 	psf_cur.cur_offset += bytes_written;
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(action));
@@ -3846,6 +3867,7 @@ prepare_spoolfile_write(char action, StringInfo s)
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == len);
@@ -3879,6 +3901,12 @@ prepare_spoolfile_exists(char *path)
 		if (fd >= 0)
 			FileClose(fd);
 
+		elog(LOG,
+			 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was "
+			 "not found in the HTAB, but was %s on the disk.",
+			 path,
+			 found ? "found" : "not found");
+
 		/*
 		 * And if it was found on disk then create the HTAB entry for it.
 		 */
@@ -3888,10 +3916,16 @@ prepare_spoolfile_exists(char *path)
 										  path,
 										  HASH_ENTER,
 										  NULL);
+			elog(LOG, "!!> prepare_spoolfile_exists: Created new HTAB entry '%s'", hentry->name);
 			hentry->allow_delete = false;
 		}
 	}
 
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 found ? "found" : "not found");
+
 	return found;
 }
 
@@ -3910,8 +3944,8 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 				oldctx2;
 	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3948,6 +3982,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -3966,6 +4001,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		nbytes = FileRead(psf.vfd, buffer, len,
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
@@ -4000,13 +4036,15 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		{
 			LogicalRepBeginPrepareData bdata;
 
+			elog(LOG, "!!> prepare_spoolfile_replay_messages: Found the BEGIN_PREPARE info");
+
 			/* read/skip the action byte. */
 			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_BEGIN_PREPARE);
 
 			/* read the begin_data. */
 			logicalrep_read_begin_prepare(&s2, &bdata);
 
-			elog(DEBUG1, "BEGIN_PREPARE info: gid = '%s', final_lsn = %X/%X, end_lsn = %X/%X",
+			elog(LOG, "!!> BEGIN_PREPARE info: gid = '%s', final_lsn = %X/%X, end_lsn = %X/%X",
 				 bdata.gid,
 				 LSN_FORMAT_ARGS(bdata.final_lsn),
 				 LSN_FORMAT_ARGS(bdata.end_lsn));
@@ -4021,13 +4059,15 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		}
 		else if (s2.data[0] == LOGICAL_REP_MSG_PREPARE)
 		{
+			elog(LOG, "!!> prepare_spoolfile_replay_messages: Found the PREPARE info");
+
 			/* read/skip the action byte. */
 			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_PREPARE);
 
 			/* read and return the prepare_data info to the caller */
 			logicalrep_read_prepare(&s2, pdata);
 
-			elog(DEBUG1, "PREPARE info: gid = '%s', prepare_lsn = %X/%X, end_lsn = %X/%X",
+			elog(LOG, "!!> PREPARE info: gid = '%s', prepare_lsn = %X/%X, end_lsn = %X/%X",
 				 pdata->gid,
 				 LSN_FORMAT_ARGS(pdata->prepare_lsn),
 				 LSN_FORMAT_ARGS(pdata->end_lsn));
@@ -4046,7 +4086,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 			nchanges++;
 
 			if (nchanges % 1000 == 0)
-				elog(DEBUG1, "replayed %d changes from file '%s'",
+				elog(LOG, "!!> replayed %d changes from file '%s'",
 					 nchanges, path);
 		}
 	}
@@ -4056,7 +4096,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 	pfree(buffer);
 	pfree(s2.data);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
@@ -4092,6 +4132,8 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 	HASH_SEQ_STATUS seq_status;
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_on_proc_exit");
+
 	/* Iterate the HTAB looking for what file can be deleted. */
 	if (psf_hash)
 	{
@@ -4100,6 +4142,7 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 		{
 			char *path = hentry->name;
 
+			elog(LOG, "!!> prepare_spoolfile_proc_exit: found '%s'", path);
 			if (hentry->allow_delete)
 				prepare_spoolfile_delete(path);
 		}
-- 
1.8.3.1

#213Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#212)
9 attachment(s)

On Thu, Mar 4, 2021 at 9:53 PM Peter Smith <smithpb2250@gmail.com> wrote:

[05a] Now syncing the psf file at prepare time

The patch v46-0008 does not handle spooling of streaming prepare if
the Subscription is configured for both two-phase and streaming.
I feel that it would be best if we don't support both two-phase and
streaming together in a subscription in this release.
Probably a future release could handle this. So, changing the patch to
not allow streaming and two-phase together.
This new patch v49 has the following changes.

* Don't support creating a subscription with both streaming and
two-phase enabled.
* Don't support altering a subscription enabling streaming if it was
created with two-phase enabled.
* Remove stream_prepare callback as a "required" callback, make it an
optional callback and remove all code related to stream_prepare in the
pgoutput plugin as well as in worker.c

Also fixed
* Don't support the alter of subscription setting two-phase. Toggling
of two-phase mode using the alter command on the subscription can
cause transactions to be missed and result in an inconsistent replica.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v49-0001-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v49-0001-Refactor-spool-file-logic-in-worker.c.patchDownload
From 9201b2759e011254288f3c1413b25fbe36b36c42 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 4 Mar 2021 08:55:27 -0500
Subject: [PATCH v49] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 18d0528..45ac498 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -917,30 +919,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +941,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +956,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1031,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v49-0005-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v49-0005-Support-2PC-txn-subscriber-tests.patchDownload
From cb411aefbc7a17e0ec8f4932a836552910d06fd6 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 4 Mar 2021 09:29:48 -0500
Subject: [PATCH v49] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code (streaming and not streaming).
---
 src/test/subscription/t/020_twophase.pl         | 338 ++++++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl | 282 ++++++++++++++++++++
 2 files changed, 620 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v49-0004-Add-two_phase-option-to-CREATE-REPLICATION-SLOT.patchapplication/octet-stream; name=v49-0004-Add-two_phase-option-to-CREATE-REPLICATION-SLOT.patchDownload
From c30070909f39a8560fa94f819a36c8f78e1a9a29 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 4 Mar 2021 09:18:52 -0500
Subject: [PATCH v49] Add two_phase option to CREATE REPLICATION SLOT.

This patch adds new option to enable two_phase while creating a slot.
---
 src/backend/commands/subscriptioncmds.c                    |  2 +-
 .../replication/libpqwalreceiver/libpqwalreceiver.c        |  6 +++++-
 src/backend/replication/logical/tablesync.c                |  2 +-
 src/backend/replication/repl_gram.y                        | 14 +++++++++++---
 src/backend/replication/repl_scanner.l                     |  1 +
 src/backend/replication/walreceiver.c                      |  2 +-
 src/include/replication/walreceiver.h                      |  5 +++--
 7 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..f6793f0 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -528,7 +528,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				walrcv_create_slot(wrconn, slotname, false, true,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..9e822f9 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -827,7 +828,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +842,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..50c3ea7 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1052,7 +1052,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
 	/*
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..c5154ae 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -242,15 +244,16 @@ create_replication_slot:
 					$$ = (Node *) cmd;
 				}
 			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e5f8a06..e40d2d0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -363,7 +363,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..f55b07c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -345,6 +345,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -418,8 +419,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
-- 
1.8.3.1

v49-0002-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v49-0002-Track-replication-origin-progress-for-rollbacks.patchDownload
From 96de2c57056bc3d68ac9d3404da67584b574a6bd Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 4 Mar 2021 08:57:26 -0500
Subject: [PATCH v49] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20..6023e7c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2276,6 +2276,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2298,6 +2306,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4e6a3df..acdb28d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5716,8 +5716,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5923,7 +5922,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5972,6 +5972,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6013,7 +6020,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6021,7 +6029,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v49-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v49-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 9556b8e56563462690bfcdc26fc84128ee9970ab Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 4 Mar 2021 09:03:33 -0500
Subject: [PATCH v49] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

* change stream_prepare_cb from a required callback to an optional callback.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/logicaldecoding.sgml           |  16 +--
 src/backend/access/transam/twophase.c       |  68 ++++++++++
 src/backend/replication/logical/logical.c   |   9 +-
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 194 ++++++++++++++++++++++++++++
 src/backend/replication/logical/worker.c    | 174 +++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 157 ++++++++++++++++------
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  70 +++++++++-
 src/include/replication/reorderbuffer.h     |  12 ++
 src/tools/pgindent/typedefs.list            |   3 +
 11 files changed, 658 insertions(+), 54 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 80eb96d..702e42d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -468,9 +468,9 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
-     and <function>stream_prepare_cb</function>
-     are required, while <function>stream_message_cb</function> and
+     <function>stream_commit_cb</function>, and <function>stream_change_cb</function>,
+     are required, while <function>stream_prepare_cb</function>, 
+     <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
 
@@ -478,9 +478,9 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
     An output plugin may also define functions to support two-phase commits,
     which allows actions to be decoded on the <command>PREPARE TRANSACTION</command>.
     The <function>begin_prepare_cb</function>, <function>prepare_cb</function>, 
-    <function>stream_prepare_cb</function>,
     <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
-    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    callbacks are required, while <function>filter_prepare_cb</function> and
+    <function>stream_prepare_cb</function>, are optional.
     </para>
    </sect2>
 
@@ -1195,9 +1195,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
     provide additional callbacks. There are multiple two-phase commit callbacks
     that are required, (<function>begin_prepare_cb</function>,
     <function>prepare_cb</function>, <function>commit_prepared_cb</function>, 
-    <function>rollback_prepared_cb</function> and
-    <function>stream_prepare_cb</function>) and an optional callback
-    (<function>filter_prepare_cb</function>).
+    <function>rollback_prepared_cb</function>) and
+    optional callbacks, (<function>stream_prepare_cb</function> and
+    <function>filter_prepare_cb</function>).
    </para>
 
    <para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..81cb765 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75de..5324851 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1326,6 +1326,9 @@ stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	Assert(ctx->streaming);
 	Assert(ctx->twophase);
 
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		return;
+
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_prepare";
@@ -1340,12 +1343,6 @@ stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	ctx->write_xid = txn->xid;
 	ctx->write_location = txn->end_lsn;
 
-	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
-	if (ctx->callbacks.stream_prepare_cb == NULL)
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("logical streaming at prepare time requires a stream_prepare_cb callback")));
-
 	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
 
 	/* Pop the error context stack */
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..e958d28 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,200 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 45ac498..92ac4cb 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -722,6 +723,157 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/*
+	 * The gid must not already be prepared.
+	 */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				errmsg("transaction identifier \"%s\" is already in use",
+					   begin_data.gid)));
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1972,6 +2124,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2e4b39f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +171,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -322,8 +342,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +362,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +383,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +843,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1250,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..b797e3b 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -170,5 +237,4 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 										  TransactionId subxid);
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
-
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..0c95dc6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8bd95ae..745b51d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v49-0006-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v49-0006-Support-2PC-txn-Subscription-option.patchDownload
From 4b80b0bb329b96a7c8f884f821d4e0d2175f1442 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 4 Mar 2021 09:50:42 -0500
Subject: [PATCH v49] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/create_subscription.sgml          | 29 +++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 72 +++++++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 +
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 ++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 +++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 ++-
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 +
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 93 +++++++++++++++-------
 src/test/regress/sql/subscription.sql              | 25 ++++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 17 files changed, 261 insertions(+), 47 deletions(-)

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..e04b8d2 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,35 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          It is not allowed to combine <literal>streaming</literal> set to
+          <literal>true</literal> and <literal>two_phase</literal> set to
+          <literal>true</literal>.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          It is not allowed to combine <literal>two_phase</literal> set to
+          <literal>true</literal> and <literal>streaming</literal> set to
+          <literal>true</literal>.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73..060fab4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1168,7 +1168,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f6793f0..96fcf49 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,26 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +309,24 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * the current implementation has some issues that could lead to a
+	 * streaming prepared transaction to be incorrectly missed in the initial
+	 * syncing phase. Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +576,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false, true,
+				walrcv_create_slot(wrconn, slotname, false, twophase,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +883,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +918,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if (sub->twophase && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +947,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +993,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1039,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9e822f9..1daa585 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -428,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 92ac4cb..9cccdef 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2631,6 +2631,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3277,6 +3278,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2e4b39f..91ecc55 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -178,13 +178,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -252,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -265,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -289,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -330,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..4ac4924 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index b797e3b..6c848c2 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f55b07c..0ed8e9d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..67b3358 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v49-0007-Tablesync-early-exit.patchapplication/octet-stream; name=v49-0007-Tablesync-early-exit.patchDownload
From d468a5a6d3986fb8091b36b68f3411b1fc75f885 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 4 Mar 2021 09:52:49 -0500
Subject: [PATCH v49] Tablesync early exit.

Give the tablesync worker an opportunity to see if it can exit immediately
(because it has already caught-up) without it needing to process a message
first before discovering that.
---
 src/backend/replication/logical/worker.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9cccdef..1f3aa01 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2252,6 +2252,16 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	bool		ping_sent = false;
 	TimeLineID	tli;
 
+	if (am_tablesync_worker())
+	{
+		/*
+		 * Give the tablesync worker an opportunity see if it can immediately
+		 * exit, instead of always handling a message (which maybe the apply
+		 * worker could have handled).
+		 */
+		process_syncing_tables(last_received);
+	}
+
 	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
-- 
1.8.3.1

v49-0008-Fix-apply-worker-empty-prepare.patchapplication/octet-stream; name=v49-0008-Fix-apply-worker-empty-prepare.patchDownload
From 0fc93ad78099765e0aa81be48f9a444934626bd0 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 4 Mar 2021 18:40:52 -0500
Subject: [PATCH v49] Fix apply worker empty prepare.

By sad timing of apply/tablesync workers it is possible to have a "consistent snapshot" that spans prepare/commit in such a way that the tablesync did not do the prepare (because snapshot not consistent) and the apply worker does the prepare ('b') but it skips all the prepared operations [e.g. inserts] while the tablesync was still busy (see the condition of should_apply_changes_for_rel). Later, at the commit prepared time when the apply worker does the commit prepare ('K'), there is nothing committed (because the inserts were skipped earlier).

This patch implements a two-part fix as suggested [1] on hackers.

Part 1 - The begin_prepare handler of apply will always wait for any busy tablesync workers to acheive SYNCDONE/READY state.

Part 2 - If (after Part 1) the apply-worker's prepare is found to be lagging behind any of the sync-workers then the subsequent prepared operations will be spooled to a file to be replayed at commit_prepared time.

Discussion:
[1] https://www.postgresql.org/message-id/CAA4eK1L%3DdhuCRvyDvrXX5wZgc7s1hLRD29CKCK6oaHtVCPgiFA%40mail.gmail.com
---
 src/backend/replication/logical/tablesync.c | 178 +++++--
 src/backend/replication/logical/worker.c    | 708 +++++++++++++++++++++++++++-
 src/include/replication/worker_internal.h   |   3 +
 src/tools/pgindent/typedefs.list            |   2 +
 4 files changed, 856 insertions(+), 35 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 50c3ea7..97fc399 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1137,3 +1111,141 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+AnyTablesyncInProgress()
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * When the process_syncing_tables_for_apply changes the state
+		 * from SYNCDONE to READY, that change is actually written directly
+		 * into the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 1f3aa01..903c287 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -209,6 +209,54 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	bool allow_delete; /* ok to delete? */
+}			PsfHashEntry;
+
+/*
+ * Information about the "current" psf spoolfile.
+ */
+typedef struct PsfFile
+{
+	char	name[MAXPGPATH];/* psf name - same as the HTAB key. */
+	bool	is_spooling;	/* are we currently spooling to this file? */
+	File 	vfd;			/* -1 when the file is closed. */
+	off_t	cur_offset;		/* offset for the next write or read. Reset to 0
+							 * when file is opened. */
+} PsfFile;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * Information about the 'current' open spoolfile is only valid when spooling.
+ * This is flagged as 'is_spooling' only between begin_prepare and prepare.
+ */
+static PsfFile psf_cur = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_delete(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+static void prepare_spoolfile_on_proc_exit(int status, Datum arg);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -730,6 +778,9 @@ apply_handle_begin_prepare(StringInfo s)
 {
 	LogicalRepBeginPrepareData begin_data;
 
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
 	logicalrep_read_begin_prepare(s, &begin_data);
 
 	/*
@@ -741,6 +792,81 @@ apply_handle_begin_prepare(StringInfo s)
 				errmsg("transaction identifier \"%s\" is already in use",
 					   begin_data.gid)));
 
+	/*
+	 * A Problem:
+	 *
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel). Later at the
+	 * commit prepared time when the apply worker does the commit prepare
+	 * (‘K’), there is nothing in it (because the inserts were skipped
+	 * earlier).
+	 *
+	 * The following code has a 2-part workaround for that scenario.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Workaround Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(begin_data.final_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (AnyTablesyncInProgress())
+		{
+			process_syncing_tables(begin_data.final_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Workaround Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		{
+			char		psfpath[MAXPGPATH];
+			StringInfoData sid;
+
+			/*
+			 * Create the spoolfile.
+			 */
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+
+			/*
+			 * From now, until the handle_prepare we are spooling to the
+			 * current psf.
+			 */
+			psf_cur.is_spooling = true;
+
+			/*
+			 * Write BEGIN_PREPARE as the first message of the psf file.
+			 */
+			initStringInfo(&sid);
+			appendBinaryStringInfo(&sid, (char *)&begin_data, sizeof(begin_data));
+			Assert(prepare_spoolfile_handler(LOGICAL_REP_MSG_BEGIN_PREPARE, &sid));
+		}
+	}
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	in_remote_transaction = true;
@@ -756,6 +882,50 @@ apply_handle_prepare(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
 
+	/*
+	 * If we were using a psf spoolfile, then write the PREPARE as the final
+	 * message. This prepare information will be used at commit_prepared time.
+	 */
+	if (psf_cur.is_spooling)
+	{
+		PsfHashEntry *hentry;
+
+		/* Write the PREPARE info to the psf file. */
+		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
+
+		/*
+		 * Flush the spoolfile, so changes can survive a restart.
+		 */
+		FileSync(psf_cur.vfd, WAIT_EVENT_DATA_FILE_SYNC);
+
+		/*
+		 * We are finished spooling to the current psf.
+		 */
+		psf_cur.is_spooling = false;
+
+		/*
+		 * The commit_prepare will need the spoolfile, so unregister it for
+		 * removal on proc-exit just in case there is an unexpected restart
+		 * between now and when commit_prepared happens.
+		 */
+		hentry = (PsfHashEntry *) hash_search(psf_hash,
+											  psf_cur.name,
+											  HASH_FIND,
+											  NULL);
+		Assert(hentry);
+		hentry->allow_delete = false;
+
+		/*
+		 * The psf_cur.vfd is meaningful only between begin_prepare and prepared.
+		 * So close it now. Any messages written to the psf will be applied
+		 * later during handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		in_remote_transaction = false;
+		return;
+	}
+
 	logicalrep_read_prepare(s, &prepare_data);
 
 	Assert(prepare_data.prepare_lsn == remote_final_lsn);
@@ -805,9 +975,63 @@ static void
 apply_handle_commit_prepared(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
 
 	logicalrep_read_commit_prepared(s, &prepare_data);
 
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+		LogicalRepPreparedTxnData pdata;
+
+		/*
+		 * 1. replay the spooled messages
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, &pdata);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		/*
+		 * 2. mark as PREPARED (use prepare_data info from the psf file)
+		 */
+
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = pdata.end_lsn;
+		replorigin_session_origin_timestamp = pdata.preparetime;
+
+		PrepareTransactionBlock(pdata.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(pdata.end_lsn);
+
+		/*
+		 * Now that we replayed the psf it is no longer needed. Just delete it.
+		 */
+		prepare_spoolfile_delete(psfpath);
+	}
+
 	/* there is no transaction when COMMIT PREPARED is called */
 	ensure_transaction();
 
@@ -838,15 +1062,49 @@ static void
 apply_handle_rollback_prepared(StringInfo s)
 {
 	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
 
 	logicalrep_read_rollback_prepared(s, &rollback_data);
 
 	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		if (psf_cur.is_spooling)
+		{
+			/*
+			 * XXX - For some reason it is currently possible (due to bug?) it
+			 * is possibe to get here, after a restart, when there was a
+			 * begin_prepare but there was NO prepare. Since there was no
+			 * prepare, the psf_cur and the transaction are still lingering
+			 * so they need to be cleaned up now.
+			 */
+			prepare_spoolfile_close();
+		}
+
+		/*
+		 * We are finished with this spoolfile. Delete it.
+		 */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/*
 	 * It is possible that we haven't received prepare because it occurred
 	 * before walsender reached a consistent point in which case we need to
 	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
 	 */
-	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
 	{
 		/*
@@ -886,6 +1144,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		!psf_cur.is_spooling &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1320,6 +1579,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1441,6 +1703,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1599,6 +1864,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1968,6 +2236,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -2263,6 +2534,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	}
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL		hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2382,7 +2670,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && !psf_cur.is_spooling)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -3130,6 +3418,9 @@ ApplyWorkerMain(Datum main_arg)
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
 
+	/* Arrange to delete any unwanted psf file(s) at proc-exit */
+	on_proc_exit(prepare_spoolfile_on_proc_exit, 0);
+
 	/* Setup signal handling */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGTERM, die);
@@ -3307,3 +3598,416 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time. If needed, this is the common function to do that file redirection.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	elog(DEBUG1,
+		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_cur.is_spooling ? "Do" : "Don't");
+
+	if (!psf_cur.is_spooling)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	bool		found;
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(!psf_cur.is_spooling);
+
+	/* create or find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER | HASH_FIND,
+										  &found);
+
+	if (!found)
+	{
+		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create file \"%s\": %m", path)));
+		}
+		memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+		psf_cur.cur_offset = 0;
+		hentry->allow_delete = true;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the beginning because we always want to
+		 * create/overwrite this file.
+		 */
+		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m", path)));
+		}
+		memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+		psf_cur.cur_offset = 0;
+		hentry->allow_delete = true;
+	}
+
+	/* Sanity checks */
+	Assert(psf_cur.vfd >= 0);
+	Assert(psf_cur.cur_offset == 0);
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_cur.vfd >= 0)
+		FileClose(psf_cur.vfd);
+
+	/* Mark this fd as not valid to use anymore. */
+	psf_cur.is_spooling = false;
+	psf_cur.vfd = -1;
+	psf_cur.cur_offset = 0;
+}
+
+/*
+ * Delete the specified psf spoolfile, and any HTAB associated with it.
+ */
+static void
+prepare_spoolfile_delete(char *path)
+{
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* Delete the file off the disk. */
+	unlink(path);
+
+	/* Remove any entry from the psf_hash, if present */
+	hash_search(psf_hash, path, HASH_REMOVE, NULL);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+	int			bytes_written;
+
+	Assert(psf_cur.is_spooling);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(len));
+	psf_cur.cur_offset += bytes_written;
+
+	/* then the action */
+	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(action));
+	psf_cur.cur_offset += bytes_written;
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == len);
+	psf_cur.cur_offset += bytes_written;
+}
+
+/*
+ * Is there a prepare spoolfile for the specified path?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	bool		found;
+	PsfHashEntry *hentry;
+
+	/* Find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_FIND,
+										  &found);
+
+	if (!found)
+	{
+		/*
+		 * Hash doesn't know about it, but perhaps the Hash was destroyed by a
+		 * restart, so let's check the file existence on disk.
+		 */
+		File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+
+		found = fd >= 0;
+		if (fd >= 0)
+			FileClose(fd);
+
+		/*
+		 * And if it was found on disk then create the HTAB entry for it.
+		 */
+		if (found)
+		{
+			hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER,
+										  NULL);
+			hentry->allow_delete = false;
+		}
+	}
+
+	return found;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ *
+ * [Note: this is similar to apply_spooled_messages function]
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	psf.vfd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	if (psf.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from prepared spoolfile \"%s\": %m",
+						path)));
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		nbytes = FileRead(psf.vfd, buffer, len,
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+		if (nbytes != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/*
+		 * The psf spoolfile contents will have first and last messages as
+		 * BEGIN_PREPARE and PREPARE message respectively. These two are
+		 * processed specially within this function.
+		 *
+		 * BEGIN_PREPARE msg: This will be the first message of the psf file.
+		 * Use this begin_data information to set the remote_final_lsn.
+		 *
+		 * PREPARE msg: The prepare_data information is returned so that the
+		 * prepare lsn values are available to the caller (commit_prepared).
+		 * Unfortunately, just dispatching the PREPARE message is problematic
+		 * because its transaction commits have side effects on this replay
+		 * loop which is still running.
+		 *
+		 * All other message content (between the BEGIN_PREPARE and the PREPARE)
+		 * will be delivered to apply_dispatch as they normally would be.
+		 */
+		if (s2.data[0] == LOGICAL_REP_MSG_BEGIN_PREPARE)
+		{
+			LogicalRepBeginPrepareData bdata;
+
+			/* read/skip the action byte. */
+			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+			/* read the begin_data. */
+			logicalrep_read_begin_prepare(&s2, &bdata);
+
+			elog(DEBUG1, "BEGIN_PREPARE info: gid = '%s', final_lsn = %X/%X, end_lsn = %X/%X",
+				 bdata.gid,
+				 LSN_FORMAT_ARGS(bdata.final_lsn),
+				 LSN_FORMAT_ARGS(bdata.end_lsn));
+
+			/*
+			 * Make sure the handle apply_dispatch methods are aware we're in a remote
+			 * transaction.
+			 */
+			remote_final_lsn = bdata.final_lsn;
+			in_remote_transaction = true;
+			pgstat_report_activity(STATE_RUNNING, NULL);
+		}
+		else if (s2.data[0] == LOGICAL_REP_MSG_PREPARE)
+		{
+			/* read/skip the action byte. */
+			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_PREPARE);
+
+			/* read and return the prepare_data info to the caller */
+			logicalrep_read_prepare(&s2, pdata);
+
+			elog(DEBUG1, "PREPARE info: gid = '%s', prepare_lsn = %X/%X, end_lsn = %X/%X",
+				 pdata->gid,
+				 LSN_FORMAT_ARGS(pdata->prepare_lsn),
+				 LSN_FORMAT_ARGS(pdata->end_lsn));
+		}
+		else
+		{
+			/* Ensure we are reading the data into our memory context. */
+			oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+			apply_dispatch(&s2);
+
+			MemoryContextReset(ApplyMessageContext);
+
+			MemoryContextSwitchTo(oldctx2);
+
+			nchanges++;
+
+			if (nchanges % 1000 == 0)
+				elog(DEBUG1, "replayed %d changes from file '%s'",
+					 nchanges, path);
+		}
+	}
+
+	FileClose(psf.vfd);
+
+	pfree(buffer);
+	pfree(s2.data);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB.
+	 *
+	 * Therefore, the name and the key must be exactly same lengths and padded
+	 * with '\0' so garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "pg_twophase/%u-%s.prep_changes", subid, gid);
+}
+
+/*
+ * proc_exit callback to remove unwanted psf files.
+ */
+static void
+prepare_spoolfile_on_proc_exit(int status, Datum arg)
+{
+	HASH_SEQ_STATUS seq_status;
+	PsfHashEntry *hentry;
+
+	/* Iterate the HTAB looking for what file can be deleted. */
+	if (psf_hash)
+	{
+		hash_seq_init(&seq_status, psf_hash);
+		while ((hentry = (PsfHashEntry *) hash_seq_search(&seq_status)) != NULL)
+		{
+			char *path = hentry->name;
+
+			if (hentry->allow_delete)
+				prepare_spoolfile_delete(path);
+		}
+	}
+}
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..95d78e9 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AnyTablesyncInProgress(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 745b51d..4ffcef5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1958,6 +1958,8 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfFile
+PsfHashEntry
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v49-0009-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v49-0009-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From 008c3bce9b59ce5909260dfe830c0d22e553a945 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 4 Mar 2021 19:15:50 -0500
Subject: [PATCH v49] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 ++++++++---
 src/backend/replication/logical/worker.c    | 77 ++++++++++++++++++++++-------
 2 files changed, 83 insertions(+), 23 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 97fc399..f3984d4 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1127,6 +1133,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1149,6 +1157,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1156,12 +1165,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1175,6 +1189,8 @@ AnyTablesyncInProgress()
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncInProgress?");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1186,8 +1202,8 @@ AnyTablesyncInProgress()
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1204,6 +1220,7 @@ AnyTablesyncInProgress()
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncInProgress?: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1215,8 +1232,8 @@ AnyTablesyncInProgress()
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1242,8 +1259,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 903c287..940f1b1 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -815,14 +815,16 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(begin_data.final_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (AnyTablesyncInProgress())
 		{
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
 			process_syncing_tables(begin_data.final_lsn);
 
 			/* This latch is to prevent 100% CPU looping. */
@@ -840,7 +842,12 @@ apply_handle_begin_prepare(StringInfo s)
 		 * prepared) will be saved to a spoolfile for replay later at
 		 * commit_prepared time.
 		 */
-		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		if (begin_data.final_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
 		{
 			char		psfpath[MAXPGPATH];
 			StringInfoData sid;
@@ -908,6 +915,8 @@ apply_handle_prepare(StringInfo s)
 		 * removal on proc-exit just in case there is an unexpected restart
 		 * between now and when commit_prepared happens.
 		 */
+		elog(LOG,
+			"!!> apply_handle_prepare: Make sure the spoolfile is not removed on proc-exit");
 		hentry = (PsfHashEntry *) hash_search(psf_hash,
 											  psf_cur.name,
 											  HASH_FIND,
@@ -990,6 +999,8 @@ apply_handle_commit_prepared(StringInfo s)
 		int			nchanges;
 		LogicalRepPreparedTxnData pdata;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * 1. replay the spooled messages
 		 */
@@ -997,8 +1008,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, &pdata);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		/*
@@ -1103,6 +1114,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf = %d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2397,18 +2409,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3612,8 +3628,8 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
-	elog(DEBUG1,
-		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
 		 action,
 		 psf_cur.is_spooling ? "Do" : "Don't");
 
@@ -3637,7 +3653,7 @@ prepare_spoolfile_create(char *path)
 	bool		found;
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(!psf_cur.is_spooling);
 
@@ -3649,7 +3665,7 @@ prepare_spoolfile_create(char *path)
 
 	if (!found)
 	{
-		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		elog(LOG, "!!> Not found file \"%s\". Create it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 		{
@@ -3667,7 +3683,7 @@ prepare_spoolfile_create(char *path)
 		 * Open the file and seek to the beginning because we always want to
 		 * create/overwrite this file.
 		 */
-		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		elog(LOG, "!!> Found file \"%s\". Overwrite it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 		{
@@ -3692,6 +3708,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_cur.vfd >= 0)
 		FileClose(psf_cur.vfd);
 
@@ -3707,6 +3724,8 @@ prepare_spoolfile_close()
 static void
 prepare_spoolfile_delete(char *path)
 {
+	elog(LOG, "!!> prepare_spoolfile_delete: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3732,18 +3751,20 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_cur.is_spooling);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(len));
 	psf_cur.cur_offset += bytes_written;
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(action));
@@ -3752,6 +3773,7 @@ prepare_spoolfile_write(char action, StringInfo s)
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == len);
@@ -3785,6 +3807,12 @@ prepare_spoolfile_exists(char *path)
 		if (fd >= 0)
 			FileClose(fd);
 
+		elog(LOG,
+			 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was "
+			 "not found in the HTAB, but was %s on the disk.",
+			 path,
+			 found ? "found" : "not found");
+
 		/*
 		 * And if it was found on disk then create the HTAB entry for it.
 		 */
@@ -3794,10 +3822,16 @@ prepare_spoolfile_exists(char *path)
 										  path,
 										  HASH_ENTER,
 										  NULL);
+			elog(LOG, "!!> prepare_spoolfile_exists: Created new HTAB entry '%s'", hentry->name);
 			hentry->allow_delete = false;
 		}
 	}
 
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 found ? "found" : "not found");
+
 	return found;
 }
 
@@ -3816,8 +3850,8 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 				oldctx2;
 	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3854,6 +3888,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -3872,6 +3907,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		nbytes = FileRead(psf.vfd, buffer, len,
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
@@ -3906,13 +3942,15 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		{
 			LogicalRepBeginPrepareData bdata;
 
+			elog(LOG, "!!> prepare_spoolfile_replay_messages: Found the BEGIN_PREPARE info");
+
 			/* read/skip the action byte. */
 			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_BEGIN_PREPARE);
 
 			/* read the begin_data. */
 			logicalrep_read_begin_prepare(&s2, &bdata);
 
-			elog(DEBUG1, "BEGIN_PREPARE info: gid = '%s', final_lsn = %X/%X, end_lsn = %X/%X",
+			elog(LOG, "!!> BEGIN_PREPARE info: gid = '%s', final_lsn = %X/%X, end_lsn = %X/%X",
 				 bdata.gid,
 				 LSN_FORMAT_ARGS(bdata.final_lsn),
 				 LSN_FORMAT_ARGS(bdata.end_lsn));
@@ -3927,13 +3965,15 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 		}
 		else if (s2.data[0] == LOGICAL_REP_MSG_PREPARE)
 		{
+			elog(LOG, "!!> prepare_spoolfile_replay_messages: Found the PREPARE info");
+
 			/* read/skip the action byte. */
 			Assert(pq_getmsgbyte(&s2) == LOGICAL_REP_MSG_PREPARE);
 
 			/* read and return the prepare_data info to the caller */
 			logicalrep_read_prepare(&s2, pdata);
 
-			elog(DEBUG1, "PREPARE info: gid = '%s', prepare_lsn = %X/%X, end_lsn = %X/%X",
+			elog(LOG, "!!> PREPARE info: gid = '%s', prepare_lsn = %X/%X, end_lsn = %X/%X",
 				 pdata->gid,
 				 LSN_FORMAT_ARGS(pdata->prepare_lsn),
 				 LSN_FORMAT_ARGS(pdata->end_lsn));
@@ -3952,7 +3992,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 			nchanges++;
 
 			if (nchanges % 1000 == 0)
-				elog(DEBUG1, "replayed %d changes from file '%s'",
+				elog(LOG, "!!> replayed %d changes from file '%s'",
 					 nchanges, path);
 		}
 	}
@@ -3962,7 +4002,7 @@ prepare_spoolfile_replay_messages(char *path, LogicalRepPreparedTxnData *pdata)
 	pfree(buffer);
 	pfree(s2.data);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
@@ -3998,6 +4038,8 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 	HASH_SEQ_STATUS seq_status;
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_on_proc_exit");
+
 	/* Iterate the HTAB looking for what file can be deleted. */
 	if (psf_hash)
 	{
@@ -4006,6 +4048,7 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 		{
 			char *path = hentry->name;
 
+			elog(LOG, "!!> prepare_spoolfile_proc_exit: found '%s'", path);
 			if (hentry->allow_delete)
 				prepare_spoolfile_delete(path);
 		}
-- 
1.8.3.1

#214vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#213)

On Fri, Mar 5, 2021 at 12:21 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, Mar 4, 2021 at 9:53 PM Peter Smith <smithpb2250@gmail.com> wrote:

[05a] Now syncing the psf file at prepare time

The patch v46-0008 does not handle spooling of streaming prepare if
the Subscription is configured for both two-phase and streaming.
I feel that it would be best if we don't support both two-phase and
streaming together in a subscription in this release.
Probably a future release could handle this. So, changing the patch to
not allow streaming and two-phase together.
This new patch v49 has the following changes.

* Don't support creating a subscription with both streaming and
two-phase enabled.
* Don't support altering a subscription enabling streaming if it was
created with two-phase enabled.
* Remove stream_prepare callback as a "required" callback, make it an
optional callback and remove all code related to stream_prepare in the
pgoutput plugin as well as in worker.c

Also fixed
* Don't support the alter of subscription setting two-phase. Toggling
of two-phase mode using the alter command on the subscription can
cause transactions to be missed and result in an inconsistent replica.

Thanks for the updated patch.
Few minor comments:

I'm not sure if we plan to change this workaround, if we are not
planning to change this workaround. We can reword the comments
suitably. We generally don't use workaround in our comments.
+               /*
+                * Workaround Part 1 of 2:
+                *
+                * Make sure every tablesync has reached at least SYNCDONE state
+                * before letting the apply worker proceed.
+                */
+               elog(DEBUG1,
+                        "apply_handle_begin_prepare, end_lsn = %X/%X,
final_lsn = %X/%X, lstate_lsn = %X/%X",
+                        LSN_FORMAT_ARGS(begin_data.end_lsn),
+                        LSN_FORMAT_ARGS(begin_data.final_lsn),
+                        LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+

We should include two_phase in tab completion (tab-complete.c file
psql_completion(const char *text, int start, int end) function) :
postgres=# create subscription sub1 connection 'port=5441
dbname=postgres' publication pub1 with (
CONNECT COPY_DATA CREATE_SLOT ENABLED
SLOT_NAME SYNCHRONOUS_COMMIT

+
+         <para>
+          It is not allowed to combine <literal>streaming</literal> set to
+          <literal>true</literal> and <literal>two_phase</literal> set to
+          <literal>true</literal>.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded
transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default,
the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          It is not allowed to combine <literal>two_phase</literal> set to
+          <literal>true</literal> and <literal>streaming</literal> set to
+          <literal>true</literal>.
+         </para>

It is not allowed to combine streaming set to true and two_phase set to true.
Should this be:
streaming option is not supported along with two_phase option.

Similarly here too:
It is not allowed to combine two_phase set to true and streaming set to true.
Should this be:
two_phase option is not supported along with streaming option.

Few indentation issues are present, we can run pgindent:
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+
  XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+
 LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out,
ReorderBufferTXN* txn,
+
                  XLogRecPtr commit_lsn);

ReorderBufferTXN * should be ReorderBufferTXN*

Line exceeds 80 chars:
+               /*
+                * Now that we replayed the psf it is no longer
needed. Just delete it.
+                */
+               prepare_spoolfile_delete(psfpath);
There is a typo, preapred should be prepared.
+         <para>
+          When two-phase commit is enabled then the decoded
transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default,
the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>

Regards,
Vignesh

#215Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#214)
9 attachment(s)

Please find attached the latest patch set v50*

Differences from v49* are:

* Rebased to HEAD @ today

* Patch 0008 "empty prepare" is updated to address the following
feedback comments:

From Amit @ 2021-03-03 [ak]
- (18) Fixed. Removed special cases in
prepare_spoolfile_replay_messages. Just dispatch all messages.
- (19) Fixed. Before replay the psf remote_final_lsn needs to be set
as commit_prepared's commit_lsn

From Vignesh @ 2021-03-05 [vc]
- (21) Fixed. Reworded comment to not refer to the fix as a "workaround".
- (25) Fixed. A comment line exceeds 80 chars.

-----
[ak] /messages/by-id/CAA4eK1KhfzCYDmv17beC6wOX_5pL-MBNYBpMiLgxrdgF1yBYng@mail.gmail.com
[vc] /messages/by-id/CALDaNm1rRG2EUus+mFrqRzEshZwJZtxja0rn_n3qXGAygODfOA@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v50-0001-Refactor-spool-file-logic-in-worker.c.patchapplication/octet-stream; name=v50-0001-Refactor-spool-file-logic-in-worker.c.patchDownload
From f6a9fd6700efa88f368523ee49bf2a16543ef361 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 6 Mar 2021 09:31:57 +1100
Subject: [PATCH v50] Refactor spool-file logic in worker.c.

This patch only refactors to isolate the streaming spool-file processing
to a separate function. A later patch to support prepared transaction
apply will require this common processing logic to be called from another
place.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/replication/logical/worker.c | 48 ++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 18d0528..45ac498 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -246,6 +246,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -917,30 +919,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +941,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +956,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1031,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v50-0002-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v50-0002-Track-replication-origin-progress-for-rollbacks.patchDownload
From a6a147a5b1fd70fa69d9b696c863370ebe173d3b Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 6 Mar 2021 09:35:36 +1100
Subject: [PATCH v50] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20..6023e7c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2276,6 +2276,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2298,6 +2306,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4e6a3df..acdb28d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5716,8 +5716,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5923,7 +5922,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5972,6 +5972,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6013,7 +6020,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6021,7 +6029,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v50-0004-Add-two_phase-option-to-CREATE-REPLICATION-SLOT.patchapplication/octet-stream; name=v50-0004-Add-two_phase-option-to-CREATE-REPLICATION-SLOT.patchDownload
From d0f202cbab319446c99aa39da723a7a9c6645efa Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 6 Mar 2021 09:43:04 +1100
Subject: [PATCH v50] Add two_phase option to CREATE REPLICATION SLOT.

This patch adds new option to enable two_phase while creating a slot.
---
 src/backend/commands/subscriptioncmds.c                    |  2 +-
 .../replication/libpqwalreceiver/libpqwalreceiver.c        |  6 +++++-
 src/backend/replication/logical/tablesync.c                |  2 +-
 src/backend/replication/repl_gram.y                        | 14 +++++++++++---
 src/backend/replication/repl_scanner.l                     |  1 +
 src/backend/replication/walreceiver.c                      |  2 +-
 src/include/replication/walreceiver.h                      |  5 +++--
 7 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..f6793f0 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -528,7 +528,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				walrcv_create_slot(wrconn, slotname, false, true,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..9e822f9 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -827,7 +828,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +842,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..50c3ea7 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1052,7 +1052,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
 	/*
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..c5154ae 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -242,15 +244,16 @@ create_replication_slot:
 					$$ = (Node *) cmd;
 				}
 			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e5f8a06..e40d2d0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -363,7 +363,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..f55b07c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -345,6 +345,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -418,8 +419,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
-- 
1.8.3.1

v50-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v50-0003-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 37a7e66486f013e10583b9c82c602c0df21073ce Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 6 Mar 2021 09:39:30 +1100
Subject: [PATCH v50] Add support for apply at prepare time to built-in logical
  replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

* change stream_prepare_cb from a required callback to an optional callback.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/logicaldecoding.sgml           |  16 +--
 src/backend/access/transam/twophase.c       |  68 ++++++++++
 src/backend/replication/logical/logical.c   |   9 +-
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 194 ++++++++++++++++++++++++++++
 src/backend/replication/logical/worker.c    | 174 +++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 157 ++++++++++++++++------
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  70 +++++++++-
 src/include/replication/reorderbuffer.h     |  12 ++
 src/tools/pgindent/typedefs.list            |   3 +
 11 files changed, 658 insertions(+), 54 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 80eb96d..702e42d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -468,9 +468,9 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
-     and <function>stream_prepare_cb</function>
-     are required, while <function>stream_message_cb</function> and
+     <function>stream_commit_cb</function>, and <function>stream_change_cb</function>,
+     are required, while <function>stream_prepare_cb</function>, 
+     <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
 
@@ -478,9 +478,9 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
     An output plugin may also define functions to support two-phase commits,
     which allows actions to be decoded on the <command>PREPARE TRANSACTION</command>.
     The <function>begin_prepare_cb</function>, <function>prepare_cb</function>, 
-    <function>stream_prepare_cb</function>,
     <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
-    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    callbacks are required, while <function>filter_prepare_cb</function> and
+    <function>stream_prepare_cb</function>, are optional.
     </para>
    </sect2>
 
@@ -1195,9 +1195,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
     provide additional callbacks. There are multiple two-phase commit callbacks
     that are required, (<function>begin_prepare_cb</function>,
     <function>prepare_cb</function>, <function>commit_prepared_cb</function>, 
-    <function>rollback_prepared_cb</function> and
-    <function>stream_prepare_cb</function>) and an optional callback
-    (<function>filter_prepare_cb</function>).
+    <function>rollback_prepared_cb</function>) and
+    optional callbacks, (<function>stream_prepare_cb</function> and
+    <function>filter_prepare_cb</function>).
    </para>
 
    <para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..81cb765 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75de..5324851 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1326,6 +1326,9 @@ stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	Assert(ctx->streaming);
 	Assert(ctx->twophase);
 
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		return;
+
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_prepare";
@@ -1340,12 +1343,6 @@ stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	ctx->write_xid = txn->xid;
 	ctx->write_location = txn->end_lsn;
 
-	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
-	if (ctx->callbacks.stream_prepare_cb == NULL)
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("logical streaming at prepare time requires a stream_prepare_cb callback")));
-
 	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
 
 	/* Pop the error context stack */
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..e958d28 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,200 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 45ac498..92ac4cb 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -722,6 +723,157 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/*
+	 * The gid must not already be prepared.
+	 */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				errmsg("transaction identifier \"%s\" is already in use",
+					   begin_data.gid)));
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1972,6 +2124,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2e4b39f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +171,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -322,8 +342,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +362,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +383,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +843,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1250,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..b797e3b 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -170,5 +237,4 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 										  TransactionId subxid);
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
-
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..0c95dc6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8bd95ae..745b51d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v50-0005-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v50-0005-Support-2PC-txn-subscriber-tests.patchDownload
From 35273871f92b717d5c89da620013a357ee017a9d Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 6 Mar 2021 09:52:50 +1100
Subject: [PATCH v50] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 338 ++++++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl | 282 ++++++++++++++++++++
 2 files changed, 620 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9c1d681
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,338 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL changes to subscriber.
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v50-0006-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v50-0006-Support-2PC-txn-Subscription-option.patchDownload
From 2f15ca04d2937ff84c9f00121173ef189e4cd21f Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 6 Mar 2021 09:56:02 +1100
Subject: [PATCH v50] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/create_subscription.sgml          | 29 +++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 72 +++++++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 +
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 ++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 +++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 ++-
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 +
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 93 +++++++++++++++-------
 src/test/regress/sql/subscription.sql              | 25 ++++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 17 files changed, 261 insertions(+), 47 deletions(-)

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..e04b8d2 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,35 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          It is not allowed to combine <literal>streaming</literal> set to
+          <literal>true</literal> and <literal>two_phase</literal> set to
+          <literal>true</literal>.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          It is not allowed to combine <literal>two_phase</literal> set to
+          <literal>true</literal> and <literal>streaming</literal> set to
+          <literal>true</literal>.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73..060fab4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1168,7 +1168,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f6793f0..96fcf49 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,26 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +309,24 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * the current implementation has some issues that could lead to a
+	 * streaming prepared transaction to be incorrectly missed in the initial
+	 * syncing phase. Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +576,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false, true,
+				walrcv_create_slot(wrconn, slotname, false, twophase,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +883,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +918,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if (sub->twophase && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +947,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +993,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1039,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9e822f9..1daa585 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -428,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 92ac4cb..9cccdef 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2631,6 +2631,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3277,6 +3278,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2e4b39f..91ecc55 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -178,13 +178,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -252,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -265,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -289,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -330,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..4ac4924 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and Two phase commit are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index b797e3b..6c848c2 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f55b07c..0ed8e9d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..67b3358 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9c1d681..a680c1a 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v50-0007-Tablesync-early-exit.patchapplication/octet-stream; name=v50-0007-Tablesync-early-exit.patchDownload
From 03076070ff5911beb7b477a905b205dc20d531a3 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 6 Mar 2021 10:12:35 +1100
Subject: [PATCH v50] Tablesync early exit.

Give the tablesync worker an opportunity to see if it can exit immediately
(because it has already caught-up) without it needing to process a message
first before discovering that.
---
 src/backend/replication/logical/worker.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9cccdef..1f3aa01 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2252,6 +2252,16 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	bool		ping_sent = false;
 	TimeLineID	tli;
 
+	if (am_tablesync_worker())
+	{
+		/*
+		 * Give the tablesync worker an opportunity see if it can immediately
+		 * exit, instead of always handling a message (which maybe the apply
+		 * worker could have handled).
+		 */
+		process_syncing_tables(last_received);
+	}
+
 	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
-- 
1.8.3.1

v50-0008-Fix-apply-worker-empty-prepare.patchapplication/octet-stream; name=v50-0008-Fix-apply-worker-empty-prepare.patchDownload
From 00930eeb9b56934a0c4fd04e99b8071cbd737960 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 6 Mar 2021 11:56:11 +1100
Subject: [PATCH v50] Fix apply worker empty prepare.

By sad timing of apply/tablesync workers it is possible to have a "consistent snapshot" that spans prepare/commit in such a way that the tablesync did not do the prepare (because snapshot not consistent) and the apply worker does the prepare ('b') but it skips all the prepared operations [e.g. inserts] while the tablesync was still busy (see the condition of should_apply_changes_for_rel). Later, at the commit prepared time when the apply worker does the commit prepare ('K'), there is nothing committed (because the inserts were skipped earlier).

This patch implements a two-part fix as suggested [1] on hackers.

Part 1 - The begin_prepare handler of apply will always wait for any busy tablesync workers to acheive SYNCDONE/READY state.

Part 2 - If (after Part 1) the apply-worker's prepare is found to be lagging behind any of the sync-workers then the subsequent prepared operations will be spooled to a file to be replayed at commit_prepared time.

Discussion:
[1] https://www.postgresql.org/message-id/CAA4eK1L%3DdhuCRvyDvrXX5wZgc7s1hLRD29CKCK6oaHtVCPgiFA%40mail.gmail.com
---
 src/backend/replication/logical/tablesync.c | 178 ++++++--
 src/backend/replication/logical/worker.c    | 634 +++++++++++++++++++++++++++-
 src/include/replication/worker_internal.h   |   3 +
 src/tools/pgindent/typedefs.list            |   2 +
 4 files changed, 781 insertions(+), 36 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 50c3ea7..97fc399 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1137,3 +1111,141 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+AnyTablesyncInProgress()
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * When the process_syncing_tables_for_apply changes the state
+		 * from SYNCDONE to READY, that change is actually written directly
+		 * into the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 1f3aa01..4093824 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -209,6 +209,54 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	bool allow_delete; /* ok to delete? */
+}			PsfHashEntry;
+
+/*
+ * Information about the "current" psf spoolfile.
+ */
+typedef struct PsfFile
+{
+	char	name[MAXPGPATH];/* psf name - same as the HTAB key. */
+	bool	is_spooling;	/* are we currently spooling to this file? */
+	File 	vfd;			/* -1 when the file is closed. */
+	off_t	cur_offset;		/* offset for the next write or read. Reset to 0
+							 * when file is opened. */
+} PsfFile;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * Information about the 'current' open spoolfile is only valid when spooling.
+ * This is flagged as 'is_spooling' only between begin_prepare and prepare.
+ */
+static PsfFile psf_cur = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_delete(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+static void prepare_spoolfile_on_proc_exit(int status, Datum arg);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -730,6 +778,9 @@ apply_handle_begin_prepare(StringInfo s)
 {
 	LogicalRepBeginPrepareData begin_data;
 
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
 	logicalrep_read_begin_prepare(s, &begin_data);
 
 	/*
@@ -741,6 +792,88 @@ apply_handle_begin_prepare(StringInfo s)
 				errmsg("transaction identifier \"%s\" is already in use",
 					   begin_data.gid)));
 
+	/*
+	 * A Problem:
+	 *
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel). Later at the
+	 * commit prepared time when the apply worker does the commit prepare
+	 * (‘K’), there is nothing in it (because the inserts were skipped
+	 * earlier). Let's call this the "empty prepare" problem.
+	 *
+	 * The following code has a 2-part fix for that scenario.
+	 *
+	 * -----
+	 *
+	 * XXX - The 2PC protocol needs the publisher to be aware when the PREPARE
+	 * has been successfully acted on. But because of this "empty prepare"
+	 * problem now the prepared messages may be spooled to a file and, when
+	 * that happens the PREPARE would not happen at the usual time, but would
+	 * be deferred until COMMIT PREPARED time. This quirk could only happen
+	 * immediately after the initial table synchronization phase; once all
+	 * tables have acheived READY state the 2PC protocol will behave normally.
+	 *
+	 * A future release may be able to detect when all tables are READY and set
+	 * a flag to indicate this subscription/slot is ready for two_phase
+	 * decoding. Then at the publisher-side, we could enable wait-for-prepares
+	 * only when all the slots of WALSender have that flag set.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Empty prepare fix. Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(begin_data.final_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (AnyTablesyncInProgress())
+		{
+			process_syncing_tables(begin_data.final_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Empty prepare fix. Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		{
+			char		psfpath[MAXPGPATH];
+
+			/*
+			 * Create the spoolfile.
+			 */
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+
+			/*
+			 * From now, until the handle_prepare we are spooling to the
+			 * current psf.
+			 */
+			psf_cur.is_spooling = true;
+		}
+	}
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	in_remote_transaction = true;
@@ -756,9 +889,58 @@ apply_handle_prepare(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
 
+	/*
+	 * If we were using a psf spoolfile, then write the PREPARE as the final
+	 * message. This prepare information will be used at commit_prepared time.
+	 */
+	if (psf_cur.is_spooling)
+	{
+		PsfHashEntry *hentry;
+
+		/* Write the PREPARE info to the psf file. */
+		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
+
+		/*
+		 * Flush the spoolfile, so changes can survive a restart.
+		 */
+		FileSync(psf_cur.vfd, WAIT_EVENT_DATA_FILE_SYNC);
+
+		/*
+		 * We are finished spooling to the current psf.
+		 */
+		psf_cur.is_spooling = false;
+
+		/*
+		 * The commit_prepare will need the spoolfile, so unregister it for
+		 * removal on proc-exit just in case there is an unexpected restart
+		 * between now and when commit_prepared happens.
+		 */
+		hentry = (PsfHashEntry *) hash_search(psf_hash,
+											  psf_cur.name,
+											  HASH_FIND,
+											  NULL);
+		Assert(hentry);
+		hentry->allow_delete = false;
+
+		/*
+		 * The psf_cur.vfd is meaningful only between begin_prepare and prepared.
+		 * So close it now. Any messages written to the psf will be applied
+		 * later during handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		in_remote_transaction = false;
+		return;
+	}
+
 	logicalrep_read_prepare(s, &prepare_data);
 
-	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+	/*
+	 * Normally, prepare_lsn == remote_final_lsn, but if this prepare message
+	 * was dispatched via the psf spoolfile replay then the remote_final_lsn
+	 * is set to commit lsn instead. Hence the <= instead of == check below.
+	 */
+	Assert(prepare_data.prepare_lsn <= remote_final_lsn);
 
 	if (IsTransactionState())
 	{
@@ -805,9 +987,38 @@ static void
 apply_handle_commit_prepared(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
 
 	logicalrep_read_commit_prepared(s, &prepare_data);
 
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+
+		/*
+		 * Replay/dispatch the spooled messages (including lastly, the PREPARE
+		 * message).
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		/*
+		 * After replaying the psf it is no longer needed. Just delete it.
+		 */
+		prepare_spoolfile_delete(psfpath);
+	}
+
 	/* there is no transaction when COMMIT PREPARED is called */
 	ensure_transaction();
 
@@ -838,15 +1049,37 @@ static void
 apply_handle_rollback_prepared(StringInfo s)
 {
 	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
 
 	logicalrep_read_rollback_prepared(s, &rollback_data);
 
 	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		/*
+		 * We are finished with this spoolfile. Delete it.
+		 */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/*
 	 * It is possible that we haven't received prepare because it occurred
 	 * before walsender reached a consistent point in which case we need to
 	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
 	 */
-	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
 	{
 		/*
@@ -886,6 +1119,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		!psf_cur.is_spooling &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1320,6 +1554,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1441,6 +1678,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1599,6 +1839,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1968,6 +2211,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -2263,6 +2509,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	}
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL		hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2382,7 +2645,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && !psf_cur.is_spooling)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -3130,6 +3393,9 @@ ApplyWorkerMain(Datum main_arg)
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
 
+	/* Arrange to delete any unwanted psf file(s) at proc-exit */
+	on_proc_exit(prepare_spoolfile_on_proc_exit, 0);
+
 	/* Setup signal handling */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGTERM, die);
@@ -3307,3 +3573,365 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time. If needed, this is the common function to do that file redirection.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	elog(DEBUG1,
+		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_cur.is_spooling ? "Do" : "Don't");
+
+	if (!psf_cur.is_spooling)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	bool		found;
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(!psf_cur.is_spooling);
+
+	/* create or find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER | HASH_FIND,
+										  &found);
+
+	if (!found)
+	{
+		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create file \"%s\": %m", path)));
+		}
+		memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+		psf_cur.cur_offset = 0;
+		hentry->allow_delete = true;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the beginning because we always want to
+		 * create/overwrite this file.
+		 */
+		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m", path)));
+		}
+		memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+		psf_cur.cur_offset = 0;
+		hentry->allow_delete = true;
+	}
+
+	/* Sanity checks */
+	Assert(psf_cur.vfd >= 0);
+	Assert(psf_cur.cur_offset == 0);
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_cur.vfd >= 0)
+		FileClose(psf_cur.vfd);
+
+	/* Mark this fd as not valid to use anymore. */
+	psf_cur.is_spooling = false;
+	psf_cur.vfd = -1;
+	psf_cur.cur_offset = 0;
+}
+
+/*
+ * Delete the specified psf spoolfile, and any HTAB associated with it.
+ */
+static void
+prepare_spoolfile_delete(char *path)
+{
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* Delete the file off the disk. */
+	unlink(path);
+
+	/* Remove any entry from the psf_hash, if present */
+	hash_search(psf_hash, path, HASH_REMOVE, NULL);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+	int			bytes_written;
+
+	Assert(psf_cur.is_spooling);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(len));
+	psf_cur.cur_offset += bytes_written;
+
+	/* then the action */
+	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(action));
+	psf_cur.cur_offset += bytes_written;
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == len);
+	psf_cur.cur_offset += bytes_written;
+}
+
+/*
+ * Is there a prepare spoolfile for the specified path?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	bool		found;
+	PsfHashEntry *hentry;
+
+	/* Find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_FIND,
+										  &found);
+
+	if (!found)
+	{
+		/*
+		 * Hash doesn't know about it, but perhaps the Hash was destroyed by a
+		 * restart, so let's check the file existence on disk.
+		 */
+		File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+
+		found = fd >= 0;
+		if (fd >= 0)
+			FileClose(fd);
+
+		/*
+		 * And if it was found on disk then create the HTAB entry for it.
+		 */
+		if (found)
+		{
+			hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER,
+										  NULL);
+			hentry->allow_delete = false;
+		}
+	}
+
+	return found;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ *
+ * [Note: this is similar to apply_spooled_messages function]
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	psf.vfd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	if (psf.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from prepared spoolfile \"%s\": %m",
+						path)));
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	remote_final_lsn = final_lsn;
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		nbytes = FileRead(psf.vfd, buffer, len,
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+		if (nbytes != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldctx2);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	FileClose(psf.vfd);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB.
+	 *
+	 * Therefore, the name and the key must be exactly same lengths and padded
+	 * with '\0' so garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "pg_twophase/%u-%s.prep_changes", subid, gid);
+}
+
+/*
+ * proc_exit callback to remove unwanted psf files.
+ */
+static void
+prepare_spoolfile_on_proc_exit(int status, Datum arg)
+{
+	HASH_SEQ_STATUS seq_status;
+	PsfHashEntry *hentry;
+
+	/* Iterate the HTAB looking for what file can be deleted. */
+	if (psf_hash)
+	{
+		hash_seq_init(&seq_status, psf_hash);
+		while ((hentry = (PsfHashEntry *) hash_seq_search(&seq_status)) != NULL)
+		{
+			char *path = hentry->name;
+
+			if (hentry->allow_delete)
+				prepare_spoolfile_delete(path);
+		}
+	}
+}
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..95d78e9 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AnyTablesyncInProgress(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 745b51d..4ffcef5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1958,6 +1958,8 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfFile
+PsfHashEntry
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v50-0009-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v50-0009-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From c6bd2ed9a82f7d3e5cf4b0df6acd59b51209f5ab Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 6 Mar 2021 12:12:27 +1100
Subject: [PATCH v50] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 +++++++++---
 src/backend/replication/logical/worker.c    | 73 +++++++++++++++++++++++------
 2 files changed, 81 insertions(+), 21 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 97fc399..f3984d4 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1127,6 +1133,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1149,6 +1157,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1156,12 +1165,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1175,6 +1189,8 @@ AnyTablesyncInProgress()
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncInProgress?");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1186,8 +1202,8 @@ AnyTablesyncInProgress()
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1204,6 +1220,7 @@ AnyTablesyncInProgress()
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncInProgress?: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1215,8 +1232,8 @@ AnyTablesyncInProgress()
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1242,8 +1259,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4093824..163a1ca 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -830,14 +830,16 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(begin_data.final_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (AnyTablesyncInProgress())
 		{
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
 			process_syncing_tables(begin_data.final_lsn);
 
 			/* This latch is to prevent 100% CPU looping. */
@@ -855,7 +857,12 @@ apply_handle_begin_prepare(StringInfo s)
 		 * prepared) will be saved to a spoolfile for replay later at
 		 * commit_prepared time.
 		 */
-		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		if (begin_data.final_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
 		{
 			char		psfpath[MAXPGPATH];
 
@@ -897,6 +904,8 @@ apply_handle_prepare(StringInfo s)
 	{
 		PsfHashEntry *hentry;
 
+		elog(LOG, "!!> apply_handle_prepare: SPOOLING");
+
 		/* Write the PREPARE info to the psf file. */
 		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
 
@@ -915,6 +924,8 @@ apply_handle_prepare(StringInfo s)
 		 * removal on proc-exit just in case there is an unexpected restart
 		 * between now and when commit_prepared happens.
 		 */
+		elog(LOG,
+			"!!> apply_handle_prepare: Make sure the spoolfile is not removed on proc-exit");
 		hentry = (PsfHashEntry *) hash_search(psf_hash,
 											  psf_cur.name,
 											  HASH_FIND,
@@ -1001,6 +1012,8 @@ apply_handle_commit_prepared(StringInfo s)
 	{
 		int			nchanges;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * Replay/dispatch the spooled messages (including lastly, the PREPARE
 		 * message).
@@ -1009,8 +1022,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		/*
@@ -1078,6 +1091,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf = %d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2372,18 +2386,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3587,8 +3605,8 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
-	elog(DEBUG1,
-		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
 		 action,
 		 psf_cur.is_spooling ? "Do" : "Don't");
 
@@ -3612,7 +3630,7 @@ prepare_spoolfile_create(char *path)
 	bool		found;
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(!psf_cur.is_spooling);
 
@@ -3624,7 +3642,7 @@ prepare_spoolfile_create(char *path)
 
 	if (!found)
 	{
-		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		elog(LOG, "!!> Not found file \"%s\". Create it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 		{
@@ -3642,7 +3660,7 @@ prepare_spoolfile_create(char *path)
 		 * Open the file and seek to the beginning because we always want to
 		 * create/overwrite this file.
 		 */
-		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		elog(LOG, "!!> Found file \"%s\". Overwrite it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 		{
@@ -3667,6 +3685,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_cur.vfd >= 0)
 		FileClose(psf_cur.vfd);
 
@@ -3682,6 +3701,8 @@ prepare_spoolfile_close()
 static void
 prepare_spoolfile_delete(char *path)
 {
+	elog(LOG, "!!> prepare_spoolfile_delete: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3707,18 +3728,20 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_cur.is_spooling);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(len));
 	psf_cur.cur_offset += bytes_written;
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(action));
@@ -3727,6 +3750,7 @@ prepare_spoolfile_write(char action, StringInfo s)
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == len);
@@ -3760,6 +3784,12 @@ prepare_spoolfile_exists(char *path)
 		if (fd >= 0)
 			FileClose(fd);
 
+		elog(LOG,
+			 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was "
+			 "not found in the HTAB, but was %s on the disk.",
+			 path,
+			 found ? "found" : "not found");
+
 		/*
 		 * And if it was found on disk then create the HTAB entry for it.
 		 */
@@ -3769,10 +3799,16 @@ prepare_spoolfile_exists(char *path)
 										  path,
 										  HASH_ENTER,
 										  NULL);
+			elog(LOG, "!!> prepare_spoolfile_exists: Created new HTAB entry '%s'", hentry->name);
 			hentry->allow_delete = false;
 		}
 	}
 
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 found ? "found" : "not found");
+
 	return found;
 }
 
@@ -3791,8 +3827,8 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 				oldctx2;
 	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3837,6 +3873,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -3855,6 +3892,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		nbytes = FileRead(psf.vfd, buffer, len,
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
@@ -3871,7 +3909,9 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		/* Ensure we are reading the data into our memory context. */
 		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
 
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: Before dispatch");
 		apply_dispatch(&s2);
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: After dispatch");
 
 		MemoryContextReset(ApplyMessageContext);
 
@@ -3880,13 +3920,13 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nchanges++;
 
 		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
+			elog(LOG, "!!> replayed %d changes from file '%s'",
 				 nchanges, path);
 	}
 
 	FileClose(psf.vfd);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
@@ -3922,6 +3962,8 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 	HASH_SEQ_STATUS seq_status;
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_on_proc_exit");
+
 	/* Iterate the HTAB looking for what file can be deleted. */
 	if (psf_hash)
 	{
@@ -3930,6 +3972,7 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 		{
 			char *path = hentry->name;
 
+			elog(LOG, "!!> prepare_spoolfile_proc_exit: found '%s'", path);
 			if (hentry->allow_delete)
 				prepare_spoolfile_delete(path);
 		}
-- 
1.8.3.1

#216osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Peter Smith (#215)
RE: [HACKERS] logical decoding of two-phase transactions

Hi

On Saturday, March 6, 2021 10:49 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v50*

When I read throught the patch set, I found there is a
wierd errmsg in apply_handle_begin_prepare(), which seems a mistake.

File : v50-0003-Add-support-for-apply-at-prepare-time-to-built-i.patch

+        * The gid must not already be prepared.
+        */
+       if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+               ereport(ERROR,
+                               (errcode(ERRCODE_DUPLICATE_OBJECT),
+                               errmsg("transaction?identifier?\"%s\"?is?already?in?use",
+                                          begin_data.gid)));

Please fix this in a next update.

Best Regards,
Takamichi Osumi

#217Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#215)

On Sat, Mar 6, 2021 at 7:19 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v50*

Few comments on the latest patch series:
=================================
1. I think we can extract the changes to make streaming optional with
2PC and infact you can start a separate thread for it.

2. I think we can get rid of table-sync early exit patch
(v50-0007-Tablesync-early-exit) as we have kept two_phase off from
tablesync worker. I agree that has its own independent value but it is
not required for this patch series.

3. Now, that we are not supporting streaming with two_pc option, do we
really need the first patch
(v50-0001-Refactor-spool-file-logic-in-worker.c)? I suggest to get rid
of the same unless it is really required. If we decide to remove this
patch, then remove the reference to apply_spooled_messages from 0008
patch.

v50-0005-Support-2PC-txn-subscriber-tests
4.
+###############################
+# Test cases involving DDL.
+###############################
+
+# TODO This can be added after we add functionality to replicate DDL
changes to subscriber.

We can remove this from the patch.

v50-0006-Support-2PC-txn-Subscription-option
5.
- /* Binary mode and streaming are only supported in v14 and higher */
+ /* Binary mode and streaming and Two phase commit are only supported
in v14 and higher */

It looks odd that only one of the option starts with capital letter
/Two/two. I suggest to two_phase.

v50-0008-Fix-apply-worker-empty-prepare
6. In 0008, the commit message lines are too long, it is difficult to
read those. Try to keep them 75 char long, this is generally what I
use but you can try something else if you want but not as long as you
have kept in this patch.

7.
+ /*
+ * A Problem:
+ *
..
Let's call this the "empty prepare" problem.
+ *
+ * The following code has a 2-part fix for that scenario.

No need to describe it in terms of problem and fix. You can say
something like: "This can lead to "empty prepare". We avoid this by
...."

--
With Regards,
Amit Kapila.

#218Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#217)
7 attachment(s)

Please find attached the latest patch set v51*

Differences from v50* are:

* Rebased to HEAD @ today

* Addresses following feedback comments:

From Osunmi-san @ 2021-03-06 [ot]
- (27) Fixed. Patch 0003. Remove weird chars from the error message.

From Amit @ 2021-03-06 [ak]
- (29) Removed patch 0007 "tablesync early exit" from this patch set.
I started a new thread [early-exit] for this.
- (30) Removed patch 0001 "refactor spool file logic" from this patch set.
- (31) Fixed. Patch 0005 removed TODO from test code.
- (32) Fixed. Patch 0006 comment typo.
- (33) Fixed. Patch 0008 commit message lines were too long
- (34) Fixed. Patch 0008 comment reworded avoiding words like
"problem" and "fix"

-----
[ot] /messages/by-id/OSBPR01MB4888636EB9421C930FB39A19ED959@OSBPR01MB4888.jpnprd01.prod.outlook.com
[ak] /messages/by-id/CAA4eK1Jxu-3qxtkfA_dKoquQgGZVcB+k9_-yT5=9GDEW84TF+A@mail.gmail.com
[early-exit] /messages/by-id/CAHut+Pt39PbQs0SxT9RMM89aYiZoQ0Kw46YZSkKZwK8z5HOr3g@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v51-0002-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v51-0002-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From ce4d859e8ed9fcad8cba0ee4c1e136a262d66036 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sun, 7 Mar 2021 09:46:30 +1100
Subject: [PATCH v51] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

* change stream_prepare_cb from a required callback to an optional callback.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/logicaldecoding.sgml           |  16 +--
 src/backend/access/transam/twophase.c       |  68 ++++++++++
 src/backend/replication/logical/logical.c   |   9 +-
 src/backend/replication/logical/origin.c    |   7 +-
 src/backend/replication/logical/proto.c     | 194 ++++++++++++++++++++++++++++
 src/backend/replication/logical/worker.c    | 174 +++++++++++++++++++++++++
 src/backend/replication/pgoutput/pgoutput.c | 157 ++++++++++++++++------
 src/include/access/twophase.h               |   2 +
 src/include/replication/logicalproto.h      |  70 +++++++++-
 src/include/replication/reorderbuffer.h     |  12 ++
 src/tools/pgindent/typedefs.list            |   3 +
 11 files changed, 658 insertions(+), 54 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 80eb96d..702e42d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -468,9 +468,9 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
-     and <function>stream_prepare_cb</function>
-     are required, while <function>stream_message_cb</function> and
+     <function>stream_commit_cb</function>, and <function>stream_change_cb</function>,
+     are required, while <function>stream_prepare_cb</function>, 
+     <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
 
@@ -478,9 +478,9 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
     An output plugin may also define functions to support two-phase commits,
     which allows actions to be decoded on the <command>PREPARE TRANSACTION</command>.
     The <function>begin_prepare_cb</function>, <function>prepare_cb</function>, 
-    <function>stream_prepare_cb</function>,
     <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
-    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    callbacks are required, while <function>filter_prepare_cb</function> and
+    <function>stream_prepare_cb</function>, are optional.
     </para>
    </sect2>
 
@@ -1195,9 +1195,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
     provide additional callbacks. There are multiple two-phase commit callbacks
     that are required, (<function>begin_prepare_cb</function>,
     <function>prepare_cb</function>, <function>commit_prepared_cb</function>, 
-    <function>rollback_prepared_cb</function> and
-    <function>stream_prepare_cb</function>) and an optional callback
-    (<function>filter_prepare_cb</function>).
+    <function>rollback_prepared_cb</function>) and
+    optional callbacks, (<function>stream_prepare_cb</function> and
+    <function>filter_prepare_cb</function>).
    </para>
 
    <para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..81cb765 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75de..5324851 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1326,6 +1326,9 @@ stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	Assert(ctx->streaming);
 	Assert(ctx->twophase);
 
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		return;
+
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_prepare";
@@ -1340,12 +1343,6 @@ stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	ctx->write_xid = txn->xid;
 	ctx->write_location = txn->end_lsn;
 
-	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
-	if (ctx->callbacks.stream_prepare_cb == NULL)
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("logical streaming at prepare time requires a stream_prepare_cb callback")));
-
 	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
 
 	/* Pop the error context stack */
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..e958d28 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,200 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 18d0528..dc8f9ad 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -720,6 +721,157 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/*
+	 * The gid must not already be prepared.
+	 */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				errmsg("transaction identifier \"%s\" is already in use",
+					   begin_data.gid)));
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1954,6 +2106,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2e4b39f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +171,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -322,8 +342,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +362,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +383,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +843,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1250,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..b797e3b 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -170,5 +237,4 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 										  TransactionId subxid);
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
-
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..0c95dc6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8bd95ae..745b51d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v51-0001-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v51-0001-Track-replication-origin-progress-for-rollbacks.patchDownload
From 2e5e2177b27d433d55e3d9e610406c8e28ab3964 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sun, 7 Mar 2021 09:23:13 +1100
Subject: [PATCH v51] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20..6023e7c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2276,6 +2276,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2298,6 +2306,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4e6a3df..acdb28d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5716,8 +5716,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5923,7 +5922,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5972,6 +5972,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6013,7 +6020,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6021,7 +6029,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v51-0003-Add-two_phase-option-to-CREATE-REPLICATION-SLOT.patchapplication/octet-stream; name=v51-0003-Add-two_phase-option-to-CREATE-REPLICATION-SLOT.patchDownload
From b3efa6c5b05a4c6878541fbb368db66f590b3c21 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sun, 7 Mar 2021 10:15:36 +1100
Subject: [PATCH v51] Add two_phase option to CREATE REPLICATION SLOT.

This patch adds new option to enable two_phase while creating a slot.
---
 src/backend/commands/subscriptioncmds.c                    |  2 +-
 .../replication/libpqwalreceiver/libpqwalreceiver.c        |  6 +++++-
 src/backend/replication/logical/tablesync.c                |  2 +-
 src/backend/replication/repl_gram.y                        | 14 +++++++++++---
 src/backend/replication/repl_scanner.l                     |  1 +
 src/backend/replication/walreceiver.c                      |  2 +-
 src/include/replication/walreceiver.h                      |  5 +++--
 7 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..f6793f0 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -528,7 +528,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				walrcv_create_slot(wrconn, slotname, false, true,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..9e822f9 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -827,7 +828,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +842,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..50c3ea7 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1052,7 +1052,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
 	/*
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..c5154ae 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -242,15 +244,16 @@ create_replication_slot:
 					$$ = (Node *) cmd;
 				}
 			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e5f8a06..e40d2d0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -363,7 +363,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..f55b07c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -345,6 +345,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -418,8 +419,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
-- 
1.8.3.1

v51-0005-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v51-0005-Support-2PC-txn-Subscription-option.patchDownload
From 49fe77c635218754d2a0868a5ee1e6da7499e231 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sun, 7 Mar 2021 10:47:16 +1100
Subject: [PATCH v51] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/create_subscription.sgml          | 29 +++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 72 +++++++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 +
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 ++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 +++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 ++-
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 +
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 93 +++++++++++++++-------
 src/test/regress/sql/subscription.sql              | 25 ++++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 17 files changed, 261 insertions(+), 47 deletions(-)

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..e04b8d2 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,35 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          It is not allowed to combine <literal>streaming</literal> set to
+          <literal>true</literal> and <literal>two_phase</literal> set to
+          <literal>true</literal>.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          It is not allowed to combine <literal>two_phase</literal> set to
+          <literal>true</literal> and <literal>streaming</literal> set to
+          <literal>true</literal>.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73..060fab4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1168,7 +1168,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f6793f0..96fcf49 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,26 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +309,24 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * the current implementation has some issues that could lead to a
+	 * streaming prepared transaction to be incorrectly missed in the initial
+	 * syncing phase. Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +576,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false, true,
+				walrcv_create_slot(wrconn, slotname, false, twophase,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +883,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +918,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if (sub->twophase && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +947,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +993,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1039,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9e822f9..1daa585 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -428,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index dc8f9ad..e5c7afd 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2613,6 +2613,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3259,6 +3260,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2e4b39f..91ecc55 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -178,13 +178,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -252,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -265,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -289,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -330,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..96c878b 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and two_phase are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index b797e3b..6c848c2 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f55b07c..0ed8e9d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..67b3358 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9aa483c..d56789d 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v51-0004-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v51-0004-Support-2PC-txn-subscriber-tests.patchDownload
From 32d5d4199b1a2567309cbca85499921201b9c08f Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sun, 7 Mar 2021 10:26:03 +1100
Subject: [PATCH v51] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 332 ++++++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl | 282 ++++++++++++++++++++
 2 files changed, 614 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9aa483c
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,332 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v51-0006-Fix-apply-worker-empty-prepare.patchapplication/octet-stream; name=v51-0006-Fix-apply-worker-empty-prepare.patchDownload
From c6624619c861b723952d025a519b7f3094ca4aa1 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sun, 7 Mar 2021 11:47:29 +1100
Subject: [PATCH v51] Fix apply worker empty prepare.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

By sad timing of apply/tablesync workers it is possible to have a
"consistent snapshot" that spans prepare/commit in such a way that
the tablesync did not do the prepare (because snapshot not consistent)
and the apply worker does the begin prepare (‘b’) but it skips all
the prepared operations [e.g. inserts] while the tablesync was still
busy (see the condition of should_apply_changes_for_rel).

This can lead to an "empty prepare", because later when the apply
worker does the commit prepare (‘K’), there is nothing in it (the
inserts were skipped earlier).

This patch implements a two-part fix as suggested [1] on hackers.

Part 1 - The begin_prepare handler of apply will always wait for any
busy tablesync workers to acheive SYNCDONE/READY state.

Part 2 - If (after Part 1) the apply-worker's prepare is found to be
lagging behind any of the sync-workers then the subsequent prepared
operations will be spooled to a file to be replayed at commit_prepared
time.

Discussion:
[1] https://www.postgresql.org/message-id/CAA4eK1L%3DdhuCRvyDvrXX5wZgc7s1hLRD29CKCK6oaHtVCPgiFA%40mail.gmail.com
---
 src/backend/replication/logical/tablesync.c | 178 ++++++--
 src/backend/replication/logical/worker.c    | 634 +++++++++++++++++++++++++++-
 src/include/replication/worker_internal.h   |   3 +
 src/tools/pgindent/typedefs.list            |   2 +
 4 files changed, 781 insertions(+), 36 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 50c3ea7..97fc399 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1137,3 +1111,141 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+AnyTablesyncInProgress()
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * When the process_syncing_tables_for_apply changes the state
+		 * from SYNCDONE to READY, that change is actually written directly
+		 * into the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e5c7afd..89988b8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -209,6 +209,54 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	bool allow_delete; /* ok to delete? */
+}			PsfHashEntry;
+
+/*
+ * Information about the "current" psf spoolfile.
+ */
+typedef struct PsfFile
+{
+	char	name[MAXPGPATH];/* psf name - same as the HTAB key. */
+	bool	is_spooling;	/* are we currently spooling to this file? */
+	File 	vfd;			/* -1 when the file is closed. */
+	off_t	cur_offset;		/* offset for the next write or read. Reset to 0
+							 * when file is opened. */
+} PsfFile;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * Information about the 'current' open spoolfile is only valid when spooling.
+ * This is flagged as 'is_spooling' only between begin_prepare and prepare.
+ */
+static PsfFile psf_cur = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_delete(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+static void prepare_spoolfile_on_proc_exit(int status, Datum arg);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -728,6 +776,9 @@ apply_handle_begin_prepare(StringInfo s)
 {
 	LogicalRepBeginPrepareData begin_data;
 
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
 	logicalrep_read_begin_prepare(s, &begin_data);
 
 	/*
@@ -739,6 +790,90 @@ apply_handle_begin_prepare(StringInfo s)
 				errmsg("transaction identifier \"%s\" is already in use",
 					   begin_data.gid)));
 
+	/*
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel).
+	 *
+	 * This can lead to an "empty prepare", because later when the apply
+	 * worker does the commit prepare (‘K’), there is nothing in it (the
+	 * inserts were skipped earlier).
+	 *
+	 * We avoid this using the 2 part logic: (1) Wait for all tablesync workers
+	 * to reach SYNCDONE/READY state; (2) If the begin_prepare lsn is now
+	 * behind any tablesync lsn then spool the prepared messages to a file
+	 * to be replayed later at commit_prepared time.
+	 *
+	 * -----
+	 *
+	 * XXX - The 2PC protocol needs the publisher to be aware when the PREPARE
+	 * has been successfully acted on. But because of this "empty prepare"
+	 * case now the prepared messages may be spooled to a file and, when
+	 * that happens the PREPARE would not happen at the usual time, but would
+	 * be deferred until COMMIT PREPARED time. This quirk could only happen
+	 * immediately after the initial table synchronization phase; once all
+	 * tables have acheived READY state the 2PC protocol will behave normally.
+	 *
+	 * A future release may be able to detect when all tables are READY and set
+	 * a flag to indicate this subscription/slot is ready for two_phase
+	 * decoding. Then at the publisher-side, we could enable wait-for-prepares
+	 * only when all the slots of WALSender have that flag set.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(begin_data.final_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (AnyTablesyncInProgress())
+		{
+			process_syncing_tables(begin_data.final_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		{
+			char		psfpath[MAXPGPATH];
+
+			/*
+			 * Create the spoolfile.
+			 */
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+
+			/*
+			 * From now, until the handle_prepare we are spooling to the
+			 * current psf.
+			 */
+			psf_cur.is_spooling = true;
+		}
+	}
+
 	remote_final_lsn = begin_data.final_lsn;
 
 	in_remote_transaction = true;
@@ -754,9 +889,58 @@ apply_handle_prepare(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
 
+	/*
+	 * If we were using a psf spoolfile, then write the PREPARE as the final
+	 * message. This prepare information will be used at commit_prepared time.
+	 */
+	if (psf_cur.is_spooling)
+	{
+		PsfHashEntry *hentry;
+
+		/* Write the PREPARE info to the psf file. */
+		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
+
+		/*
+		 * Flush the spoolfile, so changes can survive a restart.
+		 */
+		FileSync(psf_cur.vfd, WAIT_EVENT_DATA_FILE_SYNC);
+
+		/*
+		 * We are finished spooling to the current psf.
+		 */
+		psf_cur.is_spooling = false;
+
+		/*
+		 * The commit_prepare will need the spoolfile, so unregister it for
+		 * removal on proc-exit just in case there is an unexpected restart
+		 * between now and when commit_prepared happens.
+		 */
+		hentry = (PsfHashEntry *) hash_search(psf_hash,
+											  psf_cur.name,
+											  HASH_FIND,
+											  NULL);
+		Assert(hentry);
+		hentry->allow_delete = false;
+
+		/*
+		 * The psf_cur.vfd is meaningful only between begin_prepare and prepared.
+		 * So close it now. Any messages written to the psf will be applied
+		 * later during handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		in_remote_transaction = false;
+		return;
+	}
+
 	logicalrep_read_prepare(s, &prepare_data);
 
-	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+	/*
+	 * Normally, prepare_lsn == remote_final_lsn, but if this prepare message
+	 * was dispatched via the psf spoolfile replay then the remote_final_lsn
+	 * is set to commit lsn instead. Hence the <= instead of == check below.
+	 */
+	Assert(prepare_data.prepare_lsn <= remote_final_lsn);
 
 	if (IsTransactionState())
 	{
@@ -803,9 +987,38 @@ static void
 apply_handle_commit_prepared(StringInfo s)
 {
 	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
 
 	logicalrep_read_commit_prepared(s, &prepare_data);
 
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+
+		/*
+		 * Replay/dispatch the spooled messages (including lastly, the PREPARE
+		 * message).
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		/*
+		 * After replaying the psf it is no longer needed. Just delete it.
+		 */
+		prepare_spoolfile_delete(psfpath);
+	}
+
 	/* there is no transaction when COMMIT PREPARED is called */
 	ensure_transaction();
 
@@ -836,15 +1049,37 @@ static void
 apply_handle_rollback_prepared(StringInfo s)
 {
 	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
 
 	logicalrep_read_rollback_prepared(s, &rollback_data);
 
 	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		/*
+		 * We are finished with this spoolfile. Delete it.
+		 */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/*
 	 * It is possible that we haven't received prepare because it occurred
 	 * before walsender reached a consistent point in which case we need to
 	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
 	 */
-	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
 	{
 		/*
@@ -884,6 +1119,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		!psf_cur.is_spooling &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1302,6 +1538,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1423,6 +1662,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1581,6 +1823,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1950,6 +2195,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -2235,6 +2483,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	TimeLineID	tli;
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL     hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2354,7 +2619,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && !psf_cur.is_spooling)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -3102,6 +3367,9 @@ ApplyWorkerMain(Datum main_arg)
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
 
+	/* Arrange to delete any unwanted psf file(s) at proc-exit */
+	on_proc_exit(prepare_spoolfile_on_proc_exit, 0);
+
 	/* Setup signal handling */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGTERM, die);
@@ -3279,3 +3547,363 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time. If needed, this is the common function to do that file redirection.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	elog(DEBUG1,
+		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_cur.is_spooling ? "Do" : "Don't");
+
+	if (!psf_cur.is_spooling)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	bool		found;
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(!psf_cur.is_spooling);
+
+	/* create or find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER | HASH_FIND,
+										  &found);
+
+	if (!found)
+	{
+		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create file \"%s\": %m", path)));
+		}
+		memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+		psf_cur.cur_offset = 0;
+		hentry->allow_delete = true;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the beginning because we always want to
+		 * create/overwrite this file.
+		 */
+		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m", path)));
+		}
+		memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+		psf_cur.cur_offset = 0;
+		hentry->allow_delete = true;
+	}
+
+	/* Sanity checks */
+	Assert(psf_cur.vfd >= 0);
+	Assert(psf_cur.cur_offset == 0);
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_cur.vfd >= 0)
+		FileClose(psf_cur.vfd);
+
+	/* Mark this fd as not valid to use anymore. */
+	psf_cur.is_spooling = false;
+	psf_cur.vfd = -1;
+	psf_cur.cur_offset = 0;
+}
+
+/*
+ * Delete the specified psf spoolfile, and any HTAB associated with it.
+ */
+static void
+prepare_spoolfile_delete(char *path)
+{
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* Delete the file off the disk. */
+	unlink(path);
+
+	/* Remove any entry from the psf_hash, if present */
+	hash_search(psf_hash, path, HASH_REMOVE, NULL);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+	int			bytes_written;
+
+	Assert(psf_cur.is_spooling);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(len));
+	psf_cur.cur_offset += bytes_written;
+
+	/* then the action */
+	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(action));
+	psf_cur.cur_offset += bytes_written;
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == len);
+	psf_cur.cur_offset += bytes_written;
+}
+
+/*
+ * Is there a prepare spoolfile for the specified path?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	bool		found;
+	PsfHashEntry *hentry;
+
+	/* Find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_FIND,
+										  &found);
+
+	if (!found)
+	{
+		/*
+		 * Hash doesn't know about it, but perhaps the Hash was destroyed by a
+		 * restart, so let's check the file existence on disk.
+		 */
+		File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+
+		found = fd >= 0;
+		if (fd >= 0)
+			FileClose(fd);
+
+		/*
+		 * And if it was found on disk then create the HTAB entry for it.
+		 */
+		if (found)
+		{
+			hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER,
+										  NULL);
+			hentry->allow_delete = false;
+		}
+	}
+
+	return found;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	psf.vfd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	if (psf.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from prepared spoolfile \"%s\": %m",
+						path)));
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	remote_final_lsn = final_lsn;
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		nbytes = FileRead(psf.vfd, buffer, len,
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+		if (nbytes != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldctx2);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	FileClose(psf.vfd);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB.
+	 *
+	 * Therefore, the name and the key must be exactly same lengths and padded
+	 * with '\0' so garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "pg_twophase/%u-%s.prep_changes", subid, gid);
+}
+
+/*
+ * proc_exit callback to remove unwanted psf files.
+ */
+static void
+prepare_spoolfile_on_proc_exit(int status, Datum arg)
+{
+	HASH_SEQ_STATUS seq_status;
+	PsfHashEntry *hentry;
+
+	/* Iterate the HTAB looking for what file can be deleted. */
+	if (psf_hash)
+	{
+		hash_seq_init(&seq_status, psf_hash);
+		while ((hentry = (PsfHashEntry *) hash_seq_search(&seq_status)) != NULL)
+		{
+			char *path = hentry->name;
+
+			if (hentry->allow_delete)
+				prepare_spoolfile_delete(path);
+		}
+	}
+}
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..95d78e9 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AnyTablesyncInProgress(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 745b51d..4ffcef5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1958,6 +1958,8 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfFile
+PsfHashEntry
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v51-0007-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v51-0007-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From 32d994f744dd1a08440ceaa197a8f938091b694d Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sun, 7 Mar 2021 12:00:16 +1100
Subject: [PATCH v51] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 +++++++++---
 src/backend/replication/logical/worker.c    | 73 +++++++++++++++++++++++------
 2 files changed, 81 insertions(+), 21 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 97fc399..f3984d4 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1127,6 +1133,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1149,6 +1157,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1156,12 +1165,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1175,6 +1189,8 @@ AnyTablesyncInProgress()
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncInProgress?");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1186,8 +1202,8 @@ AnyTablesyncInProgress()
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1204,6 +1220,7 @@ AnyTablesyncInProgress()
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncInProgress?: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1215,8 +1232,8 @@ AnyTablesyncInProgress()
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1242,8 +1259,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 89988b8..430cf57 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -830,14 +830,16 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(begin_data.final_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (AnyTablesyncInProgress())
 		{
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
 			process_syncing_tables(begin_data.final_lsn);
 
 			/* This latch is to prevent 100% CPU looping. */
@@ -855,7 +857,12 @@ apply_handle_begin_prepare(StringInfo s)
 		 * prepared) will be saved to a spoolfile for replay later at
 		 * commit_prepared time.
 		 */
-		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		if (begin_data.final_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
 		{
 			char		psfpath[MAXPGPATH];
 
@@ -897,6 +904,8 @@ apply_handle_prepare(StringInfo s)
 	{
 		PsfHashEntry *hentry;
 
+		elog(LOG, "!!> apply_handle_prepare: SPOOLING");
+
 		/* Write the PREPARE info to the psf file. */
 		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
 
@@ -915,6 +924,8 @@ apply_handle_prepare(StringInfo s)
 		 * removal on proc-exit just in case there is an unexpected restart
 		 * between now and when commit_prepared happens.
 		 */
+		elog(LOG,
+			"!!> apply_handle_prepare: Make sure the spoolfile is not removed on proc-exit");
 		hentry = (PsfHashEntry *) hash_search(psf_hash,
 											  psf_cur.name,
 											  HASH_FIND,
@@ -1001,6 +1012,8 @@ apply_handle_commit_prepared(StringInfo s)
 	{
 		int			nchanges;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * Replay/dispatch the spooled messages (including lastly, the PREPARE
 		 * message).
@@ -1009,8 +1022,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		/*
@@ -1078,6 +1091,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf = %d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2356,18 +2370,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3561,8 +3579,8 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
-	elog(DEBUG1,
-		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
 		 action,
 		 psf_cur.is_spooling ? "Do" : "Don't");
 
@@ -3586,7 +3604,7 @@ prepare_spoolfile_create(char *path)
 	bool		found;
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(!psf_cur.is_spooling);
 
@@ -3598,7 +3616,7 @@ prepare_spoolfile_create(char *path)
 
 	if (!found)
 	{
-		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		elog(LOG, "!!> Not found file \"%s\". Create it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 		{
@@ -3616,7 +3634,7 @@ prepare_spoolfile_create(char *path)
 		 * Open the file and seek to the beginning because we always want to
 		 * create/overwrite this file.
 		 */
-		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		elog(LOG, "!!> Found file \"%s\". Overwrite it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 		{
@@ -3641,6 +3659,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_cur.vfd >= 0)
 		FileClose(psf_cur.vfd);
 
@@ -3656,6 +3675,8 @@ prepare_spoolfile_close()
 static void
 prepare_spoolfile_delete(char *path)
 {
+	elog(LOG, "!!> prepare_spoolfile_delete: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3681,18 +3702,20 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_cur.is_spooling);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(len));
 	psf_cur.cur_offset += bytes_written;
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(action));
@@ -3701,6 +3724,7 @@ prepare_spoolfile_write(char action, StringInfo s)
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == len);
@@ -3734,6 +3758,12 @@ prepare_spoolfile_exists(char *path)
 		if (fd >= 0)
 			FileClose(fd);
 
+		elog(LOG,
+			 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was "
+			 "not found in the HTAB, but was %s on the disk.",
+			 path,
+			 found ? "found" : "not found");
+
 		/*
 		 * And if it was found on disk then create the HTAB entry for it.
 		 */
@@ -3743,10 +3773,16 @@ prepare_spoolfile_exists(char *path)
 										  path,
 										  HASH_ENTER,
 										  NULL);
+			elog(LOG, "!!> prepare_spoolfile_exists: Created new HTAB entry '%s'", hentry->name);
 			hentry->allow_delete = false;
 		}
 	}
 
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 found ? "found" : "not found");
+
 	return found;
 }
 
@@ -3763,8 +3799,8 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 				oldctx2;
 	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3809,6 +3845,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -3827,6 +3864,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		nbytes = FileRead(psf.vfd, buffer, len,
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
@@ -3843,7 +3881,9 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		/* Ensure we are reading the data into our memory context. */
 		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
 
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: Before dispatch");
 		apply_dispatch(&s2);
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: After dispatch");
 
 		MemoryContextReset(ApplyMessageContext);
 
@@ -3852,13 +3892,13 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nchanges++;
 
 		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
+			elog(LOG, "!!> replayed %d changes from file '%s'",
 				 nchanges, path);
 	}
 
 	FileClose(psf.vfd);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
@@ -3894,6 +3934,8 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 	HASH_SEQ_STATUS seq_status;
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_on_proc_exit");
+
 	/* Iterate the HTAB looking for what file can be deleted. */
 	if (psf_hash)
 	{
@@ -3902,6 +3944,7 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 		{
 			char *path = hentry->name;
 
+			elog(LOG, "!!> prepare_spoolfile_proc_exit: found '%s'", path);
 			if (hentry->allow_delete)
 				prepare_spoolfile_delete(path);
 		}
-- 
1.8.3.1

#219Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#218)

On Sun, Mar 7, 2021 at 7:35 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v51*

Few more comments on v51-0006-Fix-apply-worker-empty-prepare:
======================================================
1.
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the
psf_hash. This is
+ * for maintaining a mapping between the name of the prepared
spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+ char name[MAXPGPATH]; /* Hash key --- must be first */
+ bool allow_delete; /* ok to delete? */
+} PsfHashEntry;
+

IIUC, this has table is used for two purposes in the patch (a) to
check for existence of prepare spool file where we anyway to check it
on disk if not found in the hash table. (b) to allow the prepare spool
file to be removed on proc_exit.

I think we don't need the optimization provided by (a) because it will
be too rare a case to deserve any optimization, we might write a
comment in prepare_spoolfile_exists to indicate such an optimization.
For (b), we can use a simple list to track files to be removed on
proc_exit something like we do in CreateLockFile. I think avoiding
hash table usage will reduce the code and chances of bugs in this
area. It won't be easy to write a lot of automated tests to test this
functionality so it is better to avoid minor optimizations at this
stage.

2.
+ /*
+ * Replay/dispatch the spooled messages (including lastly, the PREPARE
+ * message).
+ */
+
+ ensure_transaction();

The part of the comment: "including lastly, the PREPARE message"
doesn't seem to fit here because in this part of the code you are not
doing anything special for Prepare message. Neither are we in someway
verifying that prepared message is replayed.

3.
+ /* create or find the prepare spoolfile entry in the psf_hash */
+ hentry = (PsfHashEntry *) hash_search(psf_hash,
+   path,
+   HASH_ENTER | HASH_FIND,
+   &found);
+
+ if (!found)
+ {
+ elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+ psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+ if (psf_cur.vfd < 0)
+ {
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", path)));
+ }
+ memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+ psf_cur.cur_offset = 0;
+ hentry->allow_delete = true;
+ }
+ else
+ {
+ /*
+ * Open the file and seek to the beginning because we always want to
+ * create/overwrite this file.
+ */
+ elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+ psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+ if (psf_cur.vfd < 0)
+ {
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+ }
+ memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+ psf_cur.cur_offset = 0;
+ hentry->allow_delete = true;
+ }

Is it sufficient to check if the prepared file exists in hash_table?
Isn't it possible that after restart even though the file exists but
it won't be there in the hash table? I guess this will change if you
address the first comment.

4.
@@ -754,9 +889,58 @@ apply_handle_prepare(StringInfo s)
{
LogicalRepPreparedTxnData prepare_data;

+ /*
+ * If we were using a psf spoolfile, then write the PREPARE as the final
+ * message. This prepare information will be used at commit_prepared time.
+ */
+ if (psf_cur.is_spooling)
+ {
+ PsfHashEntry *hentry;
+
+ /* Write the PREPARE info to the psf file. */
+ prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
+
+ /*
+ * Flush the spoolfile, so changes can survive a restart.
+ */
+ FileSync(psf_cur.vfd, WAIT_EVENT_DATA_FILE_SYNC);

I think in an ideal world we only need to flush the spool file(s) when
the replication origin is advanced because at that stage after the
restart we won't get this data again. So, now, if the publisher sends
the data again after restart because the origin on the subscriber was
not moved past this prepare, you need to overwrite the existing file
which the patch is already doing but I think it is better to add some
comments explaining this.

5. Can you please test some subtransaction cases (by having savepoints
for the prepared transaction) which pass through the spool file logic?
Something like below with maybe more savepoints.
postgres=# begin;
BEGIN
postgres=*# insert into t1 values(1);
INSERT 0 1
postgres=*# savepoint s1;
SAVEPOINT
postgres=*# insert into t1 values(2);
INSERT 0 1
postgres=*# prepare transaction 'foo';
PREPARE TRANSACTION

I don't see any obvious problem in such cases but it is better to test.

6. Patch 0003 and 0006 can be merged to patch 0002 as that will enable
complete functionality for 0002. I understand that you have kept them
for easier review but I guess at this stage it is better to merge them
so that the complete functionality can be reviewed.

--
With Regards,
Amit Kapila.

#220Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#218)
5 attachment(s)

Please find attached the latest patch set v52*

Differences from v51* are:

* Rebased to HEAD @ today

* No code changes; only a merging of the v51 patches as requested [ak].

v52-0001 <== v51-0001 "track replication origin"
v52-0002 <== v51-0002 "add support for apply at prepare time" +
v51-0003 "add two phase option for create slot" + v51-0006 "fix apply
worker empty prepare"
v52-0003 <== v51-0004 "2pc tests"
v52-0004 <== v51-0005 "Subscription option"
v52-0005 <== v51-0007 "empty prepare extra logging"

-----
[ak] /messages/by-id/CAA4eK1+dO07RrQwfHAK5jDP9qiXik4-MVzy+coEG09shWTJFGg@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Show quoted text

On Sun, Mar 7, 2021 at 1:04 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v51*

Differences from v50* are:

* Rebased to HEAD @ today

* Addresses following feedback comments:

From Osunmi-san @ 2021-03-06 [ot]
- (27) Fixed. Patch 0003. Remove weird chars from the error message.

From Amit @ 2021-03-06 [ak]
- (29) Removed patch 0007 "tablesync early exit" from this patch set.
I started a new thread [early-exit] for this.
- (30) Removed patch 0001 "refactor spool file logic" from this patch set.
- (31) Fixed. Patch 0005 removed TODO from test code.
- (32) Fixed. Patch 0006 comment typo.
- (33) Fixed. Patch 0008 commit message lines were too long
- (34) Fixed. Patch 0008 comment reworded avoiding words like
"problem" and "fix"

-----
[ot] /messages/by-id/OSBPR01MB4888636EB9421C930FB39A19ED959@OSBPR01MB4888.jpnprd01.prod.outlook.com
[ak] /messages/by-id/CAA4eK1Jxu-3qxtkfA_dKoquQgGZVcB+k9_-yT5=9GDEW84TF+A@mail.gmail.com
[early-exit] /messages/by-id/CAHut+Pt39PbQs0SxT9RMM89aYiZoQ0Kw46YZSkKZwK8z5HOr3g@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v52-0001-Track-replication-origin-progress-for-rollbacks.patchapplication/octet-stream; name=v52-0001-Track-replication-origin-progress-for-rollbacks.patchDownload
From d70940144bac9dfe0cf52a7339afea46a51c4572 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 8 Mar 2021 11:42:34 +1100
Subject: [PATCH v52] Track replication origin progress for rollbacks.

Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed to update the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

Author: Amit Kapila
---
 src/backend/access/transam/twophase.c | 13 +++++++++++++
 src/backend/access/transam/xact.c     | 19 ++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20..6023e7c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2276,6 +2276,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 							   const char *gid)
 {
 	XLogRecPtr	recptr;
+	bool		replorigin;
+
+	/*
+	 * Are we using the replication origins feature?  Or, in other words, are
+	 * we replaying remote actions?
+	 */
+	replorigin = (replorigin_session_origin != InvalidRepOriginId &&
+				  replorigin_session_origin != DoNotReplicateId);
 
 	/*
 	 * Catch the scenario where we aborted partway through
@@ -2298,6 +2306,11 @@ RecordTransactionAbortPrepared(TransactionId xid,
 								MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
 								xid, gid);
 
+	if (replorigin)
+		/* Move LSNs forward for this replication origin */
+		replorigin_session_advance(replorigin_session_origin_lsn,
+								   XactLastRecEnd);
+
 	/* Always flush, since we're about to remove the 2PC state file */
 	XLogFlush(recptr);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4e6a3df..acdb28d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5716,8 +5716,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* dump transaction origin information only for abort prepared */
 	if ((replorigin_session_origin != InvalidRepOriginId) &&
-		TransactionIdIsValid(twophase_xid) &&
-		XLogLogicalInfoActive())
+		TransactionIdIsValid(twophase_xid))
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_ORIGIN;
 
@@ -5923,7 +5922,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
  * because subtransaction commit is never WAL logged.
  */
 static void
-xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+				XLogRecPtr lsn, RepOriginId origin_id)
 {
 	TransactionId max_xid;
 
@@ -5972,6 +5972,13 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 			StandbyReleaseLockTree(xid, parsed->nsubxacts, parsed->subxacts);
 	}
 
+	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+	{
+		/* recover apply progress */
+		replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+						   false /* backward */, false /* WAL */);
+	}
+
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
 }
@@ -6013,7 +6020,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, XLogRecGetXid(record));
+		xact_redo_abort(&parsed, XLogRecGetXid(record),
+						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT_PREPARED)
 	{
@@ -6021,7 +6029,8 @@ xact_redo(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
-		xact_redo_abort(&parsed, parsed.twophase_xid);
+		xact_redo_abort(&parsed, parsed.twophase_xid,
+						record->EndRecPtr, XLogRecGetOrigin(record));
 
 		/* Delete TwoPhaseState gxact entry and/or 2PC file. */
 		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-- 
1.8.3.1

v52-0003-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v52-0003-Support-2PC-txn-subscriber-tests.patchDownload
From bf021e660fa35ca3e9b148940365dfce64c19c36 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 8 Mar 2021 12:15:15 +1100
Subject: [PATCH v52] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code
---
 src/test/subscription/t/020_twophase.pl         | 332 ++++++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl | 282 ++++++++++++++++++++
 2 files changed, 614 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9aa483c
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,332 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v52-0002-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v52-0002-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From d93d77e95fdca9a60d18f02f64dc452fa1e10507 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 8 Mar 2021 12:00:26 +1100
Subject: [PATCH v52] Add support for apply at prepare time to built-in logical
  replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

* change stream_prepare_cb from a required callback to an optional callback.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

* This patch also adds new option to enable two_phase while creating a slot.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/logicaldecoding.sgml                  |  16 +-
 src/backend/access/transam/twophase.c              |  68 ++
 src/backend/commands/subscriptioncmds.c            |   2 +-
 .../libpqwalreceiver/libpqwalreceiver.c            |   6 +-
 src/backend/replication/logical/logical.c          |   9 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 194 +++++
 src/backend/replication/logical/tablesync.c        | 180 ++++-
 src/backend/replication/logical/worker.c           | 804 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 157 +++-
 src/backend/replication/repl_gram.y                |  14 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/replication/logicalproto.h             |  70 +-
 src/include/replication/reorderbuffer.h            |  12 +
 src/include/replication/walreceiver.h              |   5 +-
 src/include/replication/worker_internal.h          |   3 +
 src/tools/pgindent/typedefs.list                   |   5 +
 19 files changed, 1460 insertions(+), 97 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 80eb96d..702e42d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -468,9 +468,9 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
      An output plugin may also define functions to support streaming of large,
      in-progress transactions. The <function>stream_start_cb</function>,
      <function>stream_stop_cb</function>, <function>stream_abort_cb</function>,
-     <function>stream_commit_cb</function>, <function>stream_change_cb</function>,
-     and <function>stream_prepare_cb</function>
-     are required, while <function>stream_message_cb</function> and
+     <function>stream_commit_cb</function>, and <function>stream_change_cb</function>,
+     are required, while <function>stream_prepare_cb</function>, 
+     <function>stream_message_cb</function> and
      <function>stream_truncate_cb</function> are optional.
     </para>
 
@@ -478,9 +478,9 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb);
     An output plugin may also define functions to support two-phase commits,
     which allows actions to be decoded on the <command>PREPARE TRANSACTION</command>.
     The <function>begin_prepare_cb</function>, <function>prepare_cb</function>, 
-    <function>stream_prepare_cb</function>,
     <function>commit_prepared_cb</function> and <function>rollback_prepared_cb</function>
-    callbacks are required, while <function>filter_prepare_cb</function> is optional.
+    callbacks are required, while <function>filter_prepare_cb</function> and
+    <function>stream_prepare_cb</function>, are optional.
     </para>
    </sect2>
 
@@ -1195,9 +1195,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
     provide additional callbacks. There are multiple two-phase commit callbacks
     that are required, (<function>begin_prepare_cb</function>,
     <function>prepare_cb</function>, <function>commit_prepared_cb</function>, 
-    <function>rollback_prepared_cb</function> and
-    <function>stream_prepare_cb</function>) and an optional callback
-    (<function>filter_prepare_cb</function>).
+    <function>rollback_prepared_cb</function>) and
+    optional callbacks, (<function>stream_prepare_cb</function> and
+    <function>filter_prepare_cb</function>).
    </para>
 
    <para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..81cb765 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..f6793f0 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -528,7 +528,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				walrcv_create_slot(wrconn, slotname, false, true,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..9e822f9 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -827,7 +828,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +842,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75de..5324851 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1326,6 +1326,9 @@ stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	Assert(ctx->streaming);
 	Assert(ctx->twophase);
 
+	if (ctx->callbacks.stream_prepare_cb == NULL)
+		return;
+
 	/* Push callback + info on the error context stack */
 	state.ctx = ctx;
 	state.callback_name = "stream_prepare";
@@ -1340,12 +1343,6 @@ stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 	ctx->write_xid = txn->xid;
 	ctx->write_location = txn->end_lsn;
 
-	/* in streaming mode with two-phase commits, stream_prepare_cb is required */
-	if (ctx->callbacks.stream_prepare_cb == NULL)
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("logical streaming at prepare time requires a stream_prepare_cb callback")));
-
 	ctx->callbacks.stream_prepare_cb(ctx, txn, prepare_lsn);
 
 	/* Pop the error context stack */
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..e958d28 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,200 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..97fc399 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1052,7 +1026,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
 	/*
@@ -1137,3 +1111,141 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+AnyTablesyncInProgress()
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * When the process_syncing_tables_for_apply changes the state
+		 * from SYNCDONE to READY, that change is actually written directly
+		 * into the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 18d0528..c9b1dea 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -208,6 +209,54 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	bool allow_delete; /* ok to delete? */
+}			PsfHashEntry;
+
+/*
+ * Information about the "current" psf spoolfile.
+ */
+typedef struct PsfFile
+{
+	char	name[MAXPGPATH];/* psf name - same as the HTAB key. */
+	bool	is_spooling;	/* are we currently spooling to this file? */
+	File 	vfd;			/* -1 when the file is closed. */
+	off_t	cur_offset;		/* offset for the next write or read. Reset to 0
+							 * when file is opened. */
+} PsfFile;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * Information about the 'current' open spoolfile is only valid when spooling.
+ * This is flagged as 'is_spooling' only between begin_prepare and prepare.
+ */
+static PsfFile psf_cur = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_delete(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+static void prepare_spoolfile_on_proc_exit(int status, Datum arg);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -720,6 +769,344 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/*
+	 * The gid must not already be prepared.
+	 */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				errmsg("transaction identifier \"%s\" is already in use",
+					   begin_data.gid)));
+
+	/*
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel).
+	 *
+	 * This can lead to an "empty prepare", because later when the apply
+	 * worker does the commit prepare (‘K’), there is nothing in it (the
+	 * inserts were skipped earlier).
+	 *
+	 * We avoid this using the 2 part logic: (1) Wait for all tablesync workers
+	 * to reach SYNCDONE/READY state; (2) If the begin_prepare lsn is now
+	 * behind any tablesync lsn then spool the prepared messages to a file
+	 * to be replayed later at commit_prepared time.
+	 *
+	 * -----
+	 *
+	 * XXX - The 2PC protocol needs the publisher to be aware when the PREPARE
+	 * has been successfully acted on. But because of this "empty prepare"
+	 * case now the prepared messages may be spooled to a file and, when
+	 * that happens the PREPARE would not happen at the usual time, but would
+	 * be deferred until COMMIT PREPARED time. This quirk could only happen
+	 * immediately after the initial table synchronization phase; once all
+	 * tables have acheived READY state the 2PC protocol will behave normally.
+	 *
+	 * A future release may be able to detect when all tables are READY and set
+	 * a flag to indicate this subscription/slot is ready for two_phase
+	 * decoding. Then at the publisher-side, we could enable wait-for-prepares
+	 * only when all the slots of WALSender have that flag set.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(begin_data.final_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (AnyTablesyncInProgress())
+		{
+			process_syncing_tables(begin_data.final_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		{
+			char		psfpath[MAXPGPATH];
+
+			/*
+			 * Create the spoolfile.
+			 */
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+
+			/*
+			 * From now, until the handle_prepare we are spooling to the
+			 * current psf.
+			 */
+			psf_cur.is_spooling = true;
+		}
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * If we were using a psf spoolfile, then write the PREPARE as the final
+	 * message. This prepare information will be used at commit_prepared time.
+	 */
+	if (psf_cur.is_spooling)
+	{
+		PsfHashEntry *hentry;
+
+		/* Write the PREPARE info to the psf file. */
+		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
+
+		/*
+		 * Flush the spoolfile, so changes can survive a restart.
+		 */
+		FileSync(psf_cur.vfd, WAIT_EVENT_DATA_FILE_SYNC);
+
+		/*
+		 * We are finished spooling to the current psf.
+		 */
+		psf_cur.is_spooling = false;
+
+		/*
+		 * The commit_prepare will need the spoolfile, so unregister it for
+		 * removal on proc-exit just in case there is an unexpected restart
+		 * between now and when commit_prepared happens.
+		 */
+		hentry = (PsfHashEntry *) hash_search(psf_hash,
+											  psf_cur.name,
+											  HASH_FIND,
+											  NULL);
+		Assert(hentry);
+		hentry->allow_delete = false;
+
+		/*
+		 * The psf_cur.vfd is meaningful only between begin_prepare and prepared.
+		 * So close it now. Any messages written to the psf will be applied
+		 * later during handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		in_remote_transaction = false;
+		return;
+	}
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	/*
+	 * Normally, prepare_lsn == remote_final_lsn, but if this prepare message
+	 * was dispatched via the psf spoolfile replay then the remote_final_lsn
+	 * is set to commit lsn instead. Hence the <= instead of == check below.
+	 */
+	Assert(prepare_data.prepare_lsn <= remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+
+		/*
+		 * Replay/dispatch the spooled messages (including lastly, the PREPARE
+		 * message).
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		/*
+		 * After replaying the psf it is no longer needed. Just delete it.
+		 */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		/*
+		 * We are finished with this spoolfile. Delete it.
+		 */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
+	 */
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -732,6 +1119,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		!psf_cur.is_spooling &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1150,6 +1538,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1271,6 +1662,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1429,6 +1823,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1798,6 +2195,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1954,6 +2354,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2061,6 +2483,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	TimeLineID	tli;
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL     hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2180,7 +2619,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && !psf_cur.is_spooling)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -2927,6 +3366,9 @@ ApplyWorkerMain(Datum main_arg)
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
 
+	/* Arrange to delete any unwanted psf file(s) at proc-exit */
+	on_proc_exit(prepare_spoolfile_on_proc_exit, 0);
+
 	/* Setup signal handling */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGTERM, die);
@@ -3103,3 +3545,363 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time. If needed, this is the common function to do that file redirection.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	elog(DEBUG1,
+		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_cur.is_spooling ? "Do" : "Don't");
+
+	if (!psf_cur.is_spooling)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	bool		found;
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(!psf_cur.is_spooling);
+
+	/* create or find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER | HASH_FIND,
+										  &found);
+
+	if (!found)
+	{
+		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create file \"%s\": %m", path)));
+		}
+		memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+		psf_cur.cur_offset = 0;
+		hentry->allow_delete = true;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the beginning because we always want to
+		 * create/overwrite this file.
+		 */
+		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m", path)));
+		}
+		memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+		psf_cur.cur_offset = 0;
+		hentry->allow_delete = true;
+	}
+
+	/* Sanity checks */
+	Assert(psf_cur.vfd >= 0);
+	Assert(psf_cur.cur_offset == 0);
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_cur.vfd >= 0)
+		FileClose(psf_cur.vfd);
+
+	/* Mark this fd as not valid to use anymore. */
+	psf_cur.is_spooling = false;
+	psf_cur.vfd = -1;
+	psf_cur.cur_offset = 0;
+}
+
+/*
+ * Delete the specified psf spoolfile, and any HTAB associated with it.
+ */
+static void
+prepare_spoolfile_delete(char *path)
+{
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* Delete the file off the disk. */
+	unlink(path);
+
+	/* Remove any entry from the psf_hash, if present */
+	hash_search(psf_hash, path, HASH_REMOVE, NULL);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+	int			bytes_written;
+
+	Assert(psf_cur.is_spooling);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(len));
+	psf_cur.cur_offset += bytes_written;
+
+	/* then the action */
+	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(action));
+	psf_cur.cur_offset += bytes_written;
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == len);
+	psf_cur.cur_offset += bytes_written;
+}
+
+/*
+ * Is there a prepare spoolfile for the specified path?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	bool		found;
+	PsfHashEntry *hentry;
+
+	/* Find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_FIND,
+										  &found);
+
+	if (!found)
+	{
+		/*
+		 * Hash doesn't know about it, but perhaps the Hash was destroyed by a
+		 * restart, so let's check the file existence on disk.
+		 */
+		File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+
+		found = fd >= 0;
+		if (fd >= 0)
+			FileClose(fd);
+
+		/*
+		 * And if it was found on disk then create the HTAB entry for it.
+		 */
+		if (found)
+		{
+			hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER,
+										  NULL);
+			hentry->allow_delete = false;
+		}
+	}
+
+	return found;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	psf.vfd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	if (psf.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from prepared spoolfile \"%s\": %m",
+						path)));
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	remote_final_lsn = final_lsn;
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		nbytes = FileRead(psf.vfd, buffer, len,
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+		if (nbytes != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldctx2);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	FileClose(psf.vfd);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB.
+	 *
+	 * Therefore, the name and the key must be exactly same lengths and padded
+	 * with '\0' so garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "pg_twophase/%u-%s.prep_changes", subid, gid);
+}
+
+/*
+ * proc_exit callback to remove unwanted psf files.
+ */
+static void
+prepare_spoolfile_on_proc_exit(int status, Datum arg)
+{
+	HASH_SEQ_STATUS seq_status;
+	PsfHashEntry *hentry;
+
+	/* Iterate the HTAB looking for what file can be deleted. */
+	if (psf_hash)
+	{
+		hash_seq_init(&seq_status, psf_hash);
+		while ((hentry = (PsfHashEntry *) hash_seq_search(&seq_status)) != NULL)
+		{
+			char *path = hentry->name;
+
+			if (hentry->allow_delete)
+				prepare_spoolfile_delete(path);
+		}
+	}
+}
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2e4b39f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +171,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -322,8 +342,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +362,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +383,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +843,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1250,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..c5154ae 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -242,15 +244,16 @@ create_replication_slot:
 					$$ = (Node *) cmd;
 				}
 			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e5f8a06..e40d2d0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -363,7 +363,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..b797e3b 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN* txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -170,5 +237,4 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 										  TransactionId subxid);
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
-
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..0c95dc6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..f55b07c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -345,6 +345,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -418,8 +419,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..95d78e9 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AnyTablesyncInProgress(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8bd95ae..4ffcef5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
@@ -1955,6 +1958,8 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfFile
+PsfHashEntry
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v52-0005-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v52-0005-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From 62964901ebc60b40167b8f11db499bb60e5bcd99 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 8 Mar 2021 12:39:08 +1100
Subject: [PATCH v52] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 +++++++++---
 src/backend/replication/logical/worker.c    | 73 +++++++++++++++++++++++------
 2 files changed, 81 insertions(+), 21 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 97fc399..f3984d4 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1127,6 +1133,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1149,6 +1157,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1156,12 +1165,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1175,6 +1189,8 @@ AnyTablesyncInProgress()
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncInProgress?");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1186,8 +1202,8 @@ AnyTablesyncInProgress()
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1204,6 +1220,7 @@ AnyTablesyncInProgress()
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncInProgress?: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1215,8 +1232,8 @@ AnyTablesyncInProgress()
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1242,8 +1259,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 89988b8..430cf57 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -830,14 +830,16 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(begin_data.final_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (AnyTablesyncInProgress())
 		{
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
 			process_syncing_tables(begin_data.final_lsn);
 
 			/* This latch is to prevent 100% CPU looping. */
@@ -855,7 +857,12 @@ apply_handle_begin_prepare(StringInfo s)
 		 * prepared) will be saved to a spoolfile for replay later at
 		 * commit_prepared time.
 		 */
-		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		if (begin_data.final_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
 		{
 			char		psfpath[MAXPGPATH];
 
@@ -897,6 +904,8 @@ apply_handle_prepare(StringInfo s)
 	{
 		PsfHashEntry *hentry;
 
+		elog(LOG, "!!> apply_handle_prepare: SPOOLING");
+
 		/* Write the PREPARE info to the psf file. */
 		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
 
@@ -915,6 +924,8 @@ apply_handle_prepare(StringInfo s)
 		 * removal on proc-exit just in case there is an unexpected restart
 		 * between now and when commit_prepared happens.
 		 */
+		elog(LOG,
+			"!!> apply_handle_prepare: Make sure the spoolfile is not removed on proc-exit");
 		hentry = (PsfHashEntry *) hash_search(psf_hash,
 											  psf_cur.name,
 											  HASH_FIND,
@@ -1001,6 +1012,8 @@ apply_handle_commit_prepared(StringInfo s)
 	{
 		int			nchanges;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * Replay/dispatch the spooled messages (including lastly, the PREPARE
 		 * message).
@@ -1009,8 +1022,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		/*
@@ -1078,6 +1091,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf = %d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2356,18 +2370,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3561,8 +3579,8 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
-	elog(DEBUG1,
-		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
 		 action,
 		 psf_cur.is_spooling ? "Do" : "Don't");
 
@@ -3586,7 +3604,7 @@ prepare_spoolfile_create(char *path)
 	bool		found;
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(!psf_cur.is_spooling);
 
@@ -3598,7 +3616,7 @@ prepare_spoolfile_create(char *path)
 
 	if (!found)
 	{
-		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		elog(LOG, "!!> Not found file \"%s\". Create it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 		{
@@ -3616,7 +3634,7 @@ prepare_spoolfile_create(char *path)
 		 * Open the file and seek to the beginning because we always want to
 		 * create/overwrite this file.
 		 */
-		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		elog(LOG, "!!> Found file \"%s\". Overwrite it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 		{
@@ -3641,6 +3659,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_cur.vfd >= 0)
 		FileClose(psf_cur.vfd);
 
@@ -3656,6 +3675,8 @@ prepare_spoolfile_close()
 static void
 prepare_spoolfile_delete(char *path)
 {
+	elog(LOG, "!!> prepare_spoolfile_delete: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3681,18 +3702,20 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_cur.is_spooling);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(len));
 	psf_cur.cur_offset += bytes_written;
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(action));
@@ -3701,6 +3724,7 @@ prepare_spoolfile_write(char action, StringInfo s)
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == len);
@@ -3734,6 +3758,12 @@ prepare_spoolfile_exists(char *path)
 		if (fd >= 0)
 			FileClose(fd);
 
+		elog(LOG,
+			 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was "
+			 "not found in the HTAB, but was %s on the disk.",
+			 path,
+			 found ? "found" : "not found");
+
 		/*
 		 * And if it was found on disk then create the HTAB entry for it.
 		 */
@@ -3743,10 +3773,16 @@ prepare_spoolfile_exists(char *path)
 										  path,
 										  HASH_ENTER,
 										  NULL);
+			elog(LOG, "!!> prepare_spoolfile_exists: Created new HTAB entry '%s'", hentry->name);
 			hentry->allow_delete = false;
 		}
 	}
 
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 found ? "found" : "not found");
+
 	return found;
 }
 
@@ -3763,8 +3799,8 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 				oldctx2;
 	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3809,6 +3845,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -3827,6 +3864,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		nbytes = FileRead(psf.vfd, buffer, len,
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
@@ -3843,7 +3881,9 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		/* Ensure we are reading the data into our memory context. */
 		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
 
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: Before dispatch");
 		apply_dispatch(&s2);
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: After dispatch");
 
 		MemoryContextReset(ApplyMessageContext);
 
@@ -3852,13 +3892,13 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nchanges++;
 
 		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
+			elog(LOG, "!!> replayed %d changes from file '%s'",
 				 nchanges, path);
 	}
 
 	FileClose(psf.vfd);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
@@ -3894,6 +3934,8 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 	HASH_SEQ_STATUS seq_status;
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_on_proc_exit");
+
 	/* Iterate the HTAB looking for what file can be deleted. */
 	if (psf_hash)
 	{
@@ -3902,6 +3944,7 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 		{
 			char *path = hentry->name;
 
+			elog(LOG, "!!> prepare_spoolfile_proc_exit: found '%s'", path);
 			if (hentry->allow_delete)
 				prepare_spoolfile_delete(path);
 		}
-- 
1.8.3.1

v52-0004-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v52-0004-Support-2PC-txn-Subscription-option.patchDownload
From 2d7c6c56bd25c085532684958819f407b8e06966 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 8 Mar 2021 12:29:28 +1100
Subject: [PATCH v52] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/create_subscription.sgml          | 29 +++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 72 +++++++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 +
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 ++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 +++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 ++-
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 +
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 93 +++++++++++++++-------
 src/test/regress/sql/subscription.sql              | 25 ++++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 17 files changed, 261 insertions(+), 47 deletions(-)

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..e04b8d2 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,35 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          It is not allowed to combine <literal>streaming</literal> set to
+          <literal>true</literal> and <literal>two_phase</literal> set to
+          <literal>true</literal>.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          It is not allowed to combine <literal>two_phase</literal> set to
+          <literal>true</literal> and <literal>streaming</literal> set to
+          <literal>true</literal>.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73..060fab4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1168,7 +1168,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f6793f0..96fcf49 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,26 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +309,24 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * the current implementation has some issues that could lead to a
+	 * streaming prepared transaction to be incorrectly missed in the initial
+	 * syncing phase. Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +576,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false, true,
+				walrcv_create_slot(wrconn, slotname, false, twophase,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +883,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +918,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if (sub->twophase && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +947,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +993,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1039,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9e822f9..1daa585 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -428,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c9b1dea..89988b8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2878,6 +2878,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3527,6 +3528,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2e4b39f..91ecc55 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -178,13 +178,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -252,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -265,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -289,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -330,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..96c878b 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and two_phase are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index b797e3b..6c848c2 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f55b07c..0ed8e9d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..67b3358 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9aa483c..d56789d 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

#221Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#219)

On Sun, Mar 7, 2021 at 3:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Mar 7, 2021 at 7:35 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v51*

Few more comments on v51-0006-Fix-apply-worker-empty-prepare:
======================================================
1.
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the
psf_hash. This is
+ * for maintaining a mapping between the name of the prepared
spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+ char name[MAXPGPATH]; /* Hash key --- must be first */
+ bool allow_delete; /* ok to delete? */
+} PsfHashEntry;
+

IIUC, this has table is used for two purposes in the patch (a) to
check for existence of prepare spool file where we anyway to check it
on disk if not found in the hash table. (b) to allow the prepare spool
file to be removed on proc_exit.

I think we don't need the optimization provided by (a) because it will
be too rare a case to deserve any optimization, we might write a
comment in prepare_spoolfile_exists to indicate such an optimization.
For (b), we can use a simple list to track files to be removed on
proc_exit something like we do in CreateLockFile. I think avoiding
hash table usage will reduce the code and chances of bugs in this
area. It won't be easy to write a lot of automated tests to test this
functionality so it is better to avoid minor optimizations at this
stage.

Our data structure psf_hash also needs to be able to discover the
entry for a specific spool file and remove it. e.g. anything marked as
"allow_delete = false" (during prepare) must be able to be re-found
and removed from that structure at commit_prepared or
rollback_prepared time.

Looking at CreateLockFile code, I don't see that it is ever deleting
entries from its "lock_files" list on-the-fly, so it's not really a
fair comparison to say just use a List like CreateLockFile.

So, even though we (currently) only have a single data member
"allow_delete", I think the requirement to do a key lookup/delete
makes a HTAB a more appropriate data structure than a List.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#222Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#221)

On Mon, Mar 8, 2021 at 10:04 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Sun, Mar 7, 2021 at 3:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Mar 7, 2021 at 7:35 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v51*

Few more comments on v51-0006-Fix-apply-worker-empty-prepare:
======================================================
1.
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the
psf_hash. This is
+ * for maintaining a mapping between the name of the prepared
spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+ char name[MAXPGPATH]; /* Hash key --- must be first */
+ bool allow_delete; /* ok to delete? */
+} PsfHashEntry;
+

IIUC, this has table is used for two purposes in the patch (a) to
check for existence of prepare spool file where we anyway to check it
on disk if not found in the hash table. (b) to allow the prepare spool
file to be removed on proc_exit.

I think we don't need the optimization provided by (a) because it will
be too rare a case to deserve any optimization, we might write a
comment in prepare_spoolfile_exists to indicate such an optimization.
For (b), we can use a simple list to track files to be removed on
proc_exit something like we do in CreateLockFile. I think avoiding
hash table usage will reduce the code and chances of bugs in this
area. It won't be easy to write a lot of automated tests to test this
functionality so it is better to avoid minor optimizations at this
stage.

Our data structure psf_hash also needs to be able to discover the
entry for a specific spool file and remove it. e.g. anything marked as
"allow_delete = false" (during prepare) must be able to be re-found
and removed from that structure at commit_prepared or
rollback_prepared time.

But, I think that is not reliable because after restart the entry
might not be present and we anyway need to check the presence of the
file on disk. Actually, you don't need any manipulation with list or
hash at commit_prepared or rollback_prepared, we should just remove
the entry for it at the prepare time and there should be an assert if
we find that entry in the in-memory structure.

Looking at CreateLockFile code, I don't see that it is ever deleting
entries from its "lock_files" list on-the-fly, so it's not really a
fair comparison to say just use a List like CreateLockFile.

Sure, but you can additionally traverse the list and find the required entry.

So, even though we (currently) only have a single data member
"allow_delete", I think the requirement to do a key lookup/delete
makes a HTAB a more appropriate data structure than a List.

Actually, that member is also not required at all because you just
need it till the time of prepare and then remove it.

--
With Regards,
Amit Kapila.

#223vignesh C
vignesh21@gmail.com
In reply to: Peter Smith (#220)

On Mon, Mar 8, 2021 at 7:17 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v52*

Few comments:

+logicalrep_read_begin_prepare(StringInfo in,
LogicalRepBeginPrepareData *begin_data)
+{
+       /* read fields */
+       begin_data->final_lsn = pq_getmsgint64(in);
+       if (begin_data->final_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "final_lsn not set in begin message");
+       begin_data->end_lsn = pq_getmsgint64(in);
+       if (begin_data->end_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "end_lsn not set in begin message");
+       begin_data->committime = pq_getmsgint64(in);
+       begin_data->xid = pq_getmsgint(in, 4);
+
+       /* read gid (copy it into a pre-allocated buffer) */
+       strcpy(begin_data->gid, pq_getmsgstring(in));
+}
In logicalrep_read_begin_prepare we validate final_lsn & end_lsn. But
this validation is not done in logicalrep_read_commit_prepared and
logicalrep_read_rollback_prepared. Should we keep it consistent?

@@ -170,5 +237,4 @@ extern void
logicalrep_write_stream_abort(StringInfo out, TransactionId xid,

TransactionId subxid);
extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,

TransactionId *subxid);
-
#endif /* LOGICAL_PROTO_H */
This change is not required.

@@ -242,15 +244,16 @@ create_replication_slot:
                                        $$ = (Node *) cmd;
                                }
                        /* CREATE_REPLICATION_SLOT slot TEMPORARY
LOGICAL plugin */
-                       | K_CREATE_REPLICATION_SLOT IDENT
opt_temporary K_LOGICAL IDENT create_slot_opt_list
+                       | K_CREATE_REPLICATION_SLOT IDENT
opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
                                {
                                        CreateReplicationSlotCmd *cmd;
                                        cmd =
makeNode(CreateReplicationSlotCmd);
                                        cmd->kind = REPLICATION_KIND_LOGICAL;
                                        cmd->slotname = $2;
                                        cmd->temporary = $3;
-                                       cmd->plugin = $5;
-                                       cmd->options = $6;
+                                       cmd->two_phase = $4;
+                                       cmd->plugin = $6;
+                                       cmd->options = $7;
                                        $$ = (Node *) cmd;
                                }
Should we document two_phase in the below section:
CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL [
RESERVE_WAL ] | LOGICAL output_plugin [ EXPORT_SNAPSHOT |
NOEXPORT_SNAPSHOT | USE_SNAPSHOT ] }
Create a physical or logical replication slot. See Section 27.2.6 for
more about replication slots.
+               while (AnyTablesyncInProgress())
+               {
+                       process_syncing_tables(begin_data.final_lsn);
+
+                       /* This latch is to prevent 100% CPU looping. */
+                       (void) WaitLatch(MyLatch,
+                                                        WL_LATCH_SET
| WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+                                                        1000L,
WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+                       ResetLatch(MyLatch);
+               }
Should we have CHECK_FOR_INTERRUPTS inside the while loop?
+               if (begin_data.final_lsn < BiggestTablesyncLSN())
+               {
+                       char            psfpath[MAXPGPATH];
+
+                       /*
+                        * Create the spoolfile.
+                        */
+                       prepare_spoolfile_name(psfpath, sizeof(psfpath),
+
MyLogicalRepWorker->subid, begin_data.gid);
+                       prepare_spoolfile_create(psfpath);
We can make this as a single line comment.
+       if (!found)
+       {
+               elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+               psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT
| O_TRUNC | PG_BINARY);
+               if (psf_cur.vfd < 0)
+               {
+                       ereport(ERROR,
+                                       (errcode_for_file_access(),
+                                        errmsg("could not create file
\"%s\": %m", path)));
+               }
+               memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+               psf_cur.cur_offset = 0;
+               hentry->allow_delete = true;
+       }
+       else
+       {
+               /*
+                * Open the file and seek to the beginning because we
always want to
+                * create/overwrite this file.
+                */
+               elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+               psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT
| O_TRUNC | PG_BINARY);
+               if (psf_cur.vfd < 0)
+               {
+                       ereport(ERROR,
+                                       (errcode_for_file_access(),
+                                        errmsg("could not open file
\"%s\": %m", path)));
+               }
+               memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+               psf_cur.cur_offset = 0;
+               hentry->allow_delete = true;
+       }

Except the elog message the rest of the code is the same in both if
and else, we can move the common code outside.

        LOGICAL_REP_MSG_TYPE = 'Y',
+       LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+       LOGICAL_REP_MSG_PREPARE = 'P',
+       LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+       LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
        LOGICAL_REP_MSG_STREAM_START = 'S',
        LOGICAL_REP_MSG_STREAM_END = 'E',
        LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-       LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+       LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+       LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
As we start adding more and more features, we will have to start
adding more message types, using meaningful characters might become
difficult. Should we start using numeric instead for the new feature
getting added?

Regards.
Vignesh

#224Ajin Cherian
itsajin@gmail.com
In reply to: vignesh C (#214)
4 attachment(s)

On Fri, Mar 5, 2021 at 9:25 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks for the updated patch.
Few minor comments:

We should include two_phase in tab completion (tab-complete.c file
psql_completion(const char *text, int start, int end) function) :
postgres=# create subscription sub1 connection 'port=5441
dbname=postgres' publication pub1 with (
CONNECT COPY_DATA CREATE_SLOT ENABLED
SLOT_NAME SYNCHRONOUS_COMMIT

Updated.

+
+         <para>
+          It is not allowed to combine <literal>streaming</literal> set to
+          <literal>true</literal> and <literal>two_phase</literal> set to
+          <literal>true</literal>.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded
transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default,
the transaction
+          preapred on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          It is not allowed to combine <literal>two_phase</literal> set to
+          <literal>true</literal> and <literal>streaming</literal> set to
+          <literal>true</literal>.
+         </para>

It is not allowed to combine streaming set to true and two_phase set to true.
Should this be:
streaming option is not supported along with two_phase option.

Similarly here too:
It is not allowed to combine two_phase set to true and streaming set to true.
Should this be:
two_phase option is not supported along with streaming option.

Reworded this with a small change.

Few indentation issues are present, we can run pgindent:
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+
XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+
LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out,
ReorderBufferTXN* txn,
+
XLogRecPtr commit_lsn);

ReorderBufferTXN * should be ReorderBufferTXN*

Changed accordingly.

Created new patch v53:
* Rebased to HEAD (this resulted in removing patch 0001) and reduced
patch-set to 4 patches.
* Removed the changes for making "stream prepare" optional from
required. Will create a new patch and thread for this.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v53-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v53-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From c4e7769fda0397421244a49a7e82a81b34f8be3f Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Sun, 7 Mar 2021 23:12:33 -0500
Subject: [PATCH v53] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

* change stream_prepare_cb from a required callback to an optional callback.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

* This patch also adds new option to enable two_phase while creating a slot.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c              |  68 ++
 src/backend/commands/subscriptioncmds.c            |   2 +-
 .../libpqwalreceiver/libpqwalreceiver.c            |   6 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 194 +++++
 src/backend/replication/logical/tablesync.c        | 180 ++++-
 src/backend/replication/logical/worker.c           | 804 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 157 +++-
 src/backend/replication/repl_gram.y                |  14 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/replication/logicalproto.h             |  70 +-
 src/include/replication/reorderbuffer.h            |  12 +
 src/include/replication/walreceiver.h              |   5 +-
 src/include/replication/worker_internal.h          |   3 +
 src/tools/pgindent/typedefs.list                   |   5 +
 17 files changed, 1449 insertions(+), 83 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..81cb765 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..f6793f0 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -528,7 +528,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				walrcv_create_slot(wrconn, slotname, false, true,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..9e822f9 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -827,7 +828,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +842,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..e958d28 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,200 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..97fc399 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1052,7 +1026,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
 	/*
@@ -1137,3 +1111,141 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+AnyTablesyncInProgress()
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * When the process_syncing_tables_for_apply changes the state
+		 * from SYNCDONE to READY, that change is actually written directly
+		 * into the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 18d0528..c9b1dea 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -208,6 +209,54 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	bool allow_delete; /* ok to delete? */
+}			PsfHashEntry;
+
+/*
+ * Information about the "current" psf spoolfile.
+ */
+typedef struct PsfFile
+{
+	char	name[MAXPGPATH];/* psf name - same as the HTAB key. */
+	bool	is_spooling;	/* are we currently spooling to this file? */
+	File 	vfd;			/* -1 when the file is closed. */
+	off_t	cur_offset;		/* offset for the next write or read. Reset to 0
+							 * when file is opened. */
+} PsfFile;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * Information about the 'current' open spoolfile is only valid when spooling.
+ * This is flagged as 'is_spooling' only between begin_prepare and prepare.
+ */
+static PsfFile psf_cur = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_delete(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+static void prepare_spoolfile_on_proc_exit(int status, Datum arg);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -720,6 +769,344 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/*
+	 * The gid must not already be prepared.
+	 */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				errmsg("transaction identifier \"%s\" is already in use",
+					   begin_data.gid)));
+
+	/*
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel).
+	 *
+	 * This can lead to an "empty prepare", because later when the apply
+	 * worker does the commit prepare (‘K’), there is nothing in it (the
+	 * inserts were skipped earlier).
+	 *
+	 * We avoid this using the 2 part logic: (1) Wait for all tablesync workers
+	 * to reach SYNCDONE/READY state; (2) If the begin_prepare lsn is now
+	 * behind any tablesync lsn then spool the prepared messages to a file
+	 * to be replayed later at commit_prepared time.
+	 *
+	 * -----
+	 *
+	 * XXX - The 2PC protocol needs the publisher to be aware when the PREPARE
+	 * has been successfully acted on. But because of this "empty prepare"
+	 * case now the prepared messages may be spooled to a file and, when
+	 * that happens the PREPARE would not happen at the usual time, but would
+	 * be deferred until COMMIT PREPARED time. This quirk could only happen
+	 * immediately after the initial table synchronization phase; once all
+	 * tables have acheived READY state the 2PC protocol will behave normally.
+	 *
+	 * A future release may be able to detect when all tables are READY and set
+	 * a flag to indicate this subscription/slot is ready for two_phase
+	 * decoding. Then at the publisher-side, we could enable wait-for-prepares
+	 * only when all the slots of WALSender have that flag set.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(begin_data.final_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (AnyTablesyncInProgress())
+		{
+			process_syncing_tables(begin_data.final_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		{
+			char		psfpath[MAXPGPATH];
+
+			/*
+			 * Create the spoolfile.
+			 */
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+
+			/*
+			 * From now, until the handle_prepare we are spooling to the
+			 * current psf.
+			 */
+			psf_cur.is_spooling = true;
+		}
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * If we were using a psf spoolfile, then write the PREPARE as the final
+	 * message. This prepare information will be used at commit_prepared time.
+	 */
+	if (psf_cur.is_spooling)
+	{
+		PsfHashEntry *hentry;
+
+		/* Write the PREPARE info to the psf file. */
+		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
+
+		/*
+		 * Flush the spoolfile, so changes can survive a restart.
+		 */
+		FileSync(psf_cur.vfd, WAIT_EVENT_DATA_FILE_SYNC);
+
+		/*
+		 * We are finished spooling to the current psf.
+		 */
+		psf_cur.is_spooling = false;
+
+		/*
+		 * The commit_prepare will need the spoolfile, so unregister it for
+		 * removal on proc-exit just in case there is an unexpected restart
+		 * between now and when commit_prepared happens.
+		 */
+		hentry = (PsfHashEntry *) hash_search(psf_hash,
+											  psf_cur.name,
+											  HASH_FIND,
+											  NULL);
+		Assert(hentry);
+		hentry->allow_delete = false;
+
+		/*
+		 * The psf_cur.vfd is meaningful only between begin_prepare and prepared.
+		 * So close it now. Any messages written to the psf will be applied
+		 * later during handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		in_remote_transaction = false;
+		return;
+	}
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	/*
+	 * Normally, prepare_lsn == remote_final_lsn, but if this prepare message
+	 * was dispatched via the psf spoolfile replay then the remote_final_lsn
+	 * is set to commit lsn instead. Hence the <= instead of == check below.
+	 */
+	Assert(prepare_data.prepare_lsn <= remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+
+		/*
+		 * Replay/dispatch the spooled messages (including lastly, the PREPARE
+		 * message).
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		/*
+		 * After replaying the psf it is no longer needed. Just delete it.
+		 */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		/*
+		 * We are finished with this spoolfile. Delete it.
+		 */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
+	 */
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -732,6 +1119,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		!psf_cur.is_spooling &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1150,6 +1538,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1271,6 +1662,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1429,6 +1823,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1798,6 +2195,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1954,6 +2354,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2061,6 +2483,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	TimeLineID	tli;
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL     hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2180,7 +2619,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && !psf_cur.is_spooling)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -2927,6 +3366,9 @@ ApplyWorkerMain(Datum main_arg)
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
 
+	/* Arrange to delete any unwanted psf file(s) at proc-exit */
+	on_proc_exit(prepare_spoolfile_on_proc_exit, 0);
+
 	/* Setup signal handling */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGTERM, die);
@@ -3103,3 +3545,363 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time. If needed, this is the common function to do that file redirection.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	elog(DEBUG1,
+		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_cur.is_spooling ? "Do" : "Don't");
+
+	if (!psf_cur.is_spooling)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	bool		found;
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(!psf_cur.is_spooling);
+
+	/* create or find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER | HASH_FIND,
+										  &found);
+
+	if (!found)
+	{
+		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create file \"%s\": %m", path)));
+		}
+		memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+		psf_cur.cur_offset = 0;
+		hentry->allow_delete = true;
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the beginning because we always want to
+		 * create/overwrite this file.
+		 */
+		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+		{
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m", path)));
+		}
+		memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+		psf_cur.cur_offset = 0;
+		hentry->allow_delete = true;
+	}
+
+	/* Sanity checks */
+	Assert(psf_cur.vfd >= 0);
+	Assert(psf_cur.cur_offset == 0);
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_cur.vfd >= 0)
+		FileClose(psf_cur.vfd);
+
+	/* Mark this fd as not valid to use anymore. */
+	psf_cur.is_spooling = false;
+	psf_cur.vfd = -1;
+	psf_cur.cur_offset = 0;
+}
+
+/*
+ * Delete the specified psf spoolfile, and any HTAB associated with it.
+ */
+static void
+prepare_spoolfile_delete(char *path)
+{
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* Delete the file off the disk. */
+	unlink(path);
+
+	/* Remove any entry from the psf_hash, if present */
+	hash_search(psf_hash, path, HASH_REMOVE, NULL);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+	int			bytes_written;
+
+	Assert(psf_cur.is_spooling);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(len));
+	psf_cur.cur_offset += bytes_written;
+
+	/* then the action */
+	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(action));
+	psf_cur.cur_offset += bytes_written;
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == len);
+	psf_cur.cur_offset += bytes_written;
+}
+
+/*
+ * Is there a prepare spoolfile for the specified path?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	bool		found;
+	PsfHashEntry *hentry;
+
+	/* Find the prepare spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_FIND,
+										  &found);
+
+	if (!found)
+	{
+		/*
+		 * Hash doesn't know about it, but perhaps the Hash was destroyed by a
+		 * restart, so let's check the file existence on disk.
+		 */
+		File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+
+		found = fd >= 0;
+		if (fd >= 0)
+			FileClose(fd);
+
+		/*
+		 * And if it was found on disk then create the HTAB entry for it.
+		 */
+		if (found)
+		{
+			hentry = (PsfHashEntry *) hash_search(psf_hash,
+										  path,
+										  HASH_ENTER,
+										  NULL);
+			hentry->allow_delete = false;
+		}
+	}
+
+	return found;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	psf.vfd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	if (psf.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from prepared spoolfile \"%s\": %m",
+						path)));
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	remote_final_lsn = final_lsn;
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		nbytes = FileRead(psf.vfd, buffer, len,
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+		if (nbytes != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldctx2);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	FileClose(psf.vfd);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB.
+	 *
+	 * Therefore, the name and the key must be exactly same lengths and padded
+	 * with '\0' so garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "pg_twophase/%u-%s.prep_changes", subid, gid);
+}
+
+/*
+ * proc_exit callback to remove unwanted psf files.
+ */
+static void
+prepare_spoolfile_on_proc_exit(int status, Datum arg)
+{
+	HASH_SEQ_STATUS seq_status;
+	PsfHashEntry *hentry;
+
+	/* Iterate the HTAB looking for what file can be deleted. */
+	if (psf_hash)
+	{
+		hash_seq_init(&seq_status, psf_hash);
+		while ((hentry = (PsfHashEntry *) hash_seq_search(&seq_status)) != NULL)
+		{
+			char *path = hentry->name;
+
+			if (hentry->allow_delete)
+				prepare_spoolfile_delete(path);
+		}
+	}
+}
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2e4b39f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +171,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -322,8 +342,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +362,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +383,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +843,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1250,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..c5154ae 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -242,15 +244,16 @@ create_replication_slot:
 					$$ = (Node *) cmd;
 				}
 			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e5f8a06..e40d2d0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -363,7 +363,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..410326d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -170,5 +237,4 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 										  TransactionId subxid);
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
-
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..0c95dc6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..f55b07c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -345,6 +345,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -418,8 +419,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..95d78e9 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AnyTablesyncInProgress(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8bd95ae..4ffcef5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
@@ -1955,6 +1958,8 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfFile
+PsfHashEntry
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v53-0004-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v53-0004-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From cc8fab9d99051be97293e2401b366a0f70f7b659 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 8 Mar 2021 00:18:49 -0500
Subject: [PATCH v53] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 +++++++++---
 src/backend/replication/logical/worker.c    | 73 +++++++++++++++++++++++------
 2 files changed, 81 insertions(+), 21 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 97fc399..f3984d4 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1127,6 +1133,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1149,6 +1157,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1156,12 +1165,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1175,6 +1189,8 @@ AnyTablesyncInProgress()
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncInProgress?");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1186,8 +1202,8 @@ AnyTablesyncInProgress()
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1204,6 +1220,7 @@ AnyTablesyncInProgress()
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncInProgress?: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1215,8 +1232,8 @@ AnyTablesyncInProgress()
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1242,8 +1259,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 89988b8..430cf57 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -830,14 +830,16 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(begin_data.final_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (AnyTablesyncInProgress())
 		{
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
 			process_syncing_tables(begin_data.final_lsn);
 
 			/* This latch is to prevent 100% CPU looping. */
@@ -855,7 +857,12 @@ apply_handle_begin_prepare(StringInfo s)
 		 * prepared) will be saved to a spoolfile for replay later at
 		 * commit_prepared time.
 		 */
-		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		if (begin_data.final_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
 		{
 			char		psfpath[MAXPGPATH];
 
@@ -897,6 +904,8 @@ apply_handle_prepare(StringInfo s)
 	{
 		PsfHashEntry *hentry;
 
+		elog(LOG, "!!> apply_handle_prepare: SPOOLING");
+
 		/* Write the PREPARE info to the psf file. */
 		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
 
@@ -915,6 +924,8 @@ apply_handle_prepare(StringInfo s)
 		 * removal on proc-exit just in case there is an unexpected restart
 		 * between now and when commit_prepared happens.
 		 */
+		elog(LOG,
+			"!!> apply_handle_prepare: Make sure the spoolfile is not removed on proc-exit");
 		hentry = (PsfHashEntry *) hash_search(psf_hash,
 											  psf_cur.name,
 											  HASH_FIND,
@@ -1001,6 +1012,8 @@ apply_handle_commit_prepared(StringInfo s)
 	{
 		int			nchanges;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * Replay/dispatch the spooled messages (including lastly, the PREPARE
 		 * message).
@@ -1009,8 +1022,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		/*
@@ -1078,6 +1091,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf = %d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2356,18 +2370,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3561,8 +3579,8 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
-	elog(DEBUG1,
-		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
 		 action,
 		 psf_cur.is_spooling ? "Do" : "Don't");
 
@@ -3586,7 +3604,7 @@ prepare_spoolfile_create(char *path)
 	bool		found;
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(!psf_cur.is_spooling);
 
@@ -3598,7 +3616,7 @@ prepare_spoolfile_create(char *path)
 
 	if (!found)
 	{
-		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		elog(LOG, "!!> Not found file \"%s\". Create it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 		{
@@ -3616,7 +3634,7 @@ prepare_spoolfile_create(char *path)
 		 * Open the file and seek to the beginning because we always want to
 		 * create/overwrite this file.
 		 */
-		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		elog(LOG, "!!> Found file \"%s\". Overwrite it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 		{
@@ -3641,6 +3659,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_cur.vfd >= 0)
 		FileClose(psf_cur.vfd);
 
@@ -3656,6 +3675,8 @@ prepare_spoolfile_close()
 static void
 prepare_spoolfile_delete(char *path)
 {
+	elog(LOG, "!!> prepare_spoolfile_delete: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3681,18 +3702,20 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_cur.is_spooling);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(len));
 	psf_cur.cur_offset += bytes_written;
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(action));
@@ -3701,6 +3724,7 @@ prepare_spoolfile_write(char action, StringInfo s)
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == len);
@@ -3734,6 +3758,12 @@ prepare_spoolfile_exists(char *path)
 		if (fd >= 0)
 			FileClose(fd);
 
+		elog(LOG,
+			 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was "
+			 "not found in the HTAB, but was %s on the disk.",
+			 path,
+			 found ? "found" : "not found");
+
 		/*
 		 * And if it was found on disk then create the HTAB entry for it.
 		 */
@@ -3743,10 +3773,16 @@ prepare_spoolfile_exists(char *path)
 										  path,
 										  HASH_ENTER,
 										  NULL);
+			elog(LOG, "!!> prepare_spoolfile_exists: Created new HTAB entry '%s'", hentry->name);
 			hentry->allow_delete = false;
 		}
 	}
 
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 found ? "found" : "not found");
+
 	return found;
 }
 
@@ -3763,8 +3799,8 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 				oldctx2;
 	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3809,6 +3845,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -3827,6 +3864,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		nbytes = FileRead(psf.vfd, buffer, len,
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
@@ -3843,7 +3881,9 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		/* Ensure we are reading the data into our memory context. */
 		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
 
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: Before dispatch");
 		apply_dispatch(&s2);
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: After dispatch");
 
 		MemoryContextReset(ApplyMessageContext);
 
@@ -3852,13 +3892,13 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nchanges++;
 
 		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
+			elog(LOG, "!!> replayed %d changes from file '%s'",
 				 nchanges, path);
 	}
 
 	FileClose(psf.vfd);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
@@ -3894,6 +3934,8 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 	HASH_SEQ_STATUS seq_status;
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_on_proc_exit");
+
 	/* Iterate the HTAB looking for what file can be deleted. */
 	if (psf_hash)
 	{
@@ -3902,6 +3944,7 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 		{
 			char *path = hentry->name;
 
+			elog(LOG, "!!> prepare_spoolfile_proc_exit: found '%s'", path);
 			if (hentry->allow_delete)
 				prepare_spoolfile_delete(path);
 		}
-- 
1.8.3.1

v53-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v53-0002-Support-2PC-txn-subscriber-tests.patchDownload
From fdbcbd8e9e2fed087a108a080f2d0e164593541e Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Sun, 7 Mar 2021 23:17:21 -0500
Subject: [PATCH v53] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 332 ++++++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl | 282 ++++++++++++++++++++
 2 files changed, 614 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9aa483c
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,332 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v53-0003-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v53-0003-Support-2PC-txn-Subscription-option.patchDownload
From a06e32159567ae6f51c34b59ef18b096fc021bfa Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 8 Mar 2021 00:13:14 -0500
Subject: [PATCH v53] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/create_subscription.sgml          | 27 +++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 72 +++++++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 +
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 ++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 +++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 ++-
 src/bin/psql/tab-complete.c                        |  2 +-
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 +
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 93 +++++++++++++++-------
 src/test/regress/sql/subscription.sql              | 25 ++++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 18 files changed, 260 insertions(+), 48 deletions(-)

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..eeb7e35 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,33 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73..060fab4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1168,7 +1168,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f6793f0..96fcf49 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,26 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +309,24 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * the current implementation has some issues that could lead to a
+	 * streaming prepared transaction to be incorrectly missed in the initial
+	 * syncing phase. Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +576,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false, true,
+				walrcv_create_slot(wrconn, slotname, false, twophase,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +883,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +918,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if (sub->twophase && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +947,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +993,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1039,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9e822f9..1daa585 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -428,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c9b1dea..89988b8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2878,6 +2878,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3527,6 +3528,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2e4b39f..91ecc55 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -178,13 +178,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -252,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -265,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -289,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -330,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..96c878b 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and two_phase are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 9f0208a..34c70a1 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2763,7 +2763,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 410326d..d4af491 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f55b07c..0ed8e9d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..67b3358 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9aa483c..d56789d 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

#225Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#222)

On Mon, Mar 8, 2021 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 8, 2021 at 10:04 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Sun, Mar 7, 2021 at 3:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Mar 7, 2021 at 7:35 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v51*

Few more comments on v51-0006-Fix-apply-worker-empty-prepare:
======================================================
1.
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the
psf_hash. This is
+ * for maintaining a mapping between the name of the prepared
spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+ char name[MAXPGPATH]; /* Hash key --- must be first */
+ bool allow_delete; /* ok to delete? */
+} PsfHashEntry;
+

IIUC, this has table is used for two purposes in the patch (a) to
check for existence of prepare spool file where we anyway to check it
on disk if not found in the hash table. (b) to allow the prepare spool
file to be removed on proc_exit.

I think we don't need the optimization provided by (a) because it will
be too rare a case to deserve any optimization, we might write a
comment in prepare_spoolfile_exists to indicate such an optimization.
For (b), we can use a simple list to track files to be removed on
proc_exit something like we do in CreateLockFile. I think avoiding
hash table usage will reduce the code and chances of bugs in this
area. It won't be easy to write a lot of automated tests to test this
functionality so it is better to avoid minor optimizations at this
stage.

Our data structure psf_hash also needs to be able to discover the
entry for a specific spool file and remove it. e.g. anything marked as
"allow_delete = false" (during prepare) must be able to be re-found
and removed from that structure at commit_prepared or
rollback_prepared time.

But, I think that is not reliable because after restart the entry
might not be present and we anyway need to check the presence of the
file on disk. Actually, you don't need any manipulation with list or
hash at commit_prepared or rollback_prepared, we should just remove
the entry for it at the prepare time and there should be an assert if
we find that entry in the in-memory structure.

Looking at CreateLockFile code, I don't see that it is ever deleting
entries from its "lock_files" list on-the-fly, so it's not really a
fair comparison to say just use a List like CreateLockFile.

Sure, but you can additionally traverse the list and find the required entry.

So, even though we (currently) only have a single data member
"allow_delete", I think the requirement to do a key lookup/delete
makes a HTAB a more appropriate data structure than a List.

Actually, that member is also not required at all because you just
need it till the time of prepare and then remove it.

OK, I plan to change like this.
- Now the whole hash simply means "delete-on-exit". If the key (aka
filename) exists, delete that file on exit. If not don't
- Remove the "allow_delete" member (as you say it can be redundant
using the new interpretation above)
- the *only* code that CREATES a key will be when
prepare_spoolfile_create is called from begin_prepare.
- at apply_handle_prepare time the key is REMOVED (so that file will
not be deleted in case of a restart / error before commit/rollback)
- at apply_handle_commit_prepared Assert(if key is found) because
prepare should have removed it; the psf file is always deleted.
- at apply_handle_rollback_prepared Assert(if key is found) because
prepare should have removed it; the psf file is always deleted.
- at proc-exit time, iterate and delete all the filenames (aka keys).

-----
Kind Regards,
Peter Smith.
Fujitsu Australia

#226Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#225)

On Mon, Mar 8, 2021 at 1:26 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, Mar 8, 2021 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 8, 2021 at 10:04 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Sun, Mar 7, 2021 at 3:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Mar 7, 2021 at 7:35 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v51*

Few more comments on v51-0006-Fix-apply-worker-empty-prepare:
======================================================
1.
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the
psf_hash. This is
+ * for maintaining a mapping between the name of the prepared
spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+ char name[MAXPGPATH]; /* Hash key --- must be first */
+ bool allow_delete; /* ok to delete? */
+} PsfHashEntry;
+

IIUC, this has table is used for two purposes in the patch (a) to
check for existence of prepare spool file where we anyway to check it
on disk if not found in the hash table. (b) to allow the prepare spool
file to be removed on proc_exit.

I think we don't need the optimization provided by (a) because it will
be too rare a case to deserve any optimization, we might write a
comment in prepare_spoolfile_exists to indicate such an optimization.
For (b), we can use a simple list to track files to be removed on
proc_exit something like we do in CreateLockFile. I think avoiding
hash table usage will reduce the code and chances of bugs in this
area. It won't be easy to write a lot of automated tests to test this
functionality so it is better to avoid minor optimizations at this
stage.

Our data structure psf_hash also needs to be able to discover the
entry for a specific spool file and remove it. e.g. anything marked as
"allow_delete = false" (during prepare) must be able to be re-found
and removed from that structure at commit_prepared or
rollback_prepared time.

But, I think that is not reliable because after restart the entry
might not be present and we anyway need to check the presence of the
file on disk. Actually, you don't need any manipulation with list or
hash at commit_prepared or rollback_prepared, we should just remove
the entry for it at the prepare time and there should be an assert if
we find that entry in the in-memory structure.

Looking at CreateLockFile code, I don't see that it is ever deleting
entries from its "lock_files" list on-the-fly, so it's not really a
fair comparison to say just use a List like CreateLockFile.

Sure, but you can additionally traverse the list and find the required entry.

So, even though we (currently) only have a single data member
"allow_delete", I think the requirement to do a key lookup/delete
makes a HTAB a more appropriate data structure than a List.

Actually, that member is also not required at all because you just
need it till the time of prepare and then remove it.

OK, I plan to change like this.
- Now the whole hash simply means "delete-on-exit". If the key (aka
filename) exists, delete that file on exit. If not don't
- Remove the "allow_delete" member (as you say it can be redundant
using the new interpretation above)
- the *only* code that CREATES a key will be when
prepare_spoolfile_create is called from begin_prepare.
- at apply_handle_prepare time the key is REMOVED (so that file will
not be deleted in case of a restart / error before commit/rollback)

So, the only real place where you need to perform any search is at the
prepare time and I think it should always be the first element if we
use the list here. Am I missing something? If not, I don't see why you
want to prefer HTAB over a simple list? You can remove the first
element and probably have an assert to confirm it is the correct
element (by checking the path) you are removing.

--
With Regards,
Amit Kapila.

#227vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#224)

On Mon, Mar 8, 2021 at 11:30 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Fri, Mar 5, 2021 at 9:25 PM vignesh C <vignesh21@gmail.com> wrote:

Created new patch v53:

Thanks for the updated patch.
I had noticed one issue, publisher does not get stopped normally in
the following case:
# Publisher steps
psql -d postgres -c "CREATE TABLE do_write(id serial primary key);"
psql -d postgres -c "INSERT INTO do_write VALUES(generate_series(1,10));"
psql -d postgres -c "CREATE PUBLICATION mypub FOR TABLE do_write;"

# Subscriber steps
psql -d postgres -p 9999 -c "CREATE TABLE do_write(id serial primary key);"
psql -d postgres -p 9999 -c "INSERT INTO do_write VALUES(1);" # to
cause a PK violation
psql -d postgres -p 9999 -c "CREATE SUBSCRIPTION mysub CONNECTION
'host=localhost port=5432 dbname=postgres' PUBLICATION mypub WITH
(two_phase = true);"

# prepare & commit prepared at publisher
psql -d postgres -c \
"begin; insert into do_write values (100); prepare transaction 'test1';"
psql -d postgres -c "commit prepared 'test1';"

Stop publisher:
./pg_ctl -D publisher stop
waiting for server to shut
down...............................................................
failed
pg_ctl: server does not shut down

This is because the following process does not exit:
postgres: walsender vignesh 127.0.0.1(41550) START_REPLICATION

It continuously loops at the below:
#0 0x00007f1c520d3bca in __libc_pread64 (fd=6, buf=0x555b1b3f7870,
count=8192, offset=0) at ../sysdeps/unix/sysv/linux/pread64.c:29
#1 0x0000555b1a8f6d20 in WALRead (state=0x555b1b3f1ce0,
buf=0x555b1b3f7870 "\n\321\002", startptr=16777216, count=8192, tli=1,
errinfo=0x7ffe693b78c0) at xlogreader.c:1116
#2 0x0000555b1ac8ce10 in logical_read_xlog_page
(state=0x555b1b3f1ce0, targetPagePtr=16777216, reqLen=8192,
targetRecPtr=23049936, cur_page=0x555b1b3f7870 "\n\321\002")
at walsender.c:837
#3 0x0000555b1a8f6040 in ReadPageInternal (state=0x555b1b3f1ce0,
pageptr=23044096, reqLen=5864) at xlogreader.c:608
#4 0x0000555b1a8f5849 in XLogReadRecord (state=0x555b1b3f1ce0,
errormsg=0x7ffe693b79c0) at xlogreader.c:329
#5 0x0000555b1ac8ff4a in XLogSendLogical () at walsender.c:2846
#6 0x0000555b1ac8f1e5 in WalSndLoop (send_data=0x555b1ac8ff0e
<XLogSendLogical>) at walsender.c:2289
#7 0x0000555b1ac8db2a in StartLogicalReplication (cmd=0x555b1b3b78b8)
at walsender.c:1206
#8 0x0000555b1ac8e4dd in exec_replication_command (
cmd_string=0x555b1b331670 "START_REPLICATION SLOT \"mysub\"
LOGICAL 0/0 (proto_version '2', two_phase 'on', publication_names
'\"mypub\"')") at walsender.c:1646
#9 0x0000555b1ad04460 in PostgresMain (argc=1, argv=0x7ffe693b7cc0,
dbname=0x555b1b35cc58 "postgres", username=0x555b1b35cc38 "vignesh")
at postgres.c:4323

I felt the publisher should get stopped in this case.
Thoughts?

Regards,
Vignesh

#228Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#227)

On Mon, Mar 8, 2021 at 4:20 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, Mar 8, 2021 at 11:30 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Fri, Mar 5, 2021 at 9:25 PM vignesh C <vignesh21@gmail.com> wrote:

Created new patch v53:

Thanks for the updated patch.
I had noticed one issue, publisher does not get stopped normally in
the following case:
# Publisher steps
psql -d postgres -c "CREATE TABLE do_write(id serial primary key);"
psql -d postgres -c "INSERT INTO do_write VALUES(generate_series(1,10));"
psql -d postgres -c "CREATE PUBLICATION mypub FOR TABLE do_write;"

# Subscriber steps
psql -d postgres -p 9999 -c "CREATE TABLE do_write(id serial primary key);"
psql -d postgres -p 9999 -c "INSERT INTO do_write VALUES(1);" # to
cause a PK violation
psql -d postgres -p 9999 -c "CREATE SUBSCRIPTION mysub CONNECTION
'host=localhost port=5432 dbname=postgres' PUBLICATION mypub WITH
(two_phase = true);"

# prepare & commit prepared at publisher
psql -d postgres -c \
"begin; insert into do_write values (100); prepare transaction 'test1';"
psql -d postgres -c "commit prepared 'test1';"

Stop publisher:
./pg_ctl -D publisher stop
waiting for server to shut
down...............................................................
failed
pg_ctl: server does not shut down

This is because the following process does not exit:
postgres: walsender vignesh 127.0.0.1(41550) START_REPLICATION

It continuously loops at the below:

What happens if you don't set the two_phase option? If that also leads
to the same error then can you please also check this case on the
HEAD?

--
With Regards,
Amit Kapila.

#229Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#220)

On Mon, Mar 8, 2021 at 7:17 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v52*

Few more comments:
==================
1.
/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
- | K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT
create_slot_opt_list
+ | K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase
K_LOGICAL IDENT create_slot_opt_list

I think the comment above can have TWO_PHASE option listed.

2.
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
..
/*
+ * From now, until the handle_prepare we are spooling to the
+ * current psf.
+ */
+ psf_cur.is_spooling = true;
+ }
+ }
+
+ remote_final_lsn = begin_data.final_lsn;
+
+ in_remote_transaction = true;
+
+ pgstat_report_activity(STATE_RUNNING, NULL);

In case you are spooling the changes, you don't need to set
remote_final_lsn and in_remote_transaction. You only need to probably
do pgstat_report_activity.

3.
Similarly, you don't need to set remote_final_lsn as false in
apply_handle_prepare for the spooling case, rather there should be an
Assert stating that remote_final_lsn is false.

4.
snprintf(path, MAXPGPATH, "pg_twophase/%u-%s.prep_changes", subid, gid);

I feel it is better to create these in pg_logical/twophase as that is
where we store other logical replication related files.

--
With Regards,
Amit Kapila.

#230vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#228)

On Mon, Mar 8, 2021 at 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 8, 2021 at 4:20 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, Mar 8, 2021 at 11:30 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Fri, Mar 5, 2021 at 9:25 PM vignesh C <vignesh21@gmail.com> wrote:

Created new patch v53:

Thanks for the updated patch.
I had noticed one issue, publisher does not get stopped normally in
the following case:
# Publisher steps
psql -d postgres -c "CREATE TABLE do_write(id serial primary key);"
psql -d postgres -c "INSERT INTO do_write VALUES(generate_series(1,10));"
psql -d postgres -c "CREATE PUBLICATION mypub FOR TABLE do_write;"

# Subscriber steps
psql -d postgres -p 9999 -c "CREATE TABLE do_write(id serial primary key);"
psql -d postgres -p 9999 -c "INSERT INTO do_write VALUES(1);" # to
cause a PK violation
psql -d postgres -p 9999 -c "CREATE SUBSCRIPTION mysub CONNECTION
'host=localhost port=5432 dbname=postgres' PUBLICATION mypub WITH
(two_phase = true);"

# prepare & commit prepared at publisher
psql -d postgres -c \
"begin; insert into do_write values (100); prepare transaction 'test1';"
psql -d postgres -c "commit prepared 'test1';"

Stop publisher:
./pg_ctl -D publisher stop
waiting for server to shut
down...............................................................
failed
pg_ctl: server does not shut down

This is because the following process does not exit:
postgres: walsender vignesh 127.0.0.1(41550) START_REPLICATION

It continuously loops at the below:

What happens if you don't set the two_phase option? If that also leads
to the same error then can you please also check this case on the
HEAD?

It succeeds without the two_phase option.
I had further analyzed this issue, see the details of it below:
We have the below code in WalSndDone function which will handle the
walsender exit:
if (WalSndCaughtUp && sentPtr == replicatedPtr &&
!pq_is_send_pending())
{
QueryCompletion qc;

/* Inform the standby that XLOG streaming is done */
SetQueryCompletion(&qc, CMDTAG_COPY, 0);
EndCommand(&qc, DestRemote, false);
pq_flush();

proc_exit(0);
}

But in case of with two_phase option, replicatedPtr and sentPtr never
becomes same:
(gdb) p /x replicatedPtr
$8 = 0x15faa70
(gdb) p /x sentPtr
$10 = 0x15fac50

Whereas in case of without two_phase option, replicatedPtr and sentPtr
becomes same and exits:
(gdb) p /x sentPtr
$7 = 0x15fae10
(gdb) p /x replicatedPtr
$8 = 0x15fae10

I think in case of two_phase option, replicatedPtr and sentPtr never
becomes the same which causes this process to hang.

Regards,
Vignesh

#231Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#223)

On Mon, Mar 8, 2021 at 4:58 PM vignesh C <vignesh21@gmail.com> wrote:

LOGICAL_REP_MSG_TYPE = 'Y',
+       LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+       LOGICAL_REP_MSG_PREPARE = 'P',
+       LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+       LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
LOGICAL_REP_MSG_STREAM_START = 'S',
LOGICAL_REP_MSG_STREAM_END = 'E',
LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-       LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+       LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+       LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
} LogicalRepMsgType;
As we start adding more and more features, we will have to start
adding more message types, using meaningful characters might become
difficult. Should we start using numeric instead for the new feature
getting added?

This may or may not become a problem sometime in the future, but I
think the feedback is not really specific to the current patch set so
I am skipping it at this time.

If you want, maybe create it as a separate thread, Is it OK?

----
Kind Regards,
Peter Smith.
Fujitsu Australia

#232vignesh C
vignesh21@gmail.com
In reply to: Peter Smith (#231)

On Tue, Mar 9, 2021 at 9:14 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, Mar 8, 2021 at 4:58 PM vignesh C <vignesh21@gmail.com> wrote:

LOGICAL_REP_MSG_TYPE = 'Y',
+       LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+       LOGICAL_REP_MSG_PREPARE = 'P',
+       LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+       LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
LOGICAL_REP_MSG_STREAM_START = 'S',
LOGICAL_REP_MSG_STREAM_END = 'E',
LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-       LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+       LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+       LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
} LogicalRepMsgType;
As we start adding more and more features, we will have to start
adding more message types, using meaningful characters might become
difficult. Should we start using numeric instead for the new feature
getting added?

This may or may not become a problem sometime in the future, but I
think the feedback is not really specific to the current patch set so
I am skipping it at this time.

If you want, maybe create it as a separate thread, Is it OK?

I was thinking of changing the newly added message types to something
like below:

LOGICAL_REP_MSG_TYPE = 'Y',
+       LOGICAL_REP_MSG_BEGIN_PREPARE = 1,
+       LOGICAL_REP_MSG_PREPARE = 2,
+       LOGICAL_REP_MSG_COMMIT_PREPARED = 3,
+       LOGICAL_REP_MSG_ROLLBACK_PREPARED = 4,
LOGICAL_REP_MSG_STREAM_START = 'S',
LOGICAL_REP_MSG_STREAM_END = 'E',
LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-       LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+       LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+       LOGICAL_REP_MSG_STREAM_PREPARE = 5
} LogicalRepMsgType;

Changing these values at a later time may become difficult as it can
break backward compatibility. But if you feel the existing values are
better we can keep it as it is and think of it later when we add more
message types.

Regards,
Vignesh

#233Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#231)

On Tue, Mar 9, 2021 at 9:15 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, Mar 8, 2021 at 4:58 PM vignesh C <vignesh21@gmail.com> wrote:

LOGICAL_REP_MSG_TYPE = 'Y',
+       LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+       LOGICAL_REP_MSG_PREPARE = 'P',
+       LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+       LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
LOGICAL_REP_MSG_STREAM_START = 'S',
LOGICAL_REP_MSG_STREAM_END = 'E',
LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-       LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+       LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+       LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
} LogicalRepMsgType;
As we start adding more and more features, we will have to start
adding more message types, using meaningful characters might become
difficult. Should we start using numeric instead for the new feature
getting added?

This may or may not become a problem sometime in the future, but I
think the feedback is not really specific to the current patch set so
I am skipping it at this time.

+1.

--
With Regards,
Amit Kapila.

#234Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#220)
4 attachment(s)

Please find attached the latest patch set v54*

Differences from v53* are:

* Rebased to HEAD @ today

* Addresses some recent feedback issues for patch 0001

Feedback from Amit @ 7/March [ak]
- (36) Fixed. Comment about the psf replay.
- (37) Fixed. prepare_spoolfile_create, check file already exists (on
disk) instead of just checking HTAB.
- (38) Fixed. Added comment about potential overwrite of existing file.

Feedback from Vignesh @ 8/March [vc]
- (45) Fixed. Changed some comment to be single-line comments (e.g. if
they only apply to a single following stmt)
- (46) Fixed. prepare_spoolfile_create, refactored slightly to make
more use of common code in if/else
- (47) Skipped. This was feedback suggesting using ints instead of
character values for message type enum.

-----
[ak] /messages/by-id/CAA4eK1+dO07RrQwfHAK5jDP9qiXik4-MVzy+coEG09shWTJFGg@mail.gmail.com
[vc] /messages/by-id/CALDaNm29gOsCUtNkvHgqbbD1kbM8m67h4AqfmUWG1oTnfuPFxA@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v54-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v54-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 091eab1a590e5d8ac69981dcad000802d9ac1b98 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 9 Mar 2021 14:51:15 +1100
Subject: [PATCH v54] Add support for apply at prepare time to built-in logical
  replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

* change stream_prepare_cb from a required callback to an optional callback.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

* This patch also adds new option to enable two_phase while creating a slot.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 src/backend/access/transam/twophase.c              |  68 ++
 src/backend/commands/subscriptioncmds.c            |   2 +-
 .../libpqwalreceiver/libpqwalreceiver.c            |   6 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 194 ++++++
 src/backend/replication/logical/tablesync.c        | 180 ++++-
 src/backend/replication/logical/worker.c           | 767 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 157 ++++-
 src/backend/replication/repl_gram.y                |  14 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/replication/logicalproto.h             |  70 +-
 src/include/replication/reorderbuffer.h            |  12 +
 src/include/replication/walreceiver.h              |   5 +-
 src/include/replication/worker_internal.h          |   3 +
 src/tools/pgindent/typedefs.list                   |   5 +
 17 files changed, 1412 insertions(+), 83 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..81cb765 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..f6793f0 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -528,7 +528,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				walrcv_create_slot(wrconn, slotname, false, true,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..9e822f9 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -827,7 +828,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +842,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..e958d28 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,200 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..97fc399 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1052,7 +1026,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
 	/*
@@ -1137,3 +1111,141 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+AnyTablesyncInProgress()
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * When the process_syncing_tables_for_apply changes the state
+		 * from SYNCDONE to READY, that change is actually written directly
+		 * into the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 18d0528..61f04ed 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -208,6 +209,54 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	bool allow_delete; /* ok to delete? */
+}			PsfHashEntry;
+
+/*
+ * Information about the "current" psf spoolfile.
+ */
+typedef struct PsfFile
+{
+	char	name[MAXPGPATH];/* psf name - same as the HTAB key. */
+	bool	is_spooling;	/* are we currently spooling to this file? */
+	File 	vfd;			/* -1 when the file is closed. */
+	off_t	cur_offset;		/* offset for the next write or read. Reset to 0
+							 * when file is opened. */
+} PsfFile;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * Information about the 'current' open spoolfile is only valid when spooling.
+ * This is flagged as 'is_spooling' only between begin_prepare and prepare.
+ */
+static PsfFile psf_cur = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_delete(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+static void prepare_spoolfile_on_proc_exit(int status, Datum arg);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -720,6 +769,338 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				errmsg("transaction identifier \"%s\" is already in use",
+					   begin_data.gid)));
+
+	/*
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel).
+	 *
+	 * This can lead to an "empty prepare", because later when the apply
+	 * worker does the commit prepare (‘K’), there is nothing in it (the
+	 * inserts were skipped earlier).
+	 *
+	 * We avoid this using the 2 part logic: (1) Wait for all tablesync workers
+	 * to reach SYNCDONE/READY state; (2) If the begin_prepare lsn is now
+	 * behind any tablesync lsn then spool the prepared messages to a file
+	 * to be replayed later at commit_prepared time.
+	 *
+	 * -----
+	 *
+	 * XXX - The 2PC protocol needs the publisher to be aware when the PREPARE
+	 * has been successfully acted on. But because of this "empty prepare"
+	 * case now the prepared messages may be spooled to a file and, when
+	 * that happens the PREPARE would not happen at the usual time, but would
+	 * be deferred until COMMIT PREPARED time. This quirk could only happen
+	 * immediately after the initial table synchronization phase; once all
+	 * tables have acheived READY state the 2PC protocol will behave normally.
+	 *
+	 * A future release may be able to detect when all tables are READY and set
+	 * a flag to indicate this subscription/slot is ready for two_phase
+	 * decoding. Then at the publisher-side, we could enable wait-for-prepares
+	 * only when all the slots of WALSender have that flag set.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(begin_data.final_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (AnyTablesyncInProgress())
+		{
+			process_syncing_tables(begin_data.final_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		{
+			char		psfpath[MAXPGPATH];
+
+			/*
+			 * Create the spoolfile.
+			 */
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+
+			/*
+			 * From now, until the handle_prepare we are spooling to the
+			 * current psf.
+			 */
+			psf_cur.is_spooling = true;
+		}
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * If we were using a psf spoolfile, then write the PREPARE as the final
+	 * message. This prepare information will be used at commit_prepared time.
+	 */
+	if (psf_cur.is_spooling)
+	{
+		PsfHashEntry *hentry;
+
+		/* Write the PREPARE info to the psf file. */
+		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
+
+		/*
+		 * Flush the spoolfile, so changes can survive a restart.
+		 *
+		 * If the publisher resends the same data again after a restart (e.g.
+		 * if subscriber origin has not moved past this prepare), then the same
+		 * named psf file will be overwritten with the same data. See
+		 * prepare_spoolfile_create.
+		 */
+		FileSync(psf_cur.vfd, WAIT_EVENT_DATA_FILE_SYNC);
+
+		/* We are finished spooling to the current psf. */
+		psf_cur.is_spooling = false;
+
+		/*
+		 * The commit_prepare will need the spoolfile, so unregister it for
+		 * removal on proc-exit just in case there is an unexpected restart
+		 * between now and when commit_prepared happens.
+		 */
+		hentry = (PsfHashEntry *) hash_search(psf_hash, psf_cur.name,
+											  HASH_FIND, NULL);
+		Assert(hentry);
+		hentry->allow_delete = false;
+
+		/*
+		 * The psf_cur.vfd is meaningful only between begin_prepare and prepared.
+		 * So close it now. Any messages written to the psf will be applied
+		 * later during handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		in_remote_transaction = false;
+		return;
+	}
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	/*
+	 * Normally, prepare_lsn == remote_final_lsn, but if this prepare message
+	 * was dispatched via the psf spoolfile replay then the remote_final_lsn
+	 * is set to commit lsn instead. Hence the <= instead of == check below.
+	 */
+	Assert(prepare_data.prepare_lsn <= remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+
+		/*
+		 * Replay/dispatch the spooled messages.
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		/* After replaying the psf it is no longer needed. Just delete it. */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		/* We are finished with this spoolfile. Delete it. */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
+	 */
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -732,6 +1113,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		!psf_cur.is_spooling &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1150,6 +1532,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1271,6 +1656,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1429,6 +1817,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1798,6 +2189,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1954,6 +2348,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2061,6 +2477,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	TimeLineID	tli;
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL     hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2180,7 +2613,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && !psf_cur.is_spooling)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -2927,6 +3360,9 @@ ApplyWorkerMain(Datum main_arg)
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
 
+	/* Arrange to delete any unwanted psf file(s) at proc-exit */
+	on_proc_exit(prepare_spoolfile_on_proc_exit, 0);
+
 	/* Setup signal handling */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGTERM, die);
@@ -3103,3 +3539,332 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time. If needed, this is the common function to do that file redirection.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	elog(DEBUG1,
+		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_cur.is_spooling ? "Do" : "Don't");
+
+	if (!psf_cur.is_spooling)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	bool		file_found;
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(!psf_cur.is_spooling);
+
+	/* check if the file already exists. */
+	file_found = prepare_spoolfile_exists(path);
+
+	if (!file_found)
+	{
+		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create file \"%s\": %m", path)));
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the beginning because we always want to
+		 * create/overwrite this file.
+		 */
+		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m", path)));
+	}
+
+	/* Create/Find the spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash, path,
+											  HASH_ENTER | HASH_FIND, NULL);
+	Assert(hentry);
+	memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+	psf_cur.cur_offset = 0;
+	hentry->allow_delete = true;
+
+	/* Sanity checks */
+	Assert(psf_cur.vfd >= 0);
+	Assert(psf_cur.cur_offset == 0);
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_cur.vfd >= 0)
+		FileClose(psf_cur.vfd);
+
+	/* Mark this fd as not valid to use anymore. */
+	psf_cur.is_spooling = false;
+	psf_cur.vfd = -1;
+	psf_cur.cur_offset = 0;
+}
+
+/*
+ * Delete the specified psf spoolfile, and any HTAB associated with it.
+ */
+static void
+prepare_spoolfile_delete(char *path)
+{
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* Delete the file off the disk. */
+	unlink(path);
+
+	/* Remove any entry from the psf_hash, if present */
+	hash_search(psf_hash, path, HASH_REMOVE, NULL);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+	int			bytes_written;
+
+	Assert(psf_cur.is_spooling);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(len));
+	psf_cur.cur_offset += bytes_written;
+
+	/* then the action */
+	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(action));
+	psf_cur.cur_offset += bytes_written;
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == len);
+	psf_cur.cur_offset += bytes_written;
+}
+
+/*
+ * Is there a prepare spoolfile for the specified path?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	bool		found;
+		
+	File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+
+	found = fd >= 0;
+	if (fd >= 0)
+		FileClose(fd);
+
+	return found;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	psf.vfd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	if (psf.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from prepared spoolfile \"%s\": %m",
+						path)));
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	remote_final_lsn = final_lsn;
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		nbytes = FileRead(psf.vfd, buffer, len,
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+		if (nbytes != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldctx2);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	FileClose(psf.vfd);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB.
+	 *
+	 * Therefore, the name and the key must be exactly same lengths and padded
+	 * with '\0' so garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "pg_twophase/%u-%s.prep_changes", subid, gid);
+}
+
+/*
+ * proc_exit callback to remove unwanted psf files.
+ */
+static void
+prepare_spoolfile_on_proc_exit(int status, Datum arg)
+{
+	HASH_SEQ_STATUS seq_status;
+	PsfHashEntry *hentry;
+
+	/* Iterate the HTAB looking for what file can be deleted. */
+	if (psf_hash)
+	{
+		hash_seq_init(&seq_status, psf_hash);
+		while ((hentry = (PsfHashEntry *) hash_seq_search(&seq_status)) != NULL)
+		{
+			char *path = hentry->name;
+
+			if (hentry->allow_delete)
+				prepare_spoolfile_delete(path);
+		}
+	}
+}
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2e4b39f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +171,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -322,8 +342,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +362,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +383,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +843,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1250,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..c5154ae 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -242,15 +244,16 @@ create_replication_slot:
 					$$ = (Node *) cmd;
 				}
 			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e5f8a06..e40d2d0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -363,7 +363,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..410326d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -170,5 +237,4 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 										  TransactionId subxid);
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
-
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..0c95dc6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..f55b07c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -345,6 +345,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -418,8 +419,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..95d78e9 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AnyTablesyncInProgress(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8bd95ae..4ffcef5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
@@ -1955,6 +1958,8 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfFile
+PsfHashEntry
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v54-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v54-0002-Support-2PC-txn-subscriber-tests.patchDownload
From 612edcfc36a1accfa56a94bf8704be48f5e92a67 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 9 Mar 2021 14:59:37 +1100
Subject: [PATCH v54] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 332 ++++++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl | 282 ++++++++++++++++++++
 2 files changed, 614 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9aa483c
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,332 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v54-0004-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v54-0004-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From 1a179fcdd563d4eb6895ea44f4e7769a9a6dbca3 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 9 Mar 2021 16:03:36 +1100
Subject: [PATCH v54] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 ++++++++++---
 src/backend/replication/logical/worker.c    | 66 ++++++++++++++++++++++-------
 2 files changed, 74 insertions(+), 21 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 97fc399..f3984d4 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1127,6 +1133,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1149,6 +1157,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1156,12 +1165,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1175,6 +1189,8 @@ AnyTablesyncInProgress()
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncInProgress?");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1186,8 +1202,8 @@ AnyTablesyncInProgress()
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1204,6 +1220,7 @@ AnyTablesyncInProgress()
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncInProgress?: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1215,8 +1232,8 @@ AnyTablesyncInProgress()
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1242,8 +1259,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4cbe50e..78f0c13 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -828,14 +828,16 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(begin_data.final_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (AnyTablesyncInProgress())
 		{
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
 			process_syncing_tables(begin_data.final_lsn);
 
 			/* This latch is to prevent 100% CPU looping. */
@@ -853,7 +855,12 @@ apply_handle_begin_prepare(StringInfo s)
 		 * prepared) will be saved to a spoolfile for replay later at
 		 * commit_prepared time.
 		 */
-		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		if (begin_data.final_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
 		{
 			char		psfpath[MAXPGPATH];
 
@@ -895,6 +902,8 @@ apply_handle_prepare(StringInfo s)
 	{
 		PsfHashEntry *hentry;
 
+		elog(LOG, "!!> apply_handle_prepare: SPOOLING");
+
 		/* Write the PREPARE info to the psf file. */
 		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
 
@@ -916,6 +925,8 @@ apply_handle_prepare(StringInfo s)
 		 * removal on proc-exit just in case there is an unexpected restart
 		 * between now and when commit_prepared happens.
 		 */
+		elog(LOG,
+			"!!> apply_handle_prepare: Make sure the spoolfile is not removed on proc-exit");
 		hentry = (PsfHashEntry *) hash_search(psf_hash, psf_cur.name,
 											  HASH_FIND, NULL);
 		Assert(hentry);
@@ -1000,6 +1011,8 @@ apply_handle_commit_prepared(StringInfo s)
 	{
 		int			nchanges;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * Replay/dispatch the spooled messages.
 		 */
@@ -1007,8 +1020,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		/* After replaying the psf it is no longer needed. Just delete it. */
@@ -1072,6 +1085,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf = %d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2350,18 +2364,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3555,8 +3573,8 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
-	elog(DEBUG1,
-		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
 		 action,
 		 psf_cur.is_spooling ? "Do" : "Don't");
 
@@ -3580,7 +3598,7 @@ prepare_spoolfile_create(char *path)
 	bool		file_found;
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(!psf_cur.is_spooling);
 
@@ -3589,7 +3607,7 @@ prepare_spoolfile_create(char *path)
 
 	if (!file_found)
 	{
-		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		elog(LOG, "!!> Not found file \"%s\". Create it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 			ereport(ERROR,
@@ -3602,7 +3620,7 @@ prepare_spoolfile_create(char *path)
 		 * Open the file and seek to the beginning because we always want to
 		 * create/overwrite this file.
 		 */
-		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		elog(LOG, "!!> Found file \"%s\". Overwrite it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 			ereport(ERROR,
@@ -3630,6 +3648,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_cur.vfd >= 0)
 		FileClose(psf_cur.vfd);
 
@@ -3645,6 +3664,8 @@ prepare_spoolfile_close()
 static void
 prepare_spoolfile_delete(char *path)
 {
+	elog(LOG, "!!> prepare_spoolfile_delete: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3670,18 +3691,20 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_cur.is_spooling);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(len));
 	psf_cur.cur_offset += bytes_written;
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(action));
@@ -3690,6 +3713,7 @@ prepare_spoolfile_write(char action, StringInfo s)
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == len);
@@ -3710,6 +3734,11 @@ prepare_spoolfile_exists(char *path)
 	if (fd >= 0)
 		FileClose(fd);
 
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 found ? "found" : "not found");
+
 	return found;
 }
 
@@ -3726,8 +3755,8 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 				oldctx2;
 	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3772,6 +3801,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -3790,6 +3820,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		nbytes = FileRead(psf.vfd, buffer, len,
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
@@ -3806,7 +3837,9 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		/* Ensure we are reading the data into our memory context. */
 		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
 
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: Before dispatch");
 		apply_dispatch(&s2);
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: After dispatch");
 
 		MemoryContextReset(ApplyMessageContext);
 
@@ -3815,13 +3848,13 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nchanges++;
 
 		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
+			elog(LOG, "!!> replayed %d changes from file '%s'",
 				 nchanges, path);
 	}
 
 	FileClose(psf.vfd);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
@@ -3857,6 +3890,8 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 	HASH_SEQ_STATUS seq_status;
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_on_proc_exit");
+
 	/* Iterate the HTAB looking for what file can be deleted. */
 	if (psf_hash)
 	{
@@ -3865,6 +3900,7 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 		{
 			char *path = hentry->name;
 
+			elog(LOG, "!!> prepare_spoolfile_proc_exit: found '%s'", path);
 			if (hentry->allow_delete)
 				prepare_spoolfile_delete(path);
 		}
-- 
1.8.3.1

v54-0003-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v54-0003-Support-2PC-txn-Subscription-option.patchDownload
From 966b28dcaadb9e390dcf7eaad3ab6fcd70feb5b8 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 9 Mar 2021 15:21:00 +1100
Subject: [PATCH v54] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/create_subscription.sgml          | 27 +++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 72 +++++++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 +
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 ++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 +++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 ++-
 src/bin/psql/tab-complete.c                        |  2 +-
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 +
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 93 +++++++++++++++-------
 src/test/regress/sql/subscription.sql              | 25 ++++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 18 files changed, 260 insertions(+), 48 deletions(-)

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..eeb7e35 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,33 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fc94a73..060fab4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1168,7 +1168,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f6793f0..96fcf49 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,26 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +309,24 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * the current implementation has some issues that could lead to a
+	 * streaming prepared transaction to be incorrectly missed in the initial
+	 * syncing phase. Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +576,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false, true,
+				walrcv_create_slot(wrconn, slotname, false, twophase,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +883,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +918,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if (sub->twophase && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +947,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +993,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1039,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9e822f9..1daa585 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -428,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 61f04ed..4cbe50e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2872,6 +2872,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3521,6 +3522,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2e4b39f..91ecc55 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -178,13 +178,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -252,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -265,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -289,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -330,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..96c878b 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and two_phase are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 9f0208a..34c70a1 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2763,7 +2763,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 410326d..d4af491 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f55b07c..0ed8e9d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..67b3358 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9aa483c..d56789d 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

#235Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#230)

On Mon, Mar 8, 2021 at 8:09 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, Mar 8, 2021 at 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think in case of two_phase option, replicatedPtr and sentPtr never
becomes the same which causes this process to hang.

The reason is that because on subscriber you have created a situation
(PK violation) where it is not able to proceed with initial tablesync
and then the apply worker is waiting for tablesync to complete, so it
is not able to process new messages. I think as soon as you remove the
duplicate row from the table it will be able to proceed.

Now, we can see a similar situation even in HEAD without 2PC though it
is a bit tricky to reproduce. Basically, when the tablesync worker is
in SUBREL_STATE_CATCHUP state and it has a lot of WAL to process then
the apply worker is just waiting for it to finish applying all the WAL
and won't process any message. So at that time, if you try to stop the
publisher you will see the same behavior. I have simulated a lot of
WAL processing by manually debugging the tablesync and not proceeding
for some time. You can also try by adding sleep after the tablesync
worker has set the state as SUBREL_STATE_CATCHUP.

So, I feel this is just an expected behavior and users need to
manually fix the situation where tablesync worker is not able to
proceed due to PK violation. Does this make sense?

--
With Regards,
Amit Kapila.

#236vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#235)

On Tue, Mar 9, 2021 at 11:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 8, 2021 at 8:09 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, Mar 8, 2021 at 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think in case of two_phase option, replicatedPtr and sentPtr never
becomes the same which causes this process to hang.

The reason is that because on subscriber you have created a situation
(PK violation) where it is not able to proceed with initial tablesync
and then the apply worker is waiting for tablesync to complete, so it
is not able to process new messages. I think as soon as you remove the
duplicate row from the table it will be able to proceed.

Now, we can see a similar situation even in HEAD without 2PC though it
is a bit tricky to reproduce. Basically, when the tablesync worker is
in SUBREL_STATE_CATCHUP state and it has a lot of WAL to process then
the apply worker is just waiting for it to finish applying all the WAL
and won't process any message. So at that time, if you try to stop the
publisher you will see the same behavior. I have simulated a lot of
WAL processing by manually debugging the tablesync and not proceeding
for some time. You can also try by adding sleep after the tablesync
worker has set the state as SUBREL_STATE_CATCHUP.

So, I feel this is just an expected behavior and users need to
manually fix the situation where tablesync worker is not able to
proceed due to PK violation. Does this make sense?

Thanks for the detailed explanation, this behavior looks similar to
the issue you described, we can ignore this issue as it seems this
issue is not because of this patch. I also noticed that if we handle
the PK violation error by deleting that record which causes the PK
violation error, the server is able to stop immediately without any
issue.

Regards,
Vignesh

#237vignesh C
vignesh21@gmail.com
In reply to: Peter Smith (#234)

On Tue, Mar 9, 2021 at 10:46 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v54*

Differences from v53* are:

* Rebased to HEAD @ today

* Addresses some recent feedback issues for patch 0001

Feedback from Amit @ 7/March [ak]
- (36) Fixed. Comment about the psf replay.
- (37) Fixed. prepare_spoolfile_create, check file already exists (on
disk) instead of just checking HTAB.
- (38) Fixed. Added comment about potential overwrite of existing file.

Feedback from Vignesh @ 8/March [vc]
- (45) Fixed. Changed some comment to be single-line comments (e.g. if
they only apply to a single following stmt)
- (46) Fixed. prepare_spoolfile_create, refactored slightly to make
more use of common code in if/else
- (47) Skipped. This was feedback suggesting using ints instead of
character values for message type enum.

Thanks for the updated patch.
Few comments:

+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+       "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+       "ALTER PUBLICATION tap_pub ADD TABLE tab_full");

This can be changed to :
$node_publisher->safe_psql('postgres',
"CREATE PUBLICATION tap_pub FOR TABLE tab_full");

We can make similar changes in:
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+       "CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+       "ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres', "
+       CREATE SUBSCRIPTION tap_sub_B
+       CONNECTION '$node_A_connstr application_name=$appname_B'
+       PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+       "CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+       "ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+       "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+       or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+       "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');

"Rows inserted via 2PC are visible on the subscriber"
should be something like:
"Rows rolled back are not on the subscriber"

git diff --check
src/backend/replication/logical/worker.c:3704: trailing whitespace.

Regards,
Vignesh

#238Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#223)

On Mon, Mar 8, 2021 at 4:58 PM vignesh C <vignesh21@gmail.com> wrote:

+               while (AnyTablesyncInProgress())
+               {
+                       process_syncing_tables(begin_data.final_lsn);
+
+                       /* This latch is to prevent 100% CPU looping. */
+                       (void) WaitLatch(MyLatch,
+                                                        WL_LATCH_SET
| WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+                                                        1000L,
WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+                       ResetLatch(MyLatch);
+               }
Should we have CHECK_FOR_INTERRUPTS inside the while loop?

The process_syncing_tables will end up in the
process_syncing_tables_for_apply() function. And in that function IIUC
the apply worker is spending most of the time waiting for the
tablesync to achieve SYNCDONE state.
See wait_for_relation_state_change(rstate->relid, SUBREL_STATE_SYNCDONE);

Now, notice the wait_for_relation_state_change already has
CHECK_FOR_INTERRUPTS();

So, AFAIK it isn't necessary to put another CHECK_FOR_INTERRUPTS at
the outer loop.

Thoughts?

------
Kind Regards,
Peter Smith.
Fujitsu Australia.

#239Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#238)

On Tue, Mar 9, 2021 at 3:02 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, Mar 8, 2021 at 4:58 PM vignesh C <vignesh21@gmail.com> wrote:

+               while (AnyTablesyncInProgress())
+               {
+                       process_syncing_tables(begin_data.final_lsn);
+
+                       /* This latch is to prevent 100% CPU looping. */
+                       (void) WaitLatch(MyLatch,
+                                                        WL_LATCH_SET
| WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+                                                        1000L,
WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+                       ResetLatch(MyLatch);
+               }
Should we have CHECK_FOR_INTERRUPTS inside the while loop?

The process_syncing_tables will end up in the
process_syncing_tables_for_apply() function. And in that function IIUC
the apply worker is spending most of the time waiting for the
tablesync to achieve SYNCDONE state.
See wait_for_relation_state_change(rstate->relid, SUBREL_STATE_SYNCDONE);

But, I think for large copy, it won't wait in that state because the
tablesync worker will still be in SUBREL_STATE_DATASYNC state and we
wait for SUBREL_STATE_SYNCDONE state only after the initial copy is
finished. So, I think it is a good idea to call CHECK_FOR_INTERRUPTS
in this loop.

--
With Regards,
Amit Kapila.

#240Ajin Cherian
itsajin@gmail.com
In reply to: vignesh C (#223)
4 attachment(s)

On Mon, Mar 8, 2021 at 4:59 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, Mar 8, 2021 at 7:17 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v52*

Few comments:

+logicalrep_read_begin_prepare(StringInfo in,
LogicalRepBeginPrepareData *begin_data)
+{
+       /* read fields */
+       begin_data->final_lsn = pq_getmsgint64(in);
+       if (begin_data->final_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "final_lsn not set in begin message");
+       begin_data->end_lsn = pq_getmsgint64(in);
+       if (begin_data->end_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "end_lsn not set in begin message");
+       begin_data->committime = pq_getmsgint64(in);
+       begin_data->xid = pq_getmsgint(in, 4);
+
+       /* read gid (copy it into a pre-allocated buffer) */
+       strcpy(begin_data->gid, pq_getmsgstring(in));
+}
In logicalrep_read_begin_prepare we validate final_lsn & end_lsn. But
this validation is not done in logicalrep_read_commit_prepared and
logicalrep_read_rollback_prepared. Should we keep it consistent?

Updated.

@@ -170,5 +237,4 @@ extern void
logicalrep_write_stream_abort(StringInfo out, TransactionId xid,

TransactionId subxid);
extern void logicalrep_read_stream_abort(StringInfo in, TransactionId
*xid,

TransactionId *subxid);
-
#endif /* LOGICAL_PROTO_H
*/
This change is not required.

Removed.

@@ -242,15 +244,16 @@ create_replication_slot:
$$ = (Node *) cmd;
}
/* CREATE_REPLICATION_SLOT slot TEMPORARY
LOGICAL plugin */
-                       | K_CREATE_REPLICATION_SLOT IDENT
opt_temporary K_LOGICAL IDENT create_slot_opt_list
+                       | K_CREATE_REPLICATION_SLOT IDENT
opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
{
CreateReplicationSlotCmd *cmd;
cmd =
makeNode(CreateReplicationSlotCmd);
cmd->kind =
REPLICATION_KIND_LOGICAL;
cmd->slotname = $2;
cmd->temporary = $3;
-                                       cmd->plugin = $5;
-                                       cmd->options = $6;
+                                       cmd->two_phase = $4;
+                                       cmd->plugin = $6;
+                                       cmd->options = $7;
$$ = (Node *) cmd;
}
Should we document two_phase in the below section:
CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL [
RESERVE_WAL ] | LOGICAL output_plugin [ EXPORT_SNAPSHOT |
NOEXPORT_SNAPSHOT | USE_SNAPSHOT ] }
Create a physical or logical replication slot. See Section 27.2.6 for
more about replication slots.

Updated in protocol.sgml as well as the comment above.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v55-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v55-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 4b0d31d9462d4c18a508be5461f2cc16894501a1 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 9 Mar 2021 04:38:22 -0500
Subject: [PATCH v55] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

* This patch also adds new option to enable two_phase while creating a slot.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/protocol.sgml                         |  14 +-
 src/backend/access/transam/twophase.c              |  68 ++
 src/backend/commands/subscriptioncmds.c            |   2 +-
 .../libpqwalreceiver/libpqwalreceiver.c            |   6 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 206 ++++++
 src/backend/replication/logical/tablesync.c        | 180 ++++-
 src/backend/replication/logical/worker.c           | 767 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 157 ++++-
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/replication/logicalproto.h             |  69 +-
 src/include/replication/reorderbuffer.h            |  12 +
 src/include/replication/walreceiver.h              |   5 +-
 src/include/replication/worker_internal.h          |   3 +
 src/tools/pgindent/typedefs.list                   |   5 +
 18 files changed, 1438 insertions(+), 84 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe..9694713 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,18 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase transactions.
+         Two-phase commands like PREPARE TRANSACTION, COMMIT PREPARED and ROLLBACK PREPARED
+         are also decoded and transmitted. In two-phase transactions, the transaction is 
+         decoded and transmitted at PREPARE TRANSACTION time. 
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..81cb765 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..f6793f0 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -528,7 +528,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				walrcv_create_slot(wrconn, slotname, false, true,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..9e822f9 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -827,7 +828,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +842,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..488b2a2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,212 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..97fc399 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1052,7 +1026,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
 	/*
@@ -1137,3 +1111,141 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+AnyTablesyncInProgress()
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * When the process_syncing_tables_for_apply changes the state
+		 * from SYNCDONE to READY, that change is actually written directly
+		 * into the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 18d0528..1cdfc91 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -59,6 +59,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -208,6 +209,54 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and the
+ * corresponding fileset handles of same.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	bool allow_delete; /* ok to delete? */
+}			PsfHashEntry;
+
+/*
+ * Information about the "current" psf spoolfile.
+ */
+typedef struct PsfFile
+{
+	char	name[MAXPGPATH];/* psf name - same as the HTAB key. */
+	bool	is_spooling;	/* are we currently spooling to this file? */
+	File 	vfd;			/* -1 when the file is closed. */
+	off_t	cur_offset;		/* offset for the next write or read. Reset to 0
+							 * when file is opened. */
+} PsfFile;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * Information about the 'current' open spoolfile is only valid when spooling.
+ * This is flagged as 'is_spooling' only between begin_prepare and prepare.
+ */
+static PsfFile psf_cur = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_delete(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+static void prepare_spoolfile_on_proc_exit(int status, Datum arg);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -720,6 +769,338 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				errmsg("transaction identifier \"%s\" is already in use",
+					   begin_data.gid)));
+
+	/*
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel).
+	 *
+	 * This can lead to an "empty prepare", because later when the apply
+	 * worker does the commit prepare (‘K’), there is nothing in it (the
+	 * inserts were skipped earlier).
+	 *
+	 * We avoid this using the 2 part logic: (1) Wait for all tablesync workers
+	 * to reach SYNCDONE/READY state; (2) If the begin_prepare lsn is now
+	 * behind any tablesync lsn then spool the prepared messages to a file
+	 * to be replayed later at commit_prepared time.
+	 *
+	 * -----
+	 *
+	 * XXX - The 2PC protocol needs the publisher to be aware when the PREPARE
+	 * has been successfully acted on. But because of this "empty prepare"
+	 * case now the prepared messages may be spooled to a file and, when
+	 * that happens the PREPARE would not happen at the usual time, but would
+	 * be deferred until COMMIT PREPARED time. This quirk could only happen
+	 * immediately after the initial table synchronization phase; once all
+	 * tables have acheived READY state the 2PC protocol will behave normally.
+	 *
+	 * A future release may be able to detect when all tables are READY and set
+	 * a flag to indicate this subscription/slot is ready for two_phase
+	 * decoding. Then at the publisher-side, we could enable wait-for-prepares
+	 * only when all the slots of WALSender have that flag set.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(begin_data.final_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (AnyTablesyncInProgress())
+		{
+			process_syncing_tables(begin_data.final_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		{
+			char		psfpath[MAXPGPATH];
+
+			/*
+			 * Create the spoolfile.
+			 */
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+
+			/*
+			 * From now, until the handle_prepare we are spooling to the
+			 * current psf.
+			 */
+			psf_cur.is_spooling = true;
+		}
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * If we were using a psf spoolfile, then write the PREPARE as the final
+	 * message. This prepare information will be used at commit_prepared time.
+	 */
+	if (psf_cur.is_spooling)
+	{
+		PsfHashEntry *hentry;
+
+		/* Write the PREPARE info to the psf file. */
+		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
+
+		/*
+		 * Flush the spoolfile, so changes can survive a restart.
+		 *
+		 * If the publisher resends the same data again after a restart (e.g.
+		 * if subscriber origin has not moved past this prepare), then the same
+		 * named psf file will be overwritten with the same data. See
+		 * prepare_spoolfile_create.
+		 */
+		FileSync(psf_cur.vfd, WAIT_EVENT_DATA_FILE_SYNC);
+
+		/* We are finished spooling to the current psf. */
+		psf_cur.is_spooling = false;
+
+		/*
+		 * The commit_prepare will need the spoolfile, so unregister it for
+		 * removal on proc-exit just in case there is an unexpected restart
+		 * between now and when commit_prepared happens.
+		 */
+		hentry = (PsfHashEntry *) hash_search(psf_hash, psf_cur.name,
+											  HASH_FIND, NULL);
+		Assert(hentry);
+		hentry->allow_delete = false;
+
+		/*
+		 * The psf_cur.vfd is meaningful only between begin_prepare and prepared.
+		 * So close it now. Any messages written to the psf will be applied
+		 * later during handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		in_remote_transaction = false;
+		return;
+	}
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	/*
+	 * Normally, prepare_lsn == remote_final_lsn, but if this prepare message
+	 * was dispatched via the psf spoolfile replay then the remote_final_lsn
+	 * is set to commit lsn instead. Hence the <= instead of == check below.
+	 */
+	Assert(prepare_data.prepare_lsn <= remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+
+		/*
+		 * Replay/dispatch the spooled messages.
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		/* After replaying the psf it is no longer needed. Just delete it. */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		/* We are finished with this spoolfile. Delete it. */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
+	 */
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -732,6 +1113,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		!psf_cur.is_spooling &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1150,6 +1532,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1271,6 +1656,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1429,6 +1817,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1798,6 +2189,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1954,6 +2348,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2061,6 +2477,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	TimeLineID	tli;
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL     hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2180,7 +2613,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && !psf_cur.is_spooling)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -2927,6 +3360,9 @@ ApplyWorkerMain(Datum main_arg)
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
 
+	/* Arrange to delete any unwanted psf file(s) at proc-exit */
+	on_proc_exit(prepare_spoolfile_on_proc_exit, 0);
+
 	/* Setup signal handling */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGTERM, die);
@@ -3103,3 +3539,332 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time. If needed, this is the common function to do that file redirection.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	elog(DEBUG1,
+		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_cur.is_spooling ? "Do" : "Don't");
+
+	if (!psf_cur.is_spooling)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	bool		file_found;
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(!psf_cur.is_spooling);
+
+	/* check if the file already exists. */
+	file_found = prepare_spoolfile_exists(path);
+
+	if (!file_found)
+	{
+		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create file \"%s\": %m", path)));
+	}
+	else
+	{
+		/*
+		 * Open the file and seek to the beginning because we always want to
+		 * create/overwrite this file.
+		 */
+		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+		if (psf_cur.vfd < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m", path)));
+	}
+
+	/* Create/Find the spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash, path,
+											  HASH_ENTER | HASH_FIND, NULL);
+	Assert(hentry);
+	memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+	psf_cur.cur_offset = 0;
+	hentry->allow_delete = true;
+
+	/* Sanity checks */
+	Assert(psf_cur.vfd >= 0);
+	Assert(psf_cur.cur_offset == 0);
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_cur.vfd >= 0)
+		FileClose(psf_cur.vfd);
+
+	/* Mark this fd as not valid to use anymore. */
+	psf_cur.is_spooling = false;
+	psf_cur.vfd = -1;
+	psf_cur.cur_offset = 0;
+}
+
+/*
+ * Delete the specified psf spoolfile, and any HTAB associated with it.
+ */
+static void
+prepare_spoolfile_delete(char *path)
+{
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* Delete the file off the disk. */
+	unlink(path);
+
+	/* Remove any entry from the psf_hash, if present */
+	hash_search(psf_hash, path, HASH_REMOVE, NULL);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+	int			bytes_written;
+
+	Assert(psf_cur.is_spooling);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(len));
+	psf_cur.cur_offset += bytes_written;
+
+	/* then the action */
+	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(action));
+	psf_cur.cur_offset += bytes_written;
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == len);
+	psf_cur.cur_offset += bytes_written;
+}
+
+/*
+ * Is there a prepare spoolfile for the specified path?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	bool		found;
+
+	File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+
+	found = fd >= 0;
+	if (fd >= 0)
+		FileClose(fd);
+
+	return found;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	psf.vfd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	if (psf.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from prepared spoolfile \"%s\": %m",
+						path)));
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	remote_final_lsn = final_lsn;
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		nbytes = FileRead(psf.vfd, buffer, len,
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+		if (nbytes != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldctx2);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	FileClose(psf.vfd);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB.
+	 *
+	 * Therefore, the name and the key must be exactly same lengths and padded
+	 * with '\0' so garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "pg_twophase/%u-%s.prep_changes", subid, gid);
+}
+
+/*
+ * proc_exit callback to remove unwanted psf files.
+ */
+static void
+prepare_spoolfile_on_proc_exit(int status, Datum arg)
+{
+	HASH_SEQ_STATUS seq_status;
+	PsfHashEntry *hentry;
+
+	/* Iterate the HTAB looking for what file can be deleted. */
+	if (psf_hash)
+	{
+		hash_seq_init(&seq_status, psf_hash);
+		while ((hentry = (PsfHashEntry *) hash_seq_search(&seq_status)) != NULL)
+		{
+			char *path = hentry->name;
+
+			if (hentry->allow_delete)
+				prepare_spoolfile_delete(path);
+		}
+	}
+}
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2e4b39f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +171,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -322,8 +342,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +362,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +383,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +843,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1250,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e5f8a06..e40d2d0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -363,7 +363,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..232af01 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..0c95dc6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..f55b07c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -345,6 +345,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -418,8 +419,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..95d78e9 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AnyTablesyncInProgress(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8bd95ae..4ffcef5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
@@ -1955,6 +1958,8 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfFile
+PsfHashEntry
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v55-0004-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v55-0004-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From a40b5e9b8abc5de3ea84ff0da08f84e1d7a63c11 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 9 Mar 2021 04:45:10 -0500
Subject: [PATCH v55] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 ++++++++++---
 src/backend/replication/logical/worker.c    | 66 ++++++++++++++++++++++-------
 2 files changed, 74 insertions(+), 21 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 97fc399..f3984d4 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1127,6 +1133,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1149,6 +1157,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1156,12 +1165,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1175,6 +1189,8 @@ AnyTablesyncInProgress()
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncInProgress?");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1186,8 +1202,8 @@ AnyTablesyncInProgress()
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1204,6 +1220,7 @@ AnyTablesyncInProgress()
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncInProgress?: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1215,8 +1232,8 @@ AnyTablesyncInProgress()
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1242,8 +1259,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 34e18a8..995222a 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -828,14 +828,16 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(begin_data.final_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (AnyTablesyncInProgress())
 		{
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
 			process_syncing_tables(begin_data.final_lsn);
 
 			/* This latch is to prevent 100% CPU looping. */
@@ -853,7 +855,12 @@ apply_handle_begin_prepare(StringInfo s)
 		 * prepared) will be saved to a spoolfile for replay later at
 		 * commit_prepared time.
 		 */
-		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		if (begin_data.final_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
 		{
 			char		psfpath[MAXPGPATH];
 
@@ -895,6 +902,8 @@ apply_handle_prepare(StringInfo s)
 	{
 		PsfHashEntry *hentry;
 
+		elog(LOG, "!!> apply_handle_prepare: SPOOLING");
+
 		/* Write the PREPARE info to the psf file. */
 		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
 
@@ -916,6 +925,8 @@ apply_handle_prepare(StringInfo s)
 		 * removal on proc-exit just in case there is an unexpected restart
 		 * between now and when commit_prepared happens.
 		 */
+		elog(LOG,
+			"!!> apply_handle_prepare: Make sure the spoolfile is not removed on proc-exit");
 		hentry = (PsfHashEntry *) hash_search(psf_hash, psf_cur.name,
 											  HASH_FIND, NULL);
 		Assert(hentry);
@@ -1000,6 +1011,8 @@ apply_handle_commit_prepared(StringInfo s)
 	{
 		int			nchanges;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * Replay/dispatch the spooled messages.
 		 */
@@ -1007,8 +1020,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		/* After replaying the psf it is no longer needed. Just delete it. */
@@ -1072,6 +1085,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf = %d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2350,18 +2364,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3555,8 +3573,8 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
-	elog(DEBUG1,
-		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
 		 action,
 		 psf_cur.is_spooling ? "Do" : "Don't");
 
@@ -3580,7 +3598,7 @@ prepare_spoolfile_create(char *path)
 	bool		file_found;
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(!psf_cur.is_spooling);
 
@@ -3589,7 +3607,7 @@ prepare_spoolfile_create(char *path)
 
 	if (!file_found)
 	{
-		elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+		elog(LOG, "!!> Not found file \"%s\". Create it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 			ereport(ERROR,
@@ -3602,7 +3620,7 @@ prepare_spoolfile_create(char *path)
 		 * Open the file and seek to the beginning because we always want to
 		 * create/overwrite this file.
 		 */
-		elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+		elog(LOG, "!!> Found file \"%s\". Overwrite it.", path);
 		psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
 		if (psf_cur.vfd < 0)
 			ereport(ERROR,
@@ -3630,6 +3648,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_cur.vfd >= 0)
 		FileClose(psf_cur.vfd);
 
@@ -3645,6 +3664,8 @@ prepare_spoolfile_close()
 static void
 prepare_spoolfile_delete(char *path)
 {
+	elog(LOG, "!!> prepare_spoolfile_delete: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3670,18 +3691,20 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_cur.is_spooling);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(len));
 	psf_cur.cur_offset += bytes_written;
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(action));
@@ -3690,6 +3713,7 @@ prepare_spoolfile_write(char action, StringInfo s)
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == len);
@@ -3710,6 +3734,11 @@ prepare_spoolfile_exists(char *path)
 	if (fd >= 0)
 		FileClose(fd);
 
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 found ? "found" : "not found");
+
 	return found;
 }
 
@@ -3726,8 +3755,8 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 				oldctx2;
 	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3772,6 +3801,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -3790,6 +3820,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		nbytes = FileRead(psf.vfd, buffer, len,
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
@@ -3806,7 +3837,9 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		/* Ensure we are reading the data into our memory context. */
 		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
 
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: Before dispatch");
 		apply_dispatch(&s2);
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: After dispatch");
 
 		MemoryContextReset(ApplyMessageContext);
 
@@ -3815,13 +3848,13 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nchanges++;
 
 		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
+			elog(LOG, "!!> replayed %d changes from file '%s'",
 				 nchanges, path);
 	}
 
 	FileClose(psf.vfd);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
@@ -3857,6 +3890,8 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 	HASH_SEQ_STATUS seq_status;
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_on_proc_exit");
+
 	/* Iterate the HTAB looking for what file can be deleted. */
 	if (psf_hash)
 	{
@@ -3865,6 +3900,7 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 		{
 			char *path = hentry->name;
 
+			elog(LOG, "!!> prepare_spoolfile_proc_exit: found '%s'", path);
 			if (hentry->allow_delete)
 				prepare_spoolfile_delete(path);
 		}
-- 
1.8.3.1

v55-0003-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v55-0003-Support-2PC-txn-Subscription-option.patchDownload
From 4a107f75ae786e06473dfca0a7539eb1b060b6db Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 9 Mar 2021 04:44:23 -0500
Subject: [PATCH v55] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/create_subscription.sgml          | 27 +++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 72 +++++++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 +
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 ++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 +++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 ++-
 src/bin/psql/tab-complete.c                        |  2 +-
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 +
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 93 +++++++++++++++-------
 src/test/regress/sql/subscription.sql              | 25 ++++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 18 files changed, 260 insertions(+), 48 deletions(-)

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..eeb7e35 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,33 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65d..b77378d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f6793f0..96fcf49 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,26 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +309,24 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * the current implementation has some issues that could lead to a
+	 * streaming prepared transaction to be incorrectly missed in the initial
+	 * syncing phase. Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +576,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false, true,
+				walrcv_create_slot(wrconn, slotname, false, twophase,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +883,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +918,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if (sub->twophase && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +947,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +993,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1039,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9e822f9..1daa585 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -428,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 1cdfc91..34e18a8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2872,6 +2872,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3521,6 +3522,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2e4b39f..91ecc55 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -178,13 +178,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -252,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -265,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -289,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -330,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..96c878b 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and two_phase are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 9f0208a..34c70a1 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2763,7 +2763,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 232af01..a5bb4de 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f55b07c..0ed8e9d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..67b3358 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index 9aa483c..d56789d 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v55-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v55-0002-Support-2PC-txn-subscriber-tests.patchDownload
From 7cdf1b03b2232bd81c888eaa202e1e48d653fab3 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 9 Mar 2021 04:43:33 -0500
Subject: [PATCH v55] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 332 ++++++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl | 282 ++++++++++++++++++++
 2 files changed, 614 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..9aa483c
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,332 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#241Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#240)

On Tue, Mar 9, 2021 at 3:22 PM Ajin Cherian <itsajin@gmail.com> wrote:

Few comments:
==================

1.
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time. If needed, this is the common function to do that file redirection.
+ *

I think the last sentence ("If needed, this is the ..." in the above
comments is not required.

2.
+prepare_spoolfile_exists(char *path)
+{
+ bool found;
+
+ File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+
+ found = fd >= 0;
+ if (fd >= 0)
+ FileClose(fd);

Can we avoid using bool variable in the above code with something like below?

File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);

if (fd >= 0)
{
FileClose(fd);
return true;
}

return false;

3. In prepare_spoolfile_replay_messages(), it is better to free the
memory allocated for temporary strings buffer and s2.

4.
+ /* check if the file already exists. */
+ file_found = prepare_spoolfile_exists(path);
+
+ if (!file_found)
+ {
+ elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+ psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+ if (psf_cur.vfd < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", path)));
+ }
+ else
+ {
+ /*
+ * Open the file and seek to the beginning because we always want to
+ * create/overwrite this file.
+ */
+ elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+ psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+ if (psf_cur.vfd < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+ }

Here, whether the file exists or not you are using the same flags to
open it which seems correct to me but the code looks a bit odd. Why do
we in this case even bother to check if it exists? Is it for DEBUG
message, if so not sure if that is worth it? I am also thinking why
not have a function prepare_spoolfile_open similar to *_close and call
it from all the places with the mode where you can indicate whether
you want to create or open the file.

5. I think prepare_spoolfile_close can be extended to take PsfFile as
input and then it can be also used from
prepare_spoolfile_replay_messages.

6. I think we should also write some commentary about prepared
transactions atop of worker.c as we have done for streamed
transactions.

--
With Regards,
Amit Kapila.

#242Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#241)
2 attachment(s)

I ran a 5 cascaded setup of pub-subs on the latest patchset which starts
pgbench on the first server and waits till the data on the fifth server
matches the first.
This is based on a test script created by Erik Rijkers. The tests run fine
and the 5th server achieves data consistency in around a minute.
Attaching the test script and the test run log.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

erik_5_cascade.shtext/x-sh; charset=US-ASCII; name=erik_5_cascade.shDownload
erik_5_cascade_run.logapplication/octet-stream; name=erik_5_cascade_run.logDownload
#243Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#241)

On Tue, Mar 9, 2021 at 9:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 9, 2021 at 3:22 PM Ajin Cherian <itsajin@gmail.com> wrote:

Few comments:
==================

3. In prepare_spoolfile_replay_messages(), it is better to free the
memory allocated for temporary strings buffer and s2.

I guess this was suggested because it is what the
apply_handle_stream_commit() function was doing for very similar code.
But now the same code cannot work this time for the
*_replay_messages() function because those buffers are allocated with
the TopTransactionContext and they are already being freed as a
side-effect when the last psf message (the LOGICAL_REP_MSG_PREPARE) is
replayed/dispatched and ending the transaction. So attempting to free
them again causes segmentation violation (I already fixed this exact
problem last week when the pfree code was still in the code).

5. I think prepare_spoolfile_close can be extended to take PsfFile as
input and then it can be also used from
prepare_spoolfile_replay_messages.

No, the *_close() is intended only for ending the "current" psf (the
global psf_cur) which was being spooled. The function comment says the
same. The *_close() is paired with the *_create() which created the
psf_cur.

Whereas, the reply fd close at commit time is just a locally opened fd
unrelated to psf_cur. This close is deliberately self-contained in the
*_replay_messages() function, which is not dissimilar to what the
other streaming spool file code does - e.g. notice
apply_handle_stream_commit function simply closes its own fd using
BufFileClose; it doesn’t delegate stream_close_file() to do it.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#244Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#234)
4 attachment(s)

Please find attached the latest patch set v56*

Differences from v55* are:

* Rebased to HEAD @ today

* Addresses the following feedback issues:

(35) [ak-0307] Skipped. Suggestion to replace HTAB with List for
tracking what psf files to delete at proc-exit; Although the idea had
merit at the time, it turned out that the due to a separate bugfix
from colleague it was necessary that we also know the count of psf
files still yet to be replayed. This count was easy to know using the
existing HTAB entries, but the List entry would have been deleted
already at prepare time, so same could not be done easily if we
changed to List. So we will keep HTAB instead of List for now.

(44) [vc-0308] Fixed. Add CHECK_FOR_INTERRUPTS() to apply worker loop.

(51) [ak-0308] Fixed. New location for psf files "pg_logical/twophase".

(54) [vc-0309] Fixed. Change rollback test description text.

(55) [ak-0309] Fixed. Change to comment text of prepare_spoolfile_handler.

(56) [ak-0309] Fixed. Remove boolean variable from prepare_spoolfile_exists.

(57) [ak-0309] Skipped. Suggestion to pfree memory; it is already freed.

(58) [ak-0309] Fixed. Common code for found/not-found psf at *_create() time.

(59) [ak-0309] Skipped. Suggestion to use *_close() from *_replay();
not compatible with intent.

(60) [ak-0309] Fixed. General comment about PSF added top of worker.c

-----
[vc-0308] /messages/by-id/CALDaNm29gOsCUtNkvHgqbbD1kbM8m67h4AqfmUWG1oTnfuPFxA@mail.gmail.com
[vc-0309] /messages/by-id/CALDaNm0QuncAis5OqtjzOxAPTZRn545JLqfjFEJwyRjUH-XvEw@mail.gmail.com
[ak-0307] /messages/by-id/CAA4eK1+dO07RrQwfHAK5jDP9qiXik4-MVzy+coEG09shWTJFGg@mail.gmail.com
[ak-0308] /messages/by-id/CAA4eK1+oSUU77T92FueDJWsp=FjTroNaNC-K45Dgdr7f18aBFA@mail.gmail.com
[ak-0309] /messages/by-id/CAA4eK1Jra658uuT8zo1DcZLzpNvo4oeorMcCuSeyY2zvr3_KBA@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v56-0003-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v56-0003-Support-2PC-txn-Subscription-option.patchDownload
From 1723729e023e81640b4a82a5da6102e39d31cd3a Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Mar 2021 22:14:51 +1100
Subject: [PATCH v56] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/create_subscription.sgml          | 27 +++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 72 +++++++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 +
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 ++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 +++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 ++-
 src/bin/psql/tab-complete.c                        |  2 +-
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 +
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 93 +++++++++++++++-------
 src/test/regress/sql/subscription.sql              | 25 ++++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 18 files changed, 260 insertions(+), 48 deletions(-)

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..eeb7e35 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,33 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65d..b77378d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f6793f0..96fcf49 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,26 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +309,24 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * the current implementation has some issues that could lead to a
+	 * streaming prepared transaction to be incorrectly missed in the initial
+	 * syncing phase. Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +576,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false, true,
+				walrcv_create_slot(wrconn, slotname, false, twophase,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +883,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +918,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if (sub->twophase && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +947,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +993,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1039,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9e822f9..1daa585 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -428,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9eccc25..057d471 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2925,6 +2925,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3574,6 +3575,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2e4b39f..91ecc55 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -178,13 +178,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -252,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -265,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -289,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -330,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..96c878b 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and two_phase are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 9f0208a..34c70a1 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2763,7 +2763,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 232af01..a5bb4de 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f55b07c..0ed8e9d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..67b3358 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index ab41d2d..aa3455b 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v56-0004-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v56-0004-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From 43d6696e8dae0ff45ceda9ce1faed4013688e7b1 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Mar 2021 22:51:28 +1100
Subject: [PATCH v56] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 +++++++++++---
 src/backend/replication/logical/worker.c    | 62 +++++++++++++++++++++++------
 2 files changed, 72 insertions(+), 19 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 97fc399..f3984d4 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1127,6 +1133,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1149,6 +1157,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1156,12 +1165,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1175,6 +1189,8 @@ AnyTablesyncInProgress()
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncInProgress?");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1186,8 +1202,8 @@ AnyTablesyncInProgress()
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1204,6 +1220,7 @@ AnyTablesyncInProgress()
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncInProgress?: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1215,8 +1232,8 @@ AnyTablesyncInProgress()
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1242,8 +1259,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 057d471..a4642ad 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -871,14 +871,16 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(begin_data.final_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (AnyTablesyncInProgress())
 		{
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
 			CHECK_FOR_INTERRUPTS();
 
 			process_syncing_tables(begin_data.final_lsn);
@@ -898,7 +900,12 @@ apply_handle_begin_prepare(StringInfo s)
 		 * prepared) will be saved to a spoolfile for replay later at
 		 * commit_prepared time.
 		 */
-		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		if (begin_data.final_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
 		{
 			char		psfpath[MAXPGPATH];
 
@@ -940,6 +947,8 @@ apply_handle_prepare(StringInfo s)
 	{
 		PsfHashEntry *hentry;
 
+		elog(LOG, "!!> apply_handle_prepare: SPOOLING");
+
 		/* Write the PREPARE info to the psf file. */
 		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
 
@@ -961,6 +970,8 @@ apply_handle_prepare(StringInfo s)
 		 * removal on proc-exit just in case there is an unexpected restart
 		 * between now and when commit_prepared happens.
 		 */
+		elog(LOG,
+			"!!> apply_handle_prepare: Make sure the spoolfile is not removed on proc-exit");
 		hentry = (PsfHashEntry *) hash_search(psf_hash, psf_cur.name,
 											  HASH_FIND, NULL);
 		Assert(hentry);
@@ -1045,6 +1056,8 @@ apply_handle_commit_prepared(StringInfo s)
 	{
 		int			nchanges;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * Replay/dispatch the spooled messages.
 		 */
@@ -1052,8 +1065,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		/* After replaying the psf it is no longer needed. Just delete it. */
@@ -1117,6 +1130,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf = %d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2401,18 +2415,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3608,8 +3626,8 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
-	elog(DEBUG1,
-		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
 		 action,
 		 psf_cur.is_spooling ? "Do" : "Don't");
 
@@ -3632,7 +3650,7 @@ prepare_spoolfile_create(char *path)
 {
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(!psf_cur.is_spooling);
 
@@ -3672,6 +3690,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_cur.vfd >= 0)
 		FileClose(psf_cur.vfd);
 
@@ -3687,6 +3706,8 @@ prepare_spoolfile_close()
 static void
 prepare_spoolfile_delete(char *path)
 {
+	elog(LOG, "!!> prepare_spoolfile_delete: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3712,18 +3733,20 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_cur.is_spooling);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(len));
 	psf_cur.cur_offset += bytes_written;
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(action));
@@ -3732,6 +3755,7 @@ prepare_spoolfile_write(char action, StringInfo s)
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == len);
@@ -3746,6 +3770,11 @@ prepare_spoolfile_exists(char *path)
 {
 	File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
 
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 fd >= 0 ? "found" : "not found");
+
 	if (fd >= 0)
 		FileClose(fd);
 
@@ -3765,8 +3794,8 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 				oldctx2;
 	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3811,6 +3840,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -3829,6 +3859,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		nbytes = FileRead(psf.vfd, buffer, len,
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
@@ -3845,7 +3876,9 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		/* Ensure we are reading the data into our memory context. */
 		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
 
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: Before dispatch");
 		apply_dispatch(&s2);
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: After dispatch");
 
 		MemoryContextReset(ApplyMessageContext);
 
@@ -3854,13 +3887,13 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nchanges++;
 
 		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
+			elog(LOG, "!!> replayed %d changes from file '%s'",
 				 nchanges, path);
 	}
 
 	FileClose(psf.vfd);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
@@ -3895,6 +3928,8 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 	HASH_SEQ_STATUS seq_status;
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_on_proc_exit");
+
 	/* Iterate the HTAB looking for what files can be deleted. */
 	if (psf_hash)
 	{
@@ -3903,6 +3938,7 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 		{
 			char *path = hentry->name;
 
+			elog(LOG, "!!> prepare_spoolfile_proc_exit: found '%s'", path);
 			if (hentry->delete_on_exit)
 				prepare_spoolfile_delete(path);
 		}
-- 
1.8.3.1

v56-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v56-0002-Support-2PC-txn-subscriber-tests.patchDownload
From b048d2148672c6b83605c501cbff6905df74b421 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Mar 2021 21:59:10 +1100
Subject: [PATCH v56] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 332 ++++++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl | 282 ++++++++++++++++++++
 2 files changed, 614 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..ab41d2d
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,332 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v56-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v56-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 0ac166db91a2ce115fa68535d4a2d73a50ea05dc Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Mar 2021 21:48:05 +1100
Subject: [PATCH v56] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

* This patch also adds new option to enable two_phase while creating a slot.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/protocol.sgml                         |  14 +-
 src/backend/access/transam/twophase.c              |  68 ++
 src/backend/commands/subscriptioncmds.c            |   2 +-
 .../libpqwalreceiver/libpqwalreceiver.c            |   6 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 206 ++++++
 src/backend/replication/logical/tablesync.c        | 180 ++++-
 src/backend/replication/logical/worker.c           | 807 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 157 +++-
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/replication/logicalproto.h             |  69 +-
 src/include/replication/reorderbuffer.h            |  12 +
 src/include/replication/walreceiver.h              |   5 +-
 src/include/replication/worker_internal.h          |   3 +
 src/tools/pgindent/typedefs.list                   |   5 +
 18 files changed, 1477 insertions(+), 85 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe..9694713 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,18 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase transactions.
+         Two-phase commands like PREPARE TRANSACTION, COMMIT PREPARED and ROLLBACK PREPARED
+         are also decoded and transmitted. In two-phase transactions, the transaction is 
+         decoded and transmitted at PREPARE TRANSACTION time. 
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..81cb765 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..f6793f0 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -528,7 +528,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				walrcv_create_slot(wrconn, slotname, false, true,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..9e822f9 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -827,7 +828,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +842,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..488b2a2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,212 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..97fc399 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1052,7 +1026,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
 	/*
@@ -1137,3 +1111,141 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+AnyTablesyncInProgress()
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * When the process_syncing_tables_for_apply changes the state
+		 * from SYNCDONE to READY, that change is actually written directly
+		 * into the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 21d304a..9eccc25 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1,4 +1,5 @@
 /*-------------------------------------------------------------------------
+ * If needed, this is the common function to do that file redirection.
  * worker.c
  *	   PostgreSQL logical replication worker (apply)
  *
@@ -49,6 +50,43 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ *
+ * PREPARED SPOOLFILE (PSF) LOGIC
+ * ------------------------------
+ * It can happen that the apply worker is processing a
+ * LOGICAL_REP_MSG_BEGIN_PREPARE message at the same time as table
+ * synchronization is happening. To avoid any chance of an "empty prepare"
+ * situation arising the apply worker waits for all the tablesyncs to reach at
+ * least SYNCDONE state, but this in turn can lead to another case where the
+ * tablesync lsn has got ahead of the prepare lsn the apply worker is
+ * currently processing. Refer to the comment in the apply_handle_begin_prepare
+ * function for more details.
+ *
+ * When this happens the prepared messages are spooled into a "prepare
+ * spoolfile" (aka psf). The messages written to this file are all the prepared
+ * messages up to and including the LOGICAL_REP_MSG_PREPARE. All this psf
+ * content is then replayed later at commit time (apply_handle_commit_prepared),
+ * where the messages are all dispatched in the usual way.
+ *
+ * The psf files reside in the "pg_logical/twophase" directory and they are
+ * uniquely named. This is necessary because there may be multiple psf files
+ * co-existing, and so the correct psf must be re-discoverable (using subid and
+ * gid).
+ *
+ * Furthermore, to cope with possibility of error between the end of spooling
+ * (in apply_handle_prepare) and the commit (in apply_handle_commit_prepared) a
+ * psf file must be able to survive a PG restart. So we cannot utilizing the
+ * same (temporary file based) BufFile API that the streamed transactions use.
+ * Instead, the psf file handling uses the File API (PathNameOpenFile and
+ * friends). But this means the code now has to take responsibility for psf file
+ * cleanup. An HTAB is used to track if a particular psf file can or cannot be
+ * deleted, and a proc-exit handler is registered to take the appropriate
+ * action. Refer to function prepare_spoolfile_on_proc_exit.
+ *
+ * Upon restart, any uncommitted psf files are still present and so their commit
+ * can proceed as before.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +97,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -208,6 +247,59 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/* psf files will be written here. */
+#define PSF_DIR "pg_logical/twophase"
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and a
+ * flag indicating if it is OK to delete this psf at proc-exit time or not.
+ *
+ * Each PshHashEntry is created at prepare and removed at commit/rollback.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	bool 		delete_on_exit; 	/* ok to delete at proc-exit? */
+}			PsfHashEntry;
+
+/*
+ * Information about the "current" psf spoolfile.
+ */
+typedef struct PsfFile
+{
+	char	name[MAXPGPATH];/* psf name - same as the HTAB key. */
+	bool	is_spooling;	/* are we currently spooling to this file? */
+	File 	vfd;			/* -1 when the file is closed. */
+	off_t	cur_offset;		/* offset for the next write or read. Reset to 0
+							 * when file is opened. */
+} PsfFile;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * Information about the 'current' open spoolfile is only valid when spooling.
+ * This is flagged as 'is_spooling' only between begin_prepare and prepare.
+ */
+static PsfFile psf_cur = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_delete(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+static void prepare_spoolfile_on_proc_exit(int status, Datum arg);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -720,6 +812,340 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				errmsg("transaction identifier \"%s\" is already in use",
+					   begin_data.gid)));
+
+	/*
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel).
+	 *
+	 * This can lead to an "empty prepare", because later when the apply
+	 * worker does the commit prepare (‘K’), there is nothing in it (the
+	 * inserts were skipped earlier).
+	 *
+	 * We avoid this using the 2 part logic: (1) Wait for all tablesync workers
+	 * to reach SYNCDONE/READY state; (2) If the begin_prepare lsn is now
+	 * behind any tablesync lsn then spool the prepared messages to a file
+	 * to be replayed later at commit_prepared time.
+	 *
+	 * -----
+	 *
+	 * XXX - The 2PC protocol needs the publisher to be aware when the PREPARE
+	 * has been successfully acted on. But because of this "empty prepare"
+	 * case now the prepared messages may be spooled to a file and, when
+	 * that happens the PREPARE would not happen at the usual time, but would
+	 * be deferred until COMMIT PREPARED time. This quirk could only happen
+	 * immediately after the initial table synchronization phase; once all
+	 * tables have acheived READY state the 2PC protocol will behave normally.
+	 *
+	 * A future release may be able to detect when all tables are READY and set
+	 * a flag to indicate this subscription/slot is ready for two_phase
+	 * decoding. Then at the publisher-side, we could enable wait-for-prepares
+	 * only when all the slots of WALSender have that flag set.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(begin_data.final_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (AnyTablesyncInProgress())
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			process_syncing_tables(begin_data.final_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		{
+			char		psfpath[MAXPGPATH];
+
+			/*
+			 * Create the spoolfile.
+			 */
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+
+			/*
+			 * From now, until the handle_prepare we are spooling to the
+			 * current psf.
+			 */
+			psf_cur.is_spooling = true;
+		}
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * If we were using a psf spoolfile, then write the PREPARE as the final
+	 * message. This prepare information will be used at commit_prepared time.
+	 */
+	if (psf_cur.is_spooling)
+	{
+		PsfHashEntry *hentry;
+
+		/* Write the PREPARE info to the psf file. */
+		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
+
+		/*
+		 * Flush the spoolfile, so changes can survive a restart.
+		 *
+		 * If the publisher resends the same data again after a restart (e.g.
+		 * if subscriber origin has not moved past this prepare), then the same
+		 * named psf file will be overwritten with the same data. See
+		 * prepare_spoolfile_create.
+		 */
+		FileSync(psf_cur.vfd, WAIT_EVENT_DATA_FILE_SYNC);
+
+		/* We are finished spooling to the current psf. */
+		psf_cur.is_spooling = false;
+
+		/*
+		 * The commit_prepare will need the spoolfile, so unregister it for
+		 * removal on proc-exit just in case there is an unexpected restart
+		 * between now and when commit_prepared happens.
+		 */
+		hentry = (PsfHashEntry *) hash_search(psf_hash, psf_cur.name,
+											  HASH_FIND, NULL);
+		Assert(hentry);
+		hentry->delete_on_exit = false;
+
+		/*
+		 * The psf_cur.vfd is meaningful only between begin_prepare and prepared.
+		 * So close it now. Any messages written to the psf will be applied
+		 * later during handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		in_remote_transaction = false;
+		return;
+	}
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	/*
+	 * Normally, prepare_lsn == remote_final_lsn, but if this prepare message
+	 * was dispatched via the psf spoolfile replay then the remote_final_lsn
+	 * is set to commit lsn instead. Hence the <= instead of == check below.
+	 */
+	Assert(prepare_data.prepare_lsn <= remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+
+		/*
+		 * Replay/dispatch the spooled messages.
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		/* After replaying the psf it is no longer needed. Just delete it. */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		/* We are finished with this spoolfile. Delete it. */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
+	 */
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -732,6 +1158,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		!psf_cur.is_spooling &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1092,6 +1519,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_RELATION, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_RELATION, s))
 		return;
 
@@ -1113,6 +1543,9 @@ apply_handle_type(StringInfo s)
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TYPE, s))
 		return;
 
+	if (handle_streamed_transaction(LOGICAL_REP_MSG_TYPE, s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -1150,6 +1583,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1271,6 +1707,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1429,6 +1868,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1798,6 +2240,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1954,6 +2399,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2012,7 +2479,9 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
 		}
 	}
 
-	*have_pending_txes = !dlist_is_empty(&lsn_mapping);
+	/* consider entries in prepare spool file as not flushed */
+	*have_pending_txes = (!dlist_is_empty(&lsn_mapping) ||
+						 (psf_hash && hash_get_num_entries(psf_hash)));
 }
 
 /*
@@ -2061,6 +2530,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	TimeLineID	tli;
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL     hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2180,7 +2666,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && !psf_cur.is_spooling)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -2927,6 +3413,9 @@ ApplyWorkerMain(Datum main_arg)
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
 
+	/* Arrange to delete any unwanted psf file(s) at proc-exit */
+	on_proc_exit(prepare_spoolfile_on_proc_exit, 0);
+
 	/* Setup signal handling */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGTERM, die);
@@ -3103,3 +3592,317 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	elog(DEBUG1,
+		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_cur.is_spooling ? "Do" : "Don't");
+
+	if (!psf_cur.is_spooling)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(!psf_cur.is_spooling);
+
+	/* Make sure the PSF_DIR subdirectory exists. */
+	if (MakePGDirectory(PSF_DIR) < 0 && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						PSF_DIR)));
+
+	/*
+	 * Open the file and seek to the beginning because we always want to
+	 * create/overwrite this file.
+	 */
+	psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+	if (psf_cur.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m", path)));
+
+	/* Create/Find the spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash, path,
+											  HASH_ENTER | HASH_FIND, NULL);
+	Assert(hentry);
+	memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+	psf_cur.cur_offset = 0;
+	hentry->delete_on_exit = true;
+
+	/* Sanity checks */
+	Assert(psf_cur.vfd >= 0);
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_cur.vfd >= 0)
+		FileClose(psf_cur.vfd);
+
+	/* Mark this fd as not valid to use anymore. */
+	psf_cur.is_spooling = false;
+	psf_cur.vfd = -1;
+	psf_cur.cur_offset = 0;
+}
+
+/*
+ * Delete the specified psf spoolfile, and any HTAB associated with it.
+ */
+static void
+prepare_spoolfile_delete(char *path)
+{
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* Delete the file off the disk. */
+	unlink(path);
+
+	/* Remove any entry from the psf_hash, if present */
+	hash_search(psf_hash, path, HASH_REMOVE, NULL);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+	int			bytes_written;
+
+	Assert(psf_cur.is_spooling);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(len));
+	psf_cur.cur_offset += bytes_written;
+
+	/* then the action */
+	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(action));
+	psf_cur.cur_offset += bytes_written;
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == len);
+	psf_cur.cur_offset += bytes_written;
+}
+
+/*
+ * Is there a prepare spoolfile for the specified path?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (fd >= 0)
+		FileClose(fd);
+
+	return fd >= 0;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	psf.vfd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	if (psf.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from prepared spoolfile \"%s\": %m",
+						path)));
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	remote_final_lsn = final_lsn;
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		nbytes = FileRead(psf.vfd, buffer, len,
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+		if (nbytes != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldctx2);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	FileClose(psf.vfd);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB. Therefore, the name
+	 * and the key must be exactly same lengths and padded with '\0' so garbage
+	 * does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "%s/psf_%u-%s.changes", PSF_DIR, subid, gid);
+}
+
+/*
+ * proc_exit callback to remove unwanted psf files.
+ */
+static void
+prepare_spoolfile_on_proc_exit(int status, Datum arg)
+{
+	HASH_SEQ_STATUS seq_status;
+	PsfHashEntry *hentry;
+
+	/* Iterate the HTAB looking for what files can be deleted. */
+	if (psf_hash)
+	{
+		hash_seq_init(&seq_status, psf_hash);
+		while ((hentry = (PsfHashEntry *) hash_seq_search(&seq_status)) != NULL)
+		{
+			char *path = hentry->name;
+
+			if (hentry->delete_on_exit)
+				prepare_spoolfile_delete(path);
+		}
+	}
+}
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2e4b39f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +171,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -322,8 +342,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +362,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +383,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +843,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1250,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e5f8a06..e40d2d0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -363,7 +363,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..232af01 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..0c95dc6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..f55b07c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -345,6 +345,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -418,8 +419,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..95d78e9 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AnyTablesyncInProgress(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e4d2deb..9c49409 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
@@ -1955,6 +1958,8 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfFile
+PsfHashEntry
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

#245Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#244)
4 attachment(s)

Please find attached the latest patch set v57*

Differences from v56* are:

* Rebased to HEAD @ today

* Addresses the following feedback issues:

(24) [vc-0305] Done. Ran pgindent for all patch 0001 source files.

(49) [ak-0308] Fixed. In apply_handle_begion_prepare, don't set
in_remote_transaction if psf spooling

(50) [ak-0308] Fixed. In apply_handle_prepare, assert
!in_remote_transaction if psf spooling.

(52) [vc-0309] Done. Patch 0002. Simplify the way test 020 creates the
publication.

(53) [vc-0309] Done. Patch 0002. Simplify the way test 022 creates the
publication.

-----
[vc-0305] /messages/by-id/CALDaNm1rRG2EUus+mFrqRzEshZwJZtxja0rn_n3qXGAygODfOA@mail.gmail.com
[vc-0309] /messages/by-id/CALDaNm0QuncAis5OqtjzOxAPTZRn545JLqfjFEJwyRjUH-XvEw@mail.gmail.com
[ak-0308] /messages/by-id/CAA4eK1+oSUU77T92FueDJWsp=FjTroNaNC-K45Dgdr7f18aBFA@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v56-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v56-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 0ac166db91a2ce115fa68535d4a2d73a50ea05dc Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Mar 2021 21:48:05 +1100
Subject: [PATCH v56] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

* This patch also adds new option to enable two_phase while creating a slot.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/protocol.sgml                         |  14 +-
 src/backend/access/transam/twophase.c              |  68 ++
 src/backend/commands/subscriptioncmds.c            |   2 +-
 .../libpqwalreceiver/libpqwalreceiver.c            |   6 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 206 ++++++
 src/backend/replication/logical/tablesync.c        | 180 ++++-
 src/backend/replication/logical/worker.c           | 807 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 157 +++-
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/replication/logicalproto.h             |  69 +-
 src/include/replication/reorderbuffer.h            |  12 +
 src/include/replication/walreceiver.h              |   5 +-
 src/include/replication/worker_internal.h          |   3 +
 src/tools/pgindent/typedefs.list                   |   5 +
 18 files changed, 1477 insertions(+), 85 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe..9694713 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,18 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase transactions.
+         Two-phase commands like PREPARE TRANSACTION, COMMIT PREPARED and ROLLBACK PREPARED
+         are also decoded and transmitted. In two-phase transactions, the transaction is 
+         decoded and transmitted at PREPARE TRANSACTION time. 
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..81cb765 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char* buf;
+			TwoPhaseFileHeader* hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no other
+			 * backend commits the prepared xact in the meantime. We can do
+			 * this optimization if we encounter many collisions in GID between
+			 * publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..f6793f0 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -528,7 +528,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				walrcv_create_slot(wrconn, slotname, false, true,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..9e822f9 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -827,7 +828,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +842,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..488b2a2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,212 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..97fc399 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1052,7 +1026,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
 	/*
@@ -1137,3 +1111,141 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+AnyTablesyncInProgress()
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * When the process_syncing_tables_for_apply changes the state
+		 * from SYNCDONE to READY, that change is actually written directly
+		 * into the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 21d304a..9eccc25 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1,4 +1,5 @@
 /*-------------------------------------------------------------------------
+ * If needed, this is the common function to do that file redirection.
  * worker.c
  *	   PostgreSQL logical replication worker (apply)
  *
@@ -49,6 +50,43 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ *
+ * PREPARED SPOOLFILE (PSF) LOGIC
+ * ------------------------------
+ * It can happen that the apply worker is processing a
+ * LOGICAL_REP_MSG_BEGIN_PREPARE message at the same time as table
+ * synchronization is happening. To avoid any chance of an "empty prepare"
+ * situation arising the apply worker waits for all the tablesyncs to reach at
+ * least SYNCDONE state, but this in turn can lead to another case where the
+ * tablesync lsn has got ahead of the prepare lsn the apply worker is
+ * currently processing. Refer to the comment in the apply_handle_begin_prepare
+ * function for more details.
+ *
+ * When this happens the prepared messages are spooled into a "prepare
+ * spoolfile" (aka psf). The messages written to this file are all the prepared
+ * messages up to and including the LOGICAL_REP_MSG_PREPARE. All this psf
+ * content is then replayed later at commit time (apply_handle_commit_prepared),
+ * where the messages are all dispatched in the usual way.
+ *
+ * The psf files reside in the "pg_logical/twophase" directory and they are
+ * uniquely named. This is necessary because there may be multiple psf files
+ * co-existing, and so the correct psf must be re-discoverable (using subid and
+ * gid).
+ *
+ * Furthermore, to cope with possibility of error between the end of spooling
+ * (in apply_handle_prepare) and the commit (in apply_handle_commit_prepared) a
+ * psf file must be able to survive a PG restart. So we cannot utilizing the
+ * same (temporary file based) BufFile API that the streamed transactions use.
+ * Instead, the psf file handling uses the File API (PathNameOpenFile and
+ * friends). But this means the code now has to take responsibility for psf file
+ * cleanup. An HTAB is used to track if a particular psf file can or cannot be
+ * deleted, and a proc-exit handler is registered to take the appropriate
+ * action. Refer to function prepare_spoolfile_on_proc_exit.
+ *
+ * Upon restart, any uncommitted psf files are still present and so their commit
+ * can proceed as before.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +97,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -208,6 +247,59 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/* psf files will be written here. */
+#define PSF_DIR "pg_logical/twophase"
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and a
+ * flag indicating if it is OK to delete this psf at proc-exit time or not.
+ *
+ * Each PshHashEntry is created at prepare and removed at commit/rollback.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	bool 		delete_on_exit; 	/* ok to delete at proc-exit? */
+}			PsfHashEntry;
+
+/*
+ * Information about the "current" psf spoolfile.
+ */
+typedef struct PsfFile
+{
+	char	name[MAXPGPATH];/* psf name - same as the HTAB key. */
+	bool	is_spooling;	/* are we currently spooling to this file? */
+	File 	vfd;			/* -1 when the file is closed. */
+	off_t	cur_offset;		/* offset for the next write or read. Reset to 0
+							 * when file is opened. */
+} PsfFile;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * Information about the 'current' open spoolfile is only valid when spooling.
+ * This is flagged as 'is_spooling' only between begin_prepare and prepare.
+ */
+static PsfFile psf_cur = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_delete(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+static void prepare_spoolfile_on_proc_exit(int status, Datum arg);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -720,6 +812,340 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				errmsg("transaction identifier \"%s\" is already in use",
+					   begin_data.gid)));
+
+	/*
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel).
+	 *
+	 * This can lead to an "empty prepare", because later when the apply
+	 * worker does the commit prepare (‘K’), there is nothing in it (the
+	 * inserts were skipped earlier).
+	 *
+	 * We avoid this using the 2 part logic: (1) Wait for all tablesync workers
+	 * to reach SYNCDONE/READY state; (2) If the begin_prepare lsn is now
+	 * behind any tablesync lsn then spool the prepared messages to a file
+	 * to be replayed later at commit_prepared time.
+	 *
+	 * -----
+	 *
+	 * XXX - The 2PC protocol needs the publisher to be aware when the PREPARE
+	 * has been successfully acted on. But because of this "empty prepare"
+	 * case now the prepared messages may be spooled to a file and, when
+	 * that happens the PREPARE would not happen at the usual time, but would
+	 * be deferred until COMMIT PREPARED time. This quirk could only happen
+	 * immediately after the initial table synchronization phase; once all
+	 * tables have acheived READY state the 2PC protocol will behave normally.
+	 *
+	 * A future release may be able to detect when all tables are READY and set
+	 * a flag to indicate this subscription/slot is ready for two_phase
+	 * decoding. Then at the publisher-side, we could enable wait-for-prepares
+	 * only when all the slots of WALSender have that flag set.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(begin_data.final_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (AnyTablesyncInProgress())
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			process_syncing_tables(begin_data.final_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		{
+			char		psfpath[MAXPGPATH];
+
+			/*
+			 * Create the spoolfile.
+			 */
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+
+			/*
+			 * From now, until the handle_prepare we are spooling to the
+			 * current psf.
+			 */
+			psf_cur.is_spooling = true;
+		}
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * If we were using a psf spoolfile, then write the PREPARE as the final
+	 * message. This prepare information will be used at commit_prepared time.
+	 */
+	if (psf_cur.is_spooling)
+	{
+		PsfHashEntry *hentry;
+
+		/* Write the PREPARE info to the psf file. */
+		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
+
+		/*
+		 * Flush the spoolfile, so changes can survive a restart.
+		 *
+		 * If the publisher resends the same data again after a restart (e.g.
+		 * if subscriber origin has not moved past this prepare), then the same
+		 * named psf file will be overwritten with the same data. See
+		 * prepare_spoolfile_create.
+		 */
+		FileSync(psf_cur.vfd, WAIT_EVENT_DATA_FILE_SYNC);
+
+		/* We are finished spooling to the current psf. */
+		psf_cur.is_spooling = false;
+
+		/*
+		 * The commit_prepare will need the spoolfile, so unregister it for
+		 * removal on proc-exit just in case there is an unexpected restart
+		 * between now and when commit_prepared happens.
+		 */
+		hentry = (PsfHashEntry *) hash_search(psf_hash, psf_cur.name,
+											  HASH_FIND, NULL);
+		Assert(hentry);
+		hentry->delete_on_exit = false;
+
+		/*
+		 * The psf_cur.vfd is meaningful only between begin_prepare and prepared.
+		 * So close it now. Any messages written to the psf will be applied
+		 * later during handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		in_remote_transaction = false;
+		return;
+	}
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	/*
+	 * Normally, prepare_lsn == remote_final_lsn, but if this prepare message
+	 * was dispatched via the psf spoolfile replay then the remote_final_lsn
+	 * is set to commit lsn instead. Hence the <= instead of == check below.
+	 */
+	Assert(prepare_data.prepare_lsn <= remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+
+		/*
+		 * Replay/dispatch the spooled messages.
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		/* After replaying the psf it is no longer needed. Just delete it. */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		/* We are finished with this spoolfile. Delete it. */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
+	 */
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -732,6 +1158,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		!psf_cur.is_spooling &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1092,6 +1519,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_RELATION, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_RELATION, s))
 		return;
 
@@ -1113,6 +1543,9 @@ apply_handle_type(StringInfo s)
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TYPE, s))
 		return;
 
+	if (handle_streamed_transaction(LOGICAL_REP_MSG_TYPE, s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -1150,6 +1583,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1271,6 +1707,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1429,6 +1868,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1798,6 +2240,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1954,6 +2399,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2012,7 +2479,9 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
 		}
 	}
 
-	*have_pending_txes = !dlist_is_empty(&lsn_mapping);
+	/* consider entries in prepare spool file as not flushed */
+	*have_pending_txes = (!dlist_is_empty(&lsn_mapping) ||
+						 (psf_hash && hash_get_num_entries(psf_hash)));
 }
 
 /*
@@ -2061,6 +2530,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	TimeLineID	tli;
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL     hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2180,7 +2666,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && !psf_cur.is_spooling)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -2927,6 +3413,9 @@ ApplyWorkerMain(Datum main_arg)
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
 
+	/* Arrange to delete any unwanted psf file(s) at proc-exit */
+	on_proc_exit(prepare_spoolfile_on_proc_exit, 0);
+
 	/* Setup signal handling */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGTERM, die);
@@ -3103,3 +3592,317 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	elog(DEBUG1,
+		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_cur.is_spooling ? "Do" : "Don't");
+
+	if (!psf_cur.is_spooling)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(!psf_cur.is_spooling);
+
+	/* Make sure the PSF_DIR subdirectory exists. */
+	if (MakePGDirectory(PSF_DIR) < 0 && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						PSF_DIR)));
+
+	/*
+	 * Open the file and seek to the beginning because we always want to
+	 * create/overwrite this file.
+	 */
+	psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+	if (psf_cur.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m", path)));
+
+	/* Create/Find the spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash, path,
+											  HASH_ENTER | HASH_FIND, NULL);
+	Assert(hentry);
+	memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+	psf_cur.cur_offset = 0;
+	hentry->delete_on_exit = true;
+
+	/* Sanity checks */
+	Assert(psf_cur.vfd >= 0);
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_cur.vfd >= 0)
+		FileClose(psf_cur.vfd);
+
+	/* Mark this fd as not valid to use anymore. */
+	psf_cur.is_spooling = false;
+	psf_cur.vfd = -1;
+	psf_cur.cur_offset = 0;
+}
+
+/*
+ * Delete the specified psf spoolfile, and any HTAB associated with it.
+ */
+static void
+prepare_spoolfile_delete(char *path)
+{
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* Delete the file off the disk. */
+	unlink(path);
+
+	/* Remove any entry from the psf_hash, if present */
+	hash_search(psf_hash, path, HASH_REMOVE, NULL);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+	int			bytes_written;
+
+	Assert(psf_cur.is_spooling);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(len));
+	psf_cur.cur_offset += bytes_written;
+
+	/* then the action */
+	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(action));
+	psf_cur.cur_offset += bytes_written;
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == len);
+	psf_cur.cur_offset += bytes_written;
+}
+
+/*
+ * Is there a prepare spoolfile for the specified path?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (fd >= 0)
+		FileClose(fd);
+
+	return fd >= 0;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	psf.vfd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	if (psf.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from prepared spoolfile \"%s\": %m",
+						path)));
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	remote_final_lsn = final_lsn;
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		nbytes = FileRead(psf.vfd, buffer, len,
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+		if (nbytes != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldctx2);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	FileClose(psf.vfd);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB. Therefore, the name
+	 * and the key must be exactly same lengths and padded with '\0' so garbage
+	 * does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "%s/psf_%u-%s.changes", PSF_DIR, subid, gid);
+}
+
+/*
+ * proc_exit callback to remove unwanted psf files.
+ */
+static void
+prepare_spoolfile_on_proc_exit(int status, Datum arg)
+{
+	HASH_SEQ_STATUS seq_status;
+	PsfHashEntry *hentry;
+
+	/* Iterate the HTAB looking for what files can be deleted. */
+	if (psf_hash)
+	{
+		hash_seq_init(&seq_status, psf_hash);
+		while ((hentry = (PsfHashEntry *) hash_seq_search(&seq_status)) != NULL)
+		{
+			char *path = hentry->name;
+
+			if (hentry->delete_on_exit)
+				prepare_spoolfile_delete(path);
+		}
+	}
+}
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2e4b39f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext* ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +171,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -322,8 +342,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +362,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +383,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool        send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +843,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1250,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e5f8a06..e40d2d0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -363,7 +363,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..232af01 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..0c95dc6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..f55b07c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -345,6 +345,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -418,8 +419,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..95d78e9 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AnyTablesyncInProgress(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e4d2deb..9c49409 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,12 +1341,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
@@ -1955,6 +1958,8 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfFile
+PsfHashEntry
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v56-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v56-0002-Support-2PC-txn-subscriber-tests.patchDownload
From b048d2148672c6b83605c501cbff6905df74b421 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Mar 2021 21:59:10 +1100
Subject: [PATCH v56] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 332 ++++++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl | 282 ++++++++++++++++++++
 2 files changed, 614 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..ab41d2d
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,332 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..0f95530
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,282 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A");
+$node_A->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_A ADD TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B");
+$node_B->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub_B ADD TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v56-0004-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v56-0004-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From 43d6696e8dae0ff45ceda9ce1faed4013688e7b1 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Mar 2021 22:51:28 +1100
Subject: [PATCH v56] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 +++++++++++---
 src/backend/replication/logical/worker.c    | 62 +++++++++++++++++++++++------
 2 files changed, 72 insertions(+), 19 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 97fc399..f3984d4 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1127,6 +1133,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1149,6 +1157,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1156,12 +1165,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1175,6 +1189,8 @@ AnyTablesyncInProgress()
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncInProgress?");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1186,8 +1202,8 @@ AnyTablesyncInProgress()
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1204,6 +1220,7 @@ AnyTablesyncInProgress()
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncInProgress?: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1215,8 +1232,8 @@ AnyTablesyncInProgress()
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1242,8 +1259,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 057d471..a4642ad 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -871,14 +871,16 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(begin_data.final_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (AnyTablesyncInProgress())
 		{
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
 			CHECK_FOR_INTERRUPTS();
 
 			process_syncing_tables(begin_data.final_lsn);
@@ -898,7 +900,12 @@ apply_handle_begin_prepare(StringInfo s)
 		 * prepared) will be saved to a spoolfile for replay later at
 		 * commit_prepared time.
 		 */
-		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		if (begin_data.final_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
 		{
 			char		psfpath[MAXPGPATH];
 
@@ -940,6 +947,8 @@ apply_handle_prepare(StringInfo s)
 	{
 		PsfHashEntry *hentry;
 
+		elog(LOG, "!!> apply_handle_prepare: SPOOLING");
+
 		/* Write the PREPARE info to the psf file. */
 		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
 
@@ -961,6 +970,8 @@ apply_handle_prepare(StringInfo s)
 		 * removal on proc-exit just in case there is an unexpected restart
 		 * between now and when commit_prepared happens.
 		 */
+		elog(LOG,
+			"!!> apply_handle_prepare: Make sure the spoolfile is not removed on proc-exit");
 		hentry = (PsfHashEntry *) hash_search(psf_hash, psf_cur.name,
 											  HASH_FIND, NULL);
 		Assert(hentry);
@@ -1045,6 +1056,8 @@ apply_handle_commit_prepared(StringInfo s)
 	{
 		int			nchanges;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * Replay/dispatch the spooled messages.
 		 */
@@ -1052,8 +1065,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		/* After replaying the psf it is no longer needed. Just delete it. */
@@ -1117,6 +1130,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf = %d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2401,18 +2415,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3608,8 +3626,8 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
-	elog(DEBUG1,
-		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
 		 action,
 		 psf_cur.is_spooling ? "Do" : "Don't");
 
@@ -3632,7 +3650,7 @@ prepare_spoolfile_create(char *path)
 {
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(!psf_cur.is_spooling);
 
@@ -3672,6 +3690,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_cur.vfd >= 0)
 		FileClose(psf_cur.vfd);
 
@@ -3687,6 +3706,8 @@ prepare_spoolfile_close()
 static void
 prepare_spoolfile_delete(char *path)
 {
+	elog(LOG, "!!> prepare_spoolfile_delete: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3712,18 +3733,20 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_cur.is_spooling);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, (char *)&len, sizeof(len),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(len));
 	psf_cur.cur_offset += bytes_written;
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(action));
@@ -3732,6 +3755,7 @@ prepare_spoolfile_write(char action, StringInfo s)
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == len);
@@ -3746,6 +3770,11 @@ prepare_spoolfile_exists(char *path)
 {
 	File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
 
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 fd >= 0 ? "found" : "not found");
+
 	if (fd >= 0)
 		FileClose(fd);
 
@@ -3765,8 +3794,8 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 				oldctx2;
 	PsfFile		psf = { .is_spooling = false, .vfd = -1, .cur_offset = 0 };
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3811,6 +3840,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -3829,6 +3859,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		nbytes = FileRead(psf.vfd, buffer, len,
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
@@ -3845,7 +3876,9 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		/* Ensure we are reading the data into our memory context. */
 		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
 
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: Before dispatch");
 		apply_dispatch(&s2);
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: After dispatch");
 
 		MemoryContextReset(ApplyMessageContext);
 
@@ -3854,13 +3887,13 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nchanges++;
 
 		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
+			elog(LOG, "!!> replayed %d changes from file '%s'",
 				 nchanges, path);
 	}
 
 	FileClose(psf.vfd);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
@@ -3895,6 +3928,8 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 	HASH_SEQ_STATUS seq_status;
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_on_proc_exit");
+
 	/* Iterate the HTAB looking for what files can be deleted. */
 	if (psf_hash)
 	{
@@ -3903,6 +3938,7 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 		{
 			char *path = hentry->name;
 
+			elog(LOG, "!!> prepare_spoolfile_proc_exit: found '%s'", path);
 			if (hentry->delete_on_exit)
 				prepare_spoolfile_delete(path);
 		}
-- 
1.8.3.1

v56-0003-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v56-0003-Support-2PC-txn-Subscription-option.patchDownload
From 1723729e023e81640b4a82a5da6102e39d31cd3a Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 10 Mar 2021 22:14:51 +1100
Subject: [PATCH v56] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/create_subscription.sgml          | 27 +++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 72 +++++++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 +
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 ++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 +++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 ++-
 src/bin/psql/tab-complete.c                        |  2 +-
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 +
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 93 +++++++++++++++-------
 src/test/regress/sql/subscription.sql              | 25 ++++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 18 files changed, 260 insertions(+), 48 deletions(-)

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..eeb7e35 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,33 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65d..b77378d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f6793f0..96fcf49 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,26 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +309,24 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * the current implementation has some issues that could lead to a
+	 * streaming prepared transaction to be incorrectly missed in the initial
+	 * syncing phase. Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +576,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false, true,
+				walrcv_create_slot(wrconn, slotname, false, twophase,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +883,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +918,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if (sub->twophase && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +947,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +993,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1039,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9e822f9..1daa585 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -428,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9eccc25..057d471 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2925,6 +2925,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3574,6 +3575,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 2e4b39f..91ecc55 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -178,13 +178,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -252,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -265,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -289,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -330,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..96c878b 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and two_phase are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 9f0208a..34c70a1 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2763,7 +2763,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 232af01..a5bb4de 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f55b07c..0ed8e9d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..67b3358 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index ab41d2d..aa3455b 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -47,7 +47,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 0f95530..9fb461b 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -54,7 +54,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -66,7 +67,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

#246Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#245)
4 attachment(s)

On Thu, Mar 11, 2021 at 12:46 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v57*

Differences from v56* are:

* Rebased to HEAD @ today

* Addresses the following feedback issues:

(24) [vc-0305] Done. Ran pgindent for all patch 0001 source files.

(49) [ak-0308] Fixed. In apply_handle_begion_prepare, don't set
in_remote_transaction if psf spooling

(50) [ak-0308] Fixed. In apply_handle_prepare, assert
!in_remote_transaction if psf spooling.

(52) [vc-0309] Done. Patch 0002. Simplify the way test 020 creates the
publication.

(53) [vc-0309] Done. Patch 0002. Simplify the way test 022 creates the
publication.

-----
[vc-0305] /messages/by-id/CALDaNm1rRG2EUus+mFrqRzEshZwJZtxja0rn_n3qXGAygODfOA@mail.gmail.com
[vc-0309] /messages/by-id/CALDaNm0QuncAis5OqtjzOxAPTZRn545JLqfjFEJwyRjUH-XvEw@mail.gmail.com
[ak-0308] /messages/by-id/CAA4eK1+oSUU77T92FueDJWsp=FjTroNaNC-K45Dgdr7f18aBFA@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Oops. I posted the wrong patch set in my previous email.

Here are the correct ones for v57*.

Sorry for any confusion.

Attachments:

v57-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v57-0002-Support-2PC-txn-subscriber-tests.patchDownload
From c8478d95d90fcb444db1132816e9608950824a72 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 11 Mar 2021 12:12:08 +1100
Subject: [PATCH v57] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 330 ++++++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl | 278 ++++++++++++++++++++
 2 files changed, 608 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..b7a07be
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,330 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..92bb655
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,278 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v57-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v57-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From dbf246e19a35970dc45b2f7364dcd46302c3119e Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 11 Mar 2021 12:01:55 +1100
Subject: [PATCH v57] Add support for apply at prepare time to built-in logical
  replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

* This patch also adds new option to enable two_phase while creating a slot.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/protocol.sgml                         |  14 +-
 src/backend/access/transam/twophase.c              |  68 ++
 src/backend/commands/subscriptioncmds.c            |  10 +-
 .../libpqwalreceiver/libpqwalreceiver.c            |   6 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 206 ++++++
 src/backend/replication/logical/tablesync.c        | 180 ++++-
 src/backend/replication/logical/worker.c           | 812 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 157 +++-
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/replication/logicalproto.h             |  69 +-
 src/include/replication/reorderbuffer.h            |  12 +
 src/include/replication/walreceiver.h              |   5 +-
 src/include/replication/worker_internal.h          |   3 +
 src/tools/pgindent/typedefs.list                   |   5 +
 18 files changed, 1486 insertions(+), 89 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe..9694713 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,18 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase transactions.
+         Two-phase commands like PREPARE TRANSACTION, COMMIT PREPARED and ROLLBACK PREPARED
+         are also decoded and transmitted. In two-phase transactions, the transaction is 
+         decoded and transmitted at PREPARE TRANSACTION time. 
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..c58c46d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..fa23bcb 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -528,7 +528,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				walrcv_create_slot(wrconn, slotname, false, true,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -648,7 +648,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
 										InvalidXLogRecPtr);
 				ereport(DEBUG1,
 						(errmsg_internal("table \"%s.%s\" added to subscription \"%s\"",
-								rv->schemaname, rv->relname, sub->name)));
+										 rv->schemaname, rv->relname, sub->name)));
 			}
 		}
 
@@ -722,9 +722,9 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
 
 				ereport(DEBUG1,
 						(errmsg_internal("table \"%s.%s\" removed from subscription \"%s\"",
-								get_namespace_name(get_rel_namespace(relid)),
-								get_rel_name(relid),
-								sub->name)));
+										 get_namespace_name(get_rel_namespace(relid)),
+										 get_rel_name(relid),
+										 sub->name)));
 			}
 		}
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..9e822f9 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -827,7 +828,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +842,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..488b2a2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,212 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..3c2c9fc 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1052,7 +1026,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
 	/*
@@ -1137,3 +1111,141 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+AnyTablesyncInProgress(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * When the process_syncing_tables_for_apply changes the state from
+		 * SYNCDONE to READY, that change is actually written directly into
+		 * the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 21d304a..8bf273c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1,4 +1,5 @@
 /*-------------------------------------------------------------------------
+ * If needed, this is the common function to do that file redirection.
  * worker.c
  *	   PostgreSQL logical replication worker (apply)
  *
@@ -49,6 +50,43 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ *
+ * PREPARED SPOOLFILE (PSF) LOGIC
+ * ------------------------------
+ * It can happen that the apply worker is processing a
+ * LOGICAL_REP_MSG_BEGIN_PREPARE message at the same time as table
+ * synchronization is happening. To avoid any chance of an "empty prepare"
+ * situation arising the apply worker waits for all the tablesyncs to reach at
+ * least SYNCDONE state, but this in turn can lead to another case where the
+ * tablesync lsn has got ahead of the prepare lsn the apply worker is
+ * currently processing. Refer to the comment in the apply_handle_begin_prepare
+ * function for more details.
+ *
+ * When this happens the prepared messages are spooled into a "prepare
+ * spoolfile" (aka psf). The messages written to this file are all the prepared
+ * messages up to and including the LOGICAL_REP_MSG_PREPARE. All this psf
+ * content is then replayed later at commit time (apply_handle_commit_prepared),
+ * where the messages are all dispatched in the usual way.
+ *
+ * The psf files reside in the "pg_logical/twophase" directory and they are
+ * uniquely named. This is necessary because there may be multiple psf files
+ * co-existing, and so the correct psf must be re-discoverable (using subid and
+ * gid).
+ *
+ * Furthermore, to cope with possibility of error between the end of spooling
+ * (in apply_handle_prepare) and the commit (in apply_handle_commit_prepared) a
+ * psf file must be able to survive a PG restart. So we cannot utilizing the
+ * same (temporary file based) BufFile API that the streamed transactions use.
+ * Instead, the psf file handling uses the File API (PathNameOpenFile and
+ * friends). But this means the code now has to take responsibility for psf file
+ * cleanup. An HTAB is used to track if a particular psf file can or cannot be
+ * deleted, and a proc-exit handler is registered to take the appropriate
+ * action. Refer to function prepare_spoolfile_on_proc_exit.
+ *
+ * Upon restart, any uncommitted psf files are still present and so their commit
+ * can proceed as before.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +97,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -208,6 +247,59 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/* psf files will be written here. */
+#define PSF_DIR "pg_logical/twophase"
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and a
+ * flag indicating if it is OK to delete this psf at proc-exit time or not.
+ *
+ * Each PshHashEntry is created at prepare and removed at commit/rollback.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	bool		delete_on_exit; /* ok to delete at proc-exit? */
+} PsfHashEntry;
+
+/*
+ * Information about the "current" psf spoolfile.
+ */
+typedef struct PsfFile
+{
+	char		name[MAXPGPATH];	/* psf name - same as the HTAB key. */
+	bool		is_spooling;	/* are we currently spooling to this file? */
+	File		vfd;			/* -1 when the file is closed. */
+	off_t		cur_offset;		/* offset for the next write or read. Reset to
+								 * 0 when file is opened. */
+} PsfFile;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * Information about the 'current' open spoolfile is only valid when spooling.
+ * This is flagged as 'is_spooling' only between begin_prepare and prepare.
+ */
+static PsfFile psf_cur = {.is_spooling = false,.vfd = -1,.cur_offset = 0};
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_delete(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+static void prepare_spoolfile_on_proc_exit(int status, Datum arg);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -720,6 +812,345 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("transaction identifier \"%s\" is already in use",
+						begin_data.gid)));
+
+	/*
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel).
+	 *
+	 * This can lead to an "empty prepare", because later when the apply
+	 * worker does the commit prepare (‘K’), there is nothing in it (the
+	 * inserts were skipped earlier).
+	 *
+	 * We avoid this using the 2 part logic: (1) Wait for all tablesync
+	 * workers to reach SYNCDONE/READY state; (2) If the begin_prepare lsn is
+	 * now behind any tablesync lsn then spool the prepared messages to a file
+	 * to be replayed later at commit_prepared time.
+	 *
+	 * -----
+	 *
+	 * XXX - The 2PC protocol needs the publisher to be aware when the PREPARE
+	 * has been successfully acted on. But because of this "empty prepare"
+	 * case now the prepared messages may be spooled to a file and, when that
+	 * happens the PREPARE would not happen at the usual time, but would be
+	 * deferred until COMMIT PREPARED time. This quirk could only happen
+	 * immediately after the initial table synchronization phase; once all
+	 * tables have acheived READY state the 2PC protocol will behave normally.
+	 *
+	 * A future release may be able to detect when all tables are READY and
+	 * set a flag to indicate this subscription/slot is ready for two_phase
+	 * decoding. Then at the publisher-side, we could enable wait-for-prepares
+	 * only when all the slots of WALSender have that flag set.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(begin_data.final_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (AnyTablesyncInProgress())
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			process_syncing_tables(begin_data.final_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		{
+			char		psfpath[MAXPGPATH];
+
+			/*
+			 * Create the spoolfile.
+			 */
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+
+			/*
+			 * From now, until the handle_prepare we are spooling to the
+			 * current psf.
+			 */
+			psf_cur.is_spooling = true;
+
+			pgstat_report_activity(STATE_RUNNING, NULL);
+			return;
+		}
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * If we were using a psf spoolfile, then write the PREPARE as the final
+	 * message. This prepare information will be used at commit_prepared time.
+	 */
+	if (psf_cur.is_spooling)
+	{
+		PsfHashEntry *hentry;
+
+		Assert(!in_remote_transaction);
+
+		/* Write the PREPARE info to the psf file. */
+		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
+
+		/*
+		 * Flush the spoolfile, so changes can survive a restart.
+		 *
+		 * If the publisher resends the same data again after a restart (e.g.
+		 * if subscriber origin has not moved past this prepare), then the
+		 * same named psf file will be overwritten with the same data. See
+		 * prepare_spoolfile_create.
+		 */
+		FileSync(psf_cur.vfd, WAIT_EVENT_DATA_FILE_SYNC);
+
+		/* We are finished spooling to the current psf. */
+		psf_cur.is_spooling = false;
+
+		/*
+		 * The commit_prepare will need the spoolfile, so unregister it for
+		 * removal on proc-exit just in case there is an unexpected restart
+		 * between now and when commit_prepared happens.
+		 */
+		hentry = (PsfHashEntry *) hash_search(psf_hash, psf_cur.name,
+											  HASH_FIND, NULL);
+		Assert(hentry);
+		hentry->delete_on_exit = false;
+
+		/*
+		 * The psf_cur.vfd is meaningful only between begin_prepare and
+		 * prepared. So close it now. Any messages written to the psf will be
+		 * applied later during handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		in_remote_transaction = false;
+		return;
+	}
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	/*
+	 * Normally, prepare_lsn == remote_final_lsn, but if this prepare message
+	 * was dispatched via the psf spoolfile replay then the remote_final_lsn
+	 * is set to commit lsn instead. Hence the <= instead of == check below.
+	 */
+	Assert(prepare_data.prepare_lsn <= remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+
+		/*
+		 * Replay/dispatch the spooled messages.
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		/* After replaying the psf it is no longer needed. Just delete it. */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		/* We are finished with this spoolfile. Delete it. */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
+	 */
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -732,6 +1163,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		!psf_cur.is_spooling &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1092,6 +1524,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_RELATION, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_RELATION, s))
 		return;
 
@@ -1113,6 +1548,9 @@ apply_handle_type(StringInfo s)
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TYPE, s))
 		return;
 
+	if (handle_streamed_transaction(LOGICAL_REP_MSG_TYPE, s))
+		return;
+
 	logicalrep_read_typ(s, &typ);
 	logicalrep_typmap_update(&typ);
 }
@@ -1150,6 +1588,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1271,6 +1712,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1429,6 +1873,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1798,6 +2245,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1954,6 +2404,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2012,7 +2484,9 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
 		}
 	}
 
-	*have_pending_txes = !dlist_is_empty(&lsn_mapping);
+	/* consider entries in prepare spool file as not flushed */
+	*have_pending_txes = (!dlist_is_empty(&lsn_mapping) ||
+						  (psf_hash && hash_get_num_entries(psf_hash)));
 }
 
 /*
@@ -2061,6 +2535,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	TimeLineID	tli;
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL		hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2180,7 +2671,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && !psf_cur.is_spooling)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -2927,6 +3418,9 @@ ApplyWorkerMain(Datum main_arg)
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
 
+	/* Arrange to delete any unwanted psf file(s) at proc-exit */
+	on_proc_exit(prepare_spoolfile_on_proc_exit, 0);
+
 	/* Setup signal handling */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGTERM, die);
@@ -3103,3 +3597,317 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	elog(DEBUG1,
+		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_cur.is_spooling ? "Do" : "Don't");
+
+	if (!psf_cur.is_spooling)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(!psf_cur.is_spooling);
+
+	/* Make sure the PSF_DIR subdirectory exists. */
+	if (MakePGDirectory(PSF_DIR) < 0 && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						PSF_DIR)));
+
+	/*
+	 * Open the file and seek to the beginning because we always want to
+	 * create/overwrite this file.
+	 */
+	psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+	if (psf_cur.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m", path)));
+
+	/* Create/Find the spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash, path,
+										  HASH_ENTER | HASH_FIND, NULL);
+	Assert(hentry);
+	memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+	psf_cur.cur_offset = 0;
+	hentry->delete_on_exit = true;
+
+	/* Sanity checks */
+	Assert(psf_cur.vfd >= 0);
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_cur.vfd >= 0)
+		FileClose(psf_cur.vfd);
+
+	/* Mark this fd as not valid to use anymore. */
+	psf_cur.is_spooling = false;
+	psf_cur.vfd = -1;
+	psf_cur.cur_offset = 0;
+}
+
+/*
+ * Delete the specified psf spoolfile, and any HTAB associated with it.
+ */
+static void
+prepare_spoolfile_delete(char *path)
+{
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* Delete the file off the disk. */
+	unlink(path);
+
+	/* Remove any entry from the psf_hash, if present */
+	hash_search(psf_hash, path, HASH_REMOVE, NULL);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+	int			bytes_written;
+
+	Assert(psf_cur.is_spooling);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	bytes_written = FileWrite(psf_cur.vfd, (char *) &len, sizeof(len),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(len));
+	psf_cur.cur_offset += bytes_written;
+
+	/* then the action */
+	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(action));
+	psf_cur.cur_offset += bytes_written;
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == len);
+	psf_cur.cur_offset += bytes_written;
+}
+
+/*
+ * Is there a prepare spoolfile for the specified path?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	File		fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (fd >= 0)
+		FileClose(fd);
+
+	return fd >= 0;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	PsfFile		psf = {.is_spooling = false,.vfd = -1,.cur_offset = 0};
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	psf.vfd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	if (psf.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from prepared spoolfile \"%s\": %m",
+						path)));
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	remote_final_lsn = final_lsn;
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		nbytes = FileRead(psf.vfd, buffer, len,
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+		if (nbytes != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldctx2);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	FileClose(psf.vfd);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB. Therefore, the name
+	 * and the key must be exactly same lengths and padded with '\0' so
+	 * garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "%s/psf_%u-%s.changes", PSF_DIR, subid, gid);
+}
+
+/*
+ * proc_exit callback to remove unwanted psf files.
+ */
+static void
+prepare_spoolfile_on_proc_exit(int status, Datum arg)
+{
+	HASH_SEQ_STATUS seq_status;
+	PsfHashEntry *hentry;
+
+	/* Iterate the HTAB looking for what files can be deleted. */
+	if (psf_hash)
+	{
+		hash_seq_init(&seq_status, psf_hash);
+		while ((hentry = (PsfHashEntry *) hash_seq_search(&seq_status)) != NULL)
+		{
+			char	   *path = hentry->name;
+
+			if (hentry->delete_on_exit)
+				prepare_spoolfile_delete(path);
+		}
+	}
+}
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..ede252b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +171,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -322,8 +342,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +362,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +383,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +843,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1250,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e5f8a06..e40d2d0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -363,7 +363,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..232af01 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..0c95dc6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..f55b07c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -345,6 +345,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -418,8 +419,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..95d78e9 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AnyTablesyncInProgress(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 574a8a9..544b352 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1342,12 +1342,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
@@ -1956,6 +1959,8 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfFile
+PsfHashEntry
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v57-0004-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v57-0004-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From 428d8583cb4618ab7d00fdc20298e9e20b9fe0ee Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 11 Mar 2021 12:33:49 +1100
Subject: [PATCH v57] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c | 29 +++++++++++---
 src/backend/replication/logical/worker.c    | 62 +++++++++++++++++++++++------
 2 files changed, 72 insertions(+), 19 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 3c2c9fc..4491432 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1127,6 +1133,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1149,6 +1157,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1156,12 +1165,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1175,6 +1189,8 @@ AnyTablesyncInProgress(void)
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncInProgress?");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1186,8 +1202,8 @@ AnyTablesyncInProgress(void)
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1204,6 +1220,7 @@ AnyTablesyncInProgress(void)
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncInProgress?: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1215,8 +1232,8 @@ AnyTablesyncInProgress(void)
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1242,8 +1259,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 89e2c31..a7b5220 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -871,14 +871,16 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(begin_data.final_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (AnyTablesyncInProgress())
 		{
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
 			CHECK_FOR_INTERRUPTS();
 
 			process_syncing_tables(begin_data.final_lsn);
@@ -898,7 +900,12 @@ apply_handle_begin_prepare(StringInfo s)
 		 * prepared) will be saved to a spoolfile for replay later at
 		 * commit_prepared time.
 		 */
-		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		if (begin_data.final_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
 		{
 			char		psfpath[MAXPGPATH];
 
@@ -943,6 +950,8 @@ apply_handle_prepare(StringInfo s)
 	{
 		PsfHashEntry *hentry;
 
+		elog(LOG, "!!> apply_handle_prepare: SPOOLING");
+
 		Assert(!in_remote_transaction);
 
 		/* Write the PREPARE info to the psf file. */
@@ -966,6 +975,8 @@ apply_handle_prepare(StringInfo s)
 		 * removal on proc-exit just in case there is an unexpected restart
 		 * between now and when commit_prepared happens.
 		 */
+		elog(LOG,
+			"!!> apply_handle_prepare: Make sure the spoolfile is not removed on proc-exit");
 		hentry = (PsfHashEntry *) hash_search(psf_hash, psf_cur.name,
 											  HASH_FIND, NULL);
 		Assert(hentry);
@@ -1050,6 +1061,8 @@ apply_handle_commit_prepared(StringInfo s)
 	{
 		int			nchanges;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * Replay/dispatch the spooled messages.
 		 */
@@ -1057,8 +1070,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		/* After replaying the psf it is no longer needed. Just delete it. */
@@ -1122,6 +1135,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf = %d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2406,18 +2420,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3613,8 +3631,8 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
-	elog(DEBUG1,
-		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
 		 action,
 		 psf_cur.is_spooling ? "Do" : "Don't");
 
@@ -3637,7 +3655,7 @@ prepare_spoolfile_create(char *path)
 {
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(!psf_cur.is_spooling);
 
@@ -3677,6 +3695,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_cur.vfd >= 0)
 		FileClose(psf_cur.vfd);
 
@@ -3692,6 +3711,8 @@ prepare_spoolfile_close()
 static void
 prepare_spoolfile_delete(char *path)
 {
+	elog(LOG, "!!> prepare_spoolfile_delete: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3717,18 +3738,20 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_cur.is_spooling);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, (char *) &len, sizeof(len),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(len));
 	psf_cur.cur_offset += bytes_written;
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(action));
@@ -3737,6 +3760,7 @@ prepare_spoolfile_write(char action, StringInfo s)
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == len);
@@ -3751,6 +3775,11 @@ prepare_spoolfile_exists(char *path)
 {
 	File		fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
 
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 fd >= 0 ? "found" : "not found");
+
 	if (fd >= 0)
 		FileClose(fd);
 
@@ -3770,8 +3799,8 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 				oldctx2;
 	PsfFile		psf = {.is_spooling = false,.vfd = -1,.cur_offset = 0};
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3816,6 +3845,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -3834,6 +3864,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		nbytes = FileRead(psf.vfd, buffer, len,
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
@@ -3850,7 +3881,9 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		/* Ensure we are reading the data into our memory context. */
 		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
 
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: Before dispatch");
 		apply_dispatch(&s2);
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: After dispatch");
 
 		MemoryContextReset(ApplyMessageContext);
 
@@ -3859,13 +3892,13 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nchanges++;
 
 		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
+			elog(LOG, "!!> replayed %d changes from file '%s'",
 				 nchanges, path);
 	}
 
 	FileClose(psf.vfd);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
@@ -3900,6 +3933,8 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 	HASH_SEQ_STATUS seq_status;
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_on_proc_exit");
+
 	/* Iterate the HTAB looking for what files can be deleted. */
 	if (psf_hash)
 	{
@@ -3908,6 +3943,7 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 		{
 			char	   *path = hentry->name;
 
+			elog(LOG, "!!> prepare_spoolfile_proc_exit: found '%s'", path);
 			if (hentry->delete_on_exit)
 				prepare_spoolfile_delete(path);
 		}
-- 
1.8.3.1

v57-0003-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v57-0003-Support-2PC-txn-Subscription-option.patchDownload
From e87bd7eb0aeb5019b9a7f5978ef48c0d46c0f4c2 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 11 Mar 2021 12:26:11 +1100
Subject: [PATCH v57] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/create_subscription.sgml          | 27 +++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 72 +++++++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 +
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 ++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 +++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 ++-
 src/bin/psql/tab-complete.c                        |  2 +-
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 +
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 93 +++++++++++++++-------
 src/test/regress/sql/subscription.sql              | 25 ++++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 18 files changed, 260 insertions(+), 48 deletions(-)

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..eeb7e35 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,33 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65d..b77378d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index fa23bcb..19fa87d 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,26 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +309,24 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * the current implementation has some issues that could lead to a
+	 * streaming prepared transaction to be incorrectly missed in the initial
+	 * syncing phase. Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +576,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false, true,
+				walrcv_create_slot(wrconn, slotname, false, twophase,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +883,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +918,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if (sub->twophase && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +947,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +993,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1039,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9e822f9..1daa585 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -428,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 8bf273c..89e2c31 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2930,6 +2930,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3579,6 +3580,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ede252b..2b9e7b8 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -178,13 +178,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -252,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -265,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -289,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -330,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..96c878b 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and two_phase are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 9f0208a..34c70a1 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2763,7 +2763,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 232af01..a5bb4de 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f55b07c..0ed8e9d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..67b3358 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index b7a07be..ba60998 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -45,7 +45,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 92bb655..570f62d 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -52,7 +52,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -62,7 +63,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

#247vignesh C
vignesh21@gmail.com
In reply to: Peter Smith (#246)

On Thu, Mar 11, 2021 at 7:20 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Mar 11, 2021 at 12:46 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v57*

Differences from v56* are:

* Rebased to HEAD @ today

* Addresses the following feedback issues:

(24) [vc-0305] Done. Ran pgindent for all patch 0001 source files.

(49) [ak-0308] Fixed. In apply_handle_begion_prepare, don't set
in_remote_transaction if psf spooling

(50) [ak-0308] Fixed. In apply_handle_prepare, assert
!in_remote_transaction if psf spooling.

(52) [vc-0309] Done. Patch 0002. Simplify the way test 020 creates the
publication.

(53) [vc-0309] Done. Patch 0002. Simplify the way test 022 creates the
publication.

-----
[vc-0305] /messages/by-id/CALDaNm1rRG2EUus+mFrqRzEshZwJZtxja0rn_n3qXGAygODfOA@mail.gmail.com
[vc-0309] /messages/by-id/CALDaNm0QuncAis5OqtjzOxAPTZRn545JLqfjFEJwyRjUH-XvEw@mail.gmail.com
[ak-0308] /messages/by-id/CAA4eK1+oSUU77T92FueDJWsp=FjTroNaNC-K45Dgdr7f18aBFA@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Oops. I posted the wrong patch set in my previous email.

Here are the correct ones for v57*.

Thanks for the updated patch, few comments:
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
                                                   char **synchronous_commit,
                                                   bool *refresh,
                                                   bool *binary_given,
bool *binary,
-                                                  bool
*streaming_given, bool *streaming)
+                                                  bool
*streaming_given, bool *streaming,
+                                                  bool
*twophase_given, bool *twophase)

I felt twophase_given can be a local variable, it need not be added as
a function parameter as it is not used outside the function.

The corresponding changes can be done here too:
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt,
bool isTopLevel)
bool copy_data;
bool streaming;
bool streaming_given;
+ bool twophase;
+ bool twophase_given;
char *synchronous_commit;
char *conninfo;
char *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt,
bool isTopLevel)
&synchronous_commit,
NULL,
/* no "refresh" */

&binary_given, &binary,
-
&streaming_given, &streaming);
+
&streaming_given, &streaming,
+
&twophase_given, &twophase);

--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2930,6 +2930,7 @@ maybe_reread_subscription(void)
                strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
                newsub->binary != MySubscription->binary ||
                newsub->stream != MySubscription->stream ||
+               newsub->twophase != MySubscription->twophase ||
                !equal(newsub->publications, MySubscription->publications))
I think this is not possible, should this be an assert.

@@ -252,6 +254,16 @@ parse_output_parameters(List *options, uint32
*protocol_version,

                        *enable_streaming = defGetBoolean(defel);
                }
+               else if (strcmp(defel->defname, "two_phase") == 0)
+               {
+                       if (twophase_given)
+                               ereport(ERROR,
+                                               (errcode(ERRCODE_SYNTAX_ERROR),
+                                                errmsg("conflicting
or redundant options")));
+                       twophase_given = true;
+
+                       *enable_twophase = defGetBoolean(defel);
+               }

We have the following check in parse_subscription_options:
if (twophase && *twophase_given && *twophase)
{
if (streaming && *streaming_given && *streaming)
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("%s and %s are mutually exclusive options",
"two_phase = true", "streaming = true")));
}
Should we have a similar check in parse_output_parameters.

Regards,
Vignesh

#248Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#247)
4 attachment(s)

Please find attached the latest patch set v58*

Differences from v57* are:

* Rebased to HEAD @ today

* Addresses the following feedback issues:

(15) [ak-0301] Done. DROP SUBSCRIPTION cleans up any psf files related
to the subscription

* Bugs fixed:

- the psf proc-exit handler is now only registered for apply workers

- the apply_handle_type was missing call to prepare_spoolfile_handler

-----
[ak-0301] /messages/by-id/CAA4eK1J=i16+DVpdkBjzgWQazYwVdcMJWQF0RAeCgLkCxm40=A@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v58-0004-Fix-apply-worker-empty-prepare-dev-logs.patchapplication/octet-stream; name=v58-0004-Fix-apply-worker-empty-prepare-dev-logs.patchDownload
From db289a85df0617aae09b2d03cb330e2b814b1bc2 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 12 Mar 2021 18:04:31 +1100
Subject: [PATCH v58] Fix apply worker empty prepare (dev logs).

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the "Fix apply worker empty prepare" patch.
---
 src/backend/replication/logical/tablesync.c |  29 ++++++--
 src/backend/replication/logical/worker.c    | 106 ++++++++++++++++++++++++----
 2 files changed, 116 insertions(+), 19 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 3c2c9fc..4491432 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -282,6 +282,12 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
 	SpinLockAcquire(&MyLogicalRepWorker->relmutex);
 
+	elog(LOG,
+		 "!!> process_syncing_tables_for_sync: state = '%c', current_lsn = %X/%X, relstate_lsn = %X/%X",
+		 MyLogicalRepWorker->relstate,
+		 LSN_FORMAT_ARGS(current_lsn),
+		 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
 	if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
 		current_lsn >= MyLogicalRepWorker->relstate_lsn)
 	{
@@ -1127,6 +1133,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
 		table_states_all = NIL;
@@ -1149,6 +1157,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1156,12 +1165,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1175,6 +1189,8 @@ AnyTablesyncInProgress(void)
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncInProgress?");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1186,8 +1202,8 @@ AnyTablesyncInProgress(void)
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
-		elog(DEBUG1,
-			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+		elog(LOG,
+			 "!!> AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
 			 count,
 			 rstate->relid,
 			 rstate->state);
@@ -1204,6 +1220,7 @@ AnyTablesyncInProgress(void)
 		if (rstate->state != SUBREL_STATE_SYNCDONE &&
 			rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncInProgress?: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1215,8 +1232,8 @@ AnyTablesyncInProgress(void)
 		pgstat_report_stat(false);
 	}
 
-	elog(DEBUG1,
-		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+	elog(LOG,
+		 "!!> AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
 		 count,
 		 found_busy ? "true" : "false");
 
@@ -1242,8 +1259,8 @@ BiggestTablesyncLSN()
 			biggest_lsn = rstate->lsn;
 	}
 
-	elog(DEBUG1,
-		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+	elog(LOG,
+		 "!!> BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
 		 count,
 		 LSN_FORMAT_ARGS(biggest_lsn));
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6340a0f..cc5cc88 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -831,6 +831,38 @@ apply_handle_begin_prepare(StringInfo s)
 				 errmsg("transaction identifier \"%s\" is already in use",
 						begin_data.gid)));
 
+#if 0
+	if (!am_tablesync_worker())
+	{
+		/* This is hacky test code for ability to discover/delete all psf files. */
+
+		char path[MAXPGPATH];
+
+		/* Make some test files */
+		prepare_spoolfile_name(path, sizeof(path), 123, "test1"); prepare_spoolfile_create(path); prepare_spoolfile_close();
+		prepare_spoolfile_name(path, sizeof(path), 123, "test2"); prepare_spoolfile_create(path); prepare_spoolfile_close();
+		prepare_spoolfile_name(path, sizeof(path), 123, "test3"); prepare_spoolfile_create(path); prepare_spoolfile_close();
+		prepare_spoolfile_name(path, sizeof(path), 999, "testA"); prepare_spoolfile_create(path); prepare_spoolfile_close();
+		prepare_spoolfile_name(path, sizeof(path), 999, "testB"); prepare_spoolfile_create(path); prepare_spoolfile_close();
+		prepare_spoolfile_name(path, sizeof(path), 999, "testC"); prepare_spoolfile_create(path); prepare_spoolfile_close();
+
+		// Lists
+		prepare_spoolfiles(InvalidOid, false);	// Everything!
+		prepare_spoolfiles(123, false); 		// Only those with subscription 123
+		prepare_spoolfiles(999, false); 		// Only those with subscription 999
+		prepare_spoolfiles(666, false); 		// This subscription does not exist
+
+		// Deletes
+		prepare_spoolfiles(123, true); 			// Delete all with subscription 123
+		prepare_spoolfiles(InvalidOid, false); 	// List everything again
+		//prepare_spoolfiles(InvalidOid, true); 	// Delete all
+		//prepare_spoolfiles(InvalidOid, false); 	// List everything again
+		prepare_spoolfiles(666, true); 			// Delete non-existing
+		prepare_spoolfiles(999, true); 			// Delete all with subscription 999
+		prepare_spoolfiles(InvalidOid, false); 	// List everything again
+	}
+#endif
+
 	/*
 	 * By sad timing of apply/tablesync workers it is possible to have a
 	 * “consistent snapshot” that spans prepare/commit in such a way that
@@ -871,14 +903,16 @@ apply_handle_begin_prepare(StringInfo s)
 		 * Make sure every tablesync has reached at least SYNCDONE state
 		 * before letting the apply worker proceed.
 		 */
-		elog(DEBUG1,
-			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+		elog(LOG,
+			 "!!> apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
 			 LSN_FORMAT_ARGS(begin_data.end_lsn),
 			 LSN_FORMAT_ARGS(begin_data.final_lsn),
 			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
 
 		while (AnyTablesyncInProgress())
 		{
+			elog(LOG, "!!> apply_handle_begin_prepare - waiting for all sync workers to be DONE/READY");
+
 			CHECK_FOR_INTERRUPTS();
 
 			process_syncing_tables(begin_data.final_lsn);
@@ -898,7 +932,12 @@ apply_handle_begin_prepare(StringInfo s)
 		 * prepared) will be saved to a spoolfile for replay later at
 		 * commit_prepared time.
 		 */
-		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		if (begin_data.final_lsn < BiggestTablesyncLSN()
+#if 0
+			|| true				/* XXX - Add this line to force psf (for
+								 * easier debugging) */
+#endif
+			)
 		{
 			char		psfpath[MAXPGPATH];
 
@@ -943,6 +982,8 @@ apply_handle_prepare(StringInfo s)
 	{
 		PsfHashEntry *hentry;
 
+		elog(LOG, "!!> apply_handle_prepare: SPOOLING");
+
 		Assert(!in_remote_transaction);
 
 		/* Write the PREPARE info to the psf file. */
@@ -966,6 +1007,8 @@ apply_handle_prepare(StringInfo s)
 		 * removal on proc-exit just in case there is an unexpected restart
 		 * between now and when commit_prepared happens.
 		 */
+		elog(LOG,
+			"!!> apply_handle_prepare: Make sure the spoolfile is not removed on proc-exit");
 		hentry = (PsfHashEntry *) hash_search(psf_hash, psf_cur.name,
 											  HASH_FIND, NULL);
 		Assert(hentry);
@@ -1050,6 +1093,8 @@ apply_handle_commit_prepared(StringInfo s)
 	{
 		int			nchanges;
 
+		elog(LOG, "!!> apply_handle_commit_prepared: replaying the spooled messages");
+
 		/*
 		 * Replay/dispatch the spooled messages.
 		 */
@@ -1057,8 +1102,8 @@ apply_handle_commit_prepared(StringInfo s)
 		ensure_transaction();
 
 		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
-		elog(DEBUG1,
-			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+		elog(LOG,
+			 "!!> apply_handle_commit_prepared: replayed %d (all) changes.",
 			 nchanges);
 
 		/* After replaying the psf it is no longer needed. Just delete it. */
@@ -1122,6 +1167,7 @@ apply_handle_rollback_prepared(StringInfo s)
 	 * Prepare Spoolfile (using_psf) because in that case there is no matching
 	 * PrepareTransactionBlock done yet.
 	 */
+	elog(LOG, "!!> apply_handle_rollback_prepared: using_psf = %d", using_psf);
 	if (!using_psf &&
 		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
@@ -2406,18 +2452,22 @@ apply_dispatch(StringInfo s)
 			return;
 
 		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_begin_prepare ------");
 			apply_handle_begin_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_PREPARE:
+			elog(LOG, "!!> ------ apply_handle_prepare ------");
 			apply_handle_prepare(s);
 			return;
 
 		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_commit_prepared ------");
 			apply_handle_commit_prepared(s);
 			return;
 
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			elog(LOG, "!!> ------ apply_handle_rollback_prepared ------");
 			apply_handle_rollback_prepared(s);
 			return;
 
@@ -3614,8 +3664,8 @@ IsLogicalWorker(void)
 static bool
 prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
 {
-	elog(DEBUG1,
-		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+	elog(LOG,
+		 "!!> prepare_spoolfile_handler for action '%c'. %s write to spool file",
 		 action,
 		 psf_cur.is_spooling ? "Do" : "Don't");
 
@@ -3638,7 +3688,7 @@ prepare_spoolfile_create(char *path)
 {
 	PsfHashEntry *hentry;
 
-	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+	elog(LOG, "!!> creating file \"%s\" for prepare changes", path);
 
 	Assert(!psf_cur.is_spooling);
 
@@ -3678,6 +3728,7 @@ prepare_spoolfile_create(char *path)
 static void
 prepare_spoolfile_close()
 {
+	elog(LOG, "!!> prepare_spoolfile_close");
 	if (psf_cur.vfd >= 0)
 		FileClose(psf_cur.vfd);
 
@@ -3693,6 +3744,8 @@ prepare_spoolfile_close()
 static void
 prepare_spoolfile_delete(char *path)
 {
+	elog(LOG, "!!> prepare_spoolfile_delete: \"%s\"", path);
+
 	/* The current psf should be closed already, but make sure anyway. */
 	prepare_spoolfile_close();
 
@@ -3718,18 +3771,20 @@ prepare_spoolfile_write(char action, StringInfo s)
 
 	Assert(psf_cur.is_spooling);
 
-	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+	elog(LOG, "!!> prepare_spoolfile_write: writing action '%c'", action);
 
 	/* total on-disk size, including the action type character */
 	len = (s->len - s->cursor) + sizeof(char);
 
 	/* first write the size */
+	elog(LOG, "!!> prepare_spoolfile_write: A writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, (char *) &len, sizeof(len),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(len));
 	psf_cur.cur_offset += bytes_written;
 
 	/* then the action */
+	elog(LOG, "!!> prepare_spoolfile_write: B writing action = %c, %d bytes", action, (int)sizeof(action));
 	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == sizeof(action));
@@ -3738,6 +3793,7 @@ prepare_spoolfile_write(char action, StringInfo s)
 	/* and finally the remaining part of the buffer (after the XID) */
 	len = (s->len - s->cursor);
 
+	elog(LOG, "!!> prepare_spoolfile_write: C writing len bytes = %d", len);
 	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
 							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
 	Assert(bytes_written == len);
@@ -3752,6 +3808,11 @@ prepare_spoolfile_exists(char *path)
 {
 	File		fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
 
+	elog(LOG,
+		 "!!> prepare_spoolfile_exists: Prepared spoolfile \"%s\" was %s",
+		 path,
+		 fd >= 0 ? "found" : "not found");
+
 	if (fd >= 0)
 		FileClose(fd);
 
@@ -3771,8 +3832,8 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 				oldctx2;
 	PsfFile		psf = {.is_spooling = false,.vfd = -1,.cur_offset = 0};
 
-	elog(DEBUG1,
-		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+	elog(LOG,
+		 "!!> prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
 		 path);
 
 	/*
@@ -3817,6 +3878,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: nbytes = %d, len = %d", nbytes, len);
 
 		/* have we reached end of the file? */
 		if (nbytes == 0)
@@ -3835,6 +3897,7 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		buffer = repalloc(buffer, len);
 
 		/* and finally read the data into the buffer */
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: read %d bytes into buffer", len);
 		nbytes = FileRead(psf.vfd, buffer, len,
 						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
 		psf.cur_offset += nbytes;
@@ -3851,7 +3914,9 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		/* Ensure we are reading the data into our memory context. */
 		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
 
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: Before dispatch");
 		apply_dispatch(&s2);
+		elog(LOG, "!!> prepare_spoolfile_replay_messages: After dispatch");
 
 		MemoryContextReset(ApplyMessageContext);
 
@@ -3860,13 +3925,13 @@ prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
 		nchanges++;
 
 		if (nchanges % 1000 == 0)
-			elog(DEBUG1, "replayed %d changes from file '%s'",
+			elog(LOG, "!!> replayed %d changes from file '%s'",
 				 nchanges, path);
 	}
 
 	FileClose(psf.vfd);
 
-	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+	elog(LOG, "!!> replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
 	return nchanges;
@@ -3901,6 +3966,8 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 	HASH_SEQ_STATUS seq_status;
 	PsfHashEntry *hentry;
 
+	elog(LOG, "!!> prepare_spoolfile_on_proc_exit");
+
 	/* Iterate the HTAB looking for what files can be deleted. */
 	if (psf_hash)
 	{
@@ -3909,6 +3976,7 @@ prepare_spoolfile_on_proc_exit(int status, Datum arg)
 		{
 			char	   *path = hentry->name;
 
+			elog(LOG, "!!> prepare_spoolfile_proc_exit: found '%s'", path);
 			if (hentry->delete_on_exit)
 				prepare_spoolfile_delete(path);
 		}
@@ -3928,12 +3996,18 @@ prepare_spoolfiles(Oid subid, bool unlink_flag)
 	struct dirent *dent;
 	int count = 0;
 
+	elog(LOG,
+		 "!!> prepare_spoolfiles: subid = %u, unlink = %d",
+		 subid, unlink_flag);
+
 	dir = AllocateDir(PSF_DIR);
 	while ((dent = ReadDirExtended(dir, PSF_DIR, DEBUG1)) != NULL)
 	{
 		char		path[MAXPGPATH];
 		char		prefix[MAXPGPATH];
 
+		elog(LOG, "!!> d_name = \"%s\"", dent->d_name);
+
 		/* Only process files if they have matching subid prefix. */
 		if (OidIsValid(subid))
 			sprintf(prefix, "psf_%u_", subid);
@@ -3944,13 +4018,19 @@ prepare_spoolfiles(Oid subid, bool unlink_flag)
 			continue;
 
 		snprintf(path, MAXPGPATH, PSF_DIR "/%s", dent->d_name);
+		elog(LOG, "!!> match! \"%s\"", path);
 		count++;
 
 		if (unlink_flag)
+		{
 			unlink(path);
+			elog(LOG, "!!> removed: \"%s\"", path);
+		}
 	}
 
 	FreeDir(dir);
 
+	elog(LOG, "!!> returning count = %d", count);
+
 	return count;
 }
-- 
1.8.3.1

v58-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v58-0002-Support-2PC-txn-subscriber-tests.patchDownload
From 331d639b2a9a1104936676ec9a83da322a2d14a5 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 12 Mar 2021 17:18:26 +1100
Subject: [PATCH v58] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 330 ++++++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl | 278 ++++++++++++++++++++
 2 files changed, 608 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..b7a07be
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,330 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..92bb655
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,278 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Wait for initial table syncs to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_B->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber B to synchronize data";
+$node_C->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber C to synchronize data";
+
+is(1, 1, "Cascaded setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v58-0003-Support-2PC-txn-Subscription-option.patchapplication/octet-stream; name=v58-0003-Support-2PC-txn-Subscription-option.patchDownload
From d1ebc46491c785c340fdf37cd42706b29b4e479b Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 12 Mar 2021 17:49:20 +1100
Subject: [PATCH v58] Support 2PC txn - Subscription option.

This patch implements new SUBSCRIPTION option "two_phase".

Usage: CREATE SUBSCRIPTION ... WITH (two_phase = on)

Default is off.
---
 doc/src/sgml/ref/create_subscription.sgml          | 27 +++++++
 src/backend/catalog/pg_subscription.c              |  1 +
 src/backend/catalog/system_views.sql               |  2 +-
 src/backend/commands/subscriptioncmds.c            | 72 +++++++++++++++--
 .../libpqwalreceiver/libpqwalreceiver.c            |  4 +
 src/backend/replication/logical/worker.c           |  2 +
 src/backend/replication/pgoutput/pgoutput.c        | 36 ++++++++-
 src/bin/pg_dump/pg_dump.c                          | 16 +++-
 src/bin/pg_dump/pg_dump.h                          |  1 +
 src/bin/psql/describe.c                            | 10 ++-
 src/bin/psql/tab-complete.c                        |  2 +-
 src/include/catalog/pg_subscription.h              |  3 +
 src/include/replication/logicalproto.h             |  4 +
 src/include/replication/walreceiver.h              |  1 +
 src/test/regress/expected/subscription.out         | 93 +++++++++++++++-------
 src/test/regress/sql/subscription.sql              | 25 ++++++
 src/test/subscription/t/020_twophase.pl            |  3 +-
 src/test/subscription/t/022_twophase_cascade.pl    |  6 +-
 18 files changed, 260 insertions(+), 48 deletions(-)

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..eeb7e35 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,33 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65d..b77378d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index b5b4d57..ccd85e9 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,26 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +309,24 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * the current implementation has some issues that could lead to a
+	 * streaming prepared transaction to be incorrectly missed in the initial
+	 * syncing phase. Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] = BoolGetDatum(twophase);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +576,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false, true,
+				walrcv_create_slot(wrconn, slotname, false, twophase,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +883,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +918,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if (sub->twophase && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +947,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +993,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -982,7 +1039,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9e822f9..1daa585 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -428,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 51ff8c5..6340a0f 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2930,6 +2930,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3580,6 +3581,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
 
 	/* Start normal logical streaming replication. */
 	walrcv_startstreaming(wrconn, &options);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ede252b..2b9e7b8 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -178,13 +178,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -252,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -265,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -289,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -330,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..bc033d2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4228,6 +4228,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4272,14 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBufferStr(query, " false AS subtwophase\n");
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4300,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4326,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4387,6 +4396,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, "f") != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..96c878b 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and two_phase are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index ecdb8d7..8f13e20 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2763,7 +2763,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..45d8a34 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -54,6 +54,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	bool		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +93,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	bool		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 232af01..a5bb4de 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -28,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index f55b07c..0ed8e9d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -179,6 +179,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..67b3358 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | f                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | f                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | t                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index b7a07be..ba60998 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -45,7 +45,8 @@ my $appname = 'tap_sub';
 $node_subscriber->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub
 	CONNECTION '$publisher_connstr application_name=$appname'
-	PUBLICATION tap_pub");
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
 
 # Wait for subscriber to finish initialization
 my $caughtup_query =
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 92bb655..570f62d 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -52,7 +52,8 @@ my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
 	CONNECTION '$node_A_connstr application_name=$appname_B'
-	PUBLICATION tap_pub_A");
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
 
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
@@ -62,7 +63,8 @@ my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
 	CONNECTION '$node_B_connstr application_name=$appname_C'
-	PUBLICATION tap_pub_B");
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
 
 # Wait for subscribers to finish initialization
 my $caughtup_query_B =
-- 
1.8.3.1

v58-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v58-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 0f7e7c0c9e2171270ca2d6ccc87f2c8ec3183494 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 12 Mar 2021 17:06:36 +1100
Subject: [PATCH v58] Add support for apply at prepare time to built-in logical
   replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

* This patch also adds new option to enable two_phase while creating a slot.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/protocol.sgml                         |  14 +-
 src/backend/access/transam/twophase.c              |  68 ++
 src/backend/commands/subscriptioncmds.c            |  13 +-
 .../libpqwalreceiver/libpqwalreceiver.c            |   6 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 206 +++++
 src/backend/replication/logical/tablesync.c        | 180 ++++-
 src/backend/replication/logical/worker.c           | 853 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 157 +++-
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/replication/logicalproto.h             |  69 +-
 src/include/replication/reorderbuffer.h            |  12 +
 src/include/replication/walreceiver.h              |   5 +-
 src/include/replication/worker_internal.h          |   4 +
 src/tools/pgindent/typedefs.list                   |   5 +
 18 files changed, 1531 insertions(+), 89 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe..9694713 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,18 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase transactions.
+         Two-phase commands like PREPARE TRANSACTION, COMMIT PREPARED and ROLLBACK PREPARED
+         are also decoded and transmitted. In two-phase transactions, the transaction is 
+         decoded and transmitted at PREPARE TRANSACTION time. 
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..c58c46d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..b5b4d57 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -528,7 +528,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				walrcv_create_slot(wrconn, slotname, false, true,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -648,7 +648,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
 										InvalidXLogRecPtr);
 				ereport(DEBUG1,
 						(errmsg_internal("table \"%s.%s\" added to subscription \"%s\"",
-								rv->schemaname, rv->relname, sub->name)));
+										 rv->schemaname, rv->relname, sub->name)));
 			}
 		}
 
@@ -722,9 +722,9 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
 
 				ereport(DEBUG1,
 						(errmsg_internal("table \"%s.%s\" removed from subscription \"%s\"",
-								get_namespace_name(get_rel_namespace(relid)),
-								get_rel_name(relid),
-								sub->name)));
+										 get_namespace_name(get_rel_namespace(relid)),
+										 get_rel_name(relid),
+										 sub->name)));
 			}
 		}
 
@@ -1191,6 +1191,9 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
 	snprintf(originname, sizeof(originname), "pg_%u", subid);
 	replorigin_drop_by_name(originname, true, false);
 
+	/* Remove any psf files belonging to this subscription. */
+	prepare_spoolfiles(subid, true);
+
 	/*
 	 * If there is no slot associated with the subscription, we can finish
 	 * here.
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..9e822f9 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -827,7 +828,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +842,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..488b2a2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,212 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..3c2c9fc 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -116,6 +116,9 @@
 #include "utils/snapmgr.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +362,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +369,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,7 +390,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
@@ -425,7 +399,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1052,7 +1026,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
 	/*
@@ -1137,3 +1111,141 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are there any tablesyncs which have still not yet reached SYNCDONE/READY state?
+ */
+bool
+AnyTablesyncInProgress(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		elog(DEBUG1,
+			 "AnyTablesyncInProgress?: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
+		/*
+		 * When the process_syncing_tables_for_apply changes the state from
+		 * SYNCDONE to READY, that change is actually written directly into
+		 * the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_SYNCDONE &&
+			rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	elog(DEBUG1,
+		 "AnyTablesyncInProgress?: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
+	return found_busy;
+}
+
+/*
+ * What is the biggest LSN from the all the known tablesyncs?
+ */
+XLogRecPtr
+BiggestTablesyncLSN()
+{
+	XLogRecPtr	biggest_lsn = InvalidXLogRecPtr;
+	ListCell   *lc;
+	int			count = 0;
+
+	foreach(lc, table_states_all)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		if (rstate->lsn > biggest_lsn)
+			biggest_lsn = rstate->lsn;
+	}
+
+	elog(DEBUG1,
+		 "BiggestTablesyncLSN: Scanned %d tables. Biggest lsn found = %X/%X",
+		 count,
+		 LSN_FORMAT_ARGS(biggest_lsn));
+
+	return biggest_lsn;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 21d304a..51ff8c5 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1,4 +1,5 @@
 /*-------------------------------------------------------------------------
+ * If needed, this is the common function to do that file redirection.
  * worker.c
  *	   PostgreSQL logical replication worker (apply)
  *
@@ -49,6 +50,43 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ *
+ * PREPARED SPOOLFILE (PSF) LOGIC
+ * ------------------------------
+ * It can happen that the apply worker is processing a
+ * LOGICAL_REP_MSG_BEGIN_PREPARE message at the same time as table
+ * synchronization is happening. To avoid any chance of an "empty prepare"
+ * situation arising the apply worker waits for all the tablesyncs to reach at
+ * least SYNCDONE state, but this in turn can lead to another case where the
+ * tablesync lsn has got ahead of the prepare lsn the apply worker is
+ * currently processing. Refer to the comment in the apply_handle_begin_prepare
+ * function for more details.
+ *
+ * When this happens the prepared messages are spooled into a "prepare
+ * spoolfile" (aka psf). The messages written to this file are all the prepared
+ * messages up to and including the LOGICAL_REP_MSG_PREPARE. All this psf
+ * content is then replayed later at commit time (apply_handle_commit_prepared),
+ * where the messages are all dispatched in the usual way.
+ *
+ * The psf files reside in the "pg_logical/twophase" directory and they are
+ * uniquely named. This is necessary because there may be multiple psf files
+ * co-existing, and so the correct psf must be re-discoverable (using subid and
+ * gid).
+ *
+ * Furthermore, to cope with possibility of error between the end of spooling
+ * (in apply_handle_prepare) and the commit (in apply_handle_commit_prepared) a
+ * psf file must be able to survive a PG restart. So we cannot utilizing the
+ * same (temporary file based) BufFile API that the streamed transactions use.
+ * Instead, the psf file handling uses the File API (PathNameOpenFile and
+ * friends). But this means the code now has to take responsibility for psf file
+ * cleanup. An HTAB is used to track if a particular psf file can or cannot be
+ * deleted, and a proc-exit handler is registered to take the appropriate
+ * action. Refer to function prepare_spoolfile_on_proc_exit.
+ *
+ * Upon restart, any uncommitted psf files are still present and so their commit
+ * can proceed as before.
+ *
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +97,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -208,6 +247,59 @@ static void subxact_info_add(TransactionId xid);
 static inline void cleanup_subxact_info(void);
 
 /*
+ * The following are for the support of a spoolfile for prepared messages.
+ */
+
+/* psf files will be written here. */
+#define PSF_DIR "pg_logical/twophase"
+
+/*
+ * A Prepare spoolfile hash entry. We create this entry in the psf_hash. This is
+ * for maintaining a mapping between the name of the prepared spoolfile, and a
+ * flag indicating if it is OK to delete this psf at proc-exit time or not.
+ *
+ * Each PshHashEntry is created at prepare and removed at commit/rollback.
+ */
+typedef struct PsfHashEntry
+{
+	char		name[MAXPGPATH];	/* Hash key --- must be first */
+	bool		delete_on_exit; /* ok to delete at proc-exit? */
+} PsfHashEntry;
+
+/*
+ * Information about the "current" psf spoolfile.
+ */
+typedef struct PsfFile
+{
+	char		name[MAXPGPATH];	/* psf name - same as the HTAB key. */
+	bool		is_spooling;	/* are we currently spooling to this file? */
+	File		vfd;			/* -1 when the file is closed. */
+	off_t		cur_offset;		/* offset for the next write or read. Reset to
+								 * 0 when file is opened. */
+} PsfFile;
+
+/*
+ * Hash table for storing the Prepared spoolfile info along with shared fileset.
+ */
+static HTAB *psf_hash = NULL;
+
+/*
+ * Information about the 'current' open spoolfile is only valid when spooling.
+ * This is flagged as 'is_spooling' only between begin_prepare and prepare.
+ */
+static PsfFile psf_cur = {.is_spooling = false,.vfd = -1,.cur_offset = 0};
+
+static void prepare_spoolfile_create(char *path);
+static void prepare_spoolfile_write(char action, StringInfo s);
+static void prepare_spoolfile_close(void);
+static void prepare_spoolfile_delete(char *path);
+static bool prepare_spoolfile_exists(char *path);
+static void prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid);
+static int	prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn);
+static bool prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s);
+static void prepare_spoolfile_on_proc_exit(int status, Datum arg);
+
+/*
  * Serialize and deserialize changes for a toplevel transaction.
  */
 static void stream_cleanup_files(Oid subid, TransactionId xid);
@@ -720,6 +812,345 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("transaction identifier \"%s\" is already in use",
+						begin_data.gid)));
+
+	/*
+	 * By sad timing of apply/tablesync workers it is possible to have a
+	 * “consistent snapshot” that spans prepare/commit in such a way that
+	 * the tablesync did not do the prepare (because snapshot not consistent)
+	 * and the apply worker does the begin prepare (‘b’) but it skips all
+	 * the prepared operations [e.g. inserts] while the tablesync was still
+	 * busy (see the condition of should_apply_changes_for_rel).
+	 *
+	 * This can lead to an "empty prepare", because later when the apply
+	 * worker does the commit prepare (‘K’), there is nothing in it (the
+	 * inserts were skipped earlier).
+	 *
+	 * We avoid this using the 2 part logic: (1) Wait for all tablesync
+	 * workers to reach SYNCDONE/READY state; (2) If the begin_prepare lsn is
+	 * now behind any tablesync lsn then spool the prepared messages to a file
+	 * to be replayed later at commit_prepared time.
+	 *
+	 * -----
+	 *
+	 * XXX - The 2PC protocol needs the publisher to be aware when the PREPARE
+	 * has been successfully acted on. But because of this "empty prepare"
+	 * case now the prepared messages may be spooled to a file and, when that
+	 * happens the PREPARE would not happen at the usual time, but would be
+	 * deferred until COMMIT PREPARED time. This quirk could only happen
+	 * immediately after the initial table synchronization phase; once all
+	 * tables have acheived READY state the 2PC protocol will behave normally.
+	 *
+	 * A future release may be able to detect when all tables are READY and
+	 * set a flag to indicate this subscription/slot is ready for two_phase
+	 * decoding. Then at the publisher-side, we could enable wait-for-prepares
+	 * only when all the slots of WALSender have that flag set.
+	 */
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Part 1 of 2:
+		 *
+		 * Make sure every tablesync has reached at least SYNCDONE state
+		 * before letting the apply worker proceed.
+		 */
+		elog(DEBUG1,
+			 "apply_handle_begin_prepare, end_lsn = %X/%X, final_lsn = %X/%X, lstate_lsn = %X/%X",
+			 LSN_FORMAT_ARGS(begin_data.end_lsn),
+			 LSN_FORMAT_ARGS(begin_data.final_lsn),
+			 LSN_FORMAT_ARGS(MyLogicalRepWorker->relstate_lsn));
+
+		while (AnyTablesyncInProgress())
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			process_syncing_tables(begin_data.final_lsn);
+
+			/* This latch is to prevent 100% CPU looping. */
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
+			ResetLatch(MyLatch);
+		}
+
+		/*
+		 * Part 2 of 2:
+		 *
+		 * If (when getting to SYNCDONE/READY state) some tablesync went
+		 * beyond this begin_prepare LSN then set all messages (until
+		 * prepared) will be saved to a spoolfile for replay later at
+		 * commit_prepared time.
+		 */
+		if (begin_data.final_lsn < BiggestTablesyncLSN())
+		{
+			char		psfpath[MAXPGPATH];
+
+			/*
+			 * Create the spoolfile.
+			 */
+			prepare_spoolfile_name(psfpath, sizeof(psfpath),
+								   MyLogicalRepWorker->subid, begin_data.gid);
+			prepare_spoolfile_create(psfpath);
+
+			/*
+			 * From now, until the handle_prepare we are spooling to the
+			 * current psf.
+			 */
+			psf_cur.is_spooling = true;
+
+			pgstat_report_activity(STATE_RUNNING, NULL);
+			return;
+		}
+	}
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	/*
+	 * If we were using a psf spoolfile, then write the PREPARE as the final
+	 * message. This prepare information will be used at commit_prepared time.
+	 */
+	if (psf_cur.is_spooling)
+	{
+		PsfHashEntry *hentry;
+
+		Assert(!in_remote_transaction);
+
+		/* Write the PREPARE info to the psf file. */
+		prepare_spoolfile_handler(LOGICAL_REP_MSG_PREPARE, s);
+
+		/*
+		 * Flush the spoolfile, so changes can survive a restart.
+		 *
+		 * If the publisher resends the same data again after a restart (e.g.
+		 * if subscriber origin has not moved past this prepare), then the
+		 * same named psf file will be overwritten with the same data. See
+		 * prepare_spoolfile_create.
+		 */
+		FileSync(psf_cur.vfd, WAIT_EVENT_DATA_FILE_SYNC);
+
+		/* We are finished spooling to the current psf. */
+		psf_cur.is_spooling = false;
+
+		/*
+		 * The commit_prepare will need the spoolfile, so unregister it for
+		 * removal on proc-exit just in case there is an unexpected restart
+		 * between now and when commit_prepared happens.
+		 */
+		hentry = (PsfHashEntry *) hash_search(psf_hash, psf_cur.name,
+											  HASH_FIND, NULL);
+		Assert(hentry);
+		hentry->delete_on_exit = false;
+
+		/*
+		 * The psf_cur.vfd is meaningful only between begin_prepare and
+		 * prepared. So close it now. Any messages written to the psf will be
+		 * applied later during handle_commit_prepared.
+		 */
+		prepare_spoolfile_close();
+
+		in_remote_transaction = false;
+		return;
+	}
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	/*
+	 * Normally, prepare_lsn == remote_final_lsn, but if this prepare message
+	 * was dispatched via the psf spoolfile replay then the remote_final_lsn
+	 * is set to commit lsn instead. Hence the <= instead of == check below.
+	 */
+	Assert(prepare_data.prepare_lsn <= remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then replay
+	 * them all now.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, prepare_data.gid);
+	if (prepare_spoolfile_exists(psfpath))
+	{
+		int			nchanges;
+
+		/*
+		 * Replay/dispatch the spooled messages.
+		 */
+
+		ensure_transaction();
+
+		nchanges = prepare_spoolfile_replay_messages(psfpath, prepare_data.prepare_lsn);
+		elog(DEBUG1,
+			 "apply_handle_commit_prepared: replayed %d (all) changes.",
+			 nchanges);
+
+		/* After replaying the psf it is no longer needed. Just delete it. */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	bool		using_psf;
+	char		psfpath[MAXPGPATH];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * If this prepare's messages were being spooled to a file, then cleanup
+	 * the file.
+	 */
+	prepare_spoolfile_name(psfpath, sizeof(psfpath),
+						   MyLogicalRepWorker->subid, rollback_data.gid);
+	using_psf = prepare_spoolfile_exists(psfpath);
+	if (using_psf)
+	{
+		/* We are finished with this spoolfile. Delete it. */
+		prepare_spoolfile_delete(psfpath);
+	}
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
+	 */
+	if (!using_psf &&
+		LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -732,6 +1163,7 @@ apply_handle_origin(StringInfo s)
 	 * remote transaction and before any actual writes.
 	 */
 	if (!in_streamed_transaction &&
+		!psf_cur.is_spooling &&
 		(!in_remote_transaction ||
 		 (IsTransactionState() && !am_tablesync_worker())))
 		ereport(ERROR,
@@ -1092,6 +1524,9 @@ apply_handle_relation(StringInfo s)
 {
 	LogicalRepRelation *rel;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_RELATION, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_RELATION, s))
 		return;
 
@@ -1110,6 +1545,9 @@ apply_handle_type(StringInfo s)
 {
 	LogicalRepTyp typ;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TYPE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TYPE, s))
 		return;
 
@@ -1150,6 +1588,9 @@ apply_handle_insert(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_INSERT, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_INSERT, s))
 		return;
 
@@ -1271,6 +1712,9 @@ apply_handle_update(StringInfo s)
 	RangeTblEntry *target_rte;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_UPDATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_UPDATE, s))
 		return;
 
@@ -1429,6 +1873,9 @@ apply_handle_delete(StringInfo s)
 	TupleTableSlot *remoteslot;
 	MemoryContext oldctx;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_DELETE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_DELETE, s))
 		return;
 
@@ -1798,6 +2245,9 @@ apply_handle_truncate(StringInfo s)
 	List	   *relids_logged = NIL;
 	ListCell   *lc;
 
+	if (prepare_spoolfile_handler(LOGICAL_REP_MSG_TRUNCATE, s))
+		return;
+
 	if (handle_streamed_transaction(LOGICAL_REP_MSG_TRUNCATE, s))
 		return;
 
@@ -1954,6 +2404,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2012,7 +2484,9 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
 		}
 	}
 
-	*have_pending_txes = !dlist_is_empty(&lsn_mapping);
+	/* consider entries in prepare spool file as not flushed */
+	*have_pending_txes = (!dlist_is_empty(&lsn_mapping) ||
+						  (psf_hash && hash_get_num_entries(psf_hash)));
 }
 
 /*
@@ -2061,6 +2535,23 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	TimeLineID	tli;
 
 	/*
+	 * Initialize the psf_hash table if we haven't yet. This will be used for
+	 * the entire duration of the apply worker so create it in permanent
+	 * context.
+	 */
+	if (psf_hash == NULL)
+	{
+		HASHCTL		hash_ctl;
+		PsfHashEntry *hentry;
+
+		hash_ctl.keysize = sizeof(hentry->name);
+		hash_ctl.entrysize = sizeof(PsfHashEntry);
+		hash_ctl.hcxt = ApplyContext;
+		psf_hash = hash_create("PrepareSpoolfileHash", 1024, &hash_ctl,
+							   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+
+	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
 	 * protocol message.
 	 */
@@ -2180,7 +2671,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 		/* confirm all writes so far */
 		send_feedback(last_received, false, false);
 
-		if (!in_remote_transaction && !in_streamed_transaction)
+		if (!in_remote_transaction && !in_streamed_transaction && !psf_cur.is_spooling)
 		{
 			/*
 			 * If we didn't get any transactions for a while there might be
@@ -2927,6 +3418,10 @@ ApplyWorkerMain(Datum main_arg)
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
 
+	/* Arrange to delete any unwanted psf file(s) at proc-exit */
+	if (!am_tablesync_worker())
+		on_proc_exit(prepare_spoolfile_on_proc_exit, 0);
+
 	/* Setup signal handling */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGTERM, die);
@@ -3103,3 +3598,357 @@ IsLogicalWorker(void)
 {
 	return MyLogicalRepWorker != NULL;
 }
+
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time.
+ *
+ * Returns true if the message was redirected to the spoolfile, false
+ * otherwise (regular mode).
+ */
+static bool
+prepare_spoolfile_handler(LogicalRepMsgType action, StringInfo s)
+{
+	elog(DEBUG1,
+		 "prepare_spoolfile_handler for action '%c'. %s write to spool file",
+		 action,
+		 psf_cur.is_spooling ? "Do" : "Don't");
+
+	if (!psf_cur.is_spooling)
+		return false;
+
+	Assert(!in_streamed_transaction);
+
+	/* write the change to the current file */
+	prepare_spoolfile_write(action, s);
+
+	return true;
+}
+
+/*
+ * Create the spoolfile used to serialize the prepare messages.
+ */
+static void
+prepare_spoolfile_create(char *path)
+{
+	PsfHashEntry *hentry;
+
+	elog(DEBUG1, "creating file \"%s\" for prepare changes", path);
+
+	Assert(!psf_cur.is_spooling);
+
+	/* Make sure the PSF_DIR subdirectory exists. */
+	if (MakePGDirectory(PSF_DIR) < 0 && errno != EEXIST)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create directory \"%s\": %m",
+						PSF_DIR)));
+
+	/*
+	 * Open the file and seek to the beginning because we always want to
+	 * create/overwrite this file.
+	 */
+	psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+	if (psf_cur.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m", path)));
+
+	/* Create/Find the spoolfile entry in the psf_hash */
+	hentry = (PsfHashEntry *) hash_search(psf_hash, path,
+										  HASH_ENTER | HASH_FIND, NULL);
+	Assert(hentry);
+	memcpy(psf_cur.name, path, sizeof(psf_cur.name));
+	psf_cur.cur_offset = 0;
+	hentry->delete_on_exit = true;
+
+	/* Sanity checks */
+	Assert(psf_cur.vfd >= 0);
+	Assert(prepare_spoolfile_exists(path));
+}
+
+/*
+ * Close the "current" spoolfile and unset the fd.
+ */
+static void
+prepare_spoolfile_close()
+{
+	if (psf_cur.vfd >= 0)
+		FileClose(psf_cur.vfd);
+
+	/* Mark this fd as not valid to use anymore. */
+	psf_cur.is_spooling = false;
+	psf_cur.vfd = -1;
+	psf_cur.cur_offset = 0;
+}
+
+/*
+ * Delete the specified psf spoolfile, and any HTAB associated with it.
+ */
+static void
+prepare_spoolfile_delete(char *path)
+{
+	/* The current psf should be closed already, but make sure anyway. */
+	prepare_spoolfile_close();
+
+	/* Delete the file off the disk. */
+	unlink(path);
+
+	/* Remove any entry from the psf_hash, if present */
+	hash_search(psf_hash, path, HASH_REMOVE, NULL);
+}
+
+/*
+ * Serialize a change to the prepare spoolfile for the current toplevel transaction.
+ *
+ * The change is serialized in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ */
+static void
+prepare_spoolfile_write(char action, StringInfo s)
+{
+	int			len;
+	int			bytes_written;
+
+	Assert(psf_cur.is_spooling);
+
+	elog(DEBUG1, "prepare_spoolfile_write: writing action '%c'", action);
+
+	/* total on-disk size, including the action type character */
+	len = (s->len - s->cursor) + sizeof(char);
+
+	/* first write the size */
+	bytes_written = FileWrite(psf_cur.vfd, (char *) &len, sizeof(len),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(len));
+	psf_cur.cur_offset += bytes_written;
+
+	/* then the action */
+	bytes_written = FileWrite(psf_cur.vfd, &action, sizeof(action),
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == sizeof(action));
+	psf_cur.cur_offset += bytes_written;
+
+	/* and finally the remaining part of the buffer (after the XID) */
+	len = (s->len - s->cursor);
+
+	bytes_written = FileWrite(psf_cur.vfd, &s->data[s->cursor], len,
+							  psf_cur.cur_offset, WAIT_EVENT_DATA_FILE_WRITE);
+	Assert(bytes_written == len);
+	psf_cur.cur_offset += bytes_written;
+}
+
+/*
+ * Is there a prepare spoolfile for the specified path?
+ */
+static bool
+prepare_spoolfile_exists(char *path)
+{
+	File		fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+
+	if (fd >= 0)
+		FileClose(fd);
+
+	return fd >= 0;
+}
+
+/*
+ * Replay (apply) all the prepared messages that are in the prepare spoolfile.
+ */
+static int
+prepare_spoolfile_replay_messages(char *path, XLogRecPtr final_lsn)
+{
+	StringInfoData s2;
+	int			nchanges = 0;
+	char	   *buffer = NULL;
+	MemoryContext oldctx,
+				oldctx2;
+	PsfFile		psf = {.is_spooling = false,.vfd = -1,.cur_offset = 0};
+
+	elog(DEBUG1,
+		 "prepare_spoolfile_replay_messages: replaying changes from file \"%s\"",
+		 path);
+
+	/*
+	 * Allocate memory required to process all the messages in
+	 * TopTransactionContext to avoid it getting reset after each message is
+	 * processed.
+	 */
+	oldctx = MemoryContextSwitchTo(TopTransactionContext);
+
+	psf.vfd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+	if (psf.vfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from prepared spoolfile \"%s\": %m",
+						path)));
+
+	buffer = palloc(BLCKSZ);
+	initStringInfo(&s2);
+
+	MemoryContextSwitchTo(oldctx);
+
+	/*
+	 * Make sure the handle apply_dispatch methods are aware we're in a remote
+	 * transaction.
+	 */
+	remote_final_lsn = final_lsn;
+	in_remote_transaction = true;
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
+	/*
+	 * Read the entries one by one and pass them through the same logic as in
+	 * apply_dispatch.
+	 */
+	while (true)
+	{
+		int			nbytes;
+		int			len;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* read length of the on-disk record */
+		nbytes = FileRead(psf.vfd, (char *) &len, sizeof(len),
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+
+		/* have we reached end of the file? */
+		if (nbytes == 0)
+			break;
+
+		/* do we have a correct length? */
+		if (nbytes != sizeof(len))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		Assert(len > 0);
+
+		/* make sure we have sufficiently large buffer */
+		buffer = repalloc(buffer, len);
+
+		/* and finally read the data into the buffer */
+		nbytes = FileRead(psf.vfd, buffer, len,
+						  psf.cur_offset, WAIT_EVENT_DATA_FILE_READ);
+		psf.cur_offset += nbytes;
+		if (nbytes != len)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read from prepared spoolfile \"%s\": %m",
+							path)));
+
+		/* copy the buffer to the stringinfo and call apply_dispatch */
+		resetStringInfo(&s2);
+		appendBinaryStringInfo(&s2, buffer, len);
+
+		/* Ensure we are reading the data into our memory context. */
+		oldctx2 = MemoryContextSwitchTo(ApplyMessageContext);
+
+		apply_dispatch(&s2);
+
+		MemoryContextReset(ApplyMessageContext);
+
+		MemoryContextSwitchTo(oldctx2);
+
+		nchanges++;
+
+		if (nchanges % 1000 == 0)
+			elog(DEBUG1, "replayed %d changes from file '%s'",
+				 nchanges, path);
+	}
+
+	FileClose(psf.vfd);
+
+	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
+		 nchanges, path);
+
+	return nchanges;
+}
+
+/*
+ * Format the filename for the prepare spoolfile.
+ */
+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB. Therefore, the name
+	 * and the key must be exactly same lengths and padded with '\0' so
+	 * garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "%s/psf_%u_%s.changes", PSF_DIR, subid, gid);
+}
+
+/*
+ * proc_exit callback to remove unwanted psf files.
+ */
+static void
+prepare_spoolfile_on_proc_exit(int status, Datum arg)
+{
+	HASH_SEQ_STATUS seq_status;
+	PsfHashEntry *hentry;
+
+	/* Iterate the HTAB looking for what files can be deleted. */
+	if (psf_hash)
+	{
+		hash_seq_init(&seq_status, psf_hash);
+		while ((hentry = (PsfHashEntry *) hash_seq_search(&seq_status)) != NULL)
+		{
+			char	   *path = hentry->name;
+
+			if (hentry->delete_on_exit)
+				prepare_spoolfile_delete(path);
+		}
+	}
+}
+
+/*
+ * Find if there are any psf files belonging to the specified subscription.
+ * InvalidOid subid param means "all files".
+ *
+ * Optionally delete (unlink) all that match the subid.
+ */
+int
+prepare_spoolfiles(Oid subid, bool unlink_flag)
+{
+	DIR		   *dir;
+	struct dirent *dent;
+	int count = 0;
+
+	dir = AllocateDir(PSF_DIR);
+	while ((dent = ReadDirExtended(dir, PSF_DIR, DEBUG1)) != NULL)
+	{
+		char		path[MAXPGPATH];
+		char		prefix[MAXPGPATH];
+
+		/* Only process files if they have matching subid prefix. */
+		if (OidIsValid(subid))
+			sprintf(prefix, "psf_%u_", subid);
+		else
+			sprintf(prefix, "psf_"); /* all psf files */
+
+		if (strstr(dent->d_name, prefix) != dent->d_name)
+			continue;
+
+		snprintf(path, MAXPGPATH, PSF_DIR "/%s", dent->d_name);
+		count++;
+
+		if (unlink_flag)
+			unlink(path);
+	}
+
+	FreeDir(dir);
+
+	return count;
+}
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..ede252b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,6 +171,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -322,8 +342,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +362,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +383,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +843,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1250,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e5f8a06..e40d2d0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -363,7 +363,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..232af01 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -54,10 +55,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +120,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +128,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +177,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..0c95dc6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index a97a59a..f55b07c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -345,6 +345,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -418,8 +419,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..9015329 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,10 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AnyTablesyncInProgress(void);
+extern XLogRecPtr BiggestTablesyncLSN(void);
+extern int	prepare_spoolfiles(Oid suboid, bool unlink_flag);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e017557..2b38b7a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1343,12 +1343,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
@@ -1958,6 +1961,8 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfFile
+PsfHashEntry
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

#249Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#247)

On Fri, Mar 12, 2021 at 4:07 PM vignesh C <vignesh21@gmail.com> wrote:

Hi Vignesh,

Thanks for the review comments.

But can you please resend it with each feedback enumerated as 1. 2.
3., or have some other clear separation for each comment.

(Because everything is mushed together I am not 100% sure if your
comment text applies to the code above or below it)

TIA.

----
Kind Regards,
Peter Smith.
Fujitsu Australia

#250vignesh C
vignesh21@gmail.com
In reply to: Peter Smith (#249)

On Fri, Mar 12, 2021 at 2:29 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Fri, Mar 12, 2021 at 4:07 PM vignesh C <vignesh21@gmail.com> wrote:

Hi Vignesh,

Thanks for the review comments.

But can you please resend it with each feedback enumerated as 1. 2.
3., or have some other clear separation for each comment.

(Because everything is mushed together I am not 100% sure if your
comment text applies to the code above or below it)

1) I felt twophase_given can be a local variable, it need not be added
as a function parameter as it is not used outside the function.
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
                                                   char **synchronous_commit,
                                                   bool *refresh,
                                                   bool *binary_given,
bool *binary,
-                                                  bool
*streaming_given, bool *streaming)
+                                                  bool
*streaming_given, bool *streaming,
+                                                  bool
*twophase_given, bool *twophase)

The corresponding changes should be done here too:
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt,
bool isTopLevel)
bool copy_data;
bool streaming;
bool streaming_given;
+ bool twophase;
+ bool twophase_given;
char *synchronous_commit;
char *conninfo;
char *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt,
bool isTopLevel)
&synchronous_commit,
NULL,
/* no "refresh" */

&binary_given, &binary,
-
&streaming_given, &streaming);
+
&streaming_given, &streaming,
+
&twophase_given, &twophase);

2) I think this is not possible as we don't allow changing twophase
option, should this be an assert.
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2930,6 +2930,7 @@ maybe_reread_subscription(void)
                strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
                newsub->binary != MySubscription->binary ||
                newsub->stream != MySubscription->stream ||
+               newsub->twophase != MySubscription->twophase ||
                !equal(newsub->publications, MySubscription->publications))

3) We have the following check in parse_subscription_options:
if (twophase && *twophase_given && *twophase)
{
if (streaming && *streaming_given && *streaming)
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("%s and %s are mutually exclusive options",
"two_phase = true", "streaming = true")));
}

Should we have a similar check in parse_output_parameters?
@@ -252,6 +254,16 @@ parse_output_parameters(List *options, uint32
*protocol_version,

                        *enable_streaming = defGetBoolean(defel);
                }
+               else if (strcmp(defel->defname, "two_phase") == 0)
+               {
+                       if (twophase_given)
+                               ereport(ERROR,
+                                               (errcode(ERRCODE_SYNTAX_ERROR),
+                                                errmsg("conflicting
or redundant options")));
+                       twophase_given = true;
+
+                       *enable_twophase = defGetBoolean(defel);
+               }

Regard,
Vignesh

#251osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Peter Smith (#248)
RE: [HACKERS] logical decoding of two-phase transactions

Hi

On Friday, March 12, 2021 5:40 PM Peter Smith <smithpb2250@gmail.com>

Please find attached the latest patch set v58*

Thank you for updating those. I'm testing the patchset
and I think it's preferable that you add simple two types of more tests in 020_twophase.pl
because those aren't checked by v58.

(1) execute single PREPARE TRANSACTION
which affects several tables (connected to corresponding publications)
at the same time and confirm they are synced correctly.

(2) execute single PREPARE TRANSACTION which affects multiple subscribers
and confirm they are synced correctly.
This doesn't mean cascading standbys like 022_twophase_cascade.pl.
Imagine that there is one publisher and two subscribers to it.

In my env, I checked those and the results were fine, though.

Best Regards,
Takamichi Osumi

#252wangsh.fnst@fujitsu.com
wangsh.fnst@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#251)
RE: [HACKERS] logical decoding of two-phase transactions

Hi,

I noticed in patch v58-0001-Add-support-for-apply-at-prepare-time-to-built-i.patch

+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+	PsfHashEntry *hentry;
+
+	/*
+	 * This name is used as the key in the psf_hash HTAB. Therefore, the name
+	 * and the key must be exactly same lengths and padded with '\0' so
+	 * garbage does not impact the HTAB lookups.
+	 */
+	Assert(sizeof(hentry->name) == MAXPGPATH);
+	Assert(szpath == MAXPGPATH);
+	memset(path, '\0', MAXPGPATH);
+
+	snprintf(path, MAXPGPATH, "%s/psf_%u_%s.changes", PSF_DIR, subid, gid);
+}

The variable hentry is only used when --enable-cassert is specified, it will be a warning if I don't specify the
--enable-cassert when execute configure

And the comment says the lengths are same, I think ' Assert(sizeof(hentry->name) == szpath) ' will be better.

Best regards.
Shenhao Wang

#253Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#248)

On Fri, Mar 12, 2021 at 2:09 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v58*

In this patch-series, I see a problem with synchronous replication
when GUC 'synchronous_standby_names' is configured to use subscriber.
This will allow Prepares and Commits to wait for the subscriber to
finish. Before this patch, we never send prepare as two-phase was not
enabled by a subscriber, so it won't wait for it, rather it will make
progress because we send keep_alive messages. But after this patch, it
will start waiting for Prepare to finish. Now, without spool-file
logic, it will work because prepares are decoded on subscriber and a
corresponding ack will be sent to a publisher but for the spool-file
case, we will wait for Publisher to send commit prepared and in
publisher prepare is not finished because we are waiting for its ack.
So, it will create a sort of deadlock. This is related to the problem
as mentioned in the below comments in the patch:
+ * A future release may be able to detect when all tables are READY and set
+ * a flag to indicate this subscription/slot is ready for two_phase
+ * decoding. Then at the publisher-side, we could enable wait-for-prepares
+ * only when all the slots of WALSender have that flag set.

The difference is that it can happen now itself, prepares
automatically wait if 'synchronous_standby_names' is set. Now, we can
imagine a solution where after spooling to file the changes which
can't be applied during syncup phase, we update the flush location so
that publisher can proceed with that prepare. But I think that won't
work because once we have updated the flush location those prepares
won't be sent again and it is quite possible that we don't have
complete relation information as the schema is not sent with each
transaction. Now, we can go one step further and try to remember the
schema information the first time it is sent so that it can be reused
after restart but I think that will complicate the patch and overall
design.

I think there is a simpler solution to these problems. The idea is to
enable two_phase after the initial sync is over (all relations are in
a READY state). If we switch-on the 2PC only after all the relations
come to the READY state then we shouldn't get any prepare before
sync-point. However, it is quite possible that before reaching
syncpoint, the slot corresponding to apply-worker has skipped because
2PC was not enabled, and afterward, prepare would be skipped because
by that start_decoding_at might have moved. See the explanation in an
email: /messages/by-id/CAA4eK1LuK4t-ZYYCY7k9nMoYP+dwi-JyqUdtcffQMoB_g5k6Hw@mail.gmail.com.
Now, even the initial_consistent_point won't help because for
apply-worker, it will be different from tablesync slot's
initial_consistent_point and we would have reached initial consistency
earlier for apply-workers.

To solve the main problem (how to detect the prepares that are skipped
when we toggled the two_pc option) in the above idea, we can mark an
LSN position in the slot (two_phase_at, this will be the same as
start_decoding_at point when we receive slot with 2PC option) where we
enable two_pc. If we encounter any commit prepared whose prepare LSN
is less than two_phase_at, then we need to send prepare for the
transaction along with commit prepared.

For this solution on the subscriber-side, I think we need a tri-state
column (two_phase) in pg_subscription. It can have three values
'disable', 'can_enable', 'enable'. By default, it will be 'disable'.
If the user enables 2PC, then we can set it to 'can_enable' and once
we see all relations are in a READY state, restart the apply-worker
and this time while starting the streaming, send the two_pc option and
then we can change the state to 'enable' so that future restarts won't
send this option again. Now on the publisher side, if this option is
present, it will change the value of two_phase_at in the slot to
start_decoding_at. I think something on these lines should be much
easier than the spool-file implementation unless we see any problem
with this idea.

--
With Regards,
Amit Kapila.

#254Peter Smith
smithpb2250@gmail.com
In reply to: wangsh.fnst@fujitsu.com (#252)

On Sun, Mar 14, 2021 at 1:52 PM wangsh.fnst@fujitsu.com
<wangsh.fnst@fujitsu.com> wrote:

Hi,

I noticed in patch v58-0001-Add-support-for-apply-at-prepare-time-to-built-i.patch

+static void
+prepare_spoolfile_name(char *path, int szpath, Oid subid, char *gid)
+{
+     PsfHashEntry *hentry;
+
+     /*
+      * This name is used as the key in the psf_hash HTAB. Therefore, the name
+      * and the key must be exactly same lengths and padded with '\0' so
+      * garbage does not impact the HTAB lookups.
+      */
+     Assert(sizeof(hentry->name) == MAXPGPATH);
+     Assert(szpath == MAXPGPATH);
+     memset(path, '\0', MAXPGPATH);
+
+     snprintf(path, MAXPGPATH, "%s/psf_%u_%s.changes", PSF_DIR, subid, gid);
+}

The variable hentry is only used when --enable-cassert is specified, it will be a warning if I don't specify the
--enable-cassert when execute configure

And the comment says the lengths are same, I think ' Assert(sizeof(hentry->name) == szpath) ' will be better.

Thanks for your feedback comment.

But today Amit suggested [ak0315] that the current psf logic should
all be replaced, after which the function you commented about will no
longer exist.

----
[ak0315] /messages/by-id/CAA4eK1LVEdPYnjdajYzu3k6KEii1+F0jdQ6sWnYugiHcSGZD6Q@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

#255Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#253)
3 attachment(s)

On Mon, Mar 15, 2021 at 2:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think something on these lines should be much
easier than the spool-file implementation unless we see any problem
with this idea.

Here's a new patch-set that implements this new solution proposed by Amit.
Patchset-v60 implements:
* renamed initial_consistent_point to two_phase_at and set it when a stream
is started with two_phase on or slot is created with two_phase on.
* replication slots are created with two_phase off on start.
* start stream with two_phase on only after all tables are in READY state.
* Initially the two_phase parameter of the subscription defaults to PENDING
and is only enabled once all tables are in READY state.
* restrict REFRESH PUBLICATION with copy = true on subscriptions with
two_phase enabled.
* documentation updates

Pending work:
* add documentation for START REPLICATION syntax change.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v60-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v60-0002-Support-2PC-txn-subscriber-tests.patchDownload
From 46ead4e143f264424d9474d30d4365db8e2b4153 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 15 Mar 2021 08:27:12 -0400
Subject: [PATCH v60] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 337 ++++++++++++++++++++++++
 src/test/subscription/t/021_twophase_cascade.pl | 280 ++++++++++++++++++++
 2 files changed, 617 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..a17bf21
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,337 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophase NOT IN ('y');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_cascade.pl b/src/test/subscription/t/021_twophase_cascade.pl
new file mode 100644
index 0000000..c96f328
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_cascade.pl
@@ -0,0 +1,280 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophase NOT IN ('y');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v60-0003-Fix-apply-worker-dev-logs.patchapplication/octet-stream; name=v60-0003-Fix-apply-worker-dev-logs.patchDownload
From 415ad469b751f64a42c214ee497d10a196bc5440 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 15 Mar 2021 08:32:45 -0400
Subject: [PATCH v60] Fix apply worker (dev logs)

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the patch.
---
 src/backend/replication/logical/tablesync.c | 27 +++++++++++++++++++++++++++
 src/backend/replication/logical/worker.c    |  1 +
 2 files changed, 28 insertions(+)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index d946b59..35d2637 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -409,6 +409,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 */
 	if (MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_PENDING)
 	{
+		elog(LOG, "!!> two_phase enable is still pending");
 		if (AllTablesyncsREADY())
 		{
 			ereport(LOG,
@@ -1150,6 +1151,7 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
 
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
@@ -1173,6 +1175,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1180,12 +1183,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1208,6 +1216,8 @@ AnyTablesyncsNotREADY(void)
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncsNotREADY");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1219,6 +1229,12 @@ AnyTablesyncsNotREADY(void)
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
+		elog(LOG,
+			 "!!> AnyTablesyncsNotREADY: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
 		/*
 		 * When the process_syncing_tables_for_apply changes the state from
 		 * SYNCDONE to READY, that change is actually written directly into
@@ -1230,6 +1246,7 @@ AnyTablesyncsNotREADY(void)
 		 */
 		if (rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncsNotREADY: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1241,6 +1258,11 @@ AnyTablesyncsNotREADY(void)
 		pgstat_report_stat(false);
 	}
 
+	elog(LOG,
+		 "!!> AnyTablesyncsNotREADY: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
 	return found_busy;
 }
 
@@ -1292,6 +1314,11 @@ UpdateTwoPhaseTriState(char new_tristate)
 
 		StartTransactionCommand();
 		new_s = GetSubscription(MySubscription->oid, false);
+		elog(LOG,
+			 "!!> 2PC Tri-state for \"%s\": '%c' ==> '%c'",
+			 MySubscription->name,
+			 MySubscription->twophase,
+			 new_s->twophase);
 		CommitTransactionCommand();
 	}
 #endif
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f0e0b11..98ad27e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -3203,6 +3203,7 @@ ApplyWorkerMain(Datum main_arg)
 						MyLogicalRepWorker->subid)));
 		proc_exit(0);
 	}
+	elog(LOG, "!!> MAIN: MySubscription twophase = '%c'", MySubscription->twophase);
 
 	MySubscriptionValid = true;
 	MemoryContextSwitchTo(oldctx);
-- 
1.8.3.1

v60-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v60-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 48dda51254e7f2660d328e307c30a966a78c605d Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 15 Mar 2021 08:24:56 -0400
Subject: [PATCH v60] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

* implement new SUBSCRIPTION option "two_phase".

* add new option to enable two_phase while creating a slot.

* introduction of tri-state for twophase pg_subscription column.

* restrict ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* restrict ALTER SUBSCRIPTION SET PUBLICATION WITH (refresh = true) when two_phase enabled.

* include documentation update.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                         |  14 ++
 doc/src/sgml/protocol.sgml                         |  14 +-
 doc/src/sgml/ref/alter_subscription.sgml           |   4 +-
 doc/src/sgml/ref/create_subscription.sgml          |  36 +++
 src/backend/access/transam/twophase.c              |  68 ++++++
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 121 +++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  17 +-
 src/backend/replication/logical/decode.c           |   2 +-
 src/backend/replication/logical/logical.c          |  27 ++-
 src/backend/replication/logical/logicalfuncs.c     |   2 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 206 ++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |   4 +-
 src/backend/replication/logical/snapbuild.c        |  10 +-
 src/backend/replication/logical/tablesync.c        | 227 +++++++++++++++---
 src/backend/replication/logical/worker.c           | 264 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 193 ++++++++++++---
 src/backend/replication/repl_gram.y                |  21 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slotfuncs.c                |   2 +-
 src/backend/replication/walreceiver.c              |   4 +-
 src/backend/replication/walsender.c                |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  10 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |   8 +
 src/include/nodes/replnodes.h                      |   1 +
 src/include/replication/logical.h                  |   3 +-
 src/include/replication/logicalproto.h             |  73 +++++-
 src/include/replication/reorderbuffer.h            |  14 +-
 src/include/replication/slot.h                     |   6 +-
 src/include/replication/snapbuild.h                |   4 +-
 src/include/replication/walreceiver.h              |  12 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         |  93 +++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/tools/pgindent/typedefs.list                   |   5 +
 41 files changed, 1364 insertions(+), 167 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index b1de6d0..fa3fd77 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7567,6 +7567,20 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophase</structfield> <type>char</type>
+      </para>
+      <para>
+       The <varname>two_phase commit current state:</varname>
+       <itemizedlist>
+        <listitem><para><literal>'n'</literal> = two_phase mode was not requested, so is disabled.</para></listitem>
+        <listitem><para><literal>'p'</literal> = two_phase mode was requested, but is pending enablement.</para></listitem>
+        <listitem><para><literal>'y'</literal> = two_phase mode was requested, and is enabled.</para></listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe..9694713 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,18 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase transactions.
+         Two-phase commands like PREPARE TRANSACTION, COMMIT PREPARED and ROLLBACK PREPARED
+         are also decoded and transmitted. In two-phase transactions, the transaction is 
+         decoded and transmitted at PREPARE TRANSACTION time. 
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..85cc8bb 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -64,7 +64,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
   <para>
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with refresh
-   option as true cannot be executed inside a transaction block.
+   option as true cannot be executed inside a transaction block. They also
+   cannot be executed with <literal>copy_data = true</literal> if the
+   subscription is using <literal>two_phase</literal> commit.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..a5c9158 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,42 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the intial table synchronization phase. This means even when
+          two_phase is enabled for the subscription, the internal two-phase state remains
+          temporarily "pending" until the initialization phase is completed. See column
+          <literal>subtwophase</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..c58c46d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65d..b77378d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..b1f27ec 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,26 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +309,24 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * the current implementation has some issues that could lead to a
+	 * streaming prepared transaction to be incorrectly missed in the initial
+	 * syncing phase. Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -527,8 +578,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			if (create_slot)
 			{
 				Assert(slotname);
-
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with two-phase
+				 * enabled. Will enable it once all the tables are synced and ready.
+				 * This avoids race-conditions that might occur during initial table-sync.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -648,7 +703,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
 										InvalidXLogRecPtr);
 				ereport(DEBUG1,
 						(errmsg_internal("table \"%s.%s\" added to subscription \"%s\"",
-								rv->schemaname, rv->relname, sub->name)));
+										 rv->schemaname, rv->relname, sub->name)));
 			}
 		}
 
@@ -722,9 +777,9 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
 
 				ereport(DEBUG1,
 						(errmsg_internal("table \"%s.%s\" removed from subscription \"%s\"",
-								get_namespace_name(get_rel_namespace(relid)),
-								get_rel_name(relid),
-								sub->name)));
+										 get_namespace_name(get_rel_namespace(relid)),
+										 get_rel_name(relid),
+										 sub->name)));
 			}
 		}
 
@@ -835,7 +890,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +925,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophase != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +954,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +1000,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -953,6 +1017,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/* See ALTER_SUBSCRIPTION_REFRESH for details why this is not allow. */
+					if (sub->twophase != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -982,7 +1054,32 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state we
+				 * must not allow any subsequent table initialization to occur.
+				 * So the ALTER SUBSCRIPTION ... REFRESH is disallowed when the
+				 * the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data = false,
+				 * because when copy_data is false the tablesync will start
+				 * already in READY state and will exit directly without doing
+				 * anything which could interfere with the apply worker's
+				 * message handling.
+				 */
+				if (sub->twophase != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..47826ec 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -63,7 +63,7 @@ static void libpqrcv_readtimelinehistoryfile(WalReceiverConn *conn,
 											 TimeLineID tli, char **filename,
 											 char **content, int *len);
 static bool libpqrcv_startstreaming(WalReceiverConn *conn,
-									const WalRcvStreamOptions *options);
+									const WalRcvStreamOptions *options, bool two_phase);
 static void libpqrcv_endstreaming(WalReceiverConn *conn,
 								  TimeLineID *next_tli);
 static int	libpqrcv_receive(WalReceiverConn *conn, char **buffer,
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -387,7 +388,7 @@ libpqrcv_server_version(WalReceiverConn *conn)
  */
 static bool
 libpqrcv_startstreaming(WalReceiverConn *conn,
-						const WalRcvStreamOptions *options)
+						const WalRcvStreamOptions *options, bool two_phase)
 {
 	StringInfoData cmd;
 	PGresult   *res;
@@ -427,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -453,6 +458,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, " TIMELINE %u",
 						 options->proto.physical.startpointTLI);
 
+	if (options->logical && two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	/* Start streaming. */
 	res = libpqrcv_PQexec(conn->streamConn, cmd.data);
 	pfree(cmd.data);
@@ -827,7 +835,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +849,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f59613..5ba9e1e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -730,7 +730,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75de..7b72ec7 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -479,7 +479,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 					  XLogReaderRoutine *xl_routine,
 					  LogicalOutputPluginWriterPrepareWrite prepare_write,
 					  LogicalOutputPluginWriterWrite do_write,
-					  LogicalOutputPluginWriterUpdateProgress update_progress)
+					  LogicalOutputPluginWriterUpdateProgress update_progress,
+					  bool two_phase)
 {
 	LogicalDecodingContext *ctx;
 	ReplicationSlot *slot;
@@ -526,6 +527,20 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 		start_lsn = slot->data.confirmed_flush;
 	}
 
+	/*
+	 * If starting with two_phase enabled then set two_phase_at point.
+	 * Also update the slot to be two_phase enabled and save the slot
+	 * to disk.
+	 */
+	if (two_phase)
+	{
+		slot->data.two_phase_at = start_lsn;
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
+
+
 	ctx = StartupDecodingContext(output_plugin_options,
 								 start_lsn, InvalidTransactionId, false,
 								 fast_forward, xl_routine, prepare_write,
@@ -538,10 +553,10 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions if the two_phase option is
+	 * enabled at the time of slot creation or at restart.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase = slot->data.two_phase || two_phase;
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +617,7 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index f7e0558..366c50e 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -237,7 +237,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 											   .segment_open = wal_segment_open,
 											   .segment_close = wal_segment_close),
 									LogicalOutputPrepareWrite,
-									LogicalOutputWrite, NULL);
+									LogicalOutputWrite, NULL, false);
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..488b2a2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,212 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 91600ac..10ad8a7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2672,7 +2672,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2703,7 +2703,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * prepare if it was not decoded earlier. We don't need to decode the xact
 	 * for aborts if it is not done already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ed3acad..b6769f7 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -173,7 +173,7 @@ struct SnapBuild
 	 * needs to be sent later along with commit prepared and they must be
 	 * before this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -373,9 +373,9 @@ SnapBuildCurrentState(SnapBuild *builder)
  * Return the LSN at which the snapshot was exported
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..d946b59 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,13 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
+static bool AnyTablesyncsNotREADY(void);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +365,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +372,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +393,36 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly 'enabled'
+	 * at that time.
+	 */
+	if (MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_PENDING)
+	{
+		if (AllTablesyncsREADY())
+		{
+			ereport(LOG,
+					(errmsg("logical replication apply worker for subscription \"%s\" will restart so 2PC can be enabled",
+					MySubscription->name)));
+
+			proc_exit(0);
+		}
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1052,7 +1049,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
 	/*
@@ -1137,3 +1134,165 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are all tablesyncs READY?
+ */
+bool
+AllTablesyncsREADY(void)
+{
+	return !AnyTablesyncsNotREADY();
+}
+
+/*
+ * Are there any tablesyncs which are not yet READY?
+ */
+static bool
+AnyTablesyncsNotREADY(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		/*
+		 * When the process_syncing_tables_for_apply changes the state from
+		 * SYNCDONE to READY, that change is actually written directly into
+		 * the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	return found_busy;
+}
+
+/*
+ * Update the p_subscription two_phase tri-state of the current subscription.
+ */
+void
+UpdateTwoPhaseTriState(char new_tristate)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_tristate == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_tristate == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_tristate == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	Assert(HeapTupleIsValid(tup));
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophase - 1] = CharGetDatum(new_tristate);
+	replaces[Anum_pg_subscription_subtwophase - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+
+#if 1
+	/* This is just debugging, for confirmation the update worked. */
+	{
+		Subscription *new_s;
+
+		StartTransactionCommand();
+		new_s = GetSubscription(MySubscription->oid, false);
+		CommitTransactionCommand();
+	}
+#endif
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 21d304a..f0e0b11 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,48 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE COMMIT TRI-STATE LOGIC
+ * --------------------------------
+ * By sad timing of apply/tablesync workers it was previously possible to have a
+ * prepared transaction that arrives at the apply worker when the tablesync is
+ * busy doing the initial sync. In this case, the apply worker does the begin
+ * prepare ('b') but it skips all the prepared operations [e.g. inserts] while
+ * the tablesync was still busy (see the condition of
+ * should_apply_changes_for_rel).
+ *
+ * This would lead to an "empty prepare", because later when the apply worker
+ * does the commit prepare ('K'), there is nothing in it (the inserts were
+ * skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two-phase
+ * commit is now implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to *properly*
+ * enable the publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED.
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophase).
+ *
+ * Finally, to avoid problems from any subsequent (not READY) tablesyncs
+ * interfering with the messages (same as the original problem) there is a
+ * restriction for ALTER SUBSCRIPTION REFRESH  PUBLICATION. This command is not
+ * permitted for two_phase = on, except when copy_data = false.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +101,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -720,6 +763,168 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("transaction identifier \"%s\" is already in use",
+						begin_data.gid)));
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	/*
+	 * Normally, prepare_lsn == remote_final_lsn, but if this prepare message
+	 * was dispatched via the psf spoolfile replay then the remote_final_lsn
+	 * is set to commit lsn instead. Hence the <= instead of == check below.
+	 */
+	Assert(prepare_data.prepare_lsn <= remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1954,6 +2159,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2439,6 +2666,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3085,9 +3313,43 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
+
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains as
+		 * the tri-state PENDING until all tablesyncs have reached READY state.
+		 * Only then, can it become properly ENABLED.
+		 */
+		bool all_tables_ready = AllTablesyncsREADY();
 
+		if (MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_PENDING && all_tables_ready)
+		{
+			/* Start streaming with two_phase enabled */
+			walrcv_startstreaming(wrconn, &options, true);
+			UpdateTwoPhaseTriState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophase = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options, false);
+		}
+
+		ereport(LOG,
+			(errmsg("logical replication apply worker for subscription \"%s\" 2PC is %s.",
+			MySubscription->name,
+			MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+			MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+			MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+			"?")));
+
+	}
+	else
+	{
 	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	walrcv_startstreaming(wrconn, &options, false);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2b9e7b8 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,18 +171,22 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -232,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -245,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -269,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -310,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -322,8 +374,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +394,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +415,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +875,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1282,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..3fd5914 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -322,9 +325,9 @@ start_replication:
 				}
 			;
 
-/* START_REPLICATION SLOT slot LOGICAL %X/%X options */
+/* START_REPLICATION SLOT slot LOGICAL %X/%X options TWO_PHASE*/
 start_logical_replication:
-			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR plugin_options
+			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR plugin_options opt_two_phase
 				{
 					StartReplicationCmd *cmd;
 					cmd = makeNode(StartReplicationCmd);
@@ -332,6 +335,7 @@ start_logical_replication:
 					cmd->slotname = $3;
 					cmd->startpoint = $5;
 					cmd->options = $6;
+					cmd->two_phase = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +369,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 9817b44..951aa6c 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -515,7 +515,7 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 									XL_ROUTINE(.page_read = read_local_xlog_page,
 											   .segment_open = wal_segment_open,
 											   .segment_close = wal_segment_close),
-									NULL, NULL, NULL);
+									NULL, NULL, NULL, false);
 
 		/*
 		 * Start reading at the slot's restart_lsn, which we know to point to
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8532296..1325c29 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -364,7 +364,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
@@ -388,7 +388,7 @@ WalReceiverMain(void)
 		options.slotname = slotname[0] != '\0' ? slotname : NULL;
 		options.proto.physical.startpointTLI = startpointTLI;
 		ThisTimeLineID = startpointTLI;
-		if (walrcv_startstreaming(wrconn, &options))
+		if (walrcv_startstreaming(wrconn, &options, false))
 		{
 			if (first_stream)
 				ereport(LOG,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 23baa44..ac8a566 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1171,7 +1171,7 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 										 .segment_open = WalSndSegmentOpen,
 										 .segment_close = wal_segment_close),
 							  WalSndPrepareWrite, WalSndWriteData,
-							  WalSndUpdateProgress);
+							  WalSndUpdateProgress, cmd->two_phase);
 	xlogreader = logical_decoding_ctx->reader;
 
 	WalSndSetState(WALSNDSTATE_CATCHUP);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..102b012 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h" /* For 2PC tri-state. */
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4228,6 +4229,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4273,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophase\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4303,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4329,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4346,6 +4358,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = { LOGICALREP_TWOPHASE_STATE_DISABLED, '\0' };
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4387,6 +4400,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..96c878b 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and two_phase are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index ecdb8d7..8f13e20 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2763,7 +2763,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..4695647 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,11 @@
 
 #include "nodes/pg_list.h"
 
+/* two_phase tri-state values. */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'n'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'y'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +59,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +98,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index ebc43a0..2923c04 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -85,6 +85,7 @@ typedef struct StartReplicationCmd
 	TimeLineID	timeline;
 	XLogRecPtr	startpoint;
 	List	   *options;
+	bool		two_phase;
 } StartReplicationCmd;
 
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c253403..43d9de0 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -115,7 +115,8 @@ extern LogicalDecodingContext *CreateDecodingContext(XLogRecPtr start_lsn,
 													 XLogReaderRoutine *xl_routine,
 													 LogicalOutputPluginWriterPrepareWrite prepare_write,
 													 LogicalOutputPluginWriterWrite do_write,
-													 LogicalOutputPluginWriterUpdateProgress update_progress);
+													 LogicalOutputPluginWriterUpdateProgress update_progress,
+													 bool two_phase);
 extern void DecodingContextFindStartpoint(LogicalDecodingContext *ctx);
 extern bool DecodingContextReady(LogicalDecodingContext *ctx);
 extern void FreeDecodingContext(LogicalDecodingContext *ctx);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..a5bb4de 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -27,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
@@ -54,10 +59,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +132,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +181,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..6280559 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -643,7 +655,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..1f4b253 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,9 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..8e5e9ed 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,7 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildTwoPhaseAt(SnapBuild *builder);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..e5b6329 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -301,7 +302,7 @@ typedef void (*walrcv_readtimelinehistoryfile_fn) (WalReceiverConn *conn,
  * didn't switch to copy-mode.
  */
 typedef bool (*walrcv_startstreaming_fn) (WalReceiverConn *conn,
-										  const WalRcvStreamOptions *options);
+										  const WalRcvStreamOptions *options, bool two_phase);
 
 /*
  * walrcv_endstreaming_fn
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -412,16 +414,16 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_server_version(conn)
 #define walrcv_readtimelinehistoryfile(conn, tli, filename, content, size) \
 	WalReceiverFunctions->walrcv_readtimelinehistoryfile(conn, tli, filename, content, size)
-#define walrcv_startstreaming(conn, options) \
-	WalReceiverFunctions->walrcv_startstreaming(conn, options)
+#define walrcv_startstreaming(conn, options, two_phase) \
+	WalReceiverFunctions->walrcv_startstreaming(conn, options, two_phase)
 #define walrcv_endstreaming(conn, next_tli) \
 	WalReceiverFunctions->walrcv_endstreaming(conn, next_tli)
 #define walrcv_receive(conn, buffer, wait_fd) \
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..4be47ad 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsREADY(void);
+extern void UpdateTwoPhaseTriState(char new_tristate);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..d752346 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | n                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | n                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | n                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | n                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | n                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | n                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | n                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 61cf4ea..ddb3cfe 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1343,12 +1343,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
@@ -1959,6 +1962,8 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfFile
+PsfHashEntry
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

#256vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#255)

On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Mar 15, 2021 at 2:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think something on these lines should be much
easier than the spool-file implementation unless we see any problem
with this idea.

Here's a new patch-set that implements this new solution proposed by Amit.

Thanks for the updated patch.
Few comments:
1) These are no longer needed as it has been removed with the new changes.
@@ -1959,6 +1962,8 @@ ProtocolVersion
PrsStorage
PruneState
PruneStepResult
+PsfFile
+PsfHashEntry

2) "Binary mode and streaming and two_phase" should be "Binary mode,
streaming and two_phase" in the below code:
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)

        if (verbose)
        {
-               /* Binary mode and streaming are only supported in v14
and higher */
+               /* Binary mode and streaming and two_phase are only
supported in v14 and higher */
                if (pset.sversion >= 140000)
                        appendPQExpBuffer(&buf,
3) We have some reference to psf spoolfile, this should be removed.
Also check if the assert should be <= or ==.
+       /*
+        * Normally, prepare_lsn == remote_final_lsn, but if this
prepare message
+        * was dispatched via the psf spoolfile replay then the remote_final_lsn
+        * is set to commit lsn instead. Hence the <= instead of == check below.
+        */
+       Assert(prepare_data.prepare_lsn <= remote_final_lsn);
4) Similarly in below code:
+       /*
+        * It is possible that we haven't received prepare because it occurred
+        * before walsender reached a consistent point in which case we need to
+        * skip rollback prepared.
+        *
+        * And we also skip the FinishPreparedTransaction if we're using the
+        * Prepare Spoolfile (using_psf) because in that case there is
no matching
+        * PrepareTransactionBlock done yet.
+        */
+       if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+                                       rollback_data.preparetime))
+       {
5) Should this be present:
+#if 1
+       /* This is just debugging, for confirmation the update worked. */
+       {
+               Subscription *new_s;
+
+               StartTransactionCommand();
+               new_s = GetSubscription(MySubscription->oid, false);
+               CommitTransactionCommand();
+       }
+#endif

Regards,
Vignesh

#257Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#255)

On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Mar 15, 2021 at 2:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think something on these lines should be much
easier than the spool-file implementation unless we see any problem
with this idea.

Here's a new patch-set that implements this new solution proposed by Amit.
Patchset-v60 implements:

I have reviewed the latest patch and below are my comments, some of
these might overlap with Vignesh's as I haven't looked at his comments
in detail.
Review comments
================
1.
+ * And we also skip the FinishPreparedTransaction if we're using the
+ * Prepare Spoolfile (using_psf) because in that case there is no matching
+ * PrepareTransactionBlock done yet.
+ */
+ if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+ rollback_data.preparetime))

The above comment is not required.

2.
While streaming and two_phase can theoretically be supported,
+ * the current implementation has some issues that could lead to a
+ * streaming prepared transaction to be incorrectly missed in the initial
+ * syncing phase. Hence, disabling this combination till that issue can
+ * be addressed.
+ */
+ if (twophase && *twophase_given && *twophase)

I don't think the above statement is correct as per the current patch.
We can something like: "While streaming and two_phase can
theoretically be supported, it needs more analysis to allow them
together." or something on those lines.

3.
-
- walrcv_create_slot(wrconn, slotname, false,
+ /*
+ * Even if two_phase is set, don't create the slot with two-phase
+ * enabled. Will enable it once all the tables are synced and ready.
+ * This avoids race-conditions that might occur during initial table-sync.
+ */
+ walrcv_create_slot(wrconn, slotname, false, false,
     CRS_NOEXPORT_SNAPSHOT, NULL);

Can we please explain a bit more about race conditions due to which we
can enable two_phase only after initial sync?

4.
@@ -648,7 +703,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
  InvalidXLogRecPtr);
  ereport(DEBUG1,
  (errmsg_internal("table \"%s.%s\" added to subscription \"%s\"",
- rv->schemaname, rv->relname, sub->name)));
+ rv->schemaname, rv->relname, sub->name)));
..
..
@@ -722,9 +777,9 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
  ereport(DEBUG1,
  (errmsg_internal("table \"%s.%s\" removed from subscription \"%s\"",
- get_namespace_name(get_rel_namespace(relid)),
- get_rel_name(relid),
- sub->name)));
+ get_namespace_name(get_rel_namespace(relid)),
+ get_rel_name(relid),
+ sub->name)));

Is there any reason for the above changes w.r.t this patch?

5.
+
+ /*
+ * The subscription two_phase commit implementation requires
+ * that replication has passed the initial table
+ * synchronization phase before the two_phase becomes properly
+ * enabled.
+ *
+ * But, having reached this two-phase commit "enabled" state we
+ * must not allow any subsequent table initialization to occur.
+ * So the ALTER SUBSCRIPTION ... REFRESH is disallowed when the
+ * the user had requested two_phase = on mode.

I suggest we expand the comments more here to specify what problem can
happen if we allow subsequent table initialization after the two_phase
is enabled for the subscription. Or you can point to comments atop
worker.c.

6.
@@ -526,6 +527,20 @@ CreateDecodingContext(XLogRecPtr start_lsn,
start_lsn = slot->data.confirmed_flush;
}

+ /*
+ * If starting with two_phase enabled then set two_phase_at point.
+ * Also update the slot to be two_phase enabled and save the slot
+ * to disk.
+ */
+ if (two_phase)
+ {
+ slot->data.two_phase_at = start_lsn;
+ slot->data.two_phase = true;
+ ReplicationSlotMarkDirty();
+ ReplicationSlotSave();
+ }

Do we want to Assert that two_phase variables are not already set as
we don't want those to be reset?

7.
/*
- * We allow decoding of prepared transactions iff the two_phase option is
- * enabled at the time of slot creation.
+ * We allow decoding of prepared transactions if the two_phase option is
+ * enabled at the time of slot creation or at restart.
  */

In the above comments, there is no need to change iff to if. iff means
'if and only if' which makes sense in the above comment.

- ctx->twophase &= MyReplicationSlot->data.two_phase;
+ ctx->twophase = slot->data.two_phase || two_phase;

Why you have removed '&' in the above assignment? It is possible that
the plugin doesn't provide two_phase APIs in which case we can't
support two_phase even if asked by the user? I think we need to
probably write it as: ctx->twophase &= (slot->data.two_phase ||
two_phase);

8.
@@ -602,7 +617,7 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)

  SpinLockAcquire(&slot->mutex);
  slot->data.confirmed_flush = ctx->reader->EndRecPtr;
- slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+ slot->data.two_phase_at = ctx->reader->EndRecPtr;
  SpinLockRelease(&slot->mutex);

I think we need to set the two_phase_at only when the slot has
two_phase enabled? Previously, it was fine to set it because it was a
generic initial consistent point for a slot but after changing the
variable name it doesn't seem to make sense to assign it unless
two_phase is enabled.

9.
* needs to be sent later along with commit prepared and they must be
* before this point.
*/
- XLogRecPtr initial_consistent_point;
+ XLogRecPtr two_phase_at;

I think the explanation of this variable needs to be also updated
because now this can be used even for the first time when we enable
two_phase during streaming start.

10.
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
  XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
- XLogRecPtr initial_consistent_point,
+ XLogRecPtr two_phase_at,
  TimestampTz commit_time, RepOriginId origin_id,
  XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2703,7 +2703,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb,
TransactionId xid,
  * prepare if it was not decoded earlier. We don't need to decode the xact
  * for aborts if it is not done already.
  */
- if ((txn->final_lsn < initial_consistent_point) && is_commit)
+ if ((txn->final_lsn < two_phase_at) && is_commit)

How can this change work? During decode prepare processing the patch
only remembers the prepare info in DecodePrepare whereas we would have
skipped the prepare before that via FilterPrepare. I think we need to
remember the prepare info before calling DecodePrepare. If you have
not already tested this scenario then please test it once before
posting the next version and also explain how exactly you have tested
it?

11.
+/*
+ * Are all tablesyncs READY?
+ */
+bool
+AllTablesyncsREADY(void)
+{
+ return !AnyTablesyncsNotREADY();
+}
+
+/*
+ * Are there any tablesyncs which are not yet READY?
+ */
+static bool
+AnyTablesyncsNotREADY(void)
+{

I don't think we need separate functions here.

12.
+/*
+ * Update the p_subscription two_phase tri-state of the current subscription.
+ */
+void
+UpdateTwoPhaseTriState(char new_tristate)

I would prefer not to include 'Tri' in the above function or variable
name. We might want to extend the states in future or even without
that it would be better not to include 'tri' here.

13.
+void
+UpdateTwoPhaseTriState(char new_tristate)
{
..
+#if 1
+ /* This is just debugging, for confirmation the update worked. */
+ {
+ Subscription *new_s;
+
+ StartTransactionCommand();
+ new_s = GetSubscription(MySubscription->oid, false);
+ CommitTransactionCommand();
+ }
+#endif
..
}

Let's remove the debugging code from the main patch.

14.
/*
+ * Even when the two_phase mode is requested by the user, it remains as
+ * the tri-state PENDING until all tablesyncs have reached READY state.
+ * Only then, can it become properly ENABLED.
+ */
+ bool all_tables_ready = AllTablesyncsREADY();
+ if (MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_PENDING &&
all_tables_ready)
+ {
+ /* Start streaming with two_phase enabled */
+ walrcv_startstreaming(wrconn, &options, true);
+ UpdateTwoPhaseTriState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+ MySubscription->twophase = LOGICALREP_TWOPHASE_STATE_ENABLED;
+ }
+ else
+ {
+ walrcv_startstreaming(wrconn, &options, false);
+ }
+
+ ereport(LOG,
+ (errmsg("logical replication apply worker for subscription \"%s\" 2PC is %s.",
+ MySubscription->name,
+ MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+ MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+ MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+ "?")));

I think here two_phase code is relevant only if we are talking with
server version >= 14. You can check that by
"walrcv_server_version(wrconn) >= 140000".

15.
+static void
+FetchTableStates(bool *started_tx)
+{
+ *started_tx = false;
+
+ if (!table_states_valid)
+ {
+ MemoryContext oldctx;
+ List    *rstates;
+ ListCell   *lc;
+ SubscriptionRelState *rstate;
+
+
+ /* Clean the old lists. */
+ list_free_deep(table_states_all);
+ table_states_all = NIL;

The patch doesn't seem to be using table_states_all, it might be
leftover from the previous version. Also, the logic in this function
can simply use GetSubscriptionNotReadyRelations as the existing code
is using.

16.
+static bool
+AnyTablesyncsNotREADY(void)
+{
+ bool found_busy = false;
+ bool started_tx = false;
+ int count = 0;
+ ListCell   *lc;
+
+ /* We need up-to-date sync state info for subscription tables here. */
+ FetchTableStates(&started_tx);
+
+ /*
+ * Process all not-READY tables to see if any are also not-SYNCDONE
+ */
+ foreach(lc, table_states_not_ready)
+ {
+ SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+ count++;
+ /*
+ * When the process_syncing_tables_for_apply changes the state from
+ * SYNCDONE to READY, that change is actually written directly into
+ * the list element of table_states_not_ready.
+ *
+ * So the "table_states_not_ready" list might end up having a READY
+ * state in it even though there was none when it was initially
+ * created. This is reason why we need to check for READY below.
+ */
+ if (rstate->state != SUBREL_STATE_READY)
+ {
+ found_busy = true;
+ break;
+ }
+ }

Do we really need to do this recheck in for loop? How does it matter?
I guess if this is not required, we can simply check if
table_states_not_ready list is empty or not.

17.
+ ereport(LOG,
+ (errmsg("logical replication apply worker for subscription \"%s\"
will restart so 2PC can be enabled",

In the above message, I think it is better to write two_phase instead of 2PC.

18.
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+

Are these macros used anywhere? If not, please remove them.

--
With Regards,
Amit Kapila.

#258vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#255)

On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Mar 15, 2021 at 2:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think something on these lines should be much
easier than the spool-file implementation unless we see any problem
with this idea.

Here's a new patch-set that implements this new solution proposed by Amit.

Another couple of comments:
1) Should Assert be changed to the following in the below code:
if (!HeapTupleIsValid(tup))
elog(ERROR, "cache lookup failed for subscription %u", MySubscription->oid);

+       rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+       tup = SearchSysCacheCopy1(SUBSCRIPTIONOID,
ObjectIdGetDatum(MySubscription->oid));
+       Assert(HeapTupleIsValid(tup));
2) table_states_not_ready global variable is used immediately after
call to FetchTableStates, we can make FetchTableStates return the
value or get it as an argument to the function and the global
variables can be removed.
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;

Regards,
Vignesh

#259Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#258)

On Tue, Mar 16, 2021 at 6:22 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Mar 15, 2021 at 2:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

2) table_states_not_ready global variable is used immediately after
call to FetchTableStates, we can make FetchTableStates return the
value or get it as an argument to the function and the global
variables can be removed.
+static List *table_states_not_ready = NIL;

But we do update the states in the list table_states_not_ready in
function process_syncing_tables_for_apply. So, the current arrangement
looks good to me.

--
With Regards,
Amit Kapila.

#260vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#259)

On Tue, Mar 16, 2021 at 7:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 16, 2021 at 6:22 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Mar 15, 2021 at 2:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

2) table_states_not_ready global variable is used immediately after
call to FetchTableStates, we can make FetchTableStates return the
value or get it as an argument to the function and the global
variables can be removed.
+static List *table_states_not_ready = NIL;

But we do update the states in the list table_states_not_ready in
function process_syncing_tables_for_apply. So, the current arrangement
looks good to me.

But I felt we can do this without using global variables.
table_states_not_ready is used immediately after calling
FetchTableStates in AnyTablesyncsNotREADY and
process_syncing_tables_for_apply functions. It is not used anywhere
else. My point was we do not need to store this in global variables as
it is not needed elsewhere. We could change the return type or return
in through the function argument in this case.
Thoughts?

Regards,
Vignesh

#261Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#260)

On Wed, Mar 17, 2021 at 8:07 AM vignesh C <vignesh21@gmail.com> wrote:

On Tue, Mar 16, 2021 at 7:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 16, 2021 at 6:22 PM vignesh C <vignesh21@gmail.com> wrote:

On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Mon, Mar 15, 2021 at 2:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

2) table_states_not_ready global variable is used immediately after
call to FetchTableStates, we can make FetchTableStates return the
value or get it as an argument to the function and the global
variables can be removed.
+static List *table_states_not_ready = NIL;

But we do update the states in the list table_states_not_ready in
function process_syncing_tables_for_apply. So, the current arrangement
looks good to me.

But I felt we can do this without using global variables.
table_states_not_ready is used immediately after calling
FetchTableStates in AnyTablesyncsNotREADY and
process_syncing_tables_for_apply functions. It is not used anywhere
else. My point was we do not need to store this in global variables as
it is not needed elsewhere.

It might be possible but I am not if that is better than what we are
currently doing and moreover that is existing code and this patch has
just encapsulated in a function. Even if you think there is a better
way which I doubt, we can probably look at it as a separate patch.

--
With Regards,
Amit Kapila.

#262Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#257)
1 attachment(s)

On Tue, Mar 16, 2021 at 5:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 15, 2021 at 6:14 PM Ajin Cherian <itsajin@gmail.com> wrote:

Here's a new patch-set that implements this new solution proposed by Amit.
Patchset-v60 implements:

I have reviewed the latest patch and below are my comments, some of
these might overlap with Vignesh's as I haven't looked at his comments
in detail.
Review comments
================

Few more comments:
=================
1.
+       <structfield>subtwophase</structfield> <type>char</type>
+      </para>
+      <para>
+       The <varname>two_phase commit current state:</varname>
+       <itemizedlist>
+        <listitem><para><literal>'n'</literal> = two_phase mode was
not requested, so is disabled.</para></listitem>
+        <listitem><para><literal>'p'</literal> = two_phase mode was
requested, but is pending enablement.</para></listitem>
+        <listitem><para><literal>'y'</literal> = two_phase mode was
requested, and is enabled.</para></listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>

Can we name the column as subtwophasestate? And then describe as we
are doing for srsubstate in pg_subscription_rel. Also, it might be
better to keep names as: 'd' disabled, 'p' pending twophase enablement
and 'e' twophase enabled.
<row>
<entry role="catalog_table_entry"><para role="column_definition">
<structfield>srsubstate</structfield> <type>char</type>
</para>
<para>
State code:
<literal>i</literal> = initialize,
<literal>d</literal> = data is being copied,
<literal>f</literal> = finished table copy,
<literal>s</literal> = synchronized,
<literal>r</literal> = ready (normal replication)
</para></entry>
</row>

2.
@@ -427,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
PQserverVersion(conn->streamConn) >= 140000)
appendStringInfoString(&cmd, ", streaming 'on'");

+ if (options->proto.logical.twophase &&
+ PQserverVersion(conn->streamConn) >= 140000)
+ appendStringInfoString(&cmd, ", two_phase 'on'");
+
  pubnames = options->proto.logical.publication_names;
  pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
  if (!pubnames_str)
@@ -453,6 +458,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
  appendStringInfo(&cmd, " TIMELINE %u",
  options->proto.physical.startpointTLI);
+ if (options->logical && two_phase)
+ appendStringInfoString(&cmd, " TWO_PHASE");
+

Why are we sending two_phase 'on' and " TWO_PHASE" separately? I think
we don't need to introduce TWO_PHASE token in grammar, let's handle it
via plugin_options similar to what we do for 'streaming'. Also, a
similar change would be required for Create_Replication_Slot.

3.
+ /*
+ * Do not allow toggling of two_phase option, this could
+ * cause missing of transactions and lead to an inconsistent
+ * replica.
+ */
+ if (!twophase)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("cannot alter two_phase option")));
+

I think here you can either give reference of worker.c to explain how
this could lead to an inconsistent replica or expand the comments here
if the information is not present elsewhere.

4.
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2

I think it is better to name the new define as
LOGICALREP_PROTO_TWOPHASE_VERSION_NUM. Also mention in comments in
some way that we are keeping the same version number for stream and
two-phase defines because they got introduced in the same release
(14).

5. I have modified the comments atop worker.c to explain the design
and some of the problems clearly. See attached. If you are fine with
this, please include it in the next version of the patch.

--
With Regards,
Amit Kapila.

Attachments:

change_two_phase_desc_1.patchapplication/octet-stream; name=change_two_phase_desc_1.patchDownload
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f0e0b11678c..9bf347e2caa 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -50,22 +50,26 @@
  * the file we desired across multiple stream-open calls for the same
  * transaction.
  *
- * TWO_PHASE COMMIT TRI-STATE LOGIC
- * --------------------------------
- * By sad timing of apply/tablesync workers it was previously possible to have a
- * prepared transaction that arrives at the apply worker when the tablesync is
- * busy doing the initial sync. In this case, the apply worker does the begin
- * prepare ('b') but it skips all the prepared operations [e.g. inserts] while
- * the tablesync was still busy (see the condition of
- * should_apply_changes_for_rel).
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase trasnactions are replayed at prepare and then committed or
+ * rollbacked at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead apply worker's current location.  This would lead to
+ * an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
  *
- * This would lead to an "empty prepare", because later when the apply worker
- * does the commit prepare ('K'), there is nothing in it (the inserts were
- * skipped earlier).
- *
- * To avoid this, and similar prepare confusions the subscription two-phase
- * commit is now implemented as a tri-state with values DISABLED, PENDING, and
- * ENABLED
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
  *
  * Even if the user specifies they want a subscription with two_phase = on,
  * internally it will start with a tri-state of PENDING which only becomes
@@ -80,17 +84,28 @@
  * process_sync_tables_for_apply.
  *
  * When the (re-started) apply worker finds that all tablesyncs are READY for a
- * two_phase tri-state of PENDING it calls wal_startstreaming to *properly*
- * enable the publisher for two-phase commit and updates the tri-state value
- * PENDING -> ENABLED.
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enable two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
  *
  * If ever a user needs to be aware of the tri-state value, they can fetch it
  * from the pg_subscription catalog (see column subtwophase).
  *
- * Finally, to avoid problems from any subsequent (not READY) tablesyncs
- * interfering with the messages (same as the original problem) there is a
- * restriction for ALTER SUBSCRIPTION REFRESH  PUBLICATION. This command is not
- * permitted for two_phase = on, except when copy_data = false.
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted for
+ * two_phase = on, except when copy_data = false.
  *-------------------------------------------------------------------------
  */
 
#263Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#250)

On Fri, Mar 12, 2021 at 8:38 PM vignesh C <vignesh21@gmail.com> wrote:

...

1) I felt twophase_given can be a local variable, it need not be added
as a function parameter as it is not used outside the function.
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
char **synchronous_commit,
bool *refresh,
bool *binary_given,
bool *binary,
-                                                  bool
*streaming_given, bool *streaming)
+                                                  bool
*streaming_given, bool *streaming,
+                                                  bool
*twophase_given, bool *twophase)

The corresponding changes should be done here too:
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt,
bool isTopLevel)
bool copy_data;
bool streaming;
bool streaming_given;
+ bool twophase;
+ bool twophase_given;
char *synchronous_commit;
char *conninfo;
char *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt,
bool isTopLevel)
&synchronous_commit,
NULL,
/* no "refresh" */

&binary_given, &binary,
-
&streaming_given, &streaming);
+
&streaming_given, &streaming,
+
&twophase_given, &twophase);

It was deliberately coded this way for consistency with the other new
PG14 options - e.g. it mimics exactly binary_given, and
streaming_given.

I know the param is not currently used by the caller and so could be a
local (as you say), but I felt the code consistency and future-proof
benefits outweighed the idea of reducing the code to bare minimum
required to work just "because we can".

So I don't plan to change this, but if you still feel strongly that
the parameter must be removed please give a convincing reason.

----
Kind Regards,
Peter Smith.
Fujitsu Australia.

#264Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#262)
1 attachment(s)

On Wed, Mar 17, 2021 at 11:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

5. I have modified the comments atop worker.c to explain the design
and some of the problems clearly. See attached. If you are fine with
this, please include it in the next version of the patch.

I have further expanded these comments to explain the handling of
prepared transactions for multiple subscriptions on the same server
especially when the same prepared transaction operates on tables for
those subscriptions. See attached, this applies atop the patch sent by
me in the last email. I am not sure but I think it might be better to
add something on those lines in user-facing docs. What do you think?

Another comment:
+ ereport(LOG,
+ (errmsg("logical replication apply worker for subscription \"%s\" 2PC is %s.",
+ MySubscription->name,
+ MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+ MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+ MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+ "?")));

I don't think this is required in LOGs, maybe at some DEBUG level,
because users can check this in pg_subscription. If we keep this
message, there will be two consecutive messages like below in logs for
subscriptions that have two_pc option enabled which looks a bit odd.
LOG: logical replication apply worker for subscription "mysub" has started
LOG: logical replication apply worker for subscription "mysub" 2PC is ENABLED.

--
With Regards,
Amit Kapila.

Attachments:

change_two_phase_desc_2.patchapplication/octet-stream; name=change_two_phase_desc_2.patchDownload
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9bf347e2caa..ee3eb56d826 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -62,7 +62,7 @@
  * was prior to the initial consistent point but might have got some later
  * commits. Now, the tablesync worker will exit without doing anything for the
  * prepared transaction skipped by the apply worker as the sync location for it
- * will be already ahead apply worker's current location.  This would lead to
+ * will be already ahead apply worker's current location. This would lead to
  * an "empty prepare", because later when the apply worker does the commit
  * prepare, there is nothing in it (the inserts were skipped earlier).
  *
@@ -106,6 +106,16 @@
  * to 'off' and then again back to 'on') there is a restriction for
  * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted for
  * two_phase = on, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, one of the prepare will be successful and
+ * others will fail in which case the server will send them again. Once the
+ * commit prepared is done for the first one, the next prepare will be
+ * successful. We have thought of appending some unique identifier (like subid)
+ * with GID but that won't work for cascaded standby setup as the GID can
+ * become too long.
  *-------------------------------------------------------------------------
  */
 
#265Peter Smith
smithpb2250@gmail.com
In reply to: Ajin Cherian (#255)
3 attachment(s)

Please find attached the latest patch set v61*

Differences from v60* are:

* Rebased to HEAD @ today

* Addresses the following feedback issues:

----

Vignesh 12/Mar -
/messages/by-id/CALDaNm1p=KYcDc1s_Q0Lk2P8UYU-z4acW066gaeLfXvW_O-kBA@mail.gmail.com

(61) Skipped. twophase_given could be a local variable

----

Vignesh 16/Mar -
/messages/by-id/CALDaNm0qTRapggmUY_kgwNd14cec0i8mS5_PnrMcs_Y-_TXrgA@mail.gmail.com

(68) Fixed. Removed obsolete psf typedefs from typedefs.h.

(69) Done. Updated comment wording.

(70) Fixed. Removed references to psf in comments. Restored the Assert
how it was before

(71) Duplicate. See (73)

(72) Duplicate. See (86)

----

Amit 16/Mar - /messages/by-id/CAA4eK1Kwah+MimFMR3jPY5cSqpGFVh5zfV2g4=gTphaPsacoLw@mail.gmail.com

(73) Done. Removed comments referring to obsolete psf.

(76) Done. Removed whitespace changes unrelated to this patch set.

(77) Done. Updated comment of Alter Subscription ... REFRESH.

(84) Done. Removed the extra function AnyTablesyncsNotREADY.

(85) Done. Renamed the function UpdateTwoPhaseTriState.

(86) Fixed. Removed debugging code from the main patch.

(88) Done. Removed the unused table_states_all List.

(90) Fixed. Change the log message to say "two_phase" instead of "2PC".

----

Vignesh 16/Mar -
/messages/by-id/CALDaNm11A5wL0E-GDtqWY00iFzgUPsPLfA+L0zi4SEokEVtoFQ@mail.gmail.com

(92) Fixed. Replace cache failure Assert with ERROR

(93) Skipped. Suggested to remove the global variable for
table_states_not_ready.

----

Amit 17/Mar - /messages/by-id/CAA4eK1LNLA20ci3_qqNQv7BYRTy3HqiAsOfuieqo6tJ2GeYuJw@mail.gmail.com

(95) Done. Renamed the pg_subscription column. New state values d/p/e.
Updated PG docs.

(98) Done. Renamed the constant LOGICALREP_PROTO_2PC_VERSION_NUM.

(99) Fixed. Apply new (supplied) comments atop worker.c

----

Vignesh 17/Mar

(100) Fixed. Applied patch (supplied) to fix a multiple subscriber bug.

-----
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v60-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v60-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 48dda51254e7f2660d328e307c30a966a78c605d Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 15 Mar 2021 08:24:56 -0400
Subject: [PATCH v60] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

* implement new SUBSCRIPTION option "two_phase".

* add new option to enable two_phase while creating a slot.

* introduction of tri-state for twophase pg_subscription column.

* restrict ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* restrict ALTER SUBSCRIPTION SET PUBLICATION WITH (refresh = true) when two_phase enabled.

* include documentation update.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                         |  14 ++
 doc/src/sgml/protocol.sgml                         |  14 +-
 doc/src/sgml/ref/alter_subscription.sgml           |   4 +-
 doc/src/sgml/ref/create_subscription.sgml          |  36 +++
 src/backend/access/transam/twophase.c              |  68 ++++++
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 121 +++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  17 +-
 src/backend/replication/logical/decode.c           |   2 +-
 src/backend/replication/logical/logical.c          |  27 ++-
 src/backend/replication/logical/logicalfuncs.c     |   2 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 206 ++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |   4 +-
 src/backend/replication/logical/snapbuild.c        |  10 +-
 src/backend/replication/logical/tablesync.c        | 227 +++++++++++++++---
 src/backend/replication/logical/worker.c           | 264 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 193 ++++++++++++---
 src/backend/replication/repl_gram.y                |  21 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slotfuncs.c                |   2 +-
 src/backend/replication/walreceiver.c              |   4 +-
 src/backend/replication/walsender.c                |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  10 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |   8 +
 src/include/nodes/replnodes.h                      |   1 +
 src/include/replication/logical.h                  |   3 +-
 src/include/replication/logicalproto.h             |  73 +++++-
 src/include/replication/reorderbuffer.h            |  14 +-
 src/include/replication/slot.h                     |   6 +-
 src/include/replication/snapbuild.h                |   4 +-
 src/include/replication/walreceiver.h              |  12 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         |  93 +++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/tools/pgindent/typedefs.list                   |   5 +
 41 files changed, 1364 insertions(+), 167 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index b1de6d0..fa3fd77 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7567,6 +7567,20 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophase</structfield> <type>char</type>
+      </para>
+      <para>
+       The <varname>two_phase commit current state:</varname>
+       <itemizedlist>
+        <listitem><para><literal>'n'</literal> = two_phase mode was not requested, so is disabled.</para></listitem>
+        <listitem><para><literal>'p'</literal> = two_phase mode was requested, but is pending enablement.</para></listitem>
+        <listitem><para><literal>'y'</literal> = two_phase mode was requested, and is enabled.</para></listitem>
+       </itemizedlist>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe..9694713 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,18 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase transactions.
+         Two-phase commands like PREPARE TRANSACTION, COMMIT PREPARED and ROLLBACK PREPARED
+         are also decoded and transmitted. In two-phase transactions, the transaction is 
+         decoded and transmitted at PREPARE TRANSACTION time. 
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..85cc8bb 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -64,7 +64,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
   <para>
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with refresh
-   option as true cannot be executed inside a transaction block.
+   option as true cannot be executed inside a transaction block. They also
+   cannot be executed with <literal>copy_data = true</literal> if the
+   subscription is using <literal>two_phase</literal> commit.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..a5c9158 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,42 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the intial table synchronization phase. This means even when
+          two_phase is enabled for the subscription, the internal two-phase state remains
+          temporarily "pending" until the initialization phase is completed. See column
+          <literal>subtwophase</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..c58c46d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..7a56e35 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophase = subform->subtwophase;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65d..b77378d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophase, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..b1f27ec 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,26 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +309,24 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * the current implementation has some issues that could lead to a
+	 * streaming prepared transaction to be incorrectly missed in the initial
+	 * syncing phase. Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophase - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -527,8 +578,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			if (create_slot)
 			{
 				Assert(slotname);
-
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with two-phase
+				 * enabled. Will enable it once all the tables are synced and ready.
+				 * This avoids race-conditions that might occur during initial table-sync.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -648,7 +703,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
 										InvalidXLogRecPtr);
 				ereport(DEBUG1,
 						(errmsg_internal("table \"%s.%s\" added to subscription \"%s\"",
-								rv->schemaname, rv->relname, sub->name)));
+										 rv->schemaname, rv->relname, sub->name)));
 			}
 		}
 
@@ -722,9 +777,9 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
 
 				ereport(DEBUG1,
 						(errmsg_internal("table \"%s.%s\" removed from subscription \"%s\"",
-								get_namespace_name(get_rel_namespace(relid)),
-								get_rel_name(relid),
-								sub->name)));
+										 get_namespace_name(get_rel_namespace(relid)),
+										 get_rel_name(relid),
+										 sub->name)));
 			}
 		}
 
@@ -835,7 +890,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +925,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophase != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +954,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +1000,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -953,6 +1017,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/* See ALTER_SUBSCRIPTION_REFRESH for details why this is not allow. */
+					if (sub->twophase != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -982,7 +1054,32 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state we
+				 * must not allow any subsequent table initialization to occur.
+				 * So the ALTER SUBSCRIPTION ... REFRESH is disallowed when the
+				 * the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data = false,
+				 * because when copy_data is false the tablesync will start
+				 * already in READY state and will exit directly without doing
+				 * anything which could interfere with the apply worker's
+				 * message handling.
+				 */
+				if (sub->twophase != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5272eed..47826ec 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -63,7 +63,7 @@ static void libpqrcv_readtimelinehistoryfile(WalReceiverConn *conn,
 											 TimeLineID tli, char **filename,
 											 char **content, int *len);
 static bool libpqrcv_startstreaming(WalReceiverConn *conn,
-									const WalRcvStreamOptions *options);
+									const WalRcvStreamOptions *options, bool two_phase);
 static void libpqrcv_endstreaming(WalReceiverConn *conn,
 								  TimeLineID *next_tli);
 static int	libpqrcv_receive(WalReceiverConn *conn, char **buffer,
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -387,7 +388,7 @@ libpqrcv_server_version(WalReceiverConn *conn)
  */
 static bool
 libpqrcv_startstreaming(WalReceiverConn *conn,
-						const WalRcvStreamOptions *options)
+						const WalRcvStreamOptions *options, bool two_phase)
 {
 	StringInfoData cmd;
 	PGresult   *res;
@@ -427,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -453,6 +458,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, " TIMELINE %u",
 						 options->proto.physical.startpointTLI);
 
+	if (options->logical && two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	/* Start streaming. */
 	res = libpqrcv_PQexec(conn->streamConn, cmd.data);
 	pfree(cmd.data);
@@ -827,7 +835,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +849,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f59613..5ba9e1e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -730,7 +730,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75de..7b72ec7 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -479,7 +479,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 					  XLogReaderRoutine *xl_routine,
 					  LogicalOutputPluginWriterPrepareWrite prepare_write,
 					  LogicalOutputPluginWriterWrite do_write,
-					  LogicalOutputPluginWriterUpdateProgress update_progress)
+					  LogicalOutputPluginWriterUpdateProgress update_progress,
+					  bool two_phase)
 {
 	LogicalDecodingContext *ctx;
 	ReplicationSlot *slot;
@@ -526,6 +527,20 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 		start_lsn = slot->data.confirmed_flush;
 	}
 
+	/*
+	 * If starting with two_phase enabled then set two_phase_at point.
+	 * Also update the slot to be two_phase enabled and save the slot
+	 * to disk.
+	 */
+	if (two_phase)
+	{
+		slot->data.two_phase_at = start_lsn;
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
+
+
 	ctx = StartupDecodingContext(output_plugin_options,
 								 start_lsn, InvalidTransactionId, false,
 								 fast_forward, xl_routine, prepare_write,
@@ -538,10 +553,10 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions if the two_phase option is
+	 * enabled at the time of slot creation or at restart.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase = slot->data.two_phase || two_phase;
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +617,7 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index f7e0558..366c50e 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -237,7 +237,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 											   .segment_open = wal_segment_open,
 											   .segment_close = wal_segment_close),
 									LogicalOutputPrepareWrite,
-									LogicalOutputWrite, NULL);
+									LogicalOutputWrite, NULL, false);
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..488b2a2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,212 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 91600ac..10ad8a7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2672,7 +2672,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2703,7 +2703,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * prepare if it was not decoded earlier. We don't need to decode the xact
 	 * for aborts if it is not done already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ed3acad..b6769f7 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -173,7 +173,7 @@ struct SnapBuild
 	 * needs to be sent later along with commit prepared and they must be
 	 * before this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -373,9 +373,9 @@ SnapBuildCurrentState(SnapBuild *builder)
  * Return the LSN at which the snapshot was exported
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index feb634e..d946b59 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,13 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static List *table_states_all = NIL;
+static void FetchTableStates(bool *started_tx);
+static bool AnyTablesyncsNotREADY(void);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +365,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +372,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +393,36 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly 'enabled'
+	 * at that time.
+	 */
+	if (MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_PENDING)
+	{
+		if (AllTablesyncsREADY())
+		{
+			ereport(LOG,
+					(errmsg("logical replication apply worker for subscription \"%s\" will restart so 2PC can be enabled",
+					MySubscription->name)));
+
+			proc_exit(0);
+		}
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1052,7 +1049,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * for the catchup phase after COPY is done, so tell it to use the
 	 * snapshot to make the final data consistent.
 	 */
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 
 	/*
@@ -1137,3 +1134,165 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_all);
+		table_states_all = NIL;
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all tables. */
+		rstates = GetSubscriptionRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			SubscriptionRelState *cur_rstate = (SubscriptionRelState *) lfirst(lc);
+
+			/* List of all states */
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+			table_states_all = lappend(table_states_all, rstate);
+
+			/* List of only not-ready states */
+			if (cur_rstate->state != SUBREL_STATE_READY)
+			{
+				rstate = palloc(sizeof(SubscriptionRelState));
+				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
+				table_states_not_ready = lappend(table_states_not_ready, rstate);
+			}
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are all tablesyncs READY?
+ */
+bool
+AllTablesyncsREADY(void)
+{
+	return !AnyTablesyncsNotREADY();
+}
+
+/*
+ * Are there any tablesyncs which are not yet READY?
+ */
+static bool
+AnyTablesyncsNotREADY(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables to see if any are also not-SYNCDONE
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		/*
+		 * When the process_syncing_tables_for_apply changes the state from
+		 * SYNCDONE to READY, that change is actually written directly into
+		 * the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	return found_busy;
+}
+
+/*
+ * Update the p_subscription two_phase tri-state of the current subscription.
+ */
+void
+UpdateTwoPhaseTriState(char new_tristate)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_tristate == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_tristate == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_tristate == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	Assert(HeapTupleIsValid(tup));
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophase - 1] = CharGetDatum(new_tristate);
+	replaces[Anum_pg_subscription_subtwophase - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+
+#if 1
+	/* This is just debugging, for confirmation the update worked. */
+	{
+		Subscription *new_s;
+
+		StartTransactionCommand();
+		new_s = GetSubscription(MySubscription->oid, false);
+		CommitTransactionCommand();
+	}
+#endif
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 21d304a..f0e0b11 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,48 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE COMMIT TRI-STATE LOGIC
+ * --------------------------------
+ * By sad timing of apply/tablesync workers it was previously possible to have a
+ * prepared transaction that arrives at the apply worker when the tablesync is
+ * busy doing the initial sync. In this case, the apply worker does the begin
+ * prepare ('b') but it skips all the prepared operations [e.g. inserts] while
+ * the tablesync was still busy (see the condition of
+ * should_apply_changes_for_rel).
+ *
+ * This would lead to an "empty prepare", because later when the apply worker
+ * does the commit prepare ('K'), there is nothing in it (the inserts were
+ * skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two-phase
+ * commit is now implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to *properly*
+ * enable the publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED.
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophase).
+ *
+ * Finally, to avoid problems from any subsequent (not READY) tablesyncs
+ * interfering with the messages (same as the original problem) there is a
+ * restriction for ALTER SUBSCRIPTION REFRESH  PUBLICATION. This command is not
+ * permitted for two_phase = on, except when copy_data = false.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +101,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -720,6 +763,168 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("transaction identifier \"%s\" is already in use",
+						begin_data.gid)));
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	/*
+	 * Normally, prepare_lsn == remote_final_lsn, but if this prepare message
+	 * was dispatched via the psf spoolfile replay then the remote_final_lsn
+	 * is set to commit lsn instead. Hence the <= instead of == check below.
+	 */
+	Assert(prepare_data.prepare_lsn <= remote_final_lsn);
+
+	if (IsTransactionState())
+	{
+		/*
+		 * BeginTransactionBlock is necessary to balance the
+		 * EndTransactionBlock called within the PrepareTransactionBlock
+		 * below.
+		 */
+		BeginTransactionBlock();
+		CommitTransactionCommand();
+
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+		PrepareTransactionBlock(prepare_data.gid);
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+
+		store_flush_position(prepare_data.end_lsn);
+	}
+	else
+	{
+		/* Process any invalidation messages that might have accumulated. */
+		AcceptInvalidationMessages();
+		maybe_reread_subscription();
+	}
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 *
+	 * And we also skip the FinishPreparedTransaction if we're using the
+	 * Prepare Spoolfile (using_psf) because in that case there is no matching
+	 * PrepareTransactionBlock done yet.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1954,6 +2159,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2439,6 +2666,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophase != MySubscription->twophase ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3085,9 +3313,43 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophase;
+
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains as
+		 * the tri-state PENDING until all tablesyncs have reached READY state.
+		 * Only then, can it become properly ENABLED.
+		 */
+		bool all_tables_ready = AllTablesyncsREADY();
 
+		if (MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_PENDING && all_tables_ready)
+		{
+			/* Start streaming with two_phase enabled */
+			walrcv_startstreaming(wrconn, &options, true);
+			UpdateTwoPhaseTriState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophase = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options, false);
+		}
+
+		ereport(LOG,
+			(errmsg("logical replication apply worker for subscription \"%s\" 2PC is %s.",
+			MySubscription->name,
+			MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+			MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+			MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+			"?")));
+
+	}
+	else
+	{
 	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	walrcv_startstreaming(wrconn, &options, false);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..2b9e7b8 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,18 +171,22 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -232,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -245,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -269,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -310,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_2PC_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_2PC_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -322,8 +374,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +394,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +415,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +875,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1282,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..3fd5914 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -322,9 +325,9 @@ start_replication:
 				}
 			;
 
-/* START_REPLICATION SLOT slot LOGICAL %X/%X options */
+/* START_REPLICATION SLOT slot LOGICAL %X/%X options TWO_PHASE*/
 start_logical_replication:
-			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR plugin_options
+			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR plugin_options opt_two_phase
 				{
 					StartReplicationCmd *cmd;
 					cmd = makeNode(StartReplicationCmd);
@@ -332,6 +335,7 @@ start_logical_replication:
 					cmd->slotname = $3;
 					cmd->startpoint = $5;
 					cmd->options = $6;
+					cmd->two_phase = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +369,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 9817b44..951aa6c 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -515,7 +515,7 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 									XL_ROUTINE(.page_read = read_local_xlog_page,
 											   .segment_open = wal_segment_open,
 											   .segment_close = wal_segment_close),
-									NULL, NULL, NULL);
+									NULL, NULL, NULL, false);
 
 		/*
 		 * Start reading at the slot's restart_lsn, which we know to point to
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8532296..1325c29 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -364,7 +364,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
@@ -388,7 +388,7 @@ WalReceiverMain(void)
 		options.slotname = slotname[0] != '\0' ? slotname : NULL;
 		options.proto.physical.startpointTLI = startpointTLI;
 		ThisTimeLineID = startpointTLI;
-		if (walrcv_startstreaming(wrconn, &options))
+		if (walrcv_startstreaming(wrconn, &options, false))
 		{
 			if (first_stream)
 				ereport(LOG,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 23baa44..ac8a566 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1171,7 +1171,7 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 										 .segment_open = WalSndSegmentOpen,
 										 .segment_close = wal_segment_close),
 							  WalSndPrepareWrite, WalSndWriteData,
-							  WalSndUpdateProgress);
+							  WalSndUpdateProgress, cmd->two_phase);
 	xlogreader = logical_decoding_ctx->reader;
 
 	WalSndSetState(WALSNDSTATE_CATCHUP);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..102b012 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h" /* For 2PC tri-state. */
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4228,6 +4229,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophase;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4273,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophase\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophase\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4303,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophase = PQfnumber(res, "subtwophase");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4329,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophase =
+			pg_strdup(PQgetvalue(res, i, i_subtwophase));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4346,6 +4358,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = { LOGICALREP_TWOPHASE_STATE_DISABLED, '\0' };
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4387,6 +4400,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophase, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..28e8dd8 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophase;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..96c878b 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,15 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/* Binary mode and streaming and two_phase are only supported in v14 and higher */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophase AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index ecdb8d7..8f13e20 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2763,7 +2763,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..4695647 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,11 @@
 
 #include "nodes/pg_list.h"
 
+/* two_phase tri-state values. */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'n'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'y'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +59,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophase;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +98,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophase;		/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index ebc43a0..2923c04 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -85,6 +85,7 @@ typedef struct StartReplicationCmd
 	TimeLineID	timeline;
 	XLogRecPtr	startpoint;
 	List	   *options;
+	bool		two_phase;
 } StartReplicationCmd;
 
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c253403..43d9de0 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -115,7 +115,8 @@ extern LogicalDecodingContext *CreateDecodingContext(XLogRecPtr start_lsn,
 													 XLogReaderRoutine *xl_routine,
 													 LogicalOutputPluginWriterPrepareWrite prepare_write,
 													 LogicalOutputPluginWriterWrite do_write,
-													 LogicalOutputPluginWriterUpdateProgress update_progress);
+													 LogicalOutputPluginWriterUpdateProgress update_progress,
+													 bool two_phase);
 extern void DecodingContextFindStartpoint(LogicalDecodingContext *ctx);
 extern bool DecodingContextReady(LogicalDecodingContext *ctx);
 extern void FreeDecodingContext(LogicalDecodingContext *ctx);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..a5bb4de 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -27,10 +28,14 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_2PC_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_2PC_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
@@ -54,10 +59,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +132,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +181,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..6280559 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -643,7 +655,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..1f4b253 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,9 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..8e5e9ed 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,7 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildTwoPhaseAt(SnapBuild *builder);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..e5b6329 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -301,7 +302,7 @@ typedef void (*walrcv_readtimelinehistoryfile_fn) (WalReceiverConn *conn,
  * didn't switch to copy-mode.
  */
 typedef bool (*walrcv_startstreaming_fn) (WalReceiverConn *conn,
-										  const WalRcvStreamOptions *options);
+										  const WalRcvStreamOptions *options, bool two_phase);
 
 /*
  * walrcv_endstreaming_fn
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -412,16 +414,16 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_server_version(conn)
 #define walrcv_readtimelinehistoryfile(conn, tli, filename, content, size) \
 	WalReceiverFunctions->walrcv_readtimelinehistoryfile(conn, tli, filename, content, size)
-#define walrcv_startstreaming(conn, options) \
-	WalReceiverFunctions->walrcv_startstreaming(conn, options)
+#define walrcv_startstreaming(conn, options, two_phase) \
+	WalReceiverFunctions->walrcv_startstreaming(conn, options, two_phase)
 #define walrcv_endstreaming(conn, next_tli) \
 	WalReceiverFunctions->walrcv_endstreaming(conn, next_tli)
 #define walrcv_receive(conn, buffer, wait_fd) \
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..4be47ad 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsREADY(void);
+extern void UpdateTwoPhaseTriState(char new_tristate);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..d752346 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | n                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | n                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | n                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | n                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | n                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | n                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | n                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 61cf4ea..ddb3cfe 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1343,12 +1343,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
@@ -1959,6 +1962,8 @@ ProtocolVersion
 PrsStorage
 PruneState
 PruneStepResult
+PsfFile
+PsfHashEntry
 PsqlScanCallbacks
 PsqlScanQuoteType
 PsqlScanResult
-- 
1.8.3.1

v60-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v60-0002-Support-2PC-txn-subscriber-tests.patchDownload
From 46ead4e143f264424d9474d30d4365db8e2b4153 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 15 Mar 2021 08:27:12 -0400
Subject: [PATCH v60] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 337 ++++++++++++++++++++++++
 src/test/subscription/t/021_twophase_cascade.pl | 280 ++++++++++++++++++++
 2 files changed, 617 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..a17bf21
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,337 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophase NOT IN ('y');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_cascade.pl b/src/test/subscription/t/021_twophase_cascade.pl
new file mode 100644
index 0000000..c96f328
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_cascade.pl
@@ -0,0 +1,280 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophase NOT IN ('y');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v60-0003-Fix-apply-worker-dev-logs.patchapplication/octet-stream; name=v60-0003-Fix-apply-worker-dev-logs.patchDownload
From 415ad469b751f64a42c214ee497d10a196bc5440 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 15 Mar 2021 08:32:45 -0400
Subject: [PATCH v60] Fix apply worker (dev logs)

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the patch.
---
 src/backend/replication/logical/tablesync.c | 27 +++++++++++++++++++++++++++
 src/backend/replication/logical/worker.c    |  1 +
 2 files changed, 28 insertions(+)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index d946b59..35d2637 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -409,6 +409,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 */
 	if (MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_PENDING)
 	{
+		elog(LOG, "!!> two_phase enable is still pending");
 		if (AllTablesyncsREADY())
 		{
 			ereport(LOG,
@@ -1150,6 +1151,7 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
 
 		/* Clean the old lists. */
 		list_free_deep(table_states_all);
@@ -1173,6 +1175,7 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 			table_states_all = lappend(table_states_all, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_all - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 
 			/* List of only not-ready states */
 			if (cur_rstate->state != SUBREL_STATE_READY)
@@ -1180,12 +1183,17 @@ FetchTableStates(bool *started_tx)
 				rstate = palloc(sizeof(SubscriptionRelState));
 				memcpy(rstate, cur_rstate, sizeof(SubscriptionRelState));
 				table_states_not_ready = lappend(table_states_not_ready, rstate);
+				elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 			}
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1208,6 +1216,8 @@ AnyTablesyncsNotREADY(void)
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AnyTablesyncsNotREADY");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1219,6 +1229,12 @@ AnyTablesyncsNotREADY(void)
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
+		elog(LOG,
+			 "!!> AnyTablesyncsNotREADY: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
 		/*
 		 * When the process_syncing_tables_for_apply changes the state from
 		 * SYNCDONE to READY, that change is actually written directly into
@@ -1230,6 +1246,7 @@ AnyTablesyncsNotREADY(void)
 		 */
 		if (rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AnyTablesyncsNotREADY: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1241,6 +1258,11 @@ AnyTablesyncsNotREADY(void)
 		pgstat_report_stat(false);
 	}
 
+	elog(LOG,
+		 "!!> AnyTablesyncsNotREADY: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
 	return found_busy;
 }
 
@@ -1292,6 +1314,11 @@ UpdateTwoPhaseTriState(char new_tristate)
 
 		StartTransactionCommand();
 		new_s = GetSubscription(MySubscription->oid, false);
+		elog(LOG,
+			 "!!> 2PC Tri-state for \"%s\": '%c' ==> '%c'",
+			 MySubscription->name,
+			 MySubscription->twophase,
+			 new_s->twophase);
 		CommitTransactionCommand();
 	}
 #endif
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f0e0b11..98ad27e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -3203,6 +3203,7 @@ ApplyWorkerMain(Datum main_arg)
 						MyLogicalRepWorker->subid)));
 		proc_exit(0);
 	}
+	elog(LOG, "!!> MAIN: MySubscription twophase = '%c'", MySubscription->twophase);
 
 	MySubscriptionValid = true;
 	MemoryContextSwitchTo(oldctx);
-- 
1.8.3.1

#266Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#265)
3 attachment(s)

On Thu, Mar 18, 2021 at 5:20 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v61*

Oops. Attaching the correct v61* patches this time...

---
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v61-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v61-0002-Support-2PC-txn-subscriber-tests.patchDownload
From 95af9b5cc8636e53063ff848d6d76cc73aeaf6d1 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 18 Mar 2021 16:27:41 +1100
Subject: [PATCH v61] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 337 ++++++++++++++++++++++++
 src/test/subscription/t/021_twophase_cascade.pl | 280 ++++++++++++++++++++
 2 files changed, 617 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..b135997
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,337 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_cascade.pl b/src/test/subscription/t/021_twophase_cascade.pl
new file mode 100644
index 0000000..8e9dfdc
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_cascade.pl
@@ -0,0 +1,280 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v61-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v61-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From a5363765704f37bb48d8e9b68e7504a104d0c32b Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 18 Mar 2021 16:13:49 +1100
Subject: [PATCH v61] Add support for apply at prepare time to built-in logical
  replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

* implement new SUBSCRIPTION option "two_phase".

* add new option to enable two_phase while creating a slot.

* introduction of tri-state for twophase pg_subscription column.

* restrict ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* restrict ALTER SUBSCRIPTION SET PUBLICATION WITH (refresh = true) when two_phase enabled.

* include documentation update.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         |  14 +-
 doc/src/sgml/ref/alter_subscription.sgml           |   4 +-
 doc/src/sgml/ref/create_subscription.sgml          |  36 +++
 src/backend/access/transam/twophase.c              |  68 ++++++
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 116 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  17 +-
 src/backend/replication/logical/decode.c           |   2 +-
 src/backend/replication/logical/logical.c          |  27 ++-
 src/backend/replication/logical/logicalfuncs.c     |   2 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 206 ++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |   4 +-
 src/backend/replication/logical/snapbuild.c        |  10 +-
 src/backend/replication/logical/tablesync.c        | 195 ++++++++++++---
 src/backend/replication/logical/worker.c           | 263 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 193 ++++++++++++---
 src/backend/replication/repl_gram.y                |  21 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slotfuncs.c                |   2 +-
 src/backend/replication/walreceiver.c              |   4 +-
 src/backend/replication/walsender.c                |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |   8 +
 src/include/nodes/replnodes.h                      |   1 +
 src/include/replication/logical.h                  |   3 +-
 src/include/replication/logicalproto.h             |  75 +++++-
 src/include/replication/reorderbuffer.h            |  14 +-
 src/include/replication/slot.h                     |   6 +-
 src/include/replication/snapbuild.h                |   4 +-
 src/include/replication/walreceiver.h              |  12 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         |  93 +++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/tools/pgindent/typedefs.list                   |   3 +
 41 files changed, 1330 insertions(+), 164 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 5c9f4af..4b430ff 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7577,6 +7577,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe..9694713 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,18 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase transactions.
+         Two-phase commands like PREPARE TRANSACTION, COMMIT PREPARED and ROLLBACK PREPARED
+         are also decoded and transmitted. In two-phase transactions, the transaction is 
+         decoded and transmitted at PREPARE TRANSACTION time. 
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..85cc8bb 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -64,7 +64,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
   <para>
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with refresh
-   option as true cannot be executed inside a transaction block.
+   option as true cannot be executed inside a transaction block. They also
+   cannot be executed with <literal>copy_data = true</literal> if the
+   subscription is using <literal>two_phase</literal> commit.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..a5c9158 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,42 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the intial table synchronization phase. This means even when
+          two_phase is enabled for the subscription, the internal two-phase state remains
+          temporarily "pending" until the initialization phase is completed. See column
+          <literal>subtwophase</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..c58c46d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..658d9f8 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65d..d453902 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..8a55447 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,26 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +309,24 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * the current implementation has some issues that could lead to a
+	 * streaming prepared transaction to be incorrectly missed in the initial
+	 * syncing phase. Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -527,8 +578,12 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			if (create_slot)
 			{
 				Assert(slotname);
-
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with two-phase
+				 * enabled. Will enable it once all the tables are synced and ready.
+				 * This avoids race-conditions that might occur during initial table-sync.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +890,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +925,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +954,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +1000,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -953,6 +1017,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/* See ALTER_SUBSCRIPTION_REFRESH for details why this is not allow. */
+					if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -982,7 +1054,35 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state we
+				 * must not allow any subsequent table initialization to occur.
+				 * So the ALTER SUBSCRIPTION ... REFRESH is disallowed when the
+				 * the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data = false,
+				 * because when copy_data is false the tablesync will start
+				 * already in READY state and will exit directly without doing
+				 * anything which could interfere with the apply worker's
+				 * message handling.
+				 *
+				 * For more details see the "TWO_PHASE COMMIT TRI-STATE LOGIC"
+				 * comment atop worker.c
+				 */
+				if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index f743781..86c211c 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -63,7 +63,7 @@ static void libpqrcv_readtimelinehistoryfile(WalReceiverConn *conn,
 											 TimeLineID tli, char **filename,
 											 char **content, int *len);
 static bool libpqrcv_startstreaming(WalReceiverConn *conn,
-									const WalRcvStreamOptions *options);
+									const WalRcvStreamOptions *options, bool two_phase);
 static void libpqrcv_endstreaming(WalReceiverConn *conn,
 								  TimeLineID *next_tli);
 static int	libpqrcv_receive(WalReceiverConn *conn, char **buffer,
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -387,7 +388,7 @@ libpqrcv_server_version(WalReceiverConn *conn)
  */
 static bool
 libpqrcv_startstreaming(WalReceiverConn *conn,
-						const WalRcvStreamOptions *options)
+						const WalRcvStreamOptions *options, bool two_phase)
 {
 	StringInfoData cmd;
 	PGresult   *res;
@@ -427,6 +428,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -453,6 +458,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 		appendStringInfo(&cmd, " TIMELINE %u",
 						 options->proto.physical.startpointTLI);
 
+	if (options->logical && two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	/* Start streaming. */
 	res = libpqrcv_PQexec(conn->streamConn, cmd.data);
 	pfree(cmd.data);
@@ -827,7 +835,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +849,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f59613..5ba9e1e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -730,7 +730,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75de..7b72ec7 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -479,7 +479,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 					  XLogReaderRoutine *xl_routine,
 					  LogicalOutputPluginWriterPrepareWrite prepare_write,
 					  LogicalOutputPluginWriterWrite do_write,
-					  LogicalOutputPluginWriterUpdateProgress update_progress)
+					  LogicalOutputPluginWriterUpdateProgress update_progress,
+					  bool two_phase)
 {
 	LogicalDecodingContext *ctx;
 	ReplicationSlot *slot;
@@ -526,6 +527,20 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 		start_lsn = slot->data.confirmed_flush;
 	}
 
+	/*
+	 * If starting with two_phase enabled then set two_phase_at point.
+	 * Also update the slot to be two_phase enabled and save the slot
+	 * to disk.
+	 */
+	if (two_phase)
+	{
+		slot->data.two_phase_at = start_lsn;
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
+
+
 	ctx = StartupDecodingContext(output_plugin_options,
 								 start_lsn, InvalidTransactionId, false,
 								 fast_forward, xl_routine, prepare_write,
@@ -538,10 +553,10 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions if the two_phase option is
+	 * enabled at the time of slot creation or at restart.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase = slot->data.two_phase || two_phase;
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +617,7 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index f7e0558..366c50e 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -237,7 +237,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 											   .segment_open = wal_segment_open,
 											   .segment_close = wal_segment_close),
 									LogicalOutputPrepareWrite,
-									LogicalOutputWrite, NULL);
+									LogicalOutputWrite, NULL, false);
 
 		/*
 		 * After the sanity checks in CreateDecodingContext, make sure the
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..488b2a2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,212 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 91600ac..10ad8a7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2672,7 +2672,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2703,7 +2703,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * prepare if it was not decoded earlier. We don't need to decode the xact
 	 * for aborts if it is not done already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ed3acad..b6769f7 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -173,7 +173,7 @@ struct SnapBuild
 	 * needs to be sent later along with commit prepared and they must be
 	 * before this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -373,9 +373,9 @@ SnapBuildCurrentState(SnapBuild *builder)
  * Return the LSN at which the snapshot was exported
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 6ed3181..f64a2fb 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +363,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +370,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +391,36 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly 'enabled'
+	 * at that time.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
+	{
+		if (AllTablesyncsREADY())
+		{
+			ereport(LOG,
+					(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+					MySubscription->name)));
+
+			proc_exit(0);
+		}
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1058,7 +1053,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1144,3 +1139,135 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are all tablesyncs READY?
+ */
+bool
+AllTablesyncsREADY(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables.
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		/*
+		 * When the process_syncing_tables_for_apply changes the state from
+		 * SYNCDONE to READY, that change is actually written directly into
+		 * the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/* When no tablesyncs are busy, then all are READY */
+	return !found_busy;
+}
+
+/*
+ * Update the p_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 21d304a..c897940 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,63 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rollbacked at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead apply worker's current location.  This would lead to
+ * an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted for
+ * two_phase = on, except when copy_data = false.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +116,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -720,6 +778,151 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("transaction identifier \"%s\" is already in use",
+						begin_data.gid)));
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the
+	 * EndTransactionBlock called within the PrepareTransactionBlock
+	 * below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1954,6 +2157,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2439,6 +2664,7 @@ maybe_reread_subscription(void)
 		strcmp(newsub->slotname, MySubscription->slotname) != 0 ||
 		newsub->binary != MySubscription->binary ||
 		newsub->stream != MySubscription->stream ||
+		newsub->twophasestate != MySubscription->twophasestate ||
 		!equal(newsub->publications, MySubscription->publications))
 	{
 		ereport(LOG,
@@ -3085,9 +3311,42 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophasestate;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains as
+		 * the tri-state PENDING until all tablesyncs have reached READY state.
+		 * Only then, can it become properly ENABLED.
+		 */
+		bool all_tables_ready = AllTablesyncsREADY();
+
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING && all_tables_ready)
+		{
+			/* Start streaming with two_phase enabled */
+			walrcv_startstreaming(wrconn, &options, true);
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options, false);
+		}
+
+		ereport(LOG,
+			(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+			MySubscription->name,
+			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+			"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(wrconn, &options, false);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..60d50d7 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,18 +171,22 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -232,6 +254,16 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -245,6 +277,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -269,7 +302,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -310,6 +344,24 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Decide whether to enable two-phase commit. It is disabled by default, in
+		 * which case we just update the flag in decoding context. Otherwise
+		 * we only allow it with sufficient version of the protocol, and when
+		 * the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -322,8 +374,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +394,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +415,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +875,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1282,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..3fd5914 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -322,9 +325,9 @@ start_replication:
 				}
 			;
 
-/* START_REPLICATION SLOT slot LOGICAL %X/%X options */
+/* START_REPLICATION SLOT slot LOGICAL %X/%X options TWO_PHASE*/
 start_logical_replication:
-			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR plugin_options
+			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR plugin_options opt_two_phase
 				{
 					StartReplicationCmd *cmd;
 					cmd = makeNode(StartReplicationCmd);
@@ -332,6 +335,7 @@ start_logical_replication:
 					cmd->slotname = $3;
 					cmd->startpoint = $5;
 					cmd->options = $6;
+					cmd->two_phase = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +369,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 9817b44..951aa6c 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -515,7 +515,7 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 									XL_ROUTINE(.page_read = read_local_xlog_page,
 											   .segment_open = wal_segment_open,
 											   .segment_close = wal_segment_close),
-									NULL, NULL, NULL);
+									NULL, NULL, NULL, false);
 
 		/*
 		 * Start reading at the slot's restart_lsn, which we know to point to
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8532296..1325c29 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -364,7 +364,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
@@ -388,7 +388,7 @@ WalReceiverMain(void)
 		options.slotname = slotname[0] != '\0' ? slotname : NULL;
 		options.proto.physical.startpointTLI = startpointTLI;
 		ThisTimeLineID = startpointTLI;
-		if (walrcv_startstreaming(wrconn, &options))
+		if (walrcv_startstreaming(wrconn, &options, false))
 		{
 			if (first_stream)
 				ereport(LOG,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 23baa44..ac8a566 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1171,7 +1171,7 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 										 .segment_open = WalSndSegmentOpen,
 										 .segment_close = wal_segment_close),
 							  WalSndPrepareWrite, WalSndWriteData,
-							  WalSndUpdateProgress);
+							  WalSndUpdateProgress, cmd->two_phase);
 	xlogreader = logical_decoding_ctx->reader;
 
 	WalSndSetState(WALSNDSTATE_CATCHUP);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..eeafdf8 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h" /* For 2PC tri-state. */
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4228,6 +4229,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4273,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4303,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4329,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4346,6 +4358,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = { LOGICALREP_TWOPHASE_STATE_DISABLED, '\0' };
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4387,6 +4400,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..98776db 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..286f1c9 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index d3fb734..bff3306 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2764,7 +2764,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..3f06d1f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,11 @@
 
 #include "nodes/pg_list.h"
 
+/* two_phase tri-state values. */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +59,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +98,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index ebc43a0..2923c04 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -85,6 +85,7 @@ typedef struct StartReplicationCmd
 	TimeLineID	timeline;
 	XLogRecPtr	startpoint;
 	List	   *options;
+	bool		two_phase;
 } StartReplicationCmd;
 
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c253403..43d9de0 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -115,7 +115,8 @@ extern LogicalDecodingContext *CreateDecodingContext(XLogRecPtr start_lsn,
 													 XLogReaderRoutine *xl_routine,
 													 LogicalOutputPluginWriterPrepareWrite prepare_write,
 													 LogicalOutputPluginWriterWrite do_write,
-													 LogicalOutputPluginWriterUpdateProgress update_progress);
+													 LogicalOutputPluginWriterUpdateProgress update_progress,
+													 bool two_phase);
 extern void DecodingContextFindStartpoint(LogicalDecodingContext *ctx);
 extern bool DecodingContextReady(LogicalDecodingContext *ctx);
 extern void FreeDecodingContext(LogicalDecodingContext *ctx);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..282cf61 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -27,10 +28,16 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. This has the same protocol
+ * version requirement as LOGICAL_PROTO_STREAM_VERSION_NUM because these
+ * features were both introduced in the same release (PG14).
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
@@ -54,10 +61,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +126,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +134,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +183,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..6280559 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -247,6 +247,18 @@ typedef struct ReorderBufferChange
 	((txn)->txn_flags & RBTXN_SKIPPED_PREPARE) != 0 \
 )
 
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+	((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+
 typedef struct ReorderBufferTXN
 {
 	/* See above */
@@ -643,7 +655,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..1f4b253 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,9 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..8e5e9ed 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,7 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildTwoPhaseAt(SnapBuild *builder);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..e5b6329 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -301,7 +302,7 @@ typedef void (*walrcv_readtimelinehistoryfile_fn) (WalReceiverConn *conn,
  * didn't switch to copy-mode.
  */
 typedef bool (*walrcv_startstreaming_fn) (WalReceiverConn *conn,
-										  const WalRcvStreamOptions *options);
+										  const WalRcvStreamOptions *options, bool two_phase);
 
 /*
  * walrcv_endstreaming_fn
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -412,16 +414,16 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_server_version(conn)
 #define walrcv_readtimelinehistoryfile(conn, tli, filename, content, size) \
 	WalReceiverFunctions->walrcv_readtimelinehistoryfile(conn, tli, filename, content, size)
-#define walrcv_startstreaming(conn, options) \
-	WalReceiverFunctions->walrcv_startstreaming(conn, options)
+#define walrcv_startstreaming(conn, options, two_phase) \
+	WalReceiverFunctions->walrcv_startstreaming(conn, options, two_phase)
 #define walrcv_endstreaming(conn, next_tli) \
 	WalReceiverFunctions->walrcv_endstreaming(conn, next_tli)
 #define walrcv_receive(conn, buffer, wait_fd) \
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..efd4533 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsREADY(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..a9664e8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1d1d5d2..1f3038f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1343,12 +1343,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v61-0003-Fix-apply-worker-dev-logs.patchapplication/octet-stream; name=v61-0003-Fix-apply-worker-dev-logs.patchDownload
From 0d593e1b0c14c4c6c0c0ca149baf9b3ce5c0541a Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 18 Mar 2021 16:55:33 +1100
Subject: [PATCH v61] Fix apply worker (dev logs)

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the patch.
---
 src/backend/replication/logical/tablesync.c | 38 +++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index f64a2fb..7505a90 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -407,6 +407,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 */
 	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
 	{
+		elog(LOG, "!!> two_phase enable is still pending");
 		if (AllTablesyncsREADY())
 		{
 			ereport(LOG,
@@ -1155,6 +1156,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
@@ -1172,11 +1175,16 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
 			table_states_not_ready = lappend(table_states_not_ready, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1190,6 +1198,8 @@ AllTablesyncsREADY(void)
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AllTablesyncsREADY");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1201,6 +1211,12 @@ AllTablesyncsREADY(void)
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
+		elog(LOG,
+			 "!!> AllTablesyncsREADY: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
 		/*
 		 * When the process_syncing_tables_for_apply changes the state from
 		 * SYNCDONE to READY, that change is actually written directly into
@@ -1212,6 +1228,7 @@ AllTablesyncsREADY(void)
 		 */
 		if (rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AllTablesyncsREADY: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1223,6 +1240,11 @@ AllTablesyncsREADY(void)
 		pgstat_report_stat(false);
 	}
 
+	elog(LOG,
+		 "!!> AllTablesyncsREADY: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
 	/* When no tablesyncs are busy, then all are READY */
 	return !found_busy;
 }
@@ -1270,4 +1292,20 @@ UpdateTwoPhaseState(char new_state)
 	table_close(rel, RowExclusiveLock);
 
 	CommitTransactionCommand();
+
+#if 1
+	/* This is just debugging, for confirmation the update worked. */
+	{
+		Subscription *new_s;
+
+		StartTransactionCommand();
+		new_s = GetSubscription(MySubscription->oid, false);
+		elog(LOG,
+			 "!!> 2PC Tri-state for \"%s\": '%c' ==> '%c'",
+			 MySubscription->name,
+			 MySubscription->twophasestate,
+			 new_s->twophasestate);
+		CommitTransactionCommand();
+	}
+#endif
 }
-- 
1.8.3.1

#267Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#266)
2 attachment(s)

On Thu, Mar 18, 2021 at 5:30 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Mar 18, 2021 at 5:20 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v61*

Please find attached the latest patch set v62

Differences from v61 are:

* Rebased to HEAD

* Addresses the following feedback issues:

Vignesh 12/Mar -
/messages/by-id/CALDaNm1p=KYcDc1s_Q0Lk2P8UYU-z4acW066gaeLfXvW_O-kBA@mail.gmail.com

(62) Fixed. Added assert for twophase alter check in
maybe_reread_subscription(void)

(63) Fixed. Changed parse_output_parameters to disable two-phase and
streaming combo

Amit 16 Mar -
/messages/by-id/CAA4eK1Kwah+MimFMR3jPY5cSqpGFVh5zfV2g4=gTphaPsacoLw@mail.gmail.com

(74) Fixed. Modify comment about why not supporting combination of
two-phase and streaming

(75) Fixed. Added more comments about creating slot with two-phase race
conditions

(78) Skipped. Adding assert for two-phase variables getting reset, the
logic has been changed, so skipping this.

(79) Changed. Reworded the comment about allowing decoding of prepared
transaction (restoring iff)

(80) Fixed. Added & in the assignment for ctx->twophase, logic is also
changed

(81) Fixed. Changed to conditional setting of two_phase_at only if
two_phase is enabled.

(82) Fixed. Better explanation for two_phase_at variable in
snapbuild.changed

(83) Skipped. The comparison in ReorderBufferFinishPrepared was not changed
and it was tested and it works.
The reason it works is because even if the Prepare is filtered when
two-phase is not enabled, once the tablessync is
over and the TABLES are in READY state, the apply worker and the walsender
restarts, and after restart, the prepare will be
not be filtered out, but will be marked as skipped prepare and also updated
in ReorderBufferRememberPrepareInfo

(87) Fixed. Added server version check before two-phase enabled startstream
in ApplyWorkerMain.

(91)Fixed. Removed unused macros in reorderbuffer.h

Amit 17/Mar -
/messages/by-id/CAA4eK1LNLA20ci3_qqNQv7BYRTy3HqiAsOfuieqo6tJ2GeYuJw@mail.gmail.com

(96) Fixed - Removed token for twophase in Start Replication slot, instead
used the twophase options. But kept the token
in Create_Replication slot, as we gave the option for plugins to enable
two-phase while creating a slot. This allows plugins without a
table-synchronization phase
to handle two-phase from the start.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v62-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v62-0002-Support-2PC-txn-subscriber-tests.patchDownload
From 6ef2952862219a522b218eee7372bdc4ad4e9f5d Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 18 Mar 2021 07:22:07 -0400
Subject: [PATCH v62] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 337 ++++++++++++++++++++++++
 src/test/subscription/t/021_twophase_cascade.pl | 280 ++++++++++++++++++++
 2 files changed, 617 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..b135997
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,337 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_cascade.pl b/src/test/subscription/t/021_twophase_cascade.pl
new file mode 100644
index 0000000..8e9dfdc
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_cascade.pl
@@ -0,0 +1,280 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v62-0003-Fix-apply-worker-dev-logs.patchapplication/octet-stream; name=v62-0003-Fix-apply-worker-dev-logs.patchDownload
From b13d2aed04ddf6d71b57ce87ecd3c98a9b0f66e8 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 18 Mar 2021 07:23:48 -0400
Subject: [PATCH v62] Fix apply worker (dev logs)

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the patch.
---
 src/backend/replication/logical/tablesync.c | 38 +++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index f64a2fb..7505a90 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -407,6 +407,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 */
 	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
 	{
+		elog(LOG, "!!> two_phase enable is still pending");
 		if (AllTablesyncsREADY())
 		{
 			ereport(LOG,
@@ -1155,6 +1156,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
@@ -1172,11 +1175,16 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
 			table_states_not_ready = lappend(table_states_not_ready, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1190,6 +1198,8 @@ AllTablesyncsREADY(void)
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AllTablesyncsREADY");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1201,6 +1211,12 @@ AllTablesyncsREADY(void)
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
+		elog(LOG,
+			 "!!> AllTablesyncsREADY: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
 		/*
 		 * When the process_syncing_tables_for_apply changes the state from
 		 * SYNCDONE to READY, that change is actually written directly into
@@ -1212,6 +1228,7 @@ AllTablesyncsREADY(void)
 		 */
 		if (rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AllTablesyncsREADY: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1223,6 +1240,11 @@ AllTablesyncsREADY(void)
 		pgstat_report_stat(false);
 	}
 
+	elog(LOG,
+		 "!!> AllTablesyncsREADY: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
 	/* When no tablesyncs are busy, then all are READY */
 	return !found_busy;
 }
@@ -1270,4 +1292,20 @@ UpdateTwoPhaseState(char new_state)
 	table_close(rel, RowExclusiveLock);
 
 	CommitTransactionCommand();
+
+#if 1
+	/* This is just debugging, for confirmation the update worked. */
+	{
+		Subscription *new_s;
+
+		StartTransactionCommand();
+		new_s = GetSubscription(MySubscription->oid, false);
+		elog(LOG,
+			 "!!> 2PC Tri-state for \"%s\": '%c' ==> '%c'",
+			 MySubscription->name,
+			 MySubscription->twophasestate,
+			 new_s->twophasestate);
+		CommitTransactionCommand();
+	}
+#endif
 }
-- 
1.8.3.1

#268Ajin Cherian
itsajin@gmail.com
In reply to: Ajin Cherian (#267)
3 attachment(s)

Missed the patch - 0001, resending.

On Thu, Mar 18, 2021 at 10:58 PM Ajin Cherian <itsajin@gmail.com> wrote:

Show quoted text

On Thu, Mar 18, 2021 at 5:30 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Mar 18, 2021 at 5:20 PM Peter Smith <smithpb2250@gmail.com>
wrote:

Please find attached the latest patch set v61*

Please find attached the latest patch set v62

Differences from v61 are:

* Rebased to HEAD

* Addresses the following feedback issues:

Vignesh 12/Mar -

/messages/by-id/CALDaNm1p=KYcDc1s_Q0Lk2P8UYU-z4acW066gaeLfXvW_O-kBA@mail.gmail.com

(62) Fixed. Added assert for twophase alter check in
maybe_reread_subscription(void)

(63) Fixed. Changed parse_output_parameters to disable two-phase and
streaming combo

Amit 16 Mar -
/messages/by-id/CAA4eK1Kwah+MimFMR3jPY5cSqpGFVh5zfV2g4=gTphaPsacoLw@mail.gmail.com

(74) Fixed. Modify comment about why not supporting combination of
two-phase and streaming

(75) Fixed. Added more comments about creating slot with two-phase race
conditions

(78) Skipped. Adding assert for two-phase variables getting reset, the
logic has been changed, so skipping this.

(79) Changed. Reworded the comment about allowing decoding of prepared
transaction (restoring iff)

(80) Fixed. Added & in the assignment for ctx->twophase, logic is also
changed

(81) Fixed. Changed to conditional setting of two_phase_at only if
two_phase is enabled.

(82) Fixed. Better explanation for two_phase_at variable in
snapbuild.changed

(83) Skipped. The comparison in ReorderBufferFinishPrepared was not
changed and it was tested and it works.
The reason it works is because even if the Prepare is filtered when
two-phase is not enabled, once the tablessync is
over and the TABLES are in READY state, the apply worker and the walsender
restarts, and after restart, the prepare will be
not be filtered out, but will be marked as skipped prepare and also
updated in ReorderBufferRememberPrepareInfo

(87) Fixed. Added server version check before two-phase enabled
startstream in ApplyWorkerMain.

(91)Fixed. Removed unused macros in reorderbuffer.h

Amit 17/Mar -
/messages/by-id/CAA4eK1LNLA20ci3_qqNQv7BYRTy3HqiAsOfuieqo6tJ2GeYuJw@mail.gmail.com

(96) Fixed - Removed token for twophase in Start Replication slot, instead
used the twophase options. But kept the token
in Create_Replication slot, as we gave the option for plugins to enable
two-phase while creating a slot. This allows plugins without a
table-synchronization phase
to handle two-phase from the start.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v62-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v62-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 929a6516a5f7e5048417242e85779559857ea6e3 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 18 Mar 2021 07:17:59 -0400
Subject: [PATCH v62] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

* implement new SUBSCRIPTION option "two_phase".

* add new option to enable two_phase while creating a slot.

* introduction of tri-state for twophase pg_subscription column.

* restrict ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* restrict ALTER SUBSCRIPTION SET PUBLICATION WITH (refresh = true) when two_phase enabled.

* include documentation update.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, C Vignesh and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         |  14 +-
 doc/src/sgml/ref/alter_subscription.sgml           |   4 +-
 doc/src/sgml/ref/create_subscription.sgml          |  36 +++
 src/backend/access/transam/twophase.c              |  68 ++++++
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 125 +++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  15 +-
 src/backend/replication/logical/decode.c           |   2 +-
 src/backend/replication/logical/logical.c          |  40 +++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 206 ++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |   4 +-
 src/backend/replication/logical/snapbuild.c        |  30 ++-
 src/backend/replication/logical/tablesync.c        | 195 ++++++++++++---
 src/backend/replication/logical/worker.c           | 265 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 204 +++++++++++++---
 src/backend/replication/repl_gram.y                |  18 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   4 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |   8 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  75 +++++-
 src/include/replication/reorderbuffer.h            |   2 +-
 src/include/replication/slot.h                     |   6 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |  12 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         |  93 +++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/tools/pgindent/typedefs.list                   |   3 +
 38 files changed, 1362 insertions(+), 169 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 5c9f4af..4b430ff 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7577,6 +7577,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe..9694713 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,18 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase transactions.
+         Two-phase commands like PREPARE TRANSACTION, COMMIT PREPARED and ROLLBACK PREPARED
+         are also decoded and transmitted. In two-phase transactions, the transaction is 
+         decoded and transmitted at PREPARE TRANSACTION time. 
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..85cc8bb 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -64,7 +64,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
   <para>
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with refresh
-   option as true cannot be executed inside a transaction block.
+   option as true cannot be executed inside a transaction block. They also
+   cannot be executed with <literal>copy_data = true</literal> if the
+   subscription is using <literal>two_phase</literal> commit.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..a5c9158 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,42 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the intial table synchronization phase. This means even when
+          two_phase is enabled for the subscription, the internal two-phase state remains
+          temporarily "pending" until the initialization phase is completed. See column
+          <literal>subtwophase</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..c58c46d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..658d9f8 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65d..d453902 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..015b2dc 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,33 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+			{
+				/*
+				 * This check is specifically for ALTER commands
+				 * with two_phase option.
+				 */
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			}
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +316,23 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * it needs more analysis to allow them together.
+	 * Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +408,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +434,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +503,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -527,8 +584,15 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			if (create_slot)
 			{
 				Assert(slotname);
-
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with two-phase
+				 * enabled. Will enable it once all the tables are synced and ready.
+				 * This avoids race-conditions like prepared transactions being
+				 * skippped due to changes not being applied due to checks in
+				 * should_apply_changes_for_rel() when tablesync for the
+				 * corresponding tables are in progress.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +899,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +934,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +963,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +1009,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -953,6 +1026,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/* See ALTER_SUBSCRIPTION_REFRESH for details why this is not allow. */
+					if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -982,7 +1063,35 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state we
+				 * must not allow any subsequent table initialization to occur.
+				 * So the ALTER SUBSCRIPTION ... REFRESH is disallowed when the
+				 * the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data = false,
+				 * because when copy_data is false the tablesync will start
+				 * already in READY state and will exit directly without doing
+				 * anything which could interfere with the apply worker's
+				 * message handling.
+				 *
+				 * For more details see the "TWO_PHASE COMMIT TRI-STATE LOGIC"
+				 * comment atop worker.c
+				 */
+				if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index f743781..37aa871 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -63,7 +63,7 @@ static void libpqrcv_readtimelinehistoryfile(WalReceiverConn *conn,
 											 TimeLineID tli, char **filename,
 											 char **content, int *len);
 static bool libpqrcv_startstreaming(WalReceiverConn *conn,
-									const WalRcvStreamOptions *options);
+									const WalRcvStreamOptions *options, bool two_phase);
 static void libpqrcv_endstreaming(WalReceiverConn *conn,
 								  TimeLineID *next_tli);
 static int	libpqrcv_receive(WalReceiverConn *conn, char **buffer,
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -387,7 +388,7 @@ libpqrcv_server_version(WalReceiverConn *conn)
  */
 static bool
 libpqrcv_startstreaming(WalReceiverConn *conn,
-						const WalRcvStreamOptions *options)
+						const WalRcvStreamOptions *options, bool two_phase)
 {
 	StringInfoData cmd;
 	PGresult   *res;
@@ -427,6 +428,11 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/* set the two_phase option only if the caller specifies it. */
+		if (options->proto.logical.twophase && two_phase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -827,7 +833,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -841,6 +847,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f59613..6a90b56 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -730,7 +730,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75de..d944b02 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,21 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when
+	 * (a) the two_phase is enabled at the time of slot creation,
+	 * or (b) when the two_phase option is passed in at restart.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || MyReplicationSlot->data.two_phase);
+
+	/* Set two_phase_at LSN only if it hasn't already been set. */
+	if (ctx->twophase && !MyReplicationSlot->data.two_phase_at)
+	{
+		MyReplicationSlot->data.two_phase_at = restart_lsn;
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, restart_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -526,6 +537,7 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 		start_lsn = slot->data.confirmed_flush;
 	}
 
+
 	ctx = StartupDecodingContext(output_plugin_options,
 								 start_lsn, InvalidTransactionId, false,
 								 fast_forward, xl_routine, prepare_write,
@@ -538,10 +550,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when
+	 * (a) the two_phase is enabled at the time of slot creation,
+	 * or (b) when the two_phase option is passed in at restart.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || MyReplicationSlot->data.two_phase);
+
+	/* Set two_phase_at LSN only if it hasn't already been set. */
+	if (ctx->twophase && !MyReplicationSlot->data.two_phase_at)
+	{
+		MyReplicationSlot->data.two_phase_at = start_lsn;
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +625,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..488b2a2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,212 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 91600ac..10ad8a7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2672,7 +2672,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2703,7 +2703,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * prepare if it was not decoded earlier. We don't need to decode the xact
 	 * for aborts if it is not done already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ed3acad..fddcc64 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,12 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions from an LSN < this LSN
+	 * needs to be sent later along with commit prepared.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +278,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +306,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +367,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 6ed3181..f64a2fb 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +363,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +370,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +391,36 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly 'enabled'
+	 * at that time.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
+	{
+		if (AllTablesyncsREADY())
+		{
+			ereport(LOG,
+					(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+					MySubscription->name)));
+
+			proc_exit(0);
+		}
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1058,7 +1053,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1144,3 +1139,135 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are all tablesyncs READY?
+ */
+bool
+AllTablesyncsREADY(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables.
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		/*
+		 * When the process_syncing_tables_for_apply changes the state from
+		 * SYNCDONE to READY, that change is actually written directly into
+		 * the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/* When no tablesyncs are busy, then all are READY */
+	return !found_busy;
+}
+
+/*
+ * Update the p_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 21d304a..1a76c38 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,63 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rollbacked at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead apply worker's current location.  This would lead to
+ * an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted for
+ * two_phase = on, except when copy_data = false.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +116,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -720,6 +778,151 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("transaction identifier \"%s\" is already in use",
+						begin_data.gid)));
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the
+	 * EndTransactionBlock called within the PrepareTransactionBlock
+	 * below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1954,6 +2157,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2430,6 +2655,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -3085,9 +3313,42 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophasestate;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains as
+		 * the tri-state PENDING until all tablesyncs have reached READY state.
+		 * Only then, can it become properly ENABLED.
+		 */
+		bool all_tables_ready = AllTablesyncsREADY();
+
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING && all_tables_ready)
+		{
+			/* Start streaming with two_phase enabled */
+			walrcv_startstreaming(wrconn, &options, true);
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options, false);
+		}
+
+		ereport(LOG,
+			(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+			MySubscription->name,
+			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+			"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(wrconn, &options, false);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..9dc7fc2 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,18 +171,22 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -232,9 +254,27 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
+	/*
+	 * Enabling both streaming and two-phase is not a currently supported
+	 * combination, could be supported in future.
+	 */
+	if (*enable_twophase && *enable_streaming)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("unsupported combination of options")));
 }
 
 /*
@@ -245,6 +285,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -269,7 +310,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -310,6 +352,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * We don't decide if two-phae is disabled here. Just mark the option
+		 * as passed in or not. Two-phase could remain enabled because a previous
+		 * start-up enabled in. But we only allow the option to be passed in
+		 * with sufficient version of the protocol, and when the output plugin
+		 * supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -322,8 +385,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +405,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +426,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +886,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1293,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..27fdbd7 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -322,7 +325,7 @@ start_replication:
 				}
 			;
 
-/* START_REPLICATION SLOT slot LOGICAL %X/%X options */
+/* START_REPLICATION SLOT slot LOGICAL %X/%X options*/
 start_logical_replication:
 			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR plugin_options
 				{
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c..91224e0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8532296..1325c29 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -364,7 +364,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
@@ -388,7 +388,7 @@ WalReceiverMain(void)
 		options.slotname = slotname[0] != '\0' ? slotname : NULL;
 		options.proto.physical.startpointTLI = startpointTLI;
 		ThisTimeLineID = startpointTLI;
-		if (walrcv_startstreaming(wrconn, &options))
+		if (walrcv_startstreaming(wrconn, &options, false))
 		{
 			if (first_stream)
 				ereport(LOG,
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..eeafdf8 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h" /* For 2PC tri-state. */
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4228,6 +4229,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4273,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4303,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4329,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4346,6 +4358,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = { LOGICALREP_TWOPHASE_STATE_DISABLED, '\0' };
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4387,6 +4400,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..98776db 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..286f1c9 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index d3fb734..bff3306 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2764,7 +2764,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..3f06d1f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,11 @@
 
 #include "nodes/pg_list.h"
 
+/* two_phase tri-state values. */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +59,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +98,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c253403..6548b27 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Does output plugin wants to turn on two-phase?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..282cf61 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -27,10 +28,16 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. This has the same protocol
+ * version requirement as LOGICAL_PROTO_STREAM_VERSION_NUM because these
+ * features were both introduced in the same release (PG14).
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
@@ -54,10 +61,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +126,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +134,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +183,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..6c9f2c6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -643,7 +643,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..1f4b253 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,9 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..e5b6329 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -301,7 +302,7 @@ typedef void (*walrcv_readtimelinehistoryfile_fn) (WalReceiverConn *conn,
  * didn't switch to copy-mode.
  */
 typedef bool (*walrcv_startstreaming_fn) (WalReceiverConn *conn,
-										  const WalRcvStreamOptions *options);
+										  const WalRcvStreamOptions *options, bool two_phase);
 
 /*
  * walrcv_endstreaming_fn
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -412,16 +414,16 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_server_version(conn)
 #define walrcv_readtimelinehistoryfile(conn, tli, filename, content, size) \
 	WalReceiverFunctions->walrcv_readtimelinehistoryfile(conn, tli, filename, content, size)
-#define walrcv_startstreaming(conn, options) \
-	WalReceiverFunctions->walrcv_startstreaming(conn, options)
+#define walrcv_startstreaming(conn, options, two_phase) \
+	WalReceiverFunctions->walrcv_startstreaming(conn, options, two_phase)
 #define walrcv_endstreaming(conn, next_tli) \
 	WalReceiverFunctions->walrcv_endstreaming(conn, next_tli)
 #define walrcv_receive(conn, buffer, wait_fd) \
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..efd4533 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsREADY(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..a9664e8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1d1d5d2..1f3038f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1343,12 +1343,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v62-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v62-0002-Support-2PC-txn-subscriber-tests.patchDownload
From 6ef2952862219a522b218eee7372bdc4ad4e9f5d Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 18 Mar 2021 07:22:07 -0400
Subject: [PATCH v62] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 337 ++++++++++++++++++++++++
 src/test/subscription/t/021_twophase_cascade.pl | 280 ++++++++++++++++++++
 2 files changed, 617 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..b135997
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,337 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_cascade.pl b/src/test/subscription/t/021_twophase_cascade.pl
new file mode 100644
index 0000000..8e9dfdc
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_cascade.pl
@@ -0,0 +1,280 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v62-0003-Fix-apply-worker-dev-logs.patchapplication/octet-stream; name=v62-0003-Fix-apply-worker-dev-logs.patchDownload
From b13d2aed04ddf6d71b57ce87ecd3c98a9b0f66e8 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 18 Mar 2021 07:23:48 -0400
Subject: [PATCH v62] Fix apply worker (dev logs)

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the patch.
---
 src/backend/replication/logical/tablesync.c | 38 +++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index f64a2fb..7505a90 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -407,6 +407,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 */
 	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
 	{
+		elog(LOG, "!!> two_phase enable is still pending");
 		if (AllTablesyncsREADY())
 		{
 			ereport(LOG,
@@ -1155,6 +1156,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
@@ -1172,11 +1175,16 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
 			table_states_not_ready = lappend(table_states_not_ready, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1190,6 +1198,8 @@ AllTablesyncsREADY(void)
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AllTablesyncsREADY");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1201,6 +1211,12 @@ AllTablesyncsREADY(void)
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
+		elog(LOG,
+			 "!!> AllTablesyncsREADY: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
 		/*
 		 * When the process_syncing_tables_for_apply changes the state from
 		 * SYNCDONE to READY, that change is actually written directly into
@@ -1212,6 +1228,7 @@ AllTablesyncsREADY(void)
 		 */
 		if (rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AllTablesyncsREADY: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1223,6 +1240,11 @@ AllTablesyncsREADY(void)
 		pgstat_report_stat(false);
 	}
 
+	elog(LOG,
+		 "!!> AllTablesyncsREADY: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
 	/* When no tablesyncs are busy, then all are READY */
 	return !found_busy;
 }
@@ -1270,4 +1292,20 @@ UpdateTwoPhaseState(char new_state)
 	table_close(rel, RowExclusiveLock);
 
 	CommitTransactionCommand();
+
+#if 1
+	/* This is just debugging, for confirmation the update worked. */
+	{
+		Subscription *new_s;
+
+		StartTransactionCommand();
+		new_s = GetSubscription(MySubscription->oid, false);
+		elog(LOG,
+			 "!!> 2PC Tri-state for \"%s\": '%c' ==> '%c'",
+			 MySubscription->name,
+			 MySubscription->twophasestate,
+			 new_s->twophasestate);
+		CommitTransactionCommand();
+	}
+#endif
 }
-- 
1.8.3.1

#269osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#251)
1 attachment(s)
RE: [HACKERS] logical decoding of two-phase transactions

Hi

On Saturday, March 13, 2021 5:01 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote:

On Friday, March 12, 2021 5:40 PM Peter Smith <smithpb2250@gmail.com>

Please find attached the latest patch set v58*

Thank you for updating those. I'm testing the patchset and I think it's
preferable that you add simple two types of more tests in 020_twophase.pl
because those aren't checked by v58.

(1) execute single PREPARE TRANSACTION
which affects several tables (connected to corresponding
publications)
at the same time and confirm they are synced correctly.

(2) execute single PREPARE TRANSACTION which affects multiple
subscribers
and confirm they are synced correctly.
This doesn't mean cascading standbys like
022_twophase_cascade.pl.
Imagine that there is one publisher and two subscribers to it.

Attached a patch for those two tests. The patch works with v62.
I tested this in a loop more than 100 times and showed no failure.

Best Regards,
Takamichi Osumi

Attachments:

0001-add-2-types-of-new-tests-for-2PC.patchapplication/octet-stream; name=0001-add-2-types-of-new-tests-for-2PC.patchDownload
From 3194ddea80a15b31427a63b18bc75ccb0c8aeb23 Mon Sep 17 00:00:00 2001
From: Osumi Takamichi <osumi.takamichi@fujitsu.com>
Date: Fri, 19 Mar 2021 05:35:46 +0000
Subject: [PATCH v01] add 2 types of new tests for 2PC

Add two scenarioes into existing TAP tests for logical decoding of 2PC.
The first test is to check the data is correctly synced
when two publications are affected within one prepared transaction.
On the other hand, the second one is to check a case
when there are two standbys for one publication.

---
 src/test/subscription/t/020_twophase.pl | 98 ++++++++++++++++++++++++++++++++-
 1 file changed, 97 insertions(+), 1 deletion(-)

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index b135997..ac866b2 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 20;
+use Test::More tests => 24;
 
 ###############################
 # Setup
@@ -311,10 +311,105 @@ $result = $node_subscriber->safe_psql('postgres',
 is($result, qq(0), 'transaction is aborted on subscriber');
 
 ###############################
+# Test multiple publications by single 2PC execution.
+# Add one more set of publication and subscription
+# and change two tables within one PREPARED TRANSACTION,
+# to affect two corresponding publications at the same time.
+###############################
+$node_publisher->safe_psql('postgres',
+						   "CREATE TABLE new_tab (a int PRIMARY KEY);");
+$node_publisher->safe_psql('postgres',
+						   "CREATE PUBLICATION new_tap_pub FOR TABLE new_tab;");
+$node_subscriber->safe_psql('postgres',
+							"CREATE TABLE new_tab (a int PRIMARY KEY);");
+my $new_appname = 'new_tap_sub';
+$node_subscriber->safe_psql('postgres',"
+	CREATE SUBSCRIPTION new_tap_sub
+	CONNECTION '$publisher_connstr application_name=$new_appname'
+	PUBLICATION new_tap_pub
+	WITH (two_phase = on)");
+
+my $new_caughtup_query =
+  "SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$new_appname';";
+$node_publisher->poll_query_until('postgres', $new_caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# Table tab_full is connected to tap_pub publication,
+# while table new_tab is associated with new_tap_sub publication.
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (52);
+	INSERT INTO new_tab VALUES (1);
+	PREPARE TRANSACTION 'multiple_publications';
+	COMMIT PREPARED 'multiple_publications';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$node_publisher->poll_query_until('postgres', $new_caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that 2PC gets commited on subscriber for both subscriptions
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (52);");
+is($result, qq(1), 'first change got committed on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM new_tab where a IN (1);");
+is($result, qq(1), 'second change got committed on the subscriber');
+
+###############################
+# Test multiple standbys for single 2PC execution.
+# Add one more subscriber second_tap_sub, besides the existing subscriber tap_sub.
+# Then, run 2PC and check if all are synced correctly.
+###############################
+my $second_subscriber = get_new_node('second_subscriber');
+my $second_appname = 'second_app_sub';
+$second_subscriber->init(allows_streaming => 'logical');
+$second_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$second_subscriber->start;
+
+$second_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+$second_subscriber->safe_psql('postgres',"
+	CREATE SUBSCRIPTION second_tap_sub
+	CONNECTION '$publisher_connstr application_name=$second_appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on);");
+
+my $second_caughtup_query =
+  "SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$second_appname';";
+$node_publisher->poll_query_until('postgres', $second_caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (53);
+	PREPARE TRANSACTION 'multiple_standbys';
+	COMMIT PREPARED 'multiple_standbys';");
+
+$node_publisher->poll_query_until('postgres', $second_caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+$result = $second_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (53);");
+is($result, qq(1), 'the second sebscriber got the change');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (53);");
+is($result, qq(1), 'the first subscriber got the change');
+
+###############################
 # check all the cleanup
 ###############################
 
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION new_tap_sub");
+$second_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION second_tap_sub");
 
 $result = $node_subscriber->safe_psql('postgres',
 	"SELECT count(*) FROM pg_subscription");
@@ -334,4 +429,5 @@ $result = $node_subscriber->safe_psql('postgres',
 is($result, qq(0), 'check replication origin was dropped on subscriber');
 
 $node_subscriber->stop('fast');
+$second_subscriber->stop('fast');
 $node_publisher->stop('fast');
-- 
2.2.0

#270Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#268)
3 attachment(s)

On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian <itsajin@gmail.com> wrote:

Missed the patch - 0001, resending.

I have made miscellaneous changes in the patch which includes
improving comments, error messages, and miscellaneous coding
improvements. The most notable one is that we don't need an additional
parameter in walrcv_startstreaming, if the two_phase option is set
properly. My changes are in v63-0002-Misc-changes-by-Amit, if you are
fine with those, then please merge them in the next version. I have
omitted the dev-logs patch but feel free to submit it. I have one
question:
@@ -538,10 +550,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
..
+ /* Set two_phase_at LSN only if it hasn't already been set. */
+ if (ctx->twophase && !MyReplicationSlot->data.two_phase_at)
+ {
+ MyReplicationSlot->data.two_phase_at = start_lsn;
+ slot->data.two_phase = true;
+ ReplicationSlotMarkDirty();
+ ReplicationSlotSave();
+ SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+ }

What if the walsender or apply worker restarts after setting
two_phase_at/two_phase here and updating the two_phase state in
pg_subscription? Won't we need to set SnapBuildSetTwoPhaseAt after
restart as well? If so, we probably need a else if (ctx->twophase)
{Assert (slot->data.two_phase_at);
SnapBuildSetTwoPhaseAt(ctx->snapshot_builder,
slot->data.two_phase_at);}. Am, I missing something?

--
With Regards,
Amit Kapila.

Attachments:

v63-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v63-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 1a93fb359481244c6b699bcb3270f05b2a941be1 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 18 Mar 2021 07:17:59 -0400
Subject: [PATCH v63 1/3] Add support for apply at prepare time to built-in
 logical replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

* implement new SUBSCRIPTION option "two_phase".

* add new option to enable two_phase while creating a slot.

* introduction of tri-state for twophase pg_subscription column.

* restrict ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* restrict ALTER SUBSCRIPTION SET PUBLICATION WITH (refresh = true) when two_phase enabled.

* include documentation update.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, C Vignesh and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         |  14 +-
 doc/src/sgml/ref/alter_subscription.sgml           |   4 +-
 doc/src/sgml/ref/create_subscription.sgml          |  36 +++
 src/backend/access/transam/twophase.c              |  68 ++++++
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 125 +++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  15 +-
 src/backend/replication/logical/decode.c           |   2 +-
 src/backend/replication/logical/logical.c          |  40 +++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 206 ++++++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |   4 +-
 src/backend/replication/logical/snapbuild.c        |  30 ++-
 src/backend/replication/logical/tablesync.c        | 195 ++++++++++++---
 src/backend/replication/logical/worker.c           | 265 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 204 +++++++++++++---
 src/backend/replication/repl_gram.y                |  18 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   4 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |   8 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  75 +++++-
 src/include/replication/reorderbuffer.h            |   2 +-
 src/include/replication/slot.h                     |   6 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |  12 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         |  93 +++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/tools/pgindent/typedefs.list                   |   3 +
 38 files changed, 1362 insertions(+), 169 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 5c9f4af..4b430ff 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7577,6 +7577,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe..9694713 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,18 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase transactions.
+         Two-phase commands like PREPARE TRANSACTION, COMMIT PREPARED and ROLLBACK PREPARED
+         are also decoded and transmitted. In two-phase transactions, the transaction is 
+         decoded and transmitted at PREPARE TRANSACTION time. 
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..85cc8bb 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -64,7 +64,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
   <para>
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with refresh
-   option as true cannot be executed inside a transaction block.
+   option as true cannot be executed inside a transaction block. They also
+   cannot be executed with <literal>copy_data = true</literal> if the
+   subscription is using <literal>two_phase</literal> commit.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..a5c9158 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,42 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the intial table synchronization phase. This means even when
+          two_phase is enabled for the subscription, the internal two-phase state remains
+          temporarily "pending" until the initialization phase is completed. See column
+          <literal>subtwophase</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..c58c46d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..658d9f8 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65d..d453902 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..015b2dc 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,33 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option, this could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica.
+			 */
+			if (!twophase)
+			{
+				/*
+				 * This check is specifically for ALTER commands
+				 * with two_phase option.
+				 */
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			}
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +316,23 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * it needs more analysis to allow them together.
+	 * Hence, disabling this combination till that issue can
+	 * be addressed.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +408,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +434,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +503,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -527,8 +584,15 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			if (create_slot)
 			{
 				Assert(slotname);
-
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with two-phase
+				 * enabled. Will enable it once all the tables are synced and ready.
+				 * This avoids race-conditions like prepared transactions being
+				 * skippped due to changes not being applied due to checks in
+				 * should_apply_changes_for_rel() when tablesync for the
+				 * corresponding tables are in progress.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +899,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +934,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +963,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +1009,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -953,6 +1026,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/* See ALTER_SUBSCRIPTION_REFRESH for details why this is not allow. */
+					if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -982,7 +1063,35 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state we
+				 * must not allow any subsequent table initialization to occur.
+				 * So the ALTER SUBSCRIPTION ... REFRESH is disallowed when the
+				 * the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data = false,
+				 * because when copy_data is false the tablesync will start
+				 * already in READY state and will exit directly without doing
+				 * anything which could interfere with the apply worker's
+				 * message handling.
+				 *
+				 * For more details see the "TWO_PHASE COMMIT TRI-STATE LOGIC"
+				 * comment atop worker.c
+				 */
+				if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..c8f0539 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -63,7 +63,7 @@ static void libpqrcv_readtimelinehistoryfile(WalReceiverConn *conn,
 											 TimeLineID tli, char **filename,
 											 char **content, int *len);
 static bool libpqrcv_startstreaming(WalReceiverConn *conn,
-									const WalRcvStreamOptions *options);
+									const WalRcvStreamOptions *options, bool two_phase);
 static void libpqrcv_endstreaming(WalReceiverConn *conn,
 								  TimeLineID *next_tli);
 static int	libpqrcv_receive(WalReceiverConn *conn, char **buffer,
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -393,7 +394,7 @@ libpqrcv_server_version(WalReceiverConn *conn)
  */
 static bool
 libpqrcv_startstreaming(WalReceiverConn *conn,
-						const WalRcvStreamOptions *options)
+						const WalRcvStreamOptions *options, bool two_phase)
 {
 	StringInfoData cmd;
 	PGresult   *res;
@@ -433,6 +434,11 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/* set the two_phase option only if the caller specifies it. */
+		if (options->proto.logical.twophase && two_phase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +839,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +853,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f59613..6a90b56 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -730,7 +730,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75de..d944b02 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,21 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when
+	 * (a) the two_phase is enabled at the time of slot creation,
+	 * or (b) when the two_phase option is passed in at restart.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || MyReplicationSlot->data.two_phase);
+
+	/* Set two_phase_at LSN only if it hasn't already been set. */
+	if (ctx->twophase && !MyReplicationSlot->data.two_phase_at)
+	{
+		MyReplicationSlot->data.two_phase_at = restart_lsn;
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, restart_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -526,6 +537,7 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 		start_lsn = slot->data.confirmed_flush;
 	}
 
+
 	ctx = StartupDecodingContext(output_plugin_options,
 								 start_lsn, InvalidTransactionId, false,
 								 fast_forward, xl_routine, prepare_write,
@@ -538,10 +550,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when
+	 * (a) the two_phase is enabled at the time of slot creation,
+	 * or (b) when the two_phase option is passed in at restart.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || MyReplicationSlot->data.two_phase);
+
+	/* Set two_phase_at LSN only if it hasn't already been set. */
+	if (ctx->twophase && !MyReplicationSlot->data.two_phase_at)
+	{
+		MyReplicationSlot->data.two_phase_at = start_lsn;
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +625,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..488b2a2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,212 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 91600ac..10ad8a7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2672,7 +2672,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2703,7 +2703,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * prepare if it was not decoded earlier. We don't need to decode the xact
 	 * for aborts if it is not done already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ed3acad..fddcc64 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,12 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions from an LSN < this LSN
+	 * needs to be sent later along with commit prepared.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +278,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +306,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +367,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 6ed3181..f64a2fb 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +363,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +370,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +391,36 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly 'enabled'
+	 * at that time.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
+	{
+		if (AllTablesyncsREADY())
+		{
+			ereport(LOG,
+					(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+					MySubscription->name)));
+
+			proc_exit(0);
+		}
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1058,7 +1053,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1144,3 +1139,135 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are all tablesyncs READY?
+ */
+bool
+AllTablesyncsREADY(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables.
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		/*
+		 * When the process_syncing_tables_for_apply changes the state from
+		 * SYNCDONE to READY, that change is actually written directly into
+		 * the list element of table_states_not_ready.
+		 *
+		 * So the "table_states_not_ready" list might end up having a READY
+		 * state in it even though there was none when it was initially
+		 * created. This is reason why we need to check for READY below.
+		 */
+		if (rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/* When no tablesyncs are busy, then all are READY */
+	return !found_busy;
+}
+
+/*
+ * Update the p_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 21d304a..1a76c38 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,63 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rollbacked at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead apply worker's current location.  This would lead to
+ * an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted for
+ * two_phase = on, except when copy_data = false.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +116,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -720,6 +778,151 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("transaction identifier \"%s\" is already in use",
+						begin_data.gid)));
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the
+	 * EndTransactionBlock called within the PrepareTransactionBlock
+	 * below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1954,6 +2157,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2430,6 +2655,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -3085,9 +3313,42 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = MySubscription->twophasestate;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains as
+		 * the tri-state PENDING until all tablesyncs have reached READY state.
+		 * Only then, can it become properly ENABLED.
+		 */
+		bool all_tables_ready = AllTablesyncsREADY();
+
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING && all_tables_ready)
+		{
+			/* Start streaming with two_phase enabled */
+			walrcv_startstreaming(wrconn, &options, true);
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options, false);
+		}
+
+		ereport(LOG,
+			(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+			MySubscription->name,
+			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+			"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(wrconn, &options, false);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..9dc7fc2 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,18 +171,22 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -232,9 +254,27 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
+	/*
+	 * Enabling both streaming and two-phase is not a currently supported
+	 * combination, could be supported in future.
+	 */
+	if (*enable_twophase && *enable_streaming)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("unsupported combination of options")));
 }
 
 /*
@@ -245,6 +285,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -269,7 +310,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -310,6 +352,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * We don't decide if two-phae is disabled here. Just mark the option
+		 * as passed in or not. Two-phase could remain enabled because a previous
+		 * start-up enabled in. But we only allow the option to be passed in
+		 * with sufficient version of the protocol, and when the output plugin
+		 * supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -322,8 +385,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +405,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +426,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +886,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1293,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..27fdbd7 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -322,7 +325,7 @@ start_replication:
 				}
 			;
 
-/* START_REPLICATION SLOT slot LOGICAL %X/%X options */
+/* START_REPLICATION SLOT slot LOGICAL %X/%X options*/
 start_logical_replication:
 			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR plugin_options
 				{
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c..91224e0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8532296..1325c29 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -364,7 +364,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
@@ -388,7 +388,7 @@ WalReceiverMain(void)
 		options.slotname = slotname[0] != '\0' ? slotname : NULL;
 		options.proto.physical.startpointTLI = startpointTLI;
 		ThisTimeLineID = startpointTLI;
-		if (walrcv_startstreaming(wrconn, &options))
+		if (walrcv_startstreaming(wrconn, &options, false))
 		{
 			if (first_stream)
 				ereport(LOG,
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..eeafdf8 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h" /* For 2PC tri-state. */
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4228,6 +4229,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4273,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4303,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4329,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4346,6 +4358,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = { LOGICALREP_TWOPHASE_STATE_DISABLED, '\0' };
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4387,6 +4400,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..98776db 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..286f1c9 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 316bec8..df2d591 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2764,7 +2764,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..3f06d1f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,11 @@
 
 #include "nodes/pg_list.h"
 
+/* two_phase tri-state values. */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +59,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +98,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c253403..6548b27 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Does output plugin wants to turn on two-phase?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..282cf61 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -27,10 +28,16 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. This has the same protocol
+ * version requirement as LOGICAL_PROTO_STREAM_VERSION_NUM because these
+ * features were both introduced in the same release (PG14).
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
@@ -54,10 +61,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +126,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +134,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +183,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..6c9f2c6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -643,7 +643,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..1f4b253 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,9 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..e5b6329 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -301,7 +302,7 @@ typedef void (*walrcv_readtimelinehistoryfile_fn) (WalReceiverConn *conn,
  * didn't switch to copy-mode.
  */
 typedef bool (*walrcv_startstreaming_fn) (WalReceiverConn *conn,
-										  const WalRcvStreamOptions *options);
+										  const WalRcvStreamOptions *options, bool two_phase);
 
 /*
  * walrcv_endstreaming_fn
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -412,16 +414,16 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_server_version(conn)
 #define walrcv_readtimelinehistoryfile(conn, tli, filename, content, size) \
 	WalReceiverFunctions->walrcv_readtimelinehistoryfile(conn, tli, filename, content, size)
-#define walrcv_startstreaming(conn, options) \
-	WalReceiverFunctions->walrcv_startstreaming(conn, options)
+#define walrcv_startstreaming(conn, options, two_phase) \
+	WalReceiverFunctions->walrcv_startstreaming(conn, options, two_phase)
 #define walrcv_endstreaming(conn, next_tli) \
 	WalReceiverFunctions->walrcv_endstreaming(conn, next_tli)
 #define walrcv_receive(conn, buffer, wait_fd) \
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..efd4533 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsREADY(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..a9664e8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1d1d5d2..1f3038f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1343,12 +1343,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v63-0002-Misc-changes-by-Amit.patchapplication/octet-stream; name=v63-0002-Misc-changes-by-Amit.patchDownload
From 74f703ced88de4b16a1d216b1fc42d2ac1f77e73 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Fri, 19 Mar 2021 19:46:32 +0530
Subject: [PATCH v63 2/3] Misc changes by Amit.

---
 src/backend/commands/subscriptioncmds.c            | 21 ++++++++--------
 .../libpqwalreceiver/libpqwalreceiver.c            |  6 ++---
 src/backend/replication/logical/logical.c          | 29 ++++++++++------------
 src/backend/replication/logical/snapbuild.c        |  8 +++---
 src/backend/replication/logical/worker.c           | 25 +++++++++++++++----
 src/backend/replication/pgoutput/pgoutput.c        | 19 ++++++++------
 src/backend/replication/repl_gram.y                |  2 +-
 src/backend/replication/walreceiver.c              |  2 +-
 src/include/replication/logical.h                  |  2 +-
 src/include/replication/slot.h                     |  3 ++-
 src/include/replication/walreceiver.h              |  6 ++---
 11 files changed, 70 insertions(+), 53 deletions(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 015b2dc..79178a9 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -224,7 +224,7 @@ parse_subscription_options(List *options,
 			/*
 			 * Do not allow toggling of two_phase option, this could
 			 * cause missing of transactions and lead to an inconsistent
-			 * replica.
+			 * replica. See comments atop worker.c.
 			 */
 			if (!twophase)
 			{
@@ -321,8 +321,6 @@ parse_subscription_options(List *options,
 	 * Do additional checking for the disallowed combination of two_phase and
 	 * streaming. While streaming and two_phase can theoretically be supported,
 	 * it needs more analysis to allow them together.
-	 * Hence, disabling this combination till that issue can
-	 * be addressed.
 	 */
 	if (twophase && *twophase_given && *twophase)
 	{
@@ -584,13 +582,15 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			if (create_slot)
 			{
 				Assert(slotname);
+
 				/*
-				 * Even if two_phase is set, don't create the slot with two-phase
-				 * enabled. Will enable it once all the tables are synced and ready.
-				 * This avoids race-conditions like prepared transactions being
-				 * skippped due to changes not being applied due to checks in
-				 * should_apply_changes_for_rel() when tablesync for the
-				 * corresponding tables are in progress.
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
 				 */
 				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
@@ -1083,8 +1083,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 				 * anything which could interfere with the apply worker's
 				 * message handling.
 				 *
-				 * For more details see the "TWO_PHASE COMMIT TRI-STATE LOGIC"
-				 * comment atop worker.c
+				 * For more details see comments atop worker.c.
 				 */
 				if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
 					ereport(ERROR,
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index c8f0539..58c813f 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -63,7 +63,7 @@ static void libpqrcv_readtimelinehistoryfile(WalReceiverConn *conn,
 											 TimeLineID tli, char **filename,
 											 char **content, int *len);
 static bool libpqrcv_startstreaming(WalReceiverConn *conn,
-									const WalRcvStreamOptions *options, bool two_phase);
+									const WalRcvStreamOptions *options);
 static void libpqrcv_endstreaming(WalReceiverConn *conn,
 								  TimeLineID *next_tli);
 static int	libpqrcv_receive(WalReceiverConn *conn, char **buffer,
@@ -394,7 +394,7 @@ libpqrcv_server_version(WalReceiverConn *conn)
  */
 static bool
 libpqrcv_startstreaming(WalReceiverConn *conn,
-						const WalRcvStreamOptions *options, bool two_phase)
+						const WalRcvStreamOptions *options)
 {
 	StringInfoData cmd;
 	PGresult   *res;
@@ -435,7 +435,7 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
 		/* set the two_phase option only if the caller specifies it. */
-		if (options->proto.logical.twophase && two_phase &&
+		if (options->proto.logical.twophase &&
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", two_phase 'on'");
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d944b02..5143d8f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -432,20 +432,18 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions when
-	 * (a) the two_phase is enabled at the time of slot creation,
-	 * or (b) when the two_phase option is passed in at restart.
+	 * We allow decoding of prepared transactions when the two_phase is enabled
+	 * at the time of slot creation, or when the two_phase option is given at
+	 * the streaming start.
 	 */
-	ctx->twophase &= (ctx->twophase_opt_given || MyReplicationSlot->data.two_phase);
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
 
-	/* Set two_phase_at LSN only if it hasn't already been set. */
-	if (ctx->twophase && !MyReplicationSlot->data.two_phase_at)
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
 	{
-		MyReplicationSlot->data.two_phase_at = restart_lsn;
 		slot->data.two_phase = true;
 		ReplicationSlotMarkDirty();
 		ReplicationSlotSave();
-		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, restart_lsn);
 	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
@@ -537,7 +535,6 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 		start_lsn = slot->data.confirmed_flush;
 	}
 
-
 	ctx = StartupDecodingContext(output_plugin_options,
 								 start_lsn, InvalidTransactionId, false,
 								 fast_forward, xl_routine, prepare_write,
@@ -550,17 +547,17 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions when
-	 * (a) the two_phase is enabled at the time of slot creation,
-	 * or (b) when the two_phase option is passed in at restart.
+	 * We allow decoding of prepared transactions when the two_phase is enabled
+	 * at the time of slot creation, or when the two_phase option is given at
+	 * the streaming start.
 	 */
-	ctx->twophase &= (ctx->twophase_opt_given || MyReplicationSlot->data.two_phase);
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
 
-	/* Set two_phase_at LSN only if it hasn't already been set. */
-	if (ctx->twophase && !MyReplicationSlot->data.two_phase_at)
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
 	{
-		MyReplicationSlot->data.two_phase_at = start_lsn;
 		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
 		ReplicationSlotMarkDirty();
 		ReplicationSlotSave();
 		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index fddcc64..12f0cf9 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,10 +165,12 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which two-phase decoding was enabled.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions from an LSN < this LSN
-	 * needs to be sent later along with commit prepared.
+	 * The prepared transactions that were skipped because previously two-phase
+	 * was not enabled or are not covered by initial snapshot needs to be sent
+	 * later along with commit prepared and they must be before this point.
 	 */
 	XLogRecPtr	two_phase_at;
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 1a76c38..26f8069 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -816,6 +816,19 @@ apply_handle_prepare(StringInfo s)
 
 	Assert(prepare_data.prepare_lsn == remote_final_lsn);
 
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because
+	 * at commit prepared time, we won't know whether we have skipped
+	 * preparing a transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worth because such cases shouldn't be common. Also, as of now, the
+	 * two different subscriptions can receive the same prepared transaction
+	 * GID and can cause confusion at the time of commit prepared if we skip
+	 * preparing the tranasction.
+	 */
 	ensure_transaction();
 
 	/*
@@ -3313,7 +3326,7 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
-	options.proto.logical.twophase = MySubscription->twophasestate;
+	options.proto.logical.twophase = false;
 
 	if (!am_tablesync_worker())
 	{
@@ -3324,16 +3337,18 @@ ApplyWorkerMain(Datum main_arg)
 		 */
 		bool all_tables_ready = AllTablesyncsREADY();
 
-		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING && all_tables_ready)
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			all_tables_ready)
 		{
 			/* Start streaming with two_phase enabled */
-			walrcv_startstreaming(wrconn, &options, true);
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(wrconn, &options);
 			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
 			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
 		}
 		else
 		{
-			walrcv_startstreaming(wrconn, &options, false);
+			walrcv_startstreaming(wrconn, &options);
 		}
 
 		ereport(LOG,
@@ -3347,7 +3362,7 @@ ApplyWorkerMain(Datum main_arg)
 	else
 	{
 		/* Start normal logical streaming replication. */
-		walrcv_startstreaming(wrconn, &options, false);
+		walrcv_startstreaming(wrconn, &options);
 	}
 
 	/* Run the main loop. */
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9dc7fc2..fc4b1ad 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -267,14 +267,17 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
+
 	/*
-	 * Enabling both streaming and two-phase is not a currently supported
-	 * combination, could be supported in future.
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * it needs more analysis to allow them together.
 	 */
 	if (*enable_twophase && *enable_streaming)
 		ereport(ERROR,
 				(errcode(ERRCODE_SYNTAX_ERROR),
-				 errmsg("unsupported combination of options")));
+				 errmsg("%s and %s are mutually exclusive options",
+						"two_phase", "streaming")));
 }
 
 /*
@@ -353,11 +356,11 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		in_streaming = false;
 
 		/*
-		 * We don't decide if two-phae is disabled here. Just mark the option
-		 * as passed in or not. Two-phase could remain enabled because a previous
-		 * start-up enabled in. But we only allow the option to be passed in
-		 * with sufficient version of the protocol, and when the output plugin
-		 * supports it.
+		 * Here, we just check whether the two-phase option is passed by plugin
+		 * and decide whether to enable it at later point of time. It remains
+		 * enabled if the previous start-up has done so. But we only allow the
+		 * option to be passed in with sufficient version of the protocol, and
+		 * when the output plugin supports it.
 		 */
 		if (!enable_twophase)
 			ctx->twophase_opt_given = false;
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 27fdbd7..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -325,7 +325,7 @@ start_replication:
 				}
 			;
 
-/* START_REPLICATION SLOT slot LOGICAL %X/%X options*/
+/* START_REPLICATION SLOT slot LOGICAL %X/%X options */
 start_logical_replication:
 			K_START_REPLICATION K_SLOT IDENT K_LOGICAL RECPTR plugin_options
 				{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 1325c29..8e7edae 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -388,7 +388,7 @@ WalReceiverMain(void)
 		options.slotname = slotname[0] != '\0' ? slotname : NULL;
 		options.proto.physical.startpointTLI = startpointTLI;
 		ThisTimeLineID = startpointTLI;
-		if (walrcv_startstreaming(wrconn, &options, false))
+		if (walrcv_startstreaming(wrconn, &options))
 		{
 			if (first_stream)
 				ereport(LOG,
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 6548b27..5c1ce7e 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -90,7 +90,7 @@ typedef struct LogicalDecodingContext
 	bool		twophase;
 
 	/*
-	 * Does output plugin wants to turn on two-phase?
+	 * Is two-phase option given by output plugin?
 	 */
 	bool		twophase_opt_given;
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1f4b253..db68551 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,7 +92,8 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we enabled two_phase commit for this slot.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
 	XLogRecPtr	two_phase_at;
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e5b6329..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -302,7 +302,7 @@ typedef void (*walrcv_readtimelinehistoryfile_fn) (WalReceiverConn *conn,
  * didn't switch to copy-mode.
  */
 typedef bool (*walrcv_startstreaming_fn) (WalReceiverConn *conn,
-										  const WalRcvStreamOptions *options, bool two_phase);
+										  const WalRcvStreamOptions *options);
 
 /*
  * walrcv_endstreaming_fn
@@ -414,8 +414,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_server_version(conn)
 #define walrcv_readtimelinehistoryfile(conn, tli, filename, content, size) \
 	WalReceiverFunctions->walrcv_readtimelinehistoryfile(conn, tli, filename, content, size)
-#define walrcv_startstreaming(conn, options, two_phase) \
-	WalReceiverFunctions->walrcv_startstreaming(conn, options, two_phase)
+#define walrcv_startstreaming(conn, options) \
+	WalReceiverFunctions->walrcv_startstreaming(conn, options)
 #define walrcv_endstreaming(conn, next_tli) \
 	WalReceiverFunctions->walrcv_endstreaming(conn, next_tli)
 #define walrcv_receive(conn, buffer, wait_fd) \
-- 
1.8.3.1

v63-0003-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v63-0003-Support-2PC-txn-subscriber-tests.patchDownload
From 8ebfcfa36f4b21452f3f357b3029cc010c4eec87 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 18 Mar 2021 07:22:07 -0400
Subject: [PATCH v63 3/3] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 337 ++++++++++++++++++++++++
 src/test/subscription/t/021_twophase_cascade.pl | 280 ++++++++++++++++++++
 2 files changed, 617 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..b135997
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,337 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_cascade.pl b/src/test/subscription/t/021_twophase_cascade.pl
new file mode 100644
index 0000000..8e9dfdc
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_cascade.pl
@@ -0,0 +1,280 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#271Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#270)

On Sat, Mar 20, 2021 at 1:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian <itsajin@gmail.com> wrote:

Missed the patch - 0001, resending.

@@ -538,10 +550,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
..
+ /* Set two_phase_at LSN only if it hasn't already been set. */
+ if (ctx->twophase && !MyReplicationSlot->data.two_phase_at)
+ {
+ MyReplicationSlot->data.two_phase_at = start_lsn;
+ slot->data.two_phase = true;
+ ReplicationSlotMarkDirty();
+ ReplicationSlotSave();
+ SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+ }

What if the walsender or apply worker restarts after setting
two_phase_at/two_phase here and updating the two_phase state in
pg_subscription? Won't we need to set SnapBuildSetTwoPhaseAt after
restart as well?

After a restart, two_phase_at will be set by calling
AllocateSnapshotBuilder with two_phase_at

@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
  ctx->reorder = ReorderBufferAllocate();
  ctx->snapshot_builder =
  AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
- need_full_snapshot, slot->data.initial_consistent_point);
+ need_full_snapshot, slot->data.two_phase_at);

and then in AllocateSnapshotBuilder:

@@ -309,7 +306,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
  builder->initial_xmin_horizon = xmin_horizon;
  builder->start_decoding_at = start_lsn;
  builder->building_full_snapshot = need_full_snapshot;
- builder->initial_consistent_point = initial_consistent_point;
+ builder->two_phase_at = two_phase_at;

regards,
Ajin Cherian
Fujitsu Australia

#272Peter Smith
smithpb2250@gmail.com
In reply to: Ajin Cherian (#267)
3 attachment(s)

Please find attached the latest patch set v64*

Differences from v62* are:

* Rebased to HEAD @ yesterday 19/Mar.

* Addresses the following feedback issues:

----

From Osumi-san 19/Mar -
/messages/by-id/OSBPR01MB4888930C23E17AF29EDB9D82ED689@OSBPR01MB4888.jpnprd01.prod.outlook.com

(64) Done. New tests added. Supplied patch by Osumi-san.

(65) Done. New tests added. Supplied patch by Osumi-san.

----

From Amit 16/Mar -
/messages/by-id/CAA4eK1Kwah+MimFMR3jPY5cSqpGFVh5zfV2g4=gTphaPsacoLw@mail.gmail.com

(89) Done. Added more comments explaining the AllTablesReady() implementation.

----

From Peter 17/Mar (internal)

(94) Done. Improved comment to two_phase option parsing code

----

From Amit 17/Mar -
/messages/by-id/CAA4eK1LNLA20ci3_qqNQv7BYRTy3HqiAsOfuieqo6tJ2GeYuJw@mail.gmail.com

(97) Done. Improved comment to two_phase option parsing code

----

From Amit 18/Mar -
/messages/by-id/CAA4eK1J9A_9hsxE6m_1c6CsrMsBeeaRbaLX2P16ucJrpN25-EQ@mail.gmail.com

(101) Done. Improved comment for worker.c. Apply supplied patch from
Amit. No equivalent text was put in PG docs at this time because we
are still awaiting responses on other thread [1]/messages/by-id/CALDaNm06R_ppr5ibwS1-FLDKGqUjHr-1VPdk-yJWU1TP_zLLig@mail.gmail.com that might impact
what we may want to write. Please raise a new feedback comment
if/whenn you decide PG docs should be updating.

(102) Fixed. Use different log level for subscription starting message

----

From Amit 19/Mar (internal)

(104) Done. Rename function AllTablesyncsREADY to AllTablesyncsReady

----

From Amit 19/Mar -
/messages/by-id/CAA4eK1JLz7ypPdbkPjHQW5c9vOZO5onOwb+fSLsArHQjg6dNhQ@mail.gmail.com

(105) Done. Miscellaneous fixes. Apply supplied patch from Amit.

-----
[1]: /messages/by-id/CALDaNm06R_ppr5ibwS1-FLDKGqUjHr-1VPdk-yJWU1TP_zLLig@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v64-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchapplication/octet-stream; name=v64-0001-Add-support-for-apply-at-prepare-time-to-built-i.patchDownload
From 44e006aafef3d393a745c7eb96ce41d1572f70ee Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 20 Mar 2021 11:49:15 +1100
Subject: [PATCH v64] Add support for apply at prepare time to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* prepare API for streaming transactions is not supported.

* implement new SUBSCRIPTION option "two_phase".

* add new option to enable two_phase while creating a slot.

* introduction of tri-state for twophase pg_subscription column.

* restrict ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* restrict ALTER SUBSCRIPTION SET PUBLICATION WITH (refresh = true) when two_phase enabled.

* include documentation update.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, C Vignesh and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         |  14 +-
 doc/src/sgml/ref/alter_subscription.sgml           |   4 +-
 doc/src/sgml/ref/create_subscription.sgml          |  36 +++
 src/backend/access/transam/twophase.c              |  68 +++++
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 125 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  11 +-
 src/backend/replication/logical/decode.c           |   2 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 206 +++++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |   4 +-
 src/backend/replication/logical/snapbuild.c        |  32 ++-
 src/backend/replication/logical/tablesync.c        | 204 ++++++++++++---
 src/backend/replication/logical/worker.c           | 291 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 207 ++++++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |   8 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  75 +++++-
 src/include/replication/reorderbuffer.h            |   2 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         |  93 +++++--
 src/test/regress/sql/subscription.sql              |  25 ++
 src/tools/pgindent/typedefs.list                   |   3 +
 38 files changed, 1394 insertions(+), 161 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 5c9f4af..4b430ff 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7577,6 +7577,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe..9694713 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,18 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase transactions.
+         Two-phase commands like PREPARE TRANSACTION, COMMIT PREPARED and ROLLBACK PREPARED
+         are also decoded and transmitted. In two-phase transactions, the transaction is 
+         decoded and transmitted at PREPARE TRANSACTION time. 
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..85cc8bb 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -64,7 +64,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
   <para>
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with refresh
-   option as true cannot be executed inside a transaction block.
+   option as true cannot be executed inside a transaction block. They also
+   cannot be executed with <literal>copy_data = true</literal> if the
+   subscription is using <literal>two_phase</literal> commit.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..a5c9158 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,42 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the intial table synchronization phase. This means even when
+          two_phase is enabled for the subscription, the internal two-phase state remains
+          temporarily "pending" until the initialization phase is completed. See column
+          <literal>subtwophase</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..c58c46d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..658d9f8 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65d..d453902 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..79f46d0 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,36 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica. See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated
+			 * from AlterSubscription.
+			 */
+			if (!twophase)
+			{
+				/*
+				 * This check is specifically for ALTER commands
+				 * with two_phase option.
+				 */
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			}
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +319,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +409,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +435,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +504,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +586,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +902,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +937,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +966,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +1012,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -953,6 +1029,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/* See ALTER_SUBSCRIPTION_REFRESH for details why this is not allow. */
+					if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -982,7 +1066,34 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state we
+				 * must not allow any subsequent table initialization to occur.
+				 * So the ALTER SUBSCRIPTION ... REFRESH is disallowed when the
+				 * the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data = false,
+				 * because when copy_data is false the tablesync will start
+				 * already in READY state and will exit directly without doing
+				 * anything which could interfere with the apply worker's
+				 * message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..58c813f 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,11 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/* set the two_phase option only if the caller specifies it. */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +839,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +853,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f59613..6a90b56 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -730,7 +730,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75de..5143d8f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is enabled
+	 * at the time of slot creation, or when the two_phase option is given at
+	 * the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is enabled
+	 * at the time of slot creation, or when the two_phase option is given at
+	 * the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..488b2a2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,212 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 91600ac..10ad8a7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2672,7 +2672,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2703,7 +2703,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * prepare if it was not decoded earlier. We don't need to decode the xact
 	 * for aborts if it is not done already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ed3acad..12f0cf9 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,14 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously two-phase
+	 * was not enabled or are not covered by initial snapshot needs to be sent
+	 * later along with commit prepared and they must be before this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +280,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +308,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +369,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 6ed3181..233649e 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +363,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +370,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +391,36 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly 'enabled'
+	 * at that time.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
+	{
+		if (AllTablesyncsReady())
+		{
+			ereport(LOG,
+					(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+					MySubscription->name)));
+
+			proc_exit(0);
+		}
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1058,7 +1053,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1144,3 +1139,144 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are all tablesyncs READY?
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables.
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		/*
+		 * When the process_syncing_tables_for_apply changes the state from
+		 * SYNCDONE to READY, that change is actually written directly into
+		 * the list element of table_states_not_ready.
+		 *
+		 * The table_states_valid flag is not immediately updated, so
+		 * FetchTableStates does not rebuild the "table_states_not_ready" list
+		 * because it is unaware that it needs to.
+		 *
+		 * It means the "table_states_not_ready" list might end up having
+		 * a READY state in it even though there was none when it was
+		 * initially created.
+		 *
+		 * This is why the code is testing for READY. And because a READY in the
+		 * table_states_not_ready list is the exception rather than the rule it
+		 * means we will nearly always break from this loop at the first
+		 * iteration.
+		 */
+		if (rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/* When no tablesyncs are busy, then all are READY */
+	return !found_busy;
+}
+
+/*
+ * Update the p_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 21d304a..cf5ad27 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,73 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rollbacked at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted for
+ * two_phase = on, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, one of the prepares will be successful and
+ * others will fail, in which case the server will send them again. Once the
+ * commit prepared is done for the first one, the next prepare will be
+ * successful. We have thought of appending some unique identifier (like subid)
+ * to the GID but that won't work for cascaded standby setup as the GID can
+ * become too long.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +126,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -720,6 +788,164 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	if (LookupGXact(begin_data.gid, begin_data.end_lsn, begin_data.committime))
+		ereport(ERROR,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("transaction identifier \"%s\" is already in use",
+						begin_data.gid)));
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because
+	 * at commit prepared time, we won't know whether we have skipped
+	 * preparing a transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worth because such cases shouldn't be common. Also, as of now, the
+	 * two different subscriptions can receive the same prepared transaction
+	 * GID and can cause confusion at the time of commit prepared if we skip
+	 * preparing the tranasction.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the
+	 * EndTransactionBlock called within the PrepareTransactionBlock
+	 * below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(prepare_data.gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(prepare_data.gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(rollback_data.gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1954,6 +2180,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2430,6 +2678,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -3085,9 +3336,45 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains as
+		 * the tri-state PENDING until all tablesyncs have reached READY state.
+		 * Only then, can it become properly ENABLED.
+		 */
+		bool all_tables_ready = AllTablesyncsReady();
+
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			all_tables_ready)
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(wrconn, &options);
+
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options);
+		}
+
+		ereport(DEBUG1,
+			(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+			MySubscription->name,
+			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+			"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(wrconn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..fc4b1ad 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,18 +171,22 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -232,9 +254,30 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * it needs more analysis to allow them together.
+	 */
+	if (*enable_twophase && *enable_streaming)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("%s and %s are mutually exclusive options",
+						"two_phase", "streaming")));
 }
 
 /*
@@ -245,6 +288,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -269,7 +313,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -310,6 +355,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by plugin
+		 * and decide whether to enable it at later point of time. It remains
+		 * enabled if the previous start-up has done so. But we only allow the
+		 * option to be passed in with sufficient version of the protocol, and
+		 * when the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -322,8 +388,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +408,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +429,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +889,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1296,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c..91224e0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8532296..8e7edae 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -364,7 +364,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index eb988d7..eeafdf8 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h" /* For 2PC tri-state. */
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4228,6 +4229,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4271,9 +4273,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4294,6 +4303,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4319,6 +4329,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4346,6 +4358,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = { LOGICALREP_TWOPHASE_STATE_DISABLED, '\0' };
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4387,6 +4400,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 0a2213f..98776db 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -638,6 +638,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 20af5a9..286f1c9 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6071,7 +6071,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6097,13 +6097,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 316bec8..df2d591 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2764,7 +2764,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..3f06d1f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,11 @@
 
 #include "nodes/pg_list.h"
 
+/* two_phase tri-state values. */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +59,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +98,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c253403..5c1ce7e 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..282cf61 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -27,10 +28,16 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. This has the same protocol
+ * version requirement as LOGICAL_PROTO_STREAM_VERSION_NUM because these
+ * features were both introduced in the same release (PG14).
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
@@ -54,10 +61,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +126,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +134,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +183,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..6c9f2c6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -643,7 +643,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..db68551 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..daf6ad4 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..a9664e8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1d1d5d2..1f3038f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1343,12 +1343,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v64-0003-Fix-apply-worker-dev-logs.patchapplication/octet-stream; name=v64-0003-Fix-apply-worker-dev-logs.patchDownload
From 27605a3a4e85f7f293186a46208e7c28549ec59b Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 20 Mar 2021 12:18:15 +1100
Subject: [PATCH v64] Fix apply worker (dev logs)

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the patch.
---
 src/backend/replication/logical/tablesync.c | 38 +++++++++++++++++++++++++++++
 src/backend/replication/logical/worker.c    |  4 +--
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 233649e..6858dc3 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -407,6 +407,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 */
 	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
 	{
+		elog(LOG, "!!> two_phase enable is still pending");
 		if (AllTablesyncsReady())
 		{
 			ereport(LOG,
@@ -1155,6 +1156,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
@@ -1172,11 +1175,16 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
 			table_states_not_ready = lappend(table_states_not_ready, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1190,6 +1198,8 @@ AllTablesyncsReady(void)
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AllTablesyncsReady");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1201,6 +1211,12 @@ AllTablesyncsReady(void)
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
+		elog(LOG,
+			 "!!> AllTablesyncsReady: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
 		/*
 		 * When the process_syncing_tables_for_apply changes the state from
 		 * SYNCDONE to READY, that change is actually written directly into
@@ -1221,6 +1237,7 @@ AllTablesyncsReady(void)
 		 */
 		if (rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AllTablesyncsReady: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1232,6 +1249,11 @@ AllTablesyncsReady(void)
 		pgstat_report_stat(false);
 	}
 
+	elog(LOG,
+		 "!!> AllTablesyncsReady: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
 	/* When no tablesyncs are busy, then all are READY */
 	return !found_busy;
 }
@@ -1279,4 +1301,20 @@ UpdateTwoPhaseState(char new_state)
 	table_close(rel, RowExclusiveLock);
 
 	CommitTransactionCommand();
+
+#if 1
+	/* This is just debugging, for confirmation the update worked. */
+	{
+		Subscription *new_s;
+
+		StartTransactionCommand();
+		new_s = GetSubscription(MySubscription->oid, false);
+		elog(LOG,
+			 "!!> 2PC Tri-state for \"%s\": '%c' ==> '%c'",
+			 MySubscription->name,
+			 MySubscription->twophasestate,
+			 new_s->twophasestate);
+		CommitTransactionCommand();
+	}
+#endif
 }
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index cf5ad27..0ddfd10 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -3362,8 +3362,8 @@ ApplyWorkerMain(Datum main_arg)
 			walrcv_startstreaming(wrconn, &options);
 		}
 
-		ereport(DEBUG1,
-			(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+		ereport(LOG,
+			(errmsg("!!> logical replication apply worker for subscription \"%s\" two_phase is %s.",
 			MySubscription->name,
 			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
 			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
-- 
1.8.3.1

v64-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v64-0002-Support-2PC-txn-subscriber-tests.patchDownload
From c8f61ed496692b63dd5c84c70db9fcad14807ca9 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 20 Mar 2021 11:57:43 +1100
Subject: [PATCH v64] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 433 ++++++++++++++++++++++++
 src/test/subscription/t/021_twophase_cascade.pl | 280 +++++++++++++++
 2 files changed, 713 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..ac866b2
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,433 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = '';");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test multiple publications by single 2PC execution.
+# Add one more set of publication and subscription
+# and change two tables within one PREPARED TRANSACTION,
+# to affect two corresponding publications at the same time.
+###############################
+$node_publisher->safe_psql('postgres',
+						   "CREATE TABLE new_tab (a int PRIMARY KEY);");
+$node_publisher->safe_psql('postgres',
+						   "CREATE PUBLICATION new_tap_pub FOR TABLE new_tab;");
+$node_subscriber->safe_psql('postgres',
+							"CREATE TABLE new_tab (a int PRIMARY KEY);");
+my $new_appname = 'new_tap_sub';
+$node_subscriber->safe_psql('postgres',"
+	CREATE SUBSCRIPTION new_tap_sub
+	CONNECTION '$publisher_connstr application_name=$new_appname'
+	PUBLICATION new_tap_pub
+	WITH (two_phase = on)");
+
+my $new_caughtup_query =
+  "SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$new_appname';";
+$node_publisher->poll_query_until('postgres', $new_caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# Table tab_full is connected to tap_pub publication,
+# while table new_tab is associated with new_tap_sub publication.
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (52);
+	INSERT INTO new_tab VALUES (1);
+	PREPARE TRANSACTION 'multiple_publications';
+	COMMIT PREPARED 'multiple_publications';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$node_publisher->poll_query_until('postgres', $new_caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that 2PC gets commited on subscriber for both subscriptions
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (52);");
+is($result, qq(1), 'first change got committed on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM new_tab where a IN (1);");
+is($result, qq(1), 'second change got committed on the subscriber');
+
+###############################
+# Test multiple standbys for single 2PC execution.
+# Add one more subscriber second_tap_sub, besides the existing subscriber tap_sub.
+# Then, run 2PC and check if all are synced correctly.
+###############################
+my $second_subscriber = get_new_node('second_subscriber');
+my $second_appname = 'second_app_sub';
+$second_subscriber->init(allows_streaming => 'logical');
+$second_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$second_subscriber->start;
+
+$second_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+$second_subscriber->safe_psql('postgres',"
+	CREATE SUBSCRIPTION second_tap_sub
+	CONNECTION '$publisher_connstr application_name=$second_appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on);");
+
+my $second_caughtup_query =
+  "SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$second_appname';";
+$node_publisher->poll_query_until('postgres', $second_caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (53);
+	PREPARE TRANSACTION 'multiple_standbys';
+	COMMIT PREPARED 'multiple_standbys';");
+
+$node_publisher->poll_query_until('postgres', $second_caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+$result = $second_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (53);");
+is($result, qq(1), 'the second sebscriber got the change');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (53);");
+is($result, qq(1), 'the first subscriber got the change');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION new_tap_sub");
+$second_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION second_tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$second_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_cascade.pl b/src/test/subscription/t/021_twophase_cascade.pl
new file mode 100644
index 0000000..8e9dfdc
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_cascade.pl
@@ -0,0 +1,280 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'test_prepared_tab_full';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_prepared_xacts where gid = 'outer';");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#273Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#271)

On Sat, Mar 20, 2021 at 7:07 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Sat, Mar 20, 2021 at 1:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian <itsajin@gmail.com> wrote:

Missed the patch - 0001, resending.

@@ -538,10 +550,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
..
+ /* Set two_phase_at LSN only if it hasn't already been set. */
+ if (ctx->twophase && !MyReplicationSlot->data.two_phase_at)
+ {
+ MyReplicationSlot->data.two_phase_at = start_lsn;
+ slot->data.two_phase = true;
+ ReplicationSlotMarkDirty();
+ ReplicationSlotSave();
+ SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+ }

What if the walsender or apply worker restarts after setting
two_phase_at/two_phase here and updating the two_phase state in
pg_subscription? Won't we need to set SnapBuildSetTwoPhaseAt after
restart as well?

After a restart, two_phase_at will be set by calling AllocateSnapshotBuilder with two_phase_at

Okay, that makes sense.

--
With Regards,
Amit Kapila.

#274Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#270)

On Sat, Mar 20, 2021 at 1:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian <itsajin@gmail.com> wrote:

Missed the patch - 0001, resending.

I have made miscellaneous changes in the patch which includes
improving comments, error messages, and miscellaneous coding
improvements. The most notable one is that we don't need an additional
parameter in walrcv_startstreaming, if the two_phase option is set
properly. My changes are in v63-0002-Misc-changes-by-Amit, if you are
fine with those, then please merge them in the next version. I have
omitted the dev-logs patch but feel free to submit it. I have one
question:

I am fine with these changes. I see that Peter has already merged in these
changes.

thanks,
Ajin Cherian
Fujitsu Australia

#275Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#274)
3 attachment(s)

On Sat, Mar 20, 2021 at 10:09 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Sat, Mar 20, 2021 at 1:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian <itsajin@gmail.com> wrote:

Missed the patch - 0001, resending.

I have made miscellaneous changes in the patch which includes
improving comments, error messages, and miscellaneous coding
improvements. The most notable one is that we don't need an additional
parameter in walrcv_startstreaming, if the two_phase option is set
properly. My changes are in v63-0002-Misc-changes-by-Amit, if you are
fine with those, then please merge them in the next version. I have
omitted the dev-logs patch but feel free to submit it. I have one
question:

I am fine with these changes. I see that Peter has already merged in these changes.

I have further updated the patch to implement unique GID on the
subscriber-side as discussed in the nearby thread [1]/messages/by-id/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com. That requires
some changes in the test. Additionally, I have updated some comments
and docs. Let me know what do you think about the changes?

[1]: /messages/by-id/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com

--
With Regards,
Amit Kapila.

Attachments:

v65-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v65-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From b1cd3e9ab91203bbc5ce77e314ea646ee590a9ed Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 20 Mar 2021 11:49:15 +1100
Subject: [PATCH v65 1/3] Add support for prepared transactions to built-in
 logical replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION SET PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, C Vignesh, Dilip Kumar, Takamichi Osumi
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         |  16 +-
 doc/src/sgml/ref/alter_subscription.sgml           |   4 +-
 doc/src/sgml/ref/create_subscription.sgml          |  36 +++
 src/backend/access/transam/twophase.c              |  68 +++++
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 121 +++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  11 +-
 src/backend/replication/logical/decode.c           |   2 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 212 +++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |   4 +-
 src/backend/replication/logical/snapbuild.c        |  32 +-
 src/backend/replication/logical/tablesync.c        | 204 ++++++++++---
 src/backend/replication/logical/worker.c           | 328 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 207 ++++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |   8 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  77 ++++-
 src/include/replication/reorderbuffer.h            |   2 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         |  93 ++++--
 src/test/regress/sql/subscription.sql              |  25 ++
 src/tools/pgindent/typedefs.list                   |   3 +
 38 files changed, 1437 insertions(+), 161 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 68d1960..00d43d9 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7589,6 +7589,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe..c285ef7 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..85cc8bb 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -64,7 +64,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
   <para>
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with refresh
-   option as true cannot be executed inside a transaction block.
+   option as true cannot be executed inside a transaction block. They also
+   cannot be executed with <literal>copy_data = true</literal> if the
+   subscription is using <literal>two_phase</literal> commit.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..a5c9158 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,42 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the intial table synchronization phase. This means even when
+          two_phase is enabled for the subscription, the internal two-phase state remains
+          temporarily "pending" until the initialization phase is completed. See column
+          <literal>subtwophase</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6023e7c..c58c46d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2445,3 +2445,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..658d9f8 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65d..d453902 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..f1856d7 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,32 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could
+			 * cause missing of transactions and lead to an inconsistent
+			 * replica. See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated
+			 * from AlterSubscription.
+			 */
+			if (!twophase)
+			{
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			}
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +315,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +405,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +431,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +500,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +582,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +898,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL	/* no "two_phase" */);
 
 				if (slotname_given)
 				{
@@ -869,6 +933,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +962,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +1008,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -953,6 +1025,14 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/* See ALTER_SUBSCRIPTION_REFRESH for details why this is not allow. */
+					if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -982,7 +1062,34 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL);	/* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state we
+				 * must not allow any subsequent table initialization to occur.
+				 * So the ALTER SUBSCRIPTION ... REFRESH is disallowed when the
+				 * the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data = false,
+				 * because when copy_data is false the tablesync will start
+				 * already in READY state and will exit directly without doing
+				 * anything which could interfere with the apply worker's
+				 * message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..58c813f 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,11 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/* set the two_phase option only if the caller specifies it. */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +839,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +853,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f59613..6a90b56 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -730,7 +730,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75de..5143d8f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is enabled
+	 * at the time of slot creation, or when the two_phase option is given at
+	 * the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is enabled
+	 * at the time of slot creation, or when the two_phase option is given at
+	 * the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..21487ec 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,218 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepBeginPrepareData *begin_data)
+{
+	/* read fields */
+	begin_data->final_lsn = pq_getmsgint64(in);
+	if (begin_data->final_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->committime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c291b05..ccb786e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2672,7 +2672,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2703,7 +2703,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * prepare if it was not decoded earlier. We don't need to decode the xact
 	 * for aborts if it is not done already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ed3acad..12f0cf9 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,14 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously two-phase
+	 * was not enabled or are not covered by initial snapshot needs to be sent
+	 * later along with commit prepared and they must be before this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +280,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +308,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +369,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 6ed3181..233649e 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +363,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +370,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +391,36 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly 'enabled'
+	 * at that time.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
+	{
+		if (AllTablesyncsReady())
+		{
+			ereport(LOG,
+					(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+					MySubscription->name)));
+
+			proc_exit(0);
+		}
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1058,7 +1053,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1144,3 +1139,144 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are all tablesyncs READY?
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	int			count = 0;
+	ListCell   *lc;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	/*
+	 * Process all not-READY tables.
+	 */
+	foreach(lc, table_states_not_ready)
+	{
+		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+
+		count++;
+		/*
+		 * When the process_syncing_tables_for_apply changes the state from
+		 * SYNCDONE to READY, that change is actually written directly into
+		 * the list element of table_states_not_ready.
+		 *
+		 * The table_states_valid flag is not immediately updated, so
+		 * FetchTableStates does not rebuild the "table_states_not_ready" list
+		 * because it is unaware that it needs to.
+		 *
+		 * It means the "table_states_not_ready" list might end up having
+		 * a READY state in it even though there was none when it was
+		 * initially created.
+		 *
+		 * This is why the code is testing for READY. And because a READY in the
+		 * table_states_not_ready list is the exception rather than the rule it
+		 * means we will nearly always break from this loop at the first
+		 * iteration.
+		 */
+		if (rstate->state != SUBREL_STATE_READY)
+		{
+			found_busy = true;
+			break;
+		}
+	}
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/* When no tablesyncs are busy, then all are READY */
+	return !found_busy;
+}
+
+/*
+ * Update the p_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 21d304a..1768268 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,74 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rollbacked at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted for
+ * two_phase = on, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of origin_id of
+ * subscription and xid of prepared transaction) for each prepare transaction
+ * on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +127,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -246,6 +315,10 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(RepOriginId originid, TransactionId xid,
+								   char* gid, int szgid);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -720,6 +793,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepBeginPrepareData begin_data;
+	char	gid[GIDSIZE] PG_USED_FOR_ASSERTS_ONLY;
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(replorigin_session_origin, begin_data.xid, gid,
+						   sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.committime));
+
+	remote_final_lsn = begin_data.final_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when we
+	 * have multiple subscriptions from same node point to publications on the
+	 * same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(replorigin_session_origin, prepare_data.xid, gid,
+						   sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because
+	 * at commit prepared time, we won't know whether we have skipped
+	 * preparing a transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worth because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the
+	 * EndTransactionBlock called within the PrepareTransactionBlock
+	 * below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct
+	 * position in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(replorigin_session_origin, prepare_data.xid, gid,
+						   sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char	gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(replorigin_session_origin, rollback_data.xid, gid,
+						   sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point in which case we need to
+	 * skip rollback prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1954,6 +2201,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2430,6 +2699,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2913,6 +3185,22 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(RepOriginId originid, TransactionId xid,
+					   char *gid, int szgid)
+{
+	/* Origin and Transaction ids must be valid */
+	Assert(originid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_%u_%u", originid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3085,9 +3373,45 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains as
+		 * the tri-state PENDING until all tablesyncs have reached READY state.
+		 * Only then, can it become properly ENABLED.
+		 */
+		bool all_tables_ready = AllTablesyncsReady();
+
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			all_tables_ready)
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(wrconn, &options);
+
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options);
+		}
+
+		ereport(DEBUG1,
+			(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+			MySubscription->name,
+			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+			"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(wrconn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..fc4b1ad 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,18 +171,22 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -232,9 +254,30 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be supported,
+	 * it needs more analysis to allow them together.
+	 */
+	if (*enable_twophase && *enable_streaming)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("%s and %s are mutually exclusive options",
+						"two_phase", "streaming")));
 }
 
 /*
@@ -245,6 +288,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -269,7 +313,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -310,6 +355,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by plugin
+		 * and decide whether to enable it at later point of time. It remains
+		 * enabled if the previous start-up has done so. But we only allow the
+		 * option to be passed in with sufficient version of the protocol, and
+		 * when the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -322,8 +388,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +408,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +429,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +889,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1296,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c..91224e0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8532296..8e7edae 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -364,7 +364,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index da6cc05..dddd5a3 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h" /* For 2PC tri-state. */
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4278,6 +4279,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4321,9 +4323,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4344,6 +4353,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4369,6 +4379,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4396,6 +4408,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = { LOGICALREP_TWOPHASE_STATE_DISABLED, '\0' };
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4437,6 +4450,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5340843..70c072d 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index eeac0ef..fd0d90e 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6098,7 +6098,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6124,13 +6124,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index b67f4ea..7957e2c 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2764,7 +2764,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..3f06d1f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,11 @@
 
 #include "nodes/pg_list.h"
 
+/* two_phase tri-state values. */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +59,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +98,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c253403..5c1ce7e 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..eedfc3c 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -27,10 +28,16 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. This has the same protocol
+ * version requirement as LOGICAL_PROTO_STREAM_VERSION_NUM because these
+ * features were both introduced in the same release (PG14).
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
@@ -54,10 +61,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +126,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +134,50 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+
+/* Begin Prepare information */
+typedef struct LogicalRepBeginPrepareData
+{
+	XLogRecPtr	final_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz committime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepBeginPrepareData;
+
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for prepare, and commit prepared transaction.
+ * prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +185,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepBeginPrepareData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..6c9f2c6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -643,7 +643,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..db68551 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..daf6ad4 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..a9664e8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1d1d5d2..1f3038f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1343,12 +1343,15 @@ LogicalOutputPluginWriterPrepareWrite
 LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
+LogicalRepBeginPrepareData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v65-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v65-0002-Support-2PC-txn-subscriber-tests.patchDownload
From 742253d202a81ed9a925cd208315adc7a645f541 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sun, 21 Mar 2021 12:49:34 +0530
Subject: [PATCH v65 2/3] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 425 ++++++++++++++++++++++++
 src/test/subscription/t/021_twophase_cascade.pl | 268 +++++++++++++++
 2 files changed, 693 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..4acc9f6
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,425 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+my $caughtup_query =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname';";
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres',
+	"ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Test multiple publications by single 2PC execution.
+# Add one more set of publication and subscription
+# and change two tables within one PREPARED TRANSACTION,
+# to affect two corresponding publications at the same time.
+###############################
+$node_publisher->safe_psql('postgres',
+						   "CREATE TABLE new_tab (a int PRIMARY KEY);");
+$node_publisher->safe_psql('postgres',
+						   "CREATE PUBLICATION new_tap_pub FOR TABLE new_tab;");
+$node_subscriber->safe_psql('postgres',
+							"CREATE TABLE new_tab (a int PRIMARY KEY);");
+my $new_appname = 'new_tap_sub';
+$node_subscriber->safe_psql('postgres',"
+	CREATE SUBSCRIPTION new_tap_sub
+	CONNECTION '$publisher_connstr application_name=$new_appname'
+	PUBLICATION new_tap_pub
+	WITH (two_phase = on)");
+
+my $new_caughtup_query =
+  "SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$new_appname';";
+$node_publisher->poll_query_until('postgres', $new_caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+# Table tab_full is connected to tap_pub publication,
+# while table new_tab is associated with new_tap_sub publication.
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (52);
+	INSERT INTO new_tab VALUES (1);
+	PREPARE TRANSACTION 'multiple_publications';
+	COMMIT PREPARED 'multiple_publications';");
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$node_publisher->poll_query_until('postgres', $new_caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+# check that 2PC gets commited on subscriber for both subscriptions
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (52);");
+is($result, qq(1), 'first change got committed on the subscriber');
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM new_tab where a IN (1);");
+is($result, qq(1), 'second change got committed on the subscriber');
+
+###############################
+# Test multiple standbys for single 2PC execution.
+# Add one more subscriber second_tap_sub, besides the existing subscriber tap_sub.
+# Then, run 2PC and check if all are synced correctly.
+###############################
+my $second_subscriber = get_new_node('second_subscriber');
+my $second_appname = 'second_app_sub';
+$second_subscriber->init(allows_streaming => 'logical');
+$second_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$second_subscriber->start;
+
+$second_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+$second_subscriber->safe_psql('postgres',"
+	CREATE SUBSCRIPTION second_tap_sub
+	CONNECTION '$publisher_connstr application_name=$second_appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on);");
+
+my $second_caughtup_query =
+  "SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$second_appname';";
+$node_publisher->poll_query_until('postgres', $second_caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (53);
+	PREPARE TRANSACTION 'multiple_standbys';
+	COMMIT PREPARED 'multiple_standbys';");
+
+$node_publisher->poll_query_until('postgres', $second_caughtup_query)
+	or die "Timed out while waiting for subscriber to catch up";
+
+$node_publisher->poll_query_until('postgres', $caughtup_query)
+  or die "Timed out while waiting for subscriber to catch up";
+
+$result = $second_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (53);");
+is($result, qq(1), 'the second sebscriber got the change');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (53);");
+is($result, qq(1), 'the first subscriber got the change');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION new_tap_sub");
+$second_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION second_tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$second_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_cascade.pl b/src/test/subscription/t/021_twophase_cascade.pl
new file mode 100644
index 0000000..2e7ab52
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_cascade.pl
@@ -0,0 +1,268 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+my $caughtup_query_B =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_B';";
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+my $caughtup_query_C =
+	"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE application_name = '$appname_C';";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres',
+	"COMMIT PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres',
+	"ROLLBACK PREPARED 'test_prepared_tab_full';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "
+	COMMIT PREPARED 'outer';");
+$node_A->poll_query_until('postgres', $caughtup_query_B)
+	or die "Timed out while waiting for subscriber B to catch up";
+$node_B->poll_query_until('postgres', $caughtup_query_C)
+	or die "Timed out while waiting for subscriber C to catch up";
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v65-0003-Fix-apply-worker-dev-logs.patchapplication/octet-stream; name=v65-0003-Fix-apply-worker-dev-logs.patchDownload
From 6f1bb6bd1512c119a02880eca0b3c3ee47e7847a Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 20 Mar 2021 12:18:15 +1100
Subject: [PATCH v65 3/3] Fix apply worker (dev logs)

NOT TO BE COMMITTED.

This patch is only for adding some developer logging which may help for
debugging/testing the patch.
---
 src/backend/replication/logical/tablesync.c | 38 +++++++++++++++++++++++++++++
 src/backend/replication/logical/worker.c    |  4 +--
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 233649e..6858dc3 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -407,6 +407,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 */
 	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
 	{
+		elog(LOG, "!!> two_phase enable is still pending");
 		if (AllTablesyncsReady())
 		{
 			ereport(LOG,
@@ -1155,6 +1156,8 @@ FetchTableStates(bool *started_tx)
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		elog(LOG, "!!> FetchTableStates: Re-fetching the state list caches");
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
@@ -1172,11 +1175,16 @@ FetchTableStates(bool *started_tx)
 			rstate = palloc(sizeof(SubscriptionRelState));
 			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
 			table_states_not_ready = lappend(table_states_not_ready, rstate);
+			elog(LOG, "!!> FetchTableStates: table_states_not_ready - added Table relid %u with state '%c'", rstate->relid, rstate->state);
 		}
 		MemoryContextSwitchTo(oldctx);
 
 		table_states_valid = true;
 	}
+	else
+	{
+		elog(LOG, "!!> FetchTableStates: Already up-to-date");
+	}
 }
 
 /*
@@ -1190,6 +1198,8 @@ AllTablesyncsReady(void)
 	int			count = 0;
 	ListCell   *lc;
 
+	elog(LOG, "!!> AllTablesyncsReady");
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -1201,6 +1211,12 @@ AllTablesyncsReady(void)
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
 		count++;
+		elog(LOG,
+			 "!!> AllTablesyncsReady: #%d. Table relid %u has state '%c'",
+			 count,
+			 rstate->relid,
+			 rstate->state);
+
 		/*
 		 * When the process_syncing_tables_for_apply changes the state from
 		 * SYNCDONE to READY, that change is actually written directly into
@@ -1221,6 +1237,7 @@ AllTablesyncsReady(void)
 		 */
 		if (rstate->state != SUBREL_STATE_READY)
 		{
+			elog(LOG, "!!> AllTablesyncsReady: Table relid %u is busy!", rstate->relid);
 			found_busy = true;
 			break;
 		}
@@ -1232,6 +1249,11 @@ AllTablesyncsReady(void)
 		pgstat_report_stat(false);
 	}
 
+	elog(LOG,
+		 "!!> AllTablesyncsReady: Scanned %d tables, and found busy = %s",
+		 count,
+		 found_busy ? "true" : "false");
+
 	/* When no tablesyncs are busy, then all are READY */
 	return !found_busy;
 }
@@ -1279,4 +1301,20 @@ UpdateTwoPhaseState(char new_state)
 	table_close(rel, RowExclusiveLock);
 
 	CommitTransactionCommand();
+
+#if 1
+	/* This is just debugging, for confirmation the update worked. */
+	{
+		Subscription *new_s;
+
+		StartTransactionCommand();
+		new_s = GetSubscription(MySubscription->oid, false);
+		elog(LOG,
+			 "!!> 2PC Tri-state for \"%s\": '%c' ==> '%c'",
+			 MySubscription->name,
+			 MySubscription->twophasestate,
+			 new_s->twophasestate);
+		CommitTransactionCommand();
+	}
+#endif
 }
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 1768268..5e6bbc7 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -3399,8 +3399,8 @@ ApplyWorkerMain(Datum main_arg)
 			walrcv_startstreaming(wrconn, &options);
 		}
 
-		ereport(DEBUG1,
-			(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+		ereport(LOG,
+			(errmsg("!!> logical replication apply worker for subscription \"%s\" two_phase is %s.",
 			MySubscription->name,
 			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
 			MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
-- 
1.8.3.1

#276osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Amit Kapila (#275)
1 attachment(s)
RE: [HACKERS] logical decoding of two-phase transactions

Hello

On Sunday, March 21, 2021 4:37 PM Amit Kapila <amit.kapila16@gmail.com>

On Sat, Mar 20, 2021 at 10:09 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Sat, Mar 20, 2021 at 1:35 AM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian <itsajin@gmail.com> wrote:

Missed the patch - 0001, resending.

I have made miscellaneous changes in the patch which includes
improving comments, error messages, and miscellaneous coding
improvements. The most notable one is that we don't need an
additional parameter in walrcv_startstreaming, if the two_phase
option is set properly. My changes are in
v63-0002-Misc-changes-by-Amit, if you are fine with those, then
please merge them in the next version. I have omitted the dev-logs
patch but feel free to submit it. I have one
question:

I am fine with these changes. I see that Peter has already merged in these

changes.

I have further updated the patch to implement unique GID on the
subscriber-side as discussed in the nearby thread [1]. That requires some
changes in the test.

Thank you for your update. v65 didn't make any failure during make check-world.

I've written additional tests for alter subscription using refresh
for enabled subscription and two_phase = on.
I wrote those as TAP tests because refresh requires enabled subscription
and to get a subscription enabled, we need to set connect true as well.

TAP tests are for having connection between sub and pub,
and tests in subscription.sql are aligned with connect=false.

Just in case, I ran 020_twophase.pl with this patch 100 times, based on v65 as well
and didn't cause any failure. Please have a look at the attached patch.

Best Regards,
Takamichi Osumi

Attachments:

0001-additional-tests-for-ALTER-SUBSCRIPTION.patchapplication/octet-stream; name=0001-additional-tests-for-ALTER-SUBSCRIPTION.patchDownload
From 8b13584e44acd2b4ddc47c420893960fed8d54ff Mon Sep 17 00:00:00 2001
From: Osumi Takamichi <osumi.takamichi@fujitsu.com>
Date: Sun, 21 Mar 2021 07:48:13 +0000
Subject: [PATCH v01] additional tests for ALTER SUBSCRIPTION

Add new tests related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION
and ALTER SUBSCRIPTION ... SET PUBLICATION with refresh option for enabled subscription.

---
 src/test/subscription/t/020_twophase.pl | 52 ++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
index ac866b2..cca4f23 100644
--- a/src/test/subscription/t/020_twophase.pl
+++ b/src/test/subscription/t/020_twophase.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 24;
+use Test::More tests => 28;
 
 ###############################
 # Setup
@@ -404,6 +404,37 @@ $result = $node_subscriber->safe_psql('postgres',
 is($result, qq(1), 'the first subscriber got the change');
 
 ###############################
+# Test ALTER SUBSCRIPTION ... REFRESH PUBLICATION
+# and ALTER SUBSCRIPTION ... SET PUBLICATION with refresh option
+# for enabled subscription.
+###############################
+
+regexp_query_stderrout('ALTER SUBSCRIPTION tap_sub REFRESH PUBLICATION;',
+	qr/ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled/,
+	'test REFRESH PUBLICATION using default value of copy_data = true');
+
+regexp_query_stderrout('ALTER SUBSCRIPTION tap_sub REFRESH PUBLICATION WITH (copy_data = true);',
+	qr/ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled/,
+	'test REFRESH PUBLICATION using copy_data = true');
+
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION tap_sub REFRESH PUBLICATION WITH (copy_data = false);');
+
+regexp_query_stderrout('ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub WITH (refresh = true);',
+	qr/ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled/,
+	'test SET PUBLICATION using refresh = true and default copy_data = true ');
+
+regexp_query_stderrout('ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub WITH (refresh = true, copy_data = true);',
+	qr/ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled/,
+	'test SET PUBLICATION using refresh = true and copy_data = true ');
+
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub WITH (refresh = true, copy_data = false);');
+
+$node_subscriber->safe_psql('postgres',
+	'ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub WITH (refresh = false);');
+
+###############################
 # check all the cleanup
 ###############################
 
@@ -431,3 +462,22 @@ is($result, qq(0), 'check replication origin was dropped on subscriber');
 $node_subscriber->stop('fast');
 $second_subscriber->stop('fast');
 $node_publisher->stop('fast');
+
+sub regexp_query_stderrout
+{
+	my ($query, $expected_reg, $explanation) = @_;
+	my $stderr;
+
+	my $ret = $node_subscriber->psql('postgres',
+									 $query,
+									 stderr => \$stderr);
+	# get 3 when the passed SQL results in an error
+	if ($ret == 3)
+	{
+		ok ($stderr =~ $expected_reg, $explanation);
+	}
+	else
+	{
+		die 'the input query should cause an error to be checked by regular expression';
+	}
+}
-- 
2.2.0

#277Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#275)
4 attachment(s)

On Sun, Mar 21, 2021 at 6:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have further updated the patch to implement unique GID on the
subscriber-side as discussed in the nearby thread [1]. That requires
some changes in the test. Additionally, I have updated some comments
and docs. Let me know what do you think about the changes?

Hi Amit.

PSA a small collection of feedback patches you can apply on top of the
patch v65-0001 if you decide they are OK.

(There are all I have found after a first pass over all the recent changes).

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v65-0001-Feedback-parse_subscription_options-parens-not-n.patchapplication/octet-stream; name=v65-0001-Feedback-parse_subscription_options-parens-not-n.patchDownload
From aedc71df35b23f31952f914796a5f82fb0f0be53 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 22 Mar 2021 14:29:37 +1100
Subject: [PATCH v65] Feedback - parse_subscription_options - parens not needed

---
 src/backend/commands/subscriptioncmds.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f1856d7..f2730ec 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -230,13 +230,10 @@ parse_subscription_options(List *options,
 			 * from AlterSubscription.
 			 */
 			if (!twophase)
-			{
 				ereport(ERROR,
 						(errcode(ERRCODE_SYNTAX_ERROR),
 						 errmsg("cannot alter two_phase option")));
 
-			}
-
 			if (*twophase_given)
 				ereport(ERROR,
 						(errcode(ERRCODE_SYNTAX_ERROR),
-- 
1.8.3.1

v65-0002-Feedback-apply_handle_prepare-comment-typo.patchapplication/octet-stream; name=v65-0002-Feedback-apply_handle_prepare-comment-typo.patchDownload
From 47c0d708af3cef6e490d7093aaa0a52db97d3465 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 22 Mar 2021 14:47:14 +1100
Subject: [PATCH v65] Feedback - apply_handle_prepare - comment typo

---
 src/backend/replication/logical/worker.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 1768268..ef1c828 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -848,7 +848,7 @@ apply_handle_prepare(StringInfo s)
 	 *
 	 * XXX, We can optimize such that at commit prepared time, we first check
 	 * whether we have prepared the transaction or not but that doesn't seem
-	 * worth because such cases shouldn't be common.
+	 * worthwhile because such cases shouldn't be common.
 	 */
 	ensure_transaction();
 
-- 
1.8.3.1

v65-0003-Feedback-apply_handle_begin_prepare-ineffective-.patchapplication/octet-stream; name=v65-0003-Feedback-apply_handle_begin_prepare-ineffective-.patchDownload
From 009133fd2558a00a6ac8d59c2032f1c1b1cbac49 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 22 Mar 2021 16:15:35 +1100
Subject: [PATCH v65] Feedback - apply_handle_begin_prepare - ineffective
 PG_USED_FOR_ASSERTS_ONLY

---
 src/backend/replication/logical/worker.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ef1c828..62afa95 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -799,7 +799,7 @@ static void
 apply_handle_begin_prepare(StringInfo s)
 {
 	LogicalRepBeginPrepareData begin_data;
-	char	gid[GIDSIZE] PG_USED_FOR_ASSERTS_ONLY;
+	char	gid[GIDSIZE];
 
 	/* Tablesync should never receive prepare. */
 	Assert(!am_tablesync_worker());
-- 
1.8.3.1

v65-0004-Feedback-AllTablesyncsReady-function-simplified.patchapplication/octet-stream; name=v65-0004-Feedback-AllTablesyncsReady-function-simplified.patchDownload
From ff8ef7eefb443ad44f2e5e78c6e8b88d6174c4f9 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 22 Mar 2021 18:14:54 +1100
Subject: [PATCH v65] Feedback - AllTablesyncsReady function simplified

---
 src/backend/replication/logical/tablesync.c | 37 ++---------------------------
 1 file changed, 2 insertions(+), 35 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 233649e..307d858 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1185,46 +1185,13 @@ FetchTableStates(bool *started_tx)
 bool
 AllTablesyncsReady(void)
 {
-	bool		found_busy = false;
+	bool		found_busy;
 	bool		started_tx = false;
-	int			count = 0;
-	ListCell   *lc;
 
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
-	/*
-	 * Process all not-READY tables.
-	 */
-	foreach(lc, table_states_not_ready)
-	{
-		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
-
-		count++;
-		/*
-		 * When the process_syncing_tables_for_apply changes the state from
-		 * SYNCDONE to READY, that change is actually written directly into
-		 * the list element of table_states_not_ready.
-		 *
-		 * The table_states_valid flag is not immediately updated, so
-		 * FetchTableStates does not rebuild the "table_states_not_ready" list
-		 * because it is unaware that it needs to.
-		 *
-		 * It means the "table_states_not_ready" list might end up having
-		 * a READY state in it even though there was none when it was
-		 * initially created.
-		 *
-		 * This is why the code is testing for READY. And because a READY in the
-		 * table_states_not_ready list is the exception rather than the rule it
-		 * means we will nearly always break from this loop at the first
-		 * iteration.
-		 */
-		if (rstate->state != SUBREL_STATE_READY)
-		{
-			found_busy = true;
-			break;
-		}
-	}
+	found_busy = list_length(table_states_not_ready) > 0;
 
 	if (started_tx)
 	{
-- 
1.8.3.1

#278Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#277)
2 attachment(s)

On Mon, Mar 22, 2021 at 6:27 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Sun, Mar 21, 2021 at 6:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have further updated the patch to implement unique GID on the
subscriber-side as discussed in the nearby thread [1]. That requires
some changes in the test. Additionally, I have updated some comments
and docs. Let me know what do you think about the changes?

Hi Amit.

PSA a small collection of feedback patches you can apply on top of the
patch v65-0001 if you decide they are OK.

(There are all I have found after a first pass over all the recent changes).

I have spell-checked the content of v65-0001.

PSA a couple more feedback patches to apply on top of v65-0001 if you
decide they are ok.

----
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v65-0006-Feedback-worker.c-comment-wording.patchapplication/octet-stream; name=v65-0006-Feedback-worker.c-comment-wording.patchDownload
From 5811f61ffe8830895bcab2e5c9d3b78a9546fbf9 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 22 Mar 2021 19:44:53 +1100
Subject: [PATCH v65] Feedback - worker.c comment - wording

---
 src/backend/replication/logical/worker.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 62afa95..2665a8d 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -53,7 +53,7 @@
  * TWO_PHASE TRANSACTIONS
  * ----------------------
  * Two phase transactions are replayed at prepare and then committed or
- * rollbacked at commit prepared and rollback prepared respectively. It is
+ * rolled back at commit prepared and rollback prepared respectively. It is
  * possible to have a prepared transaction that arrives at the apply worker
  * when the tablesync is busy doing the initial copy. In this case, the apply
  * worker skips all the prepared operations [e.g. inserts] while the tablesync
-- 
1.8.3.1

v65-0005-Feedback-create_subscription-docs-typo.patchapplication/octet-stream; name=v65-0005-Feedback-create_subscription-docs-typo.patchDownload
From 63de05644bd954af7691c9a386e1debdc50ee654 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 22 Mar 2021 19:42:12 +1100
Subject: [PATCH v65] Feedback - create_subscription docs - typo

---
 doc/src/sgml/ref/create_subscription.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index a5c9158..9a83fec 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -261,7 +261,7 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
 
          <para>
           The two-phase commit implementation requires that the replication has
-          successfully passed the intial table synchronization phase. This means even when
+          successfully passed the initial table synchronization phase. This means even when
           two_phase is enabled for the subscription, the internal two-phase state remains
           temporarily "pending" until the initialization phase is completed. See column
           <literal>subtwophase</literal> of <xref linkend="catalog-pg-subscription"/>
-- 
1.8.3.1

#279tanghy.fnst@fujitsu.com
tanghy.fnst@fujitsu.com
In reply to: Amit Kapila (#275)
RE: [HACKERS] logical decoding of two-phase transactions

On Sunday, March 21, 2021 4:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have further updated the patch to implement unique GID on the
subscriber-side as discussed in the nearby thread [1].

I did some tests(cross version & synchronous) on the latest patch set v65*, all tests passed. Here is the detail, please take it as a reference.

Case | version of publisher | version of subscriber | two_phase option | synchronous | expect result | result
-------+------------------------+-------------------------+----------------------+---------------+-----------------+---------
1 | 13 | 14(patched) | on | no | same as case3 | ok
2 | 13 | 14(patched) | off | no | same as case3 | ok
3 | 13 | 14(unpatched) | not support | no | - | -
4 | 14(patched) | 13 | not support | no | same as case5 | ok
5 | 14(unpatched) | 13 | not support | no | - | -
6 | 13 | 14(patched) | on | yes | same as case8 | ok
7 | 13 | 14(patched) | off | yes | same as case8 | ok
8 | 13 | 14(unpatched) | not support | yes | - | -
9 | 14(patched) | 13 | not support | yes | same as case10 | ok
10 | 14(unpatched) | 13 | not support | yes | - | -

remark:
(1)case3, 5 ,8, 10 is tested just for reference
(2)SQL been executed in each case
scenario1 begin…commit
scenario2 begin…prepare…commit

Regards,
Tang

#280Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#278)
2 attachment(s)

On Mon, Mar 22, 2021 at 2:41 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, Mar 22, 2021 at 6:27 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Amit.

PSA a small collection of feedback patches you can apply on top of the
patch v65-0001 if you decide they are OK.

(There are all I have found after a first pass over all the recent changes).

I have spell-checked the content of v65-0001.

PSA a couple more feedback patches to apply on top of v65-0001 if you
decide they are ok.

I have incorporated all your changes and additionally made few more
changes (a) got rid of LogicalRepBeginPrepareData and instead used
LogicalRepPreparedTxnData, (b) made a number of changes in comments
and docs, (c) ran pgindent, (d) modified tests to use standard
wait_for_catch function and removed few tests to reduce the time and
to keep regression tests reliable.

--
With Regards,
Amit Kapila.

Attachments:

v66-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v66-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From c8d0d039662f03c605edc10dda1036f486b98b84 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 20 Mar 2021 11:49:15 +1100
Subject: [PATCH v66 1/2] Add support for prepared transactions to built-in
 logical replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION SET PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         |  16 +-
 doc/src/sgml/ref/alter_subscription.sgml           |   4 +-
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 +++++
 src/backend/catalog/pg_subscription.c              |   1 +
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 121 +++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |   2 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 212 +++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |  13 +-
 src/backend/replication/logical/snapbuild.c        |  33 ++-
 src/backend/replication/logical/tablesync.c        | 171 ++++++++---
 src/backend/replication/logical/worker.c           | 328 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 207 ++++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  66 ++++-
 src/include/replication/reorderbuffer.h            |   2 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         |  93 ++++--
 src/test/regress/sql/subscription.sql              |  25 ++
 src/tools/pgindent/typedefs.list                   |   2 +
 38 files changed, 1401 insertions(+), 165 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 68d1960..00d43d9 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7589,6 +7589,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe..c285ef7 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..85cc8bb 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -64,7 +64,9 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
   <para>
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with refresh
-   option as true cannot be executed inside a transaction block.
+   option as true cannot be executed inside a transaction block. They also
+   cannot be executed with <literal>copy_data = true</literal> if the
+   subscription is using <literal>two_phase</literal> commit.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 89335b6..d75f052 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..658d9f8 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65d..d453902 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..bac1ded 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +312,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +579,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +895,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -869,6 +930,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +959,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +1005,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -953,6 +1022,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allow.
+					 */
+					if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -982,7 +1062,34 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..eb03c53 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +838,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +852,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f59613..6a90b56 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -730,7 +730,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75de..7e68299 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..daeef83 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,218 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->preparetime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c291b05..d312702 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2672,7 +2672,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2698,12 +2698,13 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ed3acad..b80ed41 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 6ed3181..b4f454f 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static void FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +363,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +370,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +391,36 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
+	{
+		if (AllTablesyncsReady())
+		{
+			ereport(LOG,
+					(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+							MySubscription->name)));
+
+			proc_exit(0);
+		}
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1058,7 +1053,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1144,3 +1139,111 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ */
+static void
+FetchTableStates(bool *started_tx)
+{
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		StartTransactionCommand();
+		*started_tx = true;
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		table_states_valid = true;
+	}
+}
+
+/*
+ * Are all tablesyncs READY?
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/* When no tablesyncs are busy, then all are READY */
+	return !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 21d304a..bee4c2f 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,74 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted for
+ * two_phase = on, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of origin_id of
+ * subscription and xid of prepared transaction) for each prepare transaction
+ * on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +127,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -246,6 +315,10 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(RepOriginId originid, TransactionId xid,
+								   char *gid, int szgid);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -720,6 +793,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(replorigin_session_origin, begin_data.xid, gid,
+						   sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.preparetime));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(replorigin_session_origin, prepare_data.xid, gid,
+						   sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(replorigin_session_origin, prepare_data.xid, gid,
+						   sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(replorigin_session_origin, rollback_data.xid, gid,
+						   sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * still not enabled by that time, so in such cases, we need to skip
+	 * rollback prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1954,6 +2201,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2430,6 +2699,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2913,6 +3185,22 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(RepOriginId originid, TransactionId xid,
+					   char *gid, int szgid)
+{
+	/* Origin and Transaction ids must be valid */
+	Assert(originid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_%u_%u", originid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3085,9 +3373,45 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 */
+		bool		all_tables_ready = AllTablesyncsReady();
+
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			all_tables_ready)
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(wrconn, &options);
+
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(wrconn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..565f92b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,18 +171,22 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -232,9 +254,30 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (*enable_twophase && *enable_streaming)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("%s and %s are mutually exclusive options",
+						"two_phase", "streaming")));
 }
 
 /*
@@ -245,6 +288,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -269,7 +313,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -310,6 +355,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -322,8 +388,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +408,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +429,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +889,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1296,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c..91224e0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8532296..8e7edae 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -364,7 +364,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index da6cc05..24436ed 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4278,6 +4279,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4321,9 +4323,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4344,6 +4353,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4369,6 +4379,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4396,6 +4408,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4437,6 +4450,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5340843..70c072d 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index eeac0ef..fd0d90e 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6098,7 +6098,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6124,13 +6124,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index b67f4ea..7957e2c 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2764,7 +2764,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c253403..5c1ce7e 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..7426993 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -27,10 +28,16 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. This has the same protocol
+ * version requirement as LOGICAL_PROTO_STREAM_VERSION_NUM because these
+ * features were both introduced in the same release (PG14).
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
@@ -54,10 +61,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +126,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +134,39 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for begin_prepare, prepare, and commit prepared
+ * transaction. prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +174,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..6c9f2c6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -643,7 +643,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..db68551 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..daf6ad4 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..a9664e8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1d1d5d2..abb418a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1346,9 +1346,11 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v66-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v66-0002-Support-2PC-txn-subscriber-tests.patchDownload
From c65d7b06312dbfe7cfea4997e49541d202904099 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sun, 21 Mar 2021 12:49:34 +0530
Subject: [PATCH v66 2/2] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 293 ++++++++++++++++++++++++
 src/test/subscription/t/021_twophase_cascade.pl | 236 +++++++++++++++++++
 2 files changed, 529 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..364e6eb
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,293 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_cascade.pl b/src/test/subscription/t/021_twophase_cascade.pl
new file mode 100644
index 0000000..76b224a
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_cascade.pl
@@ -0,0 +1,236 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#281Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#280)
2 attachment(s)

On Mon, Mar 22, 2021 at 11:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have incorporated all your changes and additionally made few more
changes (a) got rid of LogicalRepBeginPrepareData and instead used
LogicalRepPreparedTxnData, (b) made a number of changes in comments
and docs, (c) ran pgindent, (d) modified tests to use standard
wait_for_catch function and removed few tests to reduce the time and
to keep regression tests reliable.

I checked all v65* / v66* differences and found only two trivial comment typos.

PSA patches to fix those.

----
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v66-0001-Feedback-apply_handle_rollback_prepared-typo-in-.patchapplication/octet-stream; name=v66-0001-Feedback-apply_handle_rollback_prepared-typo-in-.patchDownload
From fe5d4f3f81afa15ccef62f7c23677e16578fcd52 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 23 Mar 2021 10:04:28 +1100
Subject: [PATCH v66] Feedback - apply_handle_rollback_prepared - typo in
 comment

---
 src/backend/replication/logical/worker.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index bee4c2f..43304e0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -936,8 +936,8 @@ apply_handle_rollback_prepared(StringInfo s)
 	/*
 	 * It is possible that we haven't received prepare because it occurred
 	 * before walsender reached a consistent point or the two_phase was still
-	 * still not enabled by that time, so in such cases, we need to skip
-	 * rollback prepared.
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
 	 */
 	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
 					rollback_data.preparetime))
-- 
1.8.3.1

v66-0002-Feedback-AlterSubscription-typo-in-comment.patchapplication/octet-stream; name=v66-0002-Feedback-AlterSubscription-typo-in-comment.patchDownload
From fa42e49757b56e8a8d29afe6ed7fe1c0a7688e39 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 23 Mar 2021 10:08:57 +1100
Subject: [PATCH v66] Feedback - AlterSubscription - typo in comment

---
 src/backend/commands/subscriptioncmds.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bac1ded..38863bd 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -1024,7 +1024,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 					/*
 					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
-					 * not allow.
+					 * not allowed.
 					 */
 					if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
 						ereport(ERROR,
-- 
1.8.3.1

#282Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#281)
1 attachment(s)

On Tue, Mar 23, 2021 at 10:44 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, Mar 22, 2021 at 11:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have incorporated all your changes and additionally made few more
changes (a) got rid of LogicalRepBeginPrepareData and instead used
LogicalRepPreparedTxnData, (b) made a number of changes in comments
and docs, (c) ran pgindent, (d) modified tests to use standard
wait_for_catch function and removed few tests to reduce the time and
to keep regression tests reliable.

I checked all v65* / v66* differences and found only two trivial comment typos.

PSA patches to fix those.

Hi Amit.

PSA a patch to allow the ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
work when two-phase tristate is PENDING.

This is necessary for the pg_dump/pg_restore scenario, or for any
other use-case where the subscription might
start off having no tables.

Please apply this on top of your v66-0001 (after applying the other
Feedback patches I posted earlier today).

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v66-0003-Fix-to-allow-REFRESH-PUBLICATION-for-two_phase-P.patchapplication/octet-stream; name=v66-0003-Fix-to-allow-REFRESH-PUBLICATION-for-two_phase-P.patchDownload
From a6fd142f9d7b8331c93e1cfdc2ec3a1de4c1ad36 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 23 Mar 2021 20:27:19 +1100
Subject: [PATCH v66] Fix to allow REFRESH PUBLICATION for two_phase PENDING.

* pg_dump of subscription with two_phase will restore with subscription
  having tristate PENDING but still needed to allow REFRESH PUBLICATION
  because it gets restored with no tables.

* Also, if there are 0 tables in the subscription do not transition from
  PENDING to ENABLED because that will prevent using ALTER SUBSCRIPTION
  REFRESH PUBLICATION with no way to add tables.

* Includes minor updates to PG docs.
---
 doc/src/sgml/ref/alter_subscription.sgml    | 12 ++++++----
 src/backend/commands/subscriptioncmds.c     |  4 ++--
 src/backend/replication/logical/tablesync.c | 37 ++++++++++++++++++++++++-----
 src/backend/replication/logical/worker.c    | 24 ++++++++++++++++---
 4 files changed, 61 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 85cc8bb..d5e891b 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -62,11 +62,13 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
   </para>
 
   <para>
-   Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
-   <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with refresh
-   option as true cannot be executed inside a transaction block. They also
-   cannot be executed with <literal>copy_data = true</literal> if the
-   subscription is using <literal>two_phase</literal> commit.
+   Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command>, and
+   <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with
+   <literal>refresh = true</literal> cannot be executed inside a transaction
+   block. They also cannot be executed with <literal>copy_data = true</literal>
+   when the subscription has <literal>two_phase</literal> commit enabled. See
+   column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 38863bd..89abc81 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -1026,7 +1026,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
 					 * not allowed.
 					 */
-					if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
 						ereport(ERROR,
 								(errcode(ERRCODE_SYNTAX_ERROR),
 								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
@@ -1084,7 +1084,7 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 				 *
 				 * For more details see comments atop worker.c.
 				 */
-				if (sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED && copy_data)
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
 					ereport(ERROR,
 							(errcode(ERRCODE_SYNTAX_ERROR),
 							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index b4f454f..679b73f 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -409,11 +409,30 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	{
 		if (AllTablesyncsReady())
 		{
-			ereport(LOG,
-					(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
-							MySubscription->name)));
+			List *subrels = NIL;
+			bool become_two_phase_enabled = false;
 
-			proc_exit(0);
+			if (!started_tx)
+			{
+				StartTransactionCommand();
+				started_tx = true;
+			}
+			subrels = GetSubscriptionRelations(MySubscription->oid);
+
+			/*
+			 * If there are no tables then leave the state as PENDING, which
+			 * allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+			 */
+			become_two_phase_enabled = list_length(subrels) > 0;
+
+			if (become_two_phase_enabled)
+			{
+				ereport(LOG,
+						(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+								MySubscription->name)));
+
+				proc_exit(0);
+			}
 		}
 	}
 
@@ -1159,8 +1178,11 @@ FetchTableStates(bool *started_tx)
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
 
-		StartTransactionCommand();
-		*started_tx = true;
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
 
 		/* Fetch all non-ready tables. */
 		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
@@ -1181,6 +1203,9 @@ FetchTableStates(bool *started_tx)
 
 /*
  * Are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
  */
 bool
 AllTablesyncsReady(void)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 43304e0..c27db32 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -3377,15 +3377,33 @@ ApplyWorkerMain(Datum main_arg)
 
 	if (!am_tablesync_worker())
 	{
+		bool	become_two_phase_enabled = false;
+
 		/*
 		 * Even when the two_phase mode is requested by the user, it remains
 		 * as the tri-state PENDING until all tablesyncs have reached READY
 		 * state. Only then, can it become properly ENABLED.
 		 */
-		bool		all_tables_ready = AllTablesyncsReady();
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
+		{
+			if (AllTablesyncsReady())
+			{
+				List *subrels = NIL;
+
+				StartTransactionCommand();
+				subrels = GetSubscriptionRelations(MySubscription->oid);
+				CommitTransactionCommand();
+
+				/*
+				 * If there are no tables then leave the state as PENDING,
+				 * otherwise ALTER SUBSCRIPTION ... REFRESH PUBLICATION is
+				 * not allowed.
+				 */
+				become_two_phase_enabled = list_length(subrels) > 0;
+			}
+		}
 
-		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
-			all_tables_ready)
+		if (become_two_phase_enabled)
 		{
 			/* Start streaming with two_phase enabled */
 			options.proto.logical.twophase = true;
-- 
1.8.3.1

#283Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#282)
1 attachment(s)

On Tue, Mar 23, 2021 at 9:01 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please apply this on top of your v66-0001 (after applying the other
Feedback patches I posted earlier today).

Applied all the above patches and did a 5 cascade test set up and all the
instances synced correctly. Test log attached.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

5 cascade setupapplication/octet-stream; name="5 cascade setup"Download
#284Peter Smith
smithpb2250@gmail.com
In reply to: Ajin Cherian (#283)
1 attachment(s)

On Tue, Mar 23, 2021 at 9:49 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Mar 23, 2021 at 9:01 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please apply this on top of your v66-0001 (after applying the other
Feedback patches I posted earlier today).

Applied all the above patches and did a 5 cascade test set up and all the instances synced correctly. Test log attached.

FYI - Using the same v66* patch set (including yesterday's additional
patches) I have run the subscription TAP tests 020 and 021 in a loop x
150.

All passed ok. PSA the results file as evidence.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

results_020_021_x150.txttext/plain; charset=US-ASCII; name=results_020_021_x150.txtDownload
#285Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#282)
1 attachment(s)

On Tue, Mar 23, 2021 at 9:01 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Mar 23, 2021 at 10:44 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, Mar 22, 2021 at 11:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have incorporated all your changes and additionally made few more
changes (a) got rid of LogicalRepBeginPrepareData and instead used
LogicalRepPreparedTxnData, (b) made a number of changes in comments
and docs, (c) ran pgindent, (d) modified tests to use standard
wait_for_catch function and removed few tests to reduce the time and
to keep regression tests reliable.

I checked all v65* / v66* differences and found only two trivial comment typos.

PSA patches to fix those.

Hi Amit.

PSA a patch to allow the ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
work when two-phase tristate is PENDING.

This is necessary for the pg_dump/pg_restore scenario, or for any
other use-case where the subscription might
start off having no tables.

Please apply this on top of your v66-0001 (after applying the other
Feedback patches I posted earlier today).

PSA a small addition to the 66-0003 "Fix to allow REFRESH PUBLICATION"
patch posted yesterday.

This just updates the worker.c comment.

------
Kind Regards,
Peter Smith.
Fujitsu Australia.

Attachments:

v66-0004-Updated-worker.c-comment.patchapplication/octet-stream; name=v66-0004-Updated-worker.c-comment.patchDownload
From e59e2b34f337e01e65fc7bb967ee8d899bb81649 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 24 Mar 2021 18:30:06 +1100
Subject: [PATCH v66] Updated worker.c comment

---
 src/backend/replication/logical/worker.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c27db32..634fd92 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -91,6 +91,10 @@
  * prepares but we ensure that such prepares are sent along with commit
  * prepare, see ReorderBufferFinishPrepared.
  *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
  * If ever a user needs to be aware of the tri-state value, they can fetch it
  * from the pg_subscription catalog (see column subtwophasestate).
  *
@@ -104,8 +108,8 @@
  * Finally, to avoid problems mentioned in previous paragraphs from any
  * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
  * to 'off' and then again back to 'on') there is a restriction for
- * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted for
- * two_phase = on, except when copy_data = false.
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
  *
  * We can get prepare of the same GID more than once for the genuine cases
  * where we have defined multiple subscriptions for publications on the same
-- 
1.8.3.1

#286houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#280)
RE: [HACKERS] logical decoding of two-phase transactions

I have incorporated all your changes and additionally made few more changes
(a) got rid of LogicalRepBeginPrepareData and instead used
LogicalRepPreparedTxnData, (b) made a number of changes in comments and
docs, (c) ran pgindent, (d) modified tests to use standard wait_for_catch
function and removed few tests to reduce the time and to keep regression
tests reliable.

Hi,

When reading the code, I found some comments related to the patch here.

* XXX Now, this can even lead to a deadlock if the prepare
* transaction is waiting to get it logically replicated for
* distributed 2PC. Currently, we don't have an in-core
* implementation of prepares for distributed 2PC but some
* out-of-core logical replication solution can have such an
* implementation. They need to inform users to not have locks
* on catalog tables in such transactions.
*/

Since we will have in-core implementation of prepares, should we update the comments here ?

Best regards,
houzj

#287Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#282)

On Tue, Mar 23, 2021 at 3:31 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Mar 23, 2021 at 10:44 AM Peter Smith <smithpb2250@gmail.com> wrote:

PSA patches to fix those.

Hi Amit.

PSA a patch to allow the ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
work when two-phase tristate is PENDING.

This is necessary for the pg_dump/pg_restore scenario, or for any
other use-case where the subscription might
start off having no tables.

+ subrels = GetSubscriptionRelations(MySubscription->oid);
+
+ /*
+ * If there are no tables then leave the state as PENDING, which
+ * allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+ */
+ become_two_phase_enabled = list_length(subrels) > 0;

This code is similar at both the places it is used. Isn't it better to
move this inside AllTablesyncsReady and if required then we can change
the name of the function.

--
With Regards,
Amit Kapila.

#288Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#287)
1 attachment(s)

On Wed, Mar 24, 2021 at 11:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 23, 2021 at 3:31 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Mar 23, 2021 at 10:44 AM Peter Smith <smithpb2250@gmail.com> wrote:

PSA patches to fix those.

Hi Amit.

PSA a patch to allow the ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
work when two-phase tristate is PENDING.

This is necessary for the pg_dump/pg_restore scenario, or for any
other use-case where the subscription might
start off having no tables.

+ subrels = GetSubscriptionRelations(MySubscription->oid);
+
+ /*
+ * If there are no tables then leave the state as PENDING, which
+ * allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+ */
+ become_two_phase_enabled = list_length(subrels) > 0;

This code is similar at both the places it is used. Isn't it better to
move this inside AllTablesyncsReady and if required then we can change
the name of the function.

I agree. That way is better.

PSA a patch which changes the AllTableSyncsReady function to now
include the zero tables check.

(This patch is to be applied on top of all previous patches)

------
Kind Regards,
Peter Smith.
Fujitsu Australia.

Attachments:

v66-0005-Change-AllTablesyncsReady-to-return-false-when-0.patchapplication/octet-stream; name=v66-0005-Change-AllTablesyncsReady-to-return-false-when-0.patchDownload
From 981c2a2d39307a42caf921e257d0a0058b104de8 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 25 Mar 2021 13:27:12 +1100
Subject: [PATCH v66] Change AllTablesyncsReady to return false when 0 tables.

---
 src/backend/replication/logical/tablesync.c | 65 ++++++++++++++---------------
 src/backend/replication/logical/worker.c    | 28 +++----------
 2 files changed, 38 insertions(+), 55 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 679b73f..a1c9949 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -119,7 +119,7 @@
 
 static bool table_states_valid = false;
 static List *table_states_not_ready = NIL;
-static void FetchTableStates(bool *started_tx);
+static int FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -404,36 +404,18 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * When this happens, we restart the apply worker and (if the conditions
 	 * are still ok) then the two_phase tri-state will become properly
 	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
 	 */
-	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
 	{
-		if (AllTablesyncsReady())
-		{
-			List *subrels = NIL;
-			bool become_two_phase_enabled = false;
-
-			if (!started_tx)
-			{
-				StartTransactionCommand();
-				started_tx = true;
-			}
-			subrels = GetSubscriptionRelations(MySubscription->oid);
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
 
-			/*
-			 * If there are no tables then leave the state as PENDING, which
-			 * allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
-			 */
-			become_two_phase_enabled = list_length(subrels) > 0;
-
-			if (become_two_phase_enabled)
-			{
-				ereport(LOG,
-						(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
-								MySubscription->name)));
-
-				proc_exit(0);
-			}
-		}
+		proc_exit(0);
 	}
 
 	/*
@@ -1161,15 +1143,20 @@ copy_table_done:
 
 /*
  * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns how many tables belong to the subscription.
  */
-static void
+static int
 FetchTableStates(bool *started_tx)
 {
+	static int n_subrels = 0;
+
 	*started_tx = false;
 
 	if (!table_states_valid)
 	{
 		MemoryContext oldctx;
+		List	   *subrels = NIL;
 		List	   *rstates;
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
@@ -1184,6 +1171,10 @@ FetchTableStates(bool *started_tx)
 			*started_tx = true;
 		}
 
+		/* count all subscription tables. */
+		subrels = GetSubscriptionRelations(MySubscription->oid);
+		n_subrels = list_length(subrels);
+
 		/* Fetch all non-ready tables. */
 		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
 
@@ -1199,10 +1190,14 @@ FetchTableStates(bool *started_tx)
 
 		table_states_valid = true;
 	}
+
+	return n_subrels;
 }
 
 /*
- * Are all tablesyncs READY?
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
  *
  * Note: This function is not suitable to be called from outside of apply or
  * tablesync workers because MySubscription needs to be already initialized.
@@ -1212,9 +1207,10 @@ AllTablesyncsReady(void)
 {
 	bool		found_busy = false;
 	bool		started_tx = false;
+	int			n_subrels = 0;
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	FetchTableStates(&started_tx);
+	n_subrels = FetchTableStates(&started_tx);
 
 	found_busy = list_length(table_states_not_ready) > 0;
 
@@ -1224,8 +1220,11 @@ AllTablesyncsReady(void)
 		pgstat_report_stat(false);
 	}
 
-	/* When no tablesyncs are busy, then all are READY */
-	return !found_busy;
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return n_subrels > 0 && !found_busy;
 }
 
 /*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 634fd92..2cbc70e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -3381,33 +3381,17 @@ ApplyWorkerMain(Datum main_arg)
 
 	if (!am_tablesync_worker())
 	{
-		bool	become_two_phase_enabled = false;
-
 		/*
 		 * Even when the two_phase mode is requested by the user, it remains
 		 * as the tri-state PENDING until all tablesyncs have reached READY
 		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
 		 */
-		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING)
-		{
-			if (AllTablesyncsReady())
-			{
-				List *subrels = NIL;
-
-				StartTransactionCommand();
-				subrels = GetSubscriptionRelations(MySubscription->oid);
-				CommitTransactionCommand();
-
-				/*
-				 * If there are no tables then leave the state as PENDING,
-				 * otherwise ALTER SUBSCRIPTION ... REFRESH PUBLICATION is
-				 * not allowed.
-				 */
-				become_two_phase_enabled = list_length(subrels) > 0;
-			}
-		}
-
-		if (become_two_phase_enabled)
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
 		{
 			/* Start streaming with two_phase enabled */
 			options.proto.logical.twophase = true;
-- 
1.8.3.1

#289Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#288)
1 attachment(s)

On Thu, Mar 25, 2021 at 1:40 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Wed, Mar 24, 2021 at 11:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 23, 2021 at 3:31 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Mar 23, 2021 at 10:44 AM Peter Smith <smithpb2250@gmail.com> wrote:

PSA patches to fix those.

Hi Amit.

PSA a patch to allow the ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
work when two-phase tristate is PENDING.

This is necessary for the pg_dump/pg_restore scenario, or for any
other use-case where the subscription might
start off having no tables.

+ subrels = GetSubscriptionRelations(MySubscription->oid);
+
+ /*
+ * If there are no tables then leave the state as PENDING, which
+ * allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+ */
+ become_two_phase_enabled = list_length(subrels) > 0;

This code is similar at both the places it is used. Isn't it better to
move this inside AllTablesyncsReady and if required then we can change
the name of the function.

I agree. That way is better.

PSA a patch which changes the AllTableSyncsReady function to now
include the zero tables check.

(This patch is to be applied on top of all previous patches)

------

PSA a patch which modifies the FetchTableStates function to use a more
efficient way of testing if the subscription has any tables or not.

(This patch is to be applied on top of all previous v66* patches posted)

------
Kind Regards,
Peter Smith.
Fujitsu Australia.

Attachments:

v66-0006-FetchTableStates-performance-improvements.patchapplication/octet-stream; name=v66-0006-FetchTableStates-performance-improvements.patchDownload
From 5276031713c3dfb98d31d159d8933b0499839126 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 25 Mar 2021 17:55:58 +1100
Subject: [PATCH v66] FetchTableStates performance improvements.

Instead of calling the GetSubscriptionRelations every time to find
if the subscription has any tables, now we call a new function
HasSubscriptionRelations.

And even this is called only when we are unsure do tables exist or not.
---
 src/backend/catalog/pg_subscription.c       | 34 +++++++++++++++++++++++++++++
 src/backend/replication/logical/tablesync.c | 31 +++++++++++++++-----------
 src/include/catalog/pg_subscription_rel.h   |  1 +
 3 files changed, 53 insertions(+), 13 deletions(-)

diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 658d9f8..0f725fb 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -450,6 +450,40 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * the List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	int			nkeys = 0;
+	ScanKeyData skey[2];
+	SysScanDesc scan;
+	bool		has_subrels = false;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[nkeys++],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, nkeys, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index a1c9949..14c52e3 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -119,7 +119,7 @@
 
 static bool table_states_valid = false;
 static List *table_states_not_ready = NIL;
-static int FetchTableStates(bool *started_tx);
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -1144,19 +1144,18 @@ copy_table_done:
 /*
  * Common code to fetch the up-to-date sync state info into the static lists.
  *
- * Returns how many tables belong to the subscription.
+ * Returns true if subscription has 1 or more tables, else false.
  */
-static int
+static bool
 FetchTableStates(bool *started_tx)
 {
-	static int n_subrels = 0;
+	static int has_subrels = false;
 
 	*started_tx = false;
 
 	if (!table_states_valid)
 	{
 		MemoryContext oldctx;
-		List	   *subrels = NIL;
 		List	   *rstates;
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
@@ -1171,10 +1170,6 @@ FetchTableStates(bool *started_tx)
 			*started_tx = true;
 		}
 
-		/* count all subscription tables. */
-		subrels = GetSubscriptionRelations(MySubscription->oid);
-		n_subrels = list_length(subrels);
-
 		/* Fetch all non-ready tables. */
 		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
 
@@ -1188,10 +1183,20 @@ FetchTableStates(bool *started_tx)
 		}
 		MemoryContextSwitchTo(oldctx);
 
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_no_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
 		table_states_valid = true;
 	}
 
-	return n_subrels;
+	return has_subrels;
 }
 
 /*
@@ -1207,10 +1212,10 @@ AllTablesyncsReady(void)
 {
 	bool		found_busy = false;
 	bool		started_tx = false;
-	int			n_subrels = 0;
+	bool		has_subrels = false;
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	n_subrels = FetchTableStates(&started_tx);
+	has_subrels = FetchTableStates(&started_tx);
 
 	found_busy = list_length(table_states_not_ready) > 0;
 
@@ -1224,7 +1229,7 @@ AllTablesyncsReady(void)
 	 * When there are no tables, then return false.
 	 * When no tablesyncs are busy, then all are READY
 	 */
-	return n_subrels > 0 && !found_busy;
+	return has_subrels && !found_busy;
 }
 
 /*
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
-- 
1.8.3.1

#290Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#289)
2 attachment(s)

On Thu, Mar 25, 2021 at 12:39 PM Peter Smith <smithpb2250@gmail.com> wrote:

PSA a patch which modifies the FetchTableStates function to use a more
efficient way of testing if the subscription has any tables or not.

(This patch is to be applied on top of all previous v66* patches posted)

I have incorporated all your incremental patches and fixed comments
raised by Hou-San in the attached patch.

--
With Regards,
Amit Kapila.

Attachments:

v67-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v67-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 2362c01c54280ecab44457424c72101ffe917cbb Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Sat, 20 Mar 2021 11:49:15 +1100
Subject: [PATCH v67 1/2] Add support for prepared transactions to built-in
 logical replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION SET PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         |  16 +-
 doc/src/sgml/ref/alter_subscription.sgml           |  10 +-
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 +++++
 src/backend/catalog/pg_subscription.c              |  35 +++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 121 +++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 212 +++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |  13 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 200 +++++++++---
 src/backend/replication/logical/worker.c           | 334 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 207 ++++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  66 +++-
 src/include/replication/reorderbuffer.h            |   2 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         |  93 ++++--
 src/test/regress/sql/subscription.sql              |  25 ++
 src/tools/pgindent/typedefs.list                   |   2 +
 40 files changed, 1481 insertions(+), 175 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index bae4d8c..6752cd7 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7589,6 +7589,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 80eb96d..bd0a353 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1240,9 +1240,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe..c285ef7 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..d5e891b 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -62,9 +62,13 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
   </para>
 
   <para>
-   Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
-   <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with refresh
-   option as true cannot be executed inside a transaction block.
+   Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command>, and
+   <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with
+   <literal>refresh = true</literal> cannot be executed inside a transaction
+   block. They also cannot be executed with <literal>copy_data = true</literal>
+   when the subscription has <literal>two_phase</literal> commit enabled. See
+   column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 89335b6..d75f052 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..2b4b699 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -449,6 +450,40 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	int			nkeys = 0;
+	ScanKeyData skey[2];
+	SysScanDesc scan;
+	bool		has_subrels = false;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[nkeys++],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, nkeys, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65d..d453902 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..89abc81 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +312,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +579,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +895,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -869,6 +930,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +959,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +1005,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -953,6 +1022,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -982,7 +1062,34 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..eb03c53 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +838,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +852,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f59613..73038d4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -370,11 +370,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -730,7 +728,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75de..7e68299 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..daeef83 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,218 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->preparetime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 127f2c4..7e8c8d3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2672,7 +2672,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2698,12 +2698,13 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c5a8125..0daff13 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 8494db8..2937f41 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +363,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +370,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +391,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1058,7 +1054,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1144,3 +1140,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_no_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 354fbe4..833b3c8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of origin_id of
+ * subscription and xid of prepared transaction) for each prepare transaction
+ * on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -246,6 +319,10 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(RepOriginId originid, TransactionId xid,
+								   char *gid, int szgid);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -720,6 +797,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(replorigin_session_origin, begin_data.xid, gid,
+						   sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.preparetime));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(replorigin_session_origin, prepare_data.xid, gid,
+						   sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(replorigin_session_origin, prepare_data.xid, gid,
+						   sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(replorigin_session_origin, rollback_data.xid, gid,
+						   sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1954,6 +2205,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2430,6 +2703,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2916,6 +3192,22 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(RepOriginId originid, TransactionId xid,
+					   char *gid, int szgid)
+{
+	/* Origin and Transaction ids must be valid */
+	Assert(originid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_%u_%u", originid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3088,9 +3380,47 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(wrconn, &options);
+
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(wrconn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..565f92b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,18 +171,22 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -232,9 +254,30 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (*enable_twophase && *enable_streaming)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("%s and %s are mutually exclusive options",
+						"two_phase", "streaming")));
 }
 
 /*
@@ -245,6 +288,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -269,7 +313,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -310,6 +355,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -322,8 +388,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +408,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +429,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +889,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1296,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c..91224e0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8532296..8e7edae 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -364,7 +364,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index da6cc05..24436ed 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4278,6 +4279,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4321,9 +4323,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4344,6 +4353,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4369,6 +4379,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4396,6 +4408,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4437,6 +4450,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5340843..70c072d 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index eeac0ef..fd0d90e 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6098,7 +6098,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6124,13 +6124,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index a053bc1..ff7ddc3 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2763,7 +2763,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c253403..5c1ce7e 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..7426993 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -27,10 +28,16 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. This has the same protocol
+ * version requirement as LOGICAL_PROTO_STREAM_VERSION_NUM because these
+ * features were both introduced in the same release (PG14).
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
@@ -54,10 +61,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +126,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +134,39 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for begin_prepare, prepare, and commit prepared
+ * transaction. prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +174,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..6c9f2c6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -643,7 +643,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..db68551 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..daf6ad4 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..a9664e8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e6777e..1874ec7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1346,9 +1346,11 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v67-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v67-0002-Support-2PC-txn-subscriber-tests.patchDownload
From 44d53827ef0829cda950fa6de3afbdb230eaaec1 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Sun, 21 Mar 2021 12:49:34 +0530
Subject: [PATCH v67 2/2] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 293 ++++++++++++++++++++++++
 src/test/subscription/t/021_twophase_cascade.pl | 236 +++++++++++++++++++
 2 files changed, 529 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..364e6eb
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,293 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_cascade.pl b/src/test/subscription/t/021_twophase_cascade.pl
new file mode 100644
index 0000000..76b224a
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_cascade.pl
@@ -0,0 +1,236 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#291Amit Kapila
amit.kapila16@gmail.com
In reply to: houzj.fnst@fujitsu.com (#286)

On Wed, Mar 24, 2021 at 3:59 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

I have incorporated all your changes and additionally made few more changes
(a) got rid of LogicalRepBeginPrepareData and instead used
LogicalRepPreparedTxnData, (b) made a number of changes in comments and
docs, (c) ran pgindent, (d) modified tests to use standard wait_for_catch
function and removed few tests to reduce the time and to keep regression
tests reliable.

Hi,

When reading the code, I found some comments related to the patch here.

* XXX Now, this can even lead to a deadlock if the prepare
* transaction is waiting to get it logically replicated for
* distributed 2PC. Currently, we don't have an in-core
* implementation of prepares for distributed 2PC but some
* out-of-core logical replication solution can have such an
* implementation. They need to inform users to not have locks
* on catalog tables in such transactions.
*/

Since we will have in-core implementation of prepares, should we update the comments here ?

Fixed this in the latest patch posted by me. I have additionally
updated the docs to reflect the same.

--
With Regards,
Amit Kapila.

#292vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#275)
1 attachment(s)

On Sun, Mar 21, 2021 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Mar 20, 2021 at 10:09 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Sat, Mar 20, 2021 at 1:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian <itsajin@gmail.com> wrote:

Missed the patch - 0001, resending.

I have made miscellaneous changes in the patch which includes
improving comments, error messages, and miscellaneous coding
improvements. The most notable one is that we don't need an additional
parameter in walrcv_startstreaming, if the two_phase option is set
properly. My changes are in v63-0002-Misc-changes-by-Amit, if you are
fine with those, then please merge them in the next version. I have
omitted the dev-logs patch but feel free to submit it. I have one
question:

I am fine with these changes. I see that Peter has already merged in these changes.

I have further updated the patch to implement unique GID on the
subscriber-side as discussed in the nearby thread [1]. That requires
some changes in the test. Additionally, I have updated some comments
and docs. Let me know what do you think about the changes?

+static void
+TwoPhaseTransactionGid(RepOriginId originid, TransactionId xid,
+                                          char *gid, int szgid)
+{
+       /* Origin and Transaction ids must be valid */
+       Assert(originid != InvalidRepOriginId);
+       Assert(TransactionIdIsValid(xid));
+
+       snprintf(gid, szgid, "pg_%u_%u", originid, xid);
+}

I found one issue in the current mechanism that we use to generate the
GID's. In one of the scenarios it will generate the same GID's, steps
for the same is given below:
---- setup 2 publisher and one subscriber with synchronous_standby_names
prepare txn 't1' on publisher1 (This prepared txn is prepared as
pg_1_542 on subscriber)
drop subscription of publisher1
create subscription subscriber for publisher2 (We have changed the
subscription to subscribe to publisher2 which was earlier subscribing
to publisher1)
prepare txn 't2' on publisher2 (This prepared txn also uses pg_1_542
on subscriber even though user has given a different gid)

This prepared txn keeps waiting for it to complete in the subscriber,
but never completes. Here user uses different gid for prepared
transaction but it ends up using the same gid at the subscriber. The
subscriber keeps failing with:
2021-03-22 10:14:57.859 IST [73959] ERROR: transaction identifier
"pg_1_542" is already in use
2021-03-22 10:14:57.860 IST [73868] LOG: background worker "logical
replication worker" (PID 73959) exited with exit code 1

Attached file has the steps for it.
This might be a rare scenario, may or may not be a user scenario,
Should we handle this scenario?

Regards,
Vignesh

Attachments:

possible_bug.shapplication/x-shellscript; name=possible_bug.shDownload
#293Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#292)
2 attachment(s)

Please find attached the latest patch set v68*

Differences from v67* are:

* Rebased to HEAD @ today.

* v68 fixes an issue reported by Vignesh [1]/messages/by-id/CALDaNm2ZnJeG23bE+gEOQEmXo8N+fs2g4=xuH2u6nNcX0s9Jjg@mail.gmail.com where a scenario was
found which still was able to cause a generated GID clash. Using
Vignesh's test script I could reproduce the problem exactly as
described. The fix makes the GID unique by including the subid. Now
the same script runs to normal completion and produces good/expected
output:

transaction | gid | prepared |
owner | database
-------------+------------------+-------------------------------+----------+----------
547 | pg_gid_16389_543 | 2021-03-30 10:32:36.87207+11 |
postgres | postgres
555 | pg_gid_16390_543 | 2021-03-30 10:32:48.087771+11 |
postgres | postgres
(2 rows)

----
[1]: /messages/by-id/CALDaNm2ZnJeG23bE+gEOQEmXo8N+fs2g4=xuH2u6nNcX0s9Jjg@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v68-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v68-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 1c27f4a3b5b87195a4852fcef37d89ab2e3722aa Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 30 Mar 2021 10:41:52 +1100
Subject: [PATCH v68] Add support for prepared transactions to built-in 
 logical replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION SET PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         |  16 +-
 doc/src/sgml/ref/alter_subscription.sgml           |  10 +-
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 +++++
 src/backend/catalog/pg_subscription.c              |  35 +++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 121 +++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 212 +++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |  13 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 200 ++++++++++---
 src/backend/replication/logical/worker.c           | 331 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 207 ++++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  66 +++-
 src/include/replication/reorderbuffer.h            |   2 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         |  93 ++++--
 src/test/regress/sql/subscription.sql              |  25 ++
 src/tools/pgindent/typedefs.list                   |   2 +
 40 files changed, 1478 insertions(+), 175 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f103d91..ef97f40 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7622,6 +7622,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 80eb96d..bd0a353 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1240,9 +1240,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 43092fe..c285ef7 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 0adf68e..d5e891b 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -62,9 +62,13 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
   </para>
 
   <para>
-   Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
-   <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with refresh
-   option as true cannot be executed inside a transaction block.
+   Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command>, and
+   <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with
+   <literal>refresh = true</literal> cannot be executed inside a transaction
+   block. They also cannot be executed with <literal>copy_data = true</literal>
+   when the subscription has <literal>two_phase</literal> commit enabled. See
+   column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 89335b6..d75f052 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..2b4b699 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -449,6 +450,40 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	int			nkeys = 0;
+	ScanKeyData skey[2];
+	SysScanDesc scan;
+	bool		has_subrels = false;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[nkeys++],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, nkeys, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5f2541d..f134ff9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1249,7 +1249,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index bfd3514..89abc81 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -283,6 +312,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -358,6 +402,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +428,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +497,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +579,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +895,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -869,6 +930,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -892,7 +959,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +1005,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -953,6 +1022,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -982,7 +1062,34 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..eb03c53 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +838,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +852,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f59613..73038d4 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -370,11 +370,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -730,7 +728,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 37b75de..7e68299 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f2c85ca..daeef83 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -106,6 +106,218 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->preparetime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 127f2c4..7e8c8d3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2672,7 +2672,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2698,12 +2698,13 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c5a8125..0daff13 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 8494db8..2937f41 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +363,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +370,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +391,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1058,7 +1054,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1144,3 +1140,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_no_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 354fbe4..afdd257 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -246,6 +319,9 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -720,6 +796,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.preparetime));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1954,6 +2204,28 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			/* Streaming with two-phase is not supported */
+			ereport(ERROR,
+					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+					 errmsg("invalid logical replication message type \"%c\"", action)));
 	}
 
 	ereport(ERROR,
@@ -2430,6 +2702,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2916,6 +3191,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3088,9 +3377,47 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(wrconn, &options);
+
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(wrconn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 1b993fb..565f92b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -47,6 +47,16 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx,
 							  ReorderBufferChange *change);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -66,6 +76,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -143,6 +156,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->change_cb = pgoutput_change;
 	cb->truncate_cb = pgoutput_truncate;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -153,18 +171,22 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_commit_cb = pgoutput_stream_commit;
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
 parse_output_parameters(List *options, uint32 *protocol_version,
 						List **publication_names, bool *binary,
-						bool *enable_streaming)
+						bool *enable_streaming,
+						bool *enable_twophase)
 {
 	ListCell   *lc;
 	bool		protocol_version_given = false;
 	bool		publication_names_given = false;
 	bool		binary_option_given = false;
 	bool		streaming_given = false;
+	bool		twophase_given = false;
 
 	*binary = false;
 
@@ -232,9 +254,30 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 
 			*enable_streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			twophase_given = true;
+
+			*enable_twophase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (*enable_twophase && *enable_streaming)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("%s and %s are mutually exclusive options",
+						"two_phase", "streaming")));
 }
 
 /*
@@ -245,6 +288,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 				 bool is_init)
 {
 	bool		enable_streaming = false;
+	bool		enable_twophase = false;
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 
 	/* Create our memory context for private allocations. */
@@ -269,7 +313,8 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								&data->protocol_version,
 								&data->publication_names,
 								&data->binary,
-								&enable_streaming);
+								&enable_streaming,
+								&enable_twophase);
 
 		/* Check if we support requested protocol */
 		if (data->protocol_version > LOGICALREP_PROTO_MAX_VERSION_NUM)
@@ -310,6 +355,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!enable_twophase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -322,8 +388,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -338,29 +408,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -380,6 +429,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -778,18 +889,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1195,3 +1296,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c..91224e0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8532296..8e7edae 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -364,7 +364,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index da6cc05..24436ed 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4278,6 +4279,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4321,9 +4323,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4344,6 +4353,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4369,6 +4379,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4396,6 +4408,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4437,6 +4450,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5340843..70c072d 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 440249f..7b56b3c 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6227,7 +6227,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6253,13 +6253,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index a053bc1..ff7ddc3 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2763,7 +2763,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index c253403..5c1ce7e 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index fa4c372..7426993 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -27,10 +28,16 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. This has the same protocol
+ * version requirement as LOGICAL_PROTO_STREAM_VERSION_NUM because these
+ * features were both introduced in the same release (PG14).
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
@@ -54,10 +61,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_TRUNCATE = 'T',
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -114,6 +126,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -121,6 +134,39 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for begin_prepare, prepare, and commit prepared
+ * transaction. prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -128,6 +174,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..6c9f2c6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -643,7 +643,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..db68551 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..daf6ad4 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..a9664e8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..13e0c20 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e6777e..1874ec7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1346,9 +1346,11 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v68-0002-Support-2PC-txn-subscriber-tests.patchapplication/octet-stream; name=v68-0002-Support-2PC-txn-subscriber-tests.patchDownload
From c69e442bece2c0473e7dd05b64e8654ef99be0c7 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 30 Mar 2021 10:53:45 +1100
Subject: [PATCH v68] Support 2PC txn - subscriber tests.

This patch adds the two-phase commit subscriber test code.
---
 src/test/subscription/t/020_twophase.pl         | 293 ++++++++++++++++++++++++
 src/test/subscription/t/021_twophase_cascade.pl | 236 +++++++++++++++++++
 2 files changed, 529 insertions(+)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_cascade.pl

diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..364e6eb
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,293 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_cascade.pl b/src/test/subscription/t/021_twophase_cascade.pl
new file mode 100644
index 0000000..76b224a
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_cascade.pl
@@ -0,0 +1,236 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#294vignesh C
vignesh21@gmail.com
In reply to: Peter Smith (#293)

On Tue, Mar 30, 2021 at 5:34 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v68*

Differences from v67* are:

* Rebased to HEAD @ today.

* v68 fixes an issue reported by Vignesh [1] where a scenario was
found which still was able to cause a generated GID clash. Using
Vignesh's test script I could reproduce the problem exactly as
described. The fix makes the GID unique by including the subid. Now
the same script runs to normal completion and produces good/expected
output:

transaction | gid | prepared |
owner | database
-------------+------------------+-------------------------------+----------+----------
547 | pg_gid_16389_543 | 2021-03-30 10:32:36.87207+11 |
postgres | postgres
555 | pg_gid_16390_543 | 2021-03-30 10:32:48.087771+11 |
postgres | postgres
(2 rows)

Thanks for the patch with the fix, the fix solves the issue reported.

Regards,
Vignesh

#295Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#293)

On Tue, Mar 30, 2021 at 5:34 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v68*

I think this patch is in much better shape than it was few versions
earlier but I feel still some more work and testing is required. We
can try to make it work with the streaming option and do something
about empty prepare transactions to reduce the need for users to set a
much higher value for max_prepared_xacts on subscribers. So, I propose
to move it to the next CF, what do you think?

--
With Regards,
Amit Kapila.

#296Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#295)

On Thu, Apr 1, 2021 at 2:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 30, 2021 at 5:34 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v68*

I think this patch is in much better shape than it was few versions
earlier but I feel still some more work and testing is required. We
can try to make it work with the streaming option and do something
about empty prepare transactions to reduce the need for users to set a
much higher value for max_prepared_xacts on subscribers. So, I propose
to move it to the next CF, what do you think?

I agree.

regards,
Ajin Cherian
Fujitsu Australia

#297vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#295)

On Thu, Apr 1, 2021 at 8:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 30, 2021 at 5:34 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v68*

I think this patch is in much better shape than it was few versions
earlier but I feel still some more work and testing is required. We
can try to make it work with the streaming option and do something
about empty prepare transactions to reduce the need for users to set a
much higher value for max_prepared_xacts on subscribers. So, I propose
to move it to the next CF, what do you think?

+1 for moving it to the next PG version.

Regards,
Vignesh

#298Peter Smith
smithpb2250@gmail.com
In reply to: Ajin Cherian (#296)

On Thu, Apr 1, 2021 at 4:58 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, Apr 1, 2021 at 2:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 30, 2021 at 5:34 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v68*

I think this patch is in much better shape than it was few versions
earlier but I feel still some more work and testing is required. We
can try to make it work with the streaming option and do something
about empty prepare transactions to reduce the need for users to set a
much higher value for max_prepared_xacts on subscribers. So, I propose
to move it to the next CF, what do you think?

I agree.

OK, done. Moved to next CF here: https://commitfest.postgresql.org/33/2914/

------
Kind Regards,
Peter Smith.
Fujitsu Australia.

#299Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#293)
1 attachment(s)

Please find attached the latest patch set v69*

Differences from v68* are:

* Rebased to HEAD @ yesterday.
There was some impacts caused by recently pushed patches [1]https://github.com/postgres/postgres/commit/531737ddad214cb8a675953208e2f3a6b1be122b [2]https://github.com/postgres/postgres/commit/ac4645c0157fc5fcef0af8ff571512aa284a2cec

* The stream/prepare functionality and tests have been restored to be
the same as they were in v48 [3]/messages/by-id/CAHut+Psr8f1tUttndgnkK_=a7w=hsomw16SEOn6U68jSBKL9SQ@mail.gmail.com.
Previously, this code had been removed back in v49 [4]/messages/by-id/CAFPTHDZduc2fDzqd_L4vPmA2R+-e8nEbau9HseHHi82w=p-uvQ@mail.gmail.com due to
incompatibilities with the (now obsolete) psf design.

* TAP tests are now co-located in the same patch as the code they are testing.

----
[1]: https://github.com/postgres/postgres/commit/531737ddad214cb8a675953208e2f3a6b1be122b
[2]: https://github.com/postgres/postgres/commit/ac4645c0157fc5fcef0af8ff571512aa284a2cec
[3]: /messages/by-id/CAHut+Psr8f1tUttndgnkK_=a7w=hsomw16SEOn6U68jSBKL9SQ@mail.gmail.com
[4]: /messages/by-id/CAFPTHDZduc2fDzqd_L4vPmA2R+-e8nEbau9HseHHi82w=p-uvQ@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Show quoted text

On Tue, Mar 30, 2021 at 11:03 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v68*

Differences from v67* are:

* Rebased to HEAD @ today.

* v68 fixes an issue reported by Vignesh [1] where a scenario was
found which still was able to cause a generated GID clash. Using
Vignesh's test script I could reproduce the problem exactly as
described. The fix makes the GID unique by including the subid. Now
the same script runs to normal completion and produces good/expected
output:

transaction | gid | prepared |
owner | database
-------------+------------------+-------------------------------+----------+----------
547 | pg_gid_16389_543 | 2021-03-30 10:32:36.87207+11 |
postgres | postgres
555 | pg_gid_16390_543 | 2021-03-30 10:32:48.087771+11 |
postgres | postgres
(2 rows)

----
[1] /messages/by-id/CALDaNm2ZnJeG23bE+gEOQEmXo8N+fs2g4=xuH2u6nNcX0s9Jjg@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v69-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v69-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 6c356d8ff0ca0623d33a4b4ee6a1b71f5786f3c9 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 7 Apr 2021 09:53:29 +1000
Subject: [PATCH v69] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION SET PUBLICATION WITH (refresh = true) when two_phase enabled.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         |  16 +-
 doc/src/sgml/ref/alter_subscription.sgml           |  10 +-
 doc/src/sgml/ref/create_subscription.sgml          |  26 ++
 src/backend/access/transam/twophase.c              |  68 +++
 src/backend/catalog/pg_subscription.c              |  35 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 100 ++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 277 +++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |  13 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 200 +++++++--
 src/backend/replication/logical/worker.c           | 457 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 210 ++++++++--
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  72 +++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   2 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         |  95 +++--
 src/test/regress/sql/subscription.sql              |  27 ++
 src/test/subscription/t/020_twophase.pl            | 293 +++++++++++++
 src/test/subscription/t/021_twophase_cascade.pl    | 236 +++++++++++
 src/test/subscription/t/022_twophase_stream.pl     | 450 ++++++++++++++++++++
 .../subscription/t/023_twophase_cascade_stream.pl  | 268 ++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 45 files changed, 2885 insertions(+), 188 deletions(-)
 create mode 100644 src/test/subscription/t/020_twophase.pl
 create mode 100644 src/test/subscription/t/021_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/022_twophase_stream.pl
 create mode 100644 src/test/subscription/t/023_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f103d91..ef97f40 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7622,6 +7622,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 5d049cd..97ac503 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1250,9 +1250,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 380be5f..addda1c 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 5aed269..b7d99c0 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -62,9 +62,13 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
   </para>
 
   <para>
-   Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
-   <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with refresh
-   option as true cannot be executed inside a transaction block.
+   Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command>, and
+   <command>ALTER SUBSCRIPTION ... SET PUBLICATION ...</command> with
+   <literal>refresh = true</literal> cannot be executed inside a transaction
+   block. They also cannot be executed with <literal>copy_data = true</literal>
+   when the subscription has <literal>two_phase</literal> commit enabled. See
+   column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,32 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 89335b6..d75f052 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..2b4b699 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -449,6 +450,40 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	int			nkeys = 0;
+	ScanKeyData skey[2];
+	SysScanDesc scan;
+	bool		has_subrels = false;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[nkeys++],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, nkeys, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5f2541d..f134ff9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1249,7 +1249,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 5282b79..49db2c2 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -67,7 +67,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -108,6 +109,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -213,6 +219,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -358,6 +387,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -382,7 +413,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -450,6 +482,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -528,7 +564,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -835,7 +880,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -892,7 +938,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -937,7 +984,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -953,6 +1001,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -982,7 +1041,34 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..eb03c53 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +838,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +852,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 97be4b0..5bf2e70 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2f68036..e7939e5 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 2a1f983..428f7cb 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -107,6 +107,283 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->preparetime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 52d0628..92c7fa7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2680,7 +2680,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2706,12 +2706,13 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c5a8125..0daff13 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 0638f5c..a1bae34 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +363,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +370,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +391,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1058,7 +1054,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1144,3 +1140,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_no_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 74d538b..8364005 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -246,6 +319,11 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -720,6 +798,260 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.preparetime));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -917,30 +1249,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +1271,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +1286,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1361,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -1963,6 +2311,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2439,6 +2807,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2925,6 +3296,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3097,9 +3482,47 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(wrconn, &options);
+
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(wrconn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -61,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -70,6 +82,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +163,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +179,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -170,10 +192,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,6 +273,16 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -322,6 +356,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +389,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +409,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +430,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +919,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -911,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1236,3 +1344,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c..91224e0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9a0e380..92f3373 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 25717ce..2efedbf 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4368,6 +4369,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4411,9 +4413,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4434,6 +4443,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4459,6 +4469,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4486,6 +4498,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4527,6 +4540,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5340843..70c072d 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 440249f..7b56b3c 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6227,7 +6227,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6253,13 +6253,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index a053bc1..ff7ddc3 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2763,7 +2763,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..413a5ce 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..076d8c5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -27,10 +28,16 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. This has the same protocol
+ * version requirement as LOGICAL_PROTO_STREAM_VERSION_NUM because these
+ * features were both introduced in the same release (PG14).
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
@@ -55,10 +62,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -115,6 +127,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +135,39 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for begin_prepare, prepare, and commit prepared
+ * transaction. prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +175,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -174,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..6c9f2c6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -643,7 +643,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..db68551 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..daf6ad4 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 14a4302..838decc 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -224,6 +224,45 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+-- but can alter streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 81e65e5..7a9c3be 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -171,6 +171,33 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+-- but can alter streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/020_twophase.pl b/src/test/subscription/t/020_twophase.pl
new file mode 100644
index 0000000..364e6eb
--- /dev/null
+++ b/src/test/subscription/t/020_twophase.pl
@@ -0,0 +1,293 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/021_twophase_cascade.pl b/src/test/subscription/t/021_twophase_cascade.pl
new file mode 100644
index 0000000..76b224a
--- /dev/null
+++ b/src/test/subscription/t/021_twophase_cascade.pl
@@ -0,0 +1,236 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_stream.pl b/src/test/subscription/t/022_twophase_stream.pl
new file mode 100644
index 0000000..89f3eb7
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_stream.pl
@@ -0,0 +1,450 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_cascade_stream.pl b/src/test/subscription/t/023_twophase_cascade_stream.pl
new file mode 100644
index 0000000..eba1523
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_cascade_stream.pl
@@ -0,0 +1,268 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6a98064..93c4162 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1346,9 +1346,11 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

#300Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#299)
1 attachment(s)

Please find attached the latest patch set v70*

Differences from v69* are:

* Rebased to HEAD @ today
Unfortunately, the v69 patch was broken due to a recent push [1]https://github.com/postgres/postgres/commit/82ed7748b710e3ddce3f7ebc74af80fe4869492f

----
[1]: https://github.com/postgres/postgres/commit/82ed7748b710e3ddce3f7ebc74af80fe4869492f

Kind Regards,
Peter Smith.
Fujitsu Australia

Show quoted text

On Wed, Apr 7, 2021 at 10:25 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v69*

Differences from v68* are:

* Rebased to HEAD @ yesterday.
There was some impacts caused by recently pushed patches [1] [2]

* The stream/prepare functionality and tests have been restored to be
the same as they were in v48 [3].
Previously, this code had been removed back in v49 [4] due to
incompatibilities with the (now obsolete) psf design.

* TAP tests are now co-located in the same patch as the code they are testing.

----
[1] https://github.com/postgres/postgres/commit/531737ddad214cb8a675953208e2f3a6b1be122b
[2] https://github.com/postgres/postgres/commit/ac4645c0157fc5fcef0af8ff571512aa284a2cec
[3] /messages/by-id/CAHut+Psr8f1tUttndgnkK_=a7w=hsomw16SEOn6U68jSBKL9SQ@mail.gmail.com
[4] /messages/by-id/CAFPTHDZduc2fDzqd_L4vPmA2R+-e8nEbau9HseHHi82w=p-uvQ@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

On Tue, Mar 30, 2021 at 11:03 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v68*

Differences from v67* are:

* Rebased to HEAD @ today.

* v68 fixes an issue reported by Vignesh [1] where a scenario was
found which still was able to cause a generated GID clash. Using
Vignesh's test script I could reproduce the problem exactly as
described. The fix makes the GID unique by including the subid. Now
the same script runs to normal completion and produces good/expected
output:

transaction | gid | prepared |
owner | database
-------------+------------------+-------------------------------+----------+----------
547 | pg_gid_16389_543 | 2021-03-30 10:32:36.87207+11 |
postgres | postgres
555 | pg_gid_16390_543 | 2021-03-30 10:32:48.087771+11 |
postgres | postgres
(2 rows)

----
[1] /messages/by-id/CALDaNm2ZnJeG23bE+gEOQEmXo8N+fs2g4=xuH2u6nNcX0s9Jjg@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v70-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v70-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 0c9313563bc871cd941658f573f7694bf5d2143f Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 7 Apr 2021 15:15:15 +1000
Subject: [PATCH v70] Add support for prepared transactions to built-in logical
  replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         |  16 +-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  26 ++
 src/backend/access/transam/twophase.c              |  68 +++
 src/backend/catalog/pg_subscription.c              |  35 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 114 ++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 277 +++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |  13 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 200 +++++++--
 src/backend/replication/logical/worker.c           | 457 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 210 ++++++++--
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  72 +++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   2 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 111 +++--
 src/test/regress/sql/subscription.sql              |  27 ++
 src/test/subscription/t/021_twophase.pl            | 293 +++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 236 +++++++++++
 src/test/subscription/t/023_twophase_stream.pl     | 450 ++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 268 ++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 45 files changed, 2904 insertions(+), 194 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f103d91..ef97f40 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7622,6 +7622,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 5d049cd..97ac503 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1250,9 +1250,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 380be5f..addda1c 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..2408e10 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed with <literal>copy_data = true</literal>
+   when the subscription has <literal>two_phase</literal> commit enabled. See
+   column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -239,6 +239,32 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+        </listitem>
+       </varlistentry>
       </variablelist></para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 89335b6..d75f052 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..2b4b699 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -449,6 +450,40 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	int			nkeys = 0;
+	ScanKeyData skey[2];
+	SysScanDesc scan;
+	bool		has_subrels = false;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[nkeys++],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, nkeys, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5f2541d..f134ff9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1249,7 +1249,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 517c8ed..1e4ccff 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -337,6 +366,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +392,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +461,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -507,7 +543,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -814,7 +859,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -871,7 +917,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +963,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +980,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1022,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1040,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1080,34 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..eb03c53 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +838,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +852,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 9aab713..262f1da 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2f68036..e7939e5 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 2a1f983..428f7cb 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -107,6 +107,283 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->preparetime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 52d0628..92c7fa7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2680,7 +2680,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2706,12 +2706,13 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c5a8125..0daff13 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 0638f5c..a1bae34 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +363,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +370,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +391,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1058,7 +1054,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1144,3 +1140,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_no_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 8da602d..b7075cf 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -246,6 +319,11 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -720,6 +798,260 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.preparetime));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -917,30 +1249,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -948,7 +1271,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -963,7 +1286,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1038,6 +1361,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -1963,6 +2311,26 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2439,6 +2807,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2925,6 +3296,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3097,9 +3482,47 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(wrconn, &options);
+
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(wrconn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -61,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -70,6 +82,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +163,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +179,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -170,10 +192,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,6 +273,16 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
 	}
@@ -322,6 +356,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +389,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +409,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +430,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +919,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -911,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
@@ -1236,3 +1344,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c..91224e0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9a0e380..92f3373 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 25717ce..2efedbf 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4368,6 +4369,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4411,9 +4413,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4434,6 +4443,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4459,6 +4469,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4486,6 +4498,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4527,6 +4540,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5340843..70c072d 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 440249f..7b56b3c 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6227,7 +6227,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6253,13 +6253,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 891997c..7b8ebba 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2765,7 +2765,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..413a5ce 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..076d8c5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -27,10 +28,16 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. This has the same protocol
+ * version requirement as LOGICAL_PROTO_STREAM_VERSION_NUM because these
+ * features were both introduced in the same release (PG14).
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
@@ -55,10 +62,15 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -115,6 +127,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +135,39 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for begin_prepare, prepare, and commit prepared
+ * transaction. prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +175,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
@@ -174,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..6c9f2c6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -643,7 +643,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..db68551 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..daf6ad4 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f2c8cc3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,45 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+-- but can alter streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,33 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+-- but can alter streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..364e6eb
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,293 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..76b224a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,236 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..89f3eb7
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,450 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..eba1523
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,268 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b26e81d..5b30242 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1346,9 +1346,11 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

#301Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#300)
2 attachment(s)

Please find attached the latest patch set v71*

Differences from v70* are:

* Rebased to HEAD @ yesterday.

* Functionality of v71 is identical to v70, but the patch has been
split into two parts
0001 - 2PC core patch
0002 - adds 2PC support for "streaming" transactions

----
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v71-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v71-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 8baff96209b2907c330d508f2f8662f083cd8487 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 8 Apr 2021 16:53:15 +1000
Subject: [PATCH v71] Add support for prepared transactions to built-in logical
   replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         |  16 +-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 +++++
 src/backend/catalog/pg_subscription.c              |  35 +++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 135 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 212 ++++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |  13 +-
 src/backend/replication/logical/snapbuild.c        |  33 ++-
 src/backend/replication/logical/tablesync.c        | 200 ++++++++++---
 src/backend/replication/logical/worker.c           | 325 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 ++++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  63 ++++
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   2 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 293 +++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 236 +++++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 43 files changed, 2015 insertions(+), 178 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 2656786..cd6f064 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7632,6 +7632,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 5d049cd..97ac503 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1250,9 +1250,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 380be5f..addda1c 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..2408e10 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed with <literal>copy_data = true</literal>
+   when the subscription has <literal>two_phase</literal> commit enabled. See
+   column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 89335b6..d75f052 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..2b4b699 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -449,6 +450,40 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	int			nkeys = 0;
+	ScanKeyData skey[2];
+	SysScanDesc scan;
+	bool		has_subrels = false;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[nkeys++],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, nkeys, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 4d6b232..76824a7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1250,7 +1250,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 517c8ed..55deef8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -814,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +909,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +938,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +984,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1001,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1043,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1061,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1101,34 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..eb03c53 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +838,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +852,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 9aab713..262f1da 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2f68036..e7939e5 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 2a1f983..9977eae 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -107,6 +107,218 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->preparetime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 52d0628..92c7fa7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2680,7 +2680,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2706,12 +2706,13 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c5a8125..0daff13 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 0638f5c..a1bae34 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +363,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +370,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +391,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1058,7 +1054,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1144,3 +1140,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_no_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 8da602d..45cfd62 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -246,6 +319,9 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -720,6 +796,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.preparetime));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1963,6 +2213,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2439,6 +2705,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2925,6 +3194,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3097,9 +3380,47 @@ ApplyWorkerMain(Datum main_arg)
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(wrconn, &options);
+
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(wrconn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..ecf9b9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +161,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +177,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -170,10 +190,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,8 +271,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -322,6 +365,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +398,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +418,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +439,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +928,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1236,3 +1335,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c..91224e0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9a0e380..92f3373 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index d0ea489..5941868 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4368,6 +4369,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4411,9 +4413,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4434,6 +4443,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4459,6 +4469,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4486,6 +4498,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4527,6 +4540,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5340843..70c072d 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 52f7b2c..a39caac 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6234,7 +6234,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6260,13 +6260,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index d79d7b8..c5e86e3 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2765,7 +2765,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..413a5ce 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..de4dc1d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -27,10 +28,16 @@
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
  * support for streaming large transactions.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. This has the same protocol
+ * version requirement as LOGICAL_PROTO_STREAM_VERSION_NUM because these
+ * features were both introduced in the same release (PG14).
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 2
 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
 
 /*
@@ -55,6 +62,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -115,6 +126,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +134,39 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for begin_prepare, prepare, and commit prepared
+ * transaction. prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +174,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..6c9f2c6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -643,7 +643,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..db68551 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..daf6ad4 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..a10231e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..364e6eb
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,293 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..76b224a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,236 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index efb9811..0e1d74a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1346,9 +1346,11 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v71-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v71-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 0941f8439cae67d7b7d296b1c33fdcf68f376e8a Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 9 Apr 2021 10:10:04 +1000
Subject: [PATCH v71] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.
---
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  65 +++
 src/backend/replication/logical/worker.c           | 132 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |   9 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 450 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 268 ++++++++++++
 10 files changed, 949 insertions(+), 76 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 55deef8..1e4ccff 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -909,12 +894,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9977eae..428f7cb 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -319,6 +319,71 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 45cfd62..b7075cf 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -322,6 +322,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -970,6 +972,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1167,30 +1249,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1198,7 +1271,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1213,7 +1286,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1288,6 +1361,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2229,6 +2327,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ecf9b9a..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -178,7 +180,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -283,17 +285,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1010,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index de4dc1d..076d8c5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -69,7 +69,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -237,4 +238,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index a10231e..f2c8cc3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  cannot alter two_phase option
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..89f3eb7
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,450 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..eba1523
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,268 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#302Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#150)

On Mon, Dec 14, 2020 at 8:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

2.
+ /*
+ * Flags are determined from the state of the transaction. We know we
+ * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+ * it's already marked as committed then it has to be COMMIT PREPARED (and
+ * likewise for abort / ROLLBACK PREPARED).
+ */
+ if (rbtxn_commit_prepared(txn))
+ flags = LOGICALREP_IS_COMMIT_PREPARED;
+ else if (rbtxn_rollback_prepared(txn))
+ flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+ else
+ flags = LOGICALREP_IS_PREPARE;

I don't like clubbing three different operations under one message
LOGICAL_REP_MSG_PREPARE. It looks awkward to use new flags
RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED in ReordeBuffer so
that we can recognize these operations in corresponding callbacks. I
think setting any flag in ReorderBuffer should not dictate the
behavior in callbacks. Then also there are few things that are not
common to those APIs like the patch has an Assert to say that the txn
is marked with prepare flag for all three operations which I think is
not true for Rollback Prepared after the restart. We don't ensure to
set the Prepare flag if the Rollback Prepare happens after the
restart. Then, we have to introduce separate flags to distinguish
prepare/commit prepared/rollback prepared to distinguish multiple
operations sent as protocol messages. Also, all these operations are
mutually exclusive so it will be better to send separate messages for
each of these and I have changed it accordingly in the attached patch.

While looking at the two-phase protocol messages (with a view to
documenting them) I noticed that the messages for
LOGICAL_REP_MSG_PREPARE, LOGICAL_REP_MSG_COMMIT_PREPARED,
LOGICAL_REP_MSG_ROLLBACK_PREPARED are all sending and receiving flag
bytes which *always* has a value 0.

----------
e.g.
uint8 flags = 0;
pq_sendbyte(out, flags);

and
/* read flags */
uint8 flags = pq_getmsgbyte(in);
if (flags != 0)
elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
----------

I think this patch version v31 is where the flags became redundant.

Is there some reason why these unused flags still remain in the protocol code?

Do you have any objection to me removing them?
Otherwise, it might seem strange to document a flag that has no function.

------
KInd Regards,
Peter Smith.
Fujitsu Australia

#303Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#302)

On Fri, Apr 9, 2021 at 12:33 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, Dec 14, 2020 at 8:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

2.
+ /*
+ * Flags are determined from the state of the transaction. We know we
+ * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+ * it's already marked as committed then it has to be COMMIT PREPARED (and
+ * likewise for abort / ROLLBACK PREPARED).
+ */
+ if (rbtxn_commit_prepared(txn))
+ flags = LOGICALREP_IS_COMMIT_PREPARED;
+ else if (rbtxn_rollback_prepared(txn))
+ flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+ else
+ flags = LOGICALREP_IS_PREPARE;

I don't like clubbing three different operations under one message
LOGICAL_REP_MSG_PREPARE. It looks awkward to use new flags
RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED in ReordeBuffer so
that we can recognize these operations in corresponding callbacks. I
think setting any flag in ReorderBuffer should not dictate the
behavior in callbacks. Then also there are few things that are not
common to those APIs like the patch has an Assert to say that the txn
is marked with prepare flag for all three operations which I think is
not true for Rollback Prepared after the restart. We don't ensure to
set the Prepare flag if the Rollback Prepare happens after the
restart. Then, we have to introduce separate flags to distinguish
prepare/commit prepared/rollback prepared to distinguish multiple
operations sent as protocol messages. Also, all these operations are
mutually exclusive so it will be better to send separate messages for
each of these and I have changed it accordingly in the attached patch.

While looking at the two-phase protocol messages (with a view to
documenting them) I noticed that the messages for
LOGICAL_REP_MSG_PREPARE, LOGICAL_REP_MSG_COMMIT_PREPARED,
LOGICAL_REP_MSG_ROLLBACK_PREPARED are all sending and receiving flag
bytes which *always* has a value 0.

----------
e.g.
uint8 flags = 0;
pq_sendbyte(out, flags);

and
/* read flags */
uint8 flags = pq_getmsgbyte(in);
if (flags != 0)
elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
----------

I think this patch version v31 is where the flags became redundant.

I think this has been kept for future use similar to how we have in
logicalrep_write_commit. So, I think we can keep them unused for now.
We can document it similar commit message ('C') [1]https://www.postgresql.org/docs/devel/protocol-logicalrep-message-formats.html.

[1]: https://www.postgresql.org/docs/devel/protocol-logicalrep-message-formats.html

--
With Regards,
Amit Kapila.

#304Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#303)

On Fri, Apr 9, 2021 at 6:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Apr 9, 2021 at 12:33 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, Dec 14, 2020 at 8:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

2.
+ /*
+ * Flags are determined from the state of the transaction. We know we
+ * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if
+ * it's already marked as committed then it has to be COMMIT PREPARED (and
+ * likewise for abort / ROLLBACK PREPARED).
+ */
+ if (rbtxn_commit_prepared(txn))
+ flags = LOGICALREP_IS_COMMIT_PREPARED;
+ else if (rbtxn_rollback_prepared(txn))
+ flags = LOGICALREP_IS_ROLLBACK_PREPARED;
+ else
+ flags = LOGICALREP_IS_PREPARE;

I don't like clubbing three different operations under one message
LOGICAL_REP_MSG_PREPARE. It looks awkward to use new flags
RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED in ReordeBuffer so
that we can recognize these operations in corresponding callbacks. I
think setting any flag in ReorderBuffer should not dictate the
behavior in callbacks. Then also there are few things that are not
common to those APIs like the patch has an Assert to say that the txn
is marked with prepare flag for all three operations which I think is
not true for Rollback Prepared after the restart. We don't ensure to
set the Prepare flag if the Rollback Prepare happens after the
restart. Then, we have to introduce separate flags to distinguish
prepare/commit prepared/rollback prepared to distinguish multiple
operations sent as protocol messages. Also, all these operations are
mutually exclusive so it will be better to send separate messages for
each of these and I have changed it accordingly in the attached patch.

While looking at the two-phase protocol messages (with a view to
documenting them) I noticed that the messages for
LOGICAL_REP_MSG_PREPARE, LOGICAL_REP_MSG_COMMIT_PREPARED,
LOGICAL_REP_MSG_ROLLBACK_PREPARED are all sending and receiving flag
bytes which *always* has a value 0.

----------
e.g.
uint8 flags = 0;
pq_sendbyte(out, flags);

and
/* read flags */
uint8 flags = pq_getmsgbyte(in);
if (flags != 0)
elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
----------

I think this patch version v31 is where the flags became redundant.

I think this has been kept for future use similar to how we have in
logicalrep_write_commit. So, I think we can keep them unused for now.
We can document it similar commit message ('C') [1].

[1] - https://www.postgresql.org/docs/devel/protocol-logicalrep-message-formats.html

Yeah, we can do that. And if nobody else gives feedback about this
then I will do exactly like you suggested.

But I don't understand why we are even trying to "future proof" the
protocol by keeping redundant flags lying around on the off-chance
that maybe one day they could be useful.

Isn't that what the protocol version number is for? e.g. If there did
become some future need for some flags then just add them at that time
and bump the protocol version.

And, even if we wanted to, I think we cannot use these existing flags
in future without bumping the protocol version, because the current
protocol docs say that flag value must be zero!

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#305Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#301)
2 attachment(s)

Please find attached the latest patch set v72*

Differences from v71* are:

* Rebased to HEAD @ yesterday.

* The Replication protocol version requirement for two-phase message
support is bumped to version 3

* Documentation of protocol messages has be updated for two-phase
messages similar to [1]https://github.com/postgres/postgres/commit/15c1a9d9cb7604472d4823f48b64cdc02c441194

----
[1]: https://github.com/postgres/postgres/commit/15c1a9d9cb7604472d4823f48b64cdc02c441194

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v72-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v72-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From aaeb397c352267dbeddcb8e98f71d827cc5c1027 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 12 Apr 2021 19:58:47 +1000
Subject: [PATCH v72] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         | 313 ++++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  35 +++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 135 +++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 212 +++++++++++++
 src/backend/replication/logical/reorderbuffer.c    |  13 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 200 ++++++++++--
 src/backend/replication/logical/worker.c           | 341 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  65 +++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   2 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 293 ++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 236 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 43 files changed, 2318 insertions(+), 190 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 2656786..cd6f064 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7632,6 +7632,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 5d049cd..97ac503 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1250,9 +1250,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..48044dd 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Commit Prepared messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7364,6 +7385,278 @@ Stream Abort
 
 </variablelist>
 
+<!-- ==================== TWO_PHASE Messages ==================== -->
+
+<para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies this message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies this message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the commit transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('A')</term>
+<listitem><para>
+                Identifies this message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the rollback transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
 <para>
 
 The following message parts are shared by the above messages.
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..2408e10 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed with <literal>copy_data = true</literal>
+   when the subscription has <literal>two_phase</literal> commit enabled. See
+   column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b658134..a8725e6 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2460,3 +2460,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..2b4b699 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -449,6 +450,40 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	int			nkeys = 0;
+	ScanKeyData skey[2];
+	SysScanDesc scan;
+	bool		has_subrels = false;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[nkeys++],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, nkeys, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 451db2e..9abe43d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1264,7 +1264,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
 
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 517c8ed..55deef8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -814,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +909,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +938,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +984,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1001,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1043,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1061,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1101,34 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..eb03c53 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +838,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +852,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7924581..99f2afc 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 4f6e87f..954dbb5 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -209,7 +209,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -435,10 +435,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -542,10 +551,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -612,7 +632,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 2a1f983..9977eae 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -107,6 +107,218 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->preparetime = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->preparetime = pq_getmsgint64(in);
+	rollback_data->rollbacktime = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 52d0628..92c7fa7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2680,7 +2680,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2706,12 +2706,13 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c5a8125..0daff13 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 0638f5c..a1bae34 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +363,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +370,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +391,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1058,7 +1054,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1144,3 +1140,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_no_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fb3ba5c..5c1ae6b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -246,6 +319,9 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -720,6 +796,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.preparetime));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.preparetime))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollbacktime;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1970,6 +2220,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2446,6 +2712,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2932,6 +3201,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3098,15 +3381,67 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(wrconn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(wrconn, &options);
+
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(wrconn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..ecf9b9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +161,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +177,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -170,10 +190,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,8 +271,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -322,6 +365,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +398,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +418,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +439,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +928,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1236,3 +1335,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c..91224e0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9a0e380..92f3373 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index d0ea489..5941868 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4368,6 +4369,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4411,9 +4413,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4434,6 +4443,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4459,6 +4469,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4486,6 +4498,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4527,6 +4540,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5340843..70c072d 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index fdc2a89..682b5ff 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,13 +6415,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 26ac786..8bab45c 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2765,7 +2765,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7dfcb7b..72f049b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -88,11 +88,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..540d8ee 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -115,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +132,39 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information. This same structure is used to
+ * hold the information for begin_prepare, prepare, and commit prepared
+ * transaction. prepare_lsn and preparetime are used to store commit lsn and
+ * commit time for commit prepared.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz preparetime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz preparetime;
+	TimestampTz rollbacktime;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +172,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 565a961..6c9f2c6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -643,7 +643,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..db68551 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..daf6ad4 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..a10231e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..364e6eb
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,293 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..76b224a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,236 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c7aff67..2b89efc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1347,9 +1347,11 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v72-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v72-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From a9f45cce14a34b571830f0312b35657678a034b3 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 13 Apr 2021 12:40:19 +1000
Subject: [PATCH v72] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.
---
 doc/src/sgml/protocol.sgml                         |  70 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  65 +++
 src/backend/replication/logical/worker.c           | 132 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |   9 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 450 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 268 ++++++++++++
 11 files changed, 1017 insertions(+), 78 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 48044dd..6a6a6d8 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Commit Prepared messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7388,7 +7388,8 @@ Stream Abort
 <!-- ==================== TWO_PHASE Messages ==================== -->
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared,
+Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7655,6 +7656,71 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the stream prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the stream prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 55deef8..1e4ccff 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -909,12 +894,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 9977eae..428f7cb 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -319,6 +319,71 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->preparetime = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 5c1ae6b..f3534bb 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -322,6 +322,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -970,6 +972,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.preparetime;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1167,30 +1249,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1198,7 +1271,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1213,7 +1286,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1288,6 +1361,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2236,6 +2334,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ecf9b9a..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -178,7 +180,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -283,17 +285,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1010,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 540d8ee..97a60d1 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -235,4 +236,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index a10231e..f2c8cc3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  cannot alter two_phase option
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..89f3eb7
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,450 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..eba1523
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,268 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#306Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#305)
2 attachment(s)

Please find attached the latest patch set v73`*

Differences from v72* are:

* Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly)

* Minor documentation correction for protocol messages for Commit Prepared ('K')

* Non-functional code tidy (mostly proto.c) to reduce overloading
different meanings to same member names for prepare/commit times.

----
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v73-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v73-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 3523b0afc122c0d64cc8123494802de98a18e540 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 20 Apr 2021 15:25:25 +1000
Subject: [PATCH v73] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.
---
 doc/src/sgml/protocol.sgml                         |  70 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  65 +++
 src/backend/replication/logical/worker.c           | 132 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |   9 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 450 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 268 ++++++++++++
 11 files changed, 1017 insertions(+), 78 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 248ed37..bf45b44 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Commit Prepared messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7388,7 +7388,8 @@ Stream Abort
 <!-- ==================== TWO_PHASE Messages ==================== -->
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared,
+Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7655,6 +7656,71 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the stream prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the stream prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 55deef8..1e4ccff 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -909,12 +894,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index d0e0d19..1798266 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -319,6 +319,71 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9a2157a..fedf2a6 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -322,6 +322,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -970,6 +972,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1167,30 +1249,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1198,7 +1271,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1213,7 +1286,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1288,6 +1361,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2236,6 +2334,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ecf9b9a..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -178,7 +180,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -283,17 +285,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1010,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 36fa320..9b3e934 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -244,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index a10231e..f2c8cc3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  cannot alter two_phase option
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..89f3eb7
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,450 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..eba1523
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,268 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v73-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v73-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 3e9c13b1452d0367e48043afda3cf66e1b196a98 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 20 Apr 2021 14:20:58 +1000
Subject: [PATCH v73] Add support for prepared transactions to built-in logical
  replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         | 313 ++++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  35 +++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 135 +++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 218 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 200 ++++++++++--
 src/backend/replication/logical/worker.c           | 341 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  74 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 293 ++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 236 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 43 files changed, 2341 insertions(+), 200 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 1345791..2f8eb17 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7640,6 +7640,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index f61bcfc..4ca0b96 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1250,9 +1250,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..248ed37 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Commit Prepared messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7364,6 +7385,278 @@ Stream Abort
 
 </variablelist>
 
+<!-- ==================== TWO_PHASE Messages ==================== -->
+
+<para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies this message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies this message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the commit transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('A')</term>
+<listitem><para>
+                Identifies this message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the rollback transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
 <para>
 
 The following message parts are shared by the above messages.
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..2408e10 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed with <literal>copy_data = true</literal>
+   when the subscription has <literal>two_phase</literal> commit enabled. See
+   column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b658134..a8725e6 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2460,3 +2460,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..2b4b699 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -449,6 +450,40 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	int			nkeys = 0;
+	ScanKeyData skey[2];
+	SysScanDesc scan;
+	bool		has_subrels = false;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[nkeys++],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, nkeys, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e96dd73..289b4b2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1266,5 +1266,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 517c8ed..55deef8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -814,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +909,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +938,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +984,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1001,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1043,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1061,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1101,34 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..eb03c53 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +838,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +852,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7924581..99f2afc 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 35b0c67..54a2cac 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -209,7 +209,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -435,10 +435,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -542,10 +551,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -612,7 +632,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 2a1f983..d0e0d19 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
 }
 
 /*
@@ -107,6 +107,218 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -842,7 +1054,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5cb484f..341bef5 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2546,7 +2546,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->u_op_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2637,7 +2637,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->u_op_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2684,7 +2684,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->u_op_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2701,7 +2701,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2720,19 +2720,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->u_op_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2750,12 +2751,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->u_op_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->u_op_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c5a8125..0daff13 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 0638f5c..a1bae34 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +363,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +370,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +391,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1058,7 +1054,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1144,3 +1140,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_no_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fb3ba5c..9a2157a 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -246,6 +319,9 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -720,6 +796,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1970,6 +2220,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2446,6 +2712,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2932,6 +3201,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3098,15 +3381,67 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(wrconn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(wrconn, &options);
+
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(wrconn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..ecf9b9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +161,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +177,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -170,10 +190,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,8 +271,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -322,6 +365,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +398,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +418,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +439,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +928,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1236,3 +1335,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f61b163..fb33a77 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9a0e380..92f3373 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e397b76..afa5622 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4368,6 +4369,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4411,9 +4413,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4434,6 +4443,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4459,6 +4469,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4486,6 +4498,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4527,6 +4540,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5340843..70c072d 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 3e39fdb..920f083 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,13 +6415,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index cfd0a84..111e90e 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2777,7 +2777,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7dfcb7b..72f049b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -88,11 +88,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..36fa320 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -115,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +181,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bfab830..bfd3423 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -311,7 +311,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} u_op_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -650,7 +654,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..db68551 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..daf6ad4 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..a10231e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..364e6eb
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,293 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..76b224a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,236 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c7aff67..2b89efc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1347,9 +1347,11 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

#307Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#306)
2 attachment(s)

On Tue, Apr 20, 2021 at 3:45 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v73`*

Differences from v72* are:

* Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly)

* Minor documentation correction for protocol messages for Commit Prepared ('K')

* Non-functional code tidy (mostly proto.c) to reduce overloading
different meanings to same member names for prepare/commit times.

Please find attached a re-posting of patch set v73*

This is the same as yesterday's v73 but with a contrib module compile
error fixed.

(I have confirmed make check-world is OK for this patch set)

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v73-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v73-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 7ad53bc50111d5e8812ce8582e867af6af71bbcc Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 21 Apr 2021 15:33:37 +1000
Subject: [PATCH v73] Add support for prepared transactions to built-in logical
   replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         | 313 ++++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  35 +++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 135 +++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 218 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 200 ++++++++++--
 src/backend/replication/logical/worker.c           | 341 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  74 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 293 ++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 236 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 44 files changed, 2347 insertions(+), 206 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..9393c85 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 1345791..2f8eb17 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7640,6 +7640,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index f61bcfc..4ca0b96 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1250,9 +1250,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..248ed37 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Commit Prepared messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7364,6 +7385,278 @@ Stream Abort
 
 </variablelist>
 
+<!-- ==================== TWO_PHASE Messages ==================== -->
+
+<para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies this message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies this message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the commit transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('A')</term>
+<listitem><para>
+                Identifies this message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the rollback transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
 <para>
 
 The following message parts are shared by the above messages.
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..2408e10 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed with <literal>copy_data = true</literal>
+   when the subscription has <literal>two_phase</literal> commit enabled. See
+   column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b658134..a8725e6 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2460,3 +2460,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID and lsn is	around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..2b4b699 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -449,6 +450,40 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	int			nkeys = 0;
+	ScanKeyData skey[2];
+	SysScanDesc scan;
+	bool		has_subrels = false;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[nkeys++],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, nkeys, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e96dd73..289b4b2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1266,5 +1266,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 517c8ed..55deef8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("cannot alter two_phase option")));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -814,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +909,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +938,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +984,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1001,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1043,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1061,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1101,34 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..eb03c53 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +838,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +852,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7924581..99f2afc 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 35b0c67..54a2cac 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -209,7 +209,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -435,10 +435,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -542,10 +551,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -612,7 +632,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 2a1f983..d0e0d19 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
 }
 
 /*
@@ -107,6 +107,218 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "final_lsn not set in begin message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepare message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepare message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in commit prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in commit prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -842,7 +1054,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5cb484f..341bef5 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2546,7 +2546,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->u_op_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2637,7 +2637,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->u_op_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2684,7 +2684,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->u_op_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2701,7 +2701,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2720,19 +2720,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->u_op_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2750,12 +2751,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->u_op_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->u_op_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c5a8125..0daff13 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 0638f5c..a1bae34 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +363,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +370,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +391,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1058,7 +1054,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1144,3 +1140,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_no_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fb3ba5c..9a2157a 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -246,6 +319,9 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -720,6 +796,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1970,6 +2220,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2446,6 +2712,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2932,6 +3201,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3098,15 +3381,67 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(wrconn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(wrconn, &options);
+
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(wrconn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..ecf9b9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +161,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +177,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -170,10 +190,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,8 +271,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -322,6 +365,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +398,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +418,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +439,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +928,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1236,3 +1335,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f61b163..fb33a77 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9a0e380..92f3373 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e397b76..afa5622 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4368,6 +4369,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4411,9 +4413,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4434,6 +4443,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4459,6 +4469,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4486,6 +4498,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4527,6 +4540,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5340843..70c072d 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 3e39fdb..920f083 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,13 +6415,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index cfd0a84..111e90e 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2777,7 +2777,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7dfcb7b..72f049b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -88,11 +88,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..36fa320 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -115,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +181,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bfab830..bfd3423 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -311,7 +311,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} u_op_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -650,7 +654,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 1ad5e6c..db68551 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..daf6ad4 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..a10231e 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  cannot alter two_phase option
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..364e6eb
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,293 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the tx state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..76b224a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,236 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 29;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c7aff67..2b89efc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1347,9 +1347,11 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v73-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v73-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From f9768c322dae6963c5d6f6cd9718ce6db4339aeb Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 21 Apr 2021 16:35:24 +1000
Subject: [PATCH v73] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.
---
 doc/src/sgml/protocol.sgml                         |  70 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  65 +++
 src/backend/replication/logical/worker.c           | 132 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |   9 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 450 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 268 ++++++++++++
 11 files changed, 1017 insertions(+), 78 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 248ed37..bf45b44 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Commit Prepared messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7388,7 +7388,8 @@ Stream Abort
 <!-- ==================== TWO_PHASE Messages ==================== -->
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared,
+Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7655,6 +7656,71 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the stream prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the stream prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 55deef8..1e4ccff 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -909,12 +894,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index d0e0d19..1798266 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -319,6 +319,71 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* transaction ID */
+	Assert(TransactionIdIsValid(txn->xid));
+	pq_sendint32(out, txn->xid);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	TransactionId xid;
+	uint8		flags;
+
+	xid = pq_getmsgint(in, 4);
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9a2157a..fedf2a6 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -322,6 +322,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -970,6 +972,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1167,30 +1249,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1198,7 +1271,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1213,7 +1286,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1288,6 +1361,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2236,6 +2334,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ecf9b9a..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -178,7 +180,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -283,17 +285,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1010,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 36fa320..9b3e934 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -244,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index a10231e..f2c8cc3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  cannot alter two_phase option
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..89f3eb7
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,450 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC tx)
+# Note: the 2PC tx still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC tx)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC tx works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC tx gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..eba1523
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,268 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the tx state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the tx state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#308vignesh C
vignesh21@gmail.com
In reply to: Peter Smith (#307)

On Wed, Apr 21, 2021 at 12:13 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Apr 20, 2021 at 3:45 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v73`*

Differences from v72* are:

* Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly)

* Minor documentation correction for protocol messages for Commit Prepared ('K')

* Non-functional code tidy (mostly proto.c) to reduce overloading
different meanings to same member names for prepare/commit times.

Please find attached a re-posting of patch set v73*

This is the same as yesterday's v73 but with a contrib module compile
error fixed.

Thanks for the updated patch, few comments:
1) Should "final_lsn not set in begin message" be "prepare_lsn not set
in begin message"
+logicalrep_read_begin_prepare(StringInfo in,
LogicalRepPreparedTxnData *begin_data)
+{
+       /* read fields */
+       begin_data->prepare_lsn = pq_getmsgint64(in);
+       if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "final_lsn not set in begin message");
2) Should "These commands" be "ALTER SUBSCRIPTION ... REFRESH
PUBLICATION and ALTER SUBSCRIPTION ... SET/ADD PUBLICATION ..." as
copy_data cannot be specified with alter subscription .. drop
publication.
+   These commands also cannot be executed with <literal>copy_data =
true</literal>
+   when the subscription has <literal>two_phase</literal> commit enabled. See
+   column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual
two-phase state.
3) <term>Byte1('A')</term> should be <term>Byte1('r')</term> as we
have defined LOGICAL_REP_MSG_ROLLBACK_PREPARED as r.
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('A')</term>
+<listitem><para>
+                Identifies this message as the rollback of a
two-phase transaction message.
+</para></listitem>
+</varlistentry>
4) Should "Check if the prepared transaction with the given GID and
lsn is around." be
"Check if the prepared transaction with the given GID, lsn & timestamp
is around."
+/*
+ * LookupGXact
+ *             Check if the prepared transaction with the given GID
and lsn is around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
5) Should we change "The LSN of the prepare." to "The LSN of the begin prepare."
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies this message as the beginning of a
two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>

6) Similarly in cases of "Commit Prepared" and "Rollback Prepared"

7) No need to initialize has_subrels as we will always assign the
value returned by HeapTupleIsValid
+HasSubscriptionRelations(Oid subid)
+{
+       Relation        rel;
+       int                     nkeys = 0;
+       ScanKeyData skey[2];
+       SysScanDesc scan;
+       bool            has_subrels = false;
+
+       rel = table_open(SubscriptionRelRelationId, AccessShareLock);
8) We could include errhint, like errhint("Option \"two_phase\"
specified more than once") to specify a more informative error
message.
+               else if (strcmp(defel->defname, "two_phase") == 0)
+               {
+                       if (two_phase_option_given)
+                               ereport(ERROR,
+                                               (errcode(ERRCODE_SYNTAX_ERROR),
+                                                errmsg("conflicting
or redundant options")));
+                       two_phase_option_given = true;
+
+                       data->two_phase = defGetBoolean(defel);
+               }
9) We have a lot of function parameters for
parse_subscription_options, should we change it to struct?
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
                                                   char **synchronous_commit,
                                                   bool *refresh,
                                                   bool *binary_given,
bool *binary,
-                                                  bool
*streaming_given, bool *streaming)
+                                                  bool
*streaming_given, bool *streaming,
+                                                  bool
*twophase_given, bool *twophase)
10) Should we change " errhint("Use ALTER SUBSCRIPTION ...SET
PUBLICATION with refresh = false, or with copy_data = false, or use
DROP/CREATE SUBSCRIPTION.")" to  "errhint("Use ALTER SUBSCRIPTION
...SET/ADD PUBLICATION with refresh = false, or with copy_data =
false.")" as we don't support copy_data in ALTER subscription ... DROP
publication.
+                                       /*
+                                        * See
ALTER_SUBSCRIPTION_REFRESH for details why this is
+                                        * not allowed.
+                                        */
+                                       if (sub->twophasestate ==
LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+                                               ereport(ERROR,
+
(errcode(ERRCODE_SYNTAX_ERROR),
+
errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed
when two_phase is enabled"),
+
errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh =
false, or with copy_data = false"
+
          ", or use DROP/CREATE SUBSCRIPTION.")));
11) Should 14000 be 15000 as this feature will be committed in PG15
+               if (options->proto.logical.twophase &&
+                       PQserverVersion(conn->streamConn) >= 140000)
+                       appendStringInfoString(&cmd, ", two_phase 'on'");
12) should we change "begin message" to "begin prepare message"
+       if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "final_lsn not set in begin message");
+       begin_data->end_lsn = pq_getmsgint64(in);
+       if (begin_data->end_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "end_lsn not set in begin message");
13) should we change "commit prepare message" to "commit prepared message"
+       if (flags != 0)
+               elog(ERROR, "unrecognized flags %u in commit prepare
message", flags);
+
+       /* read fields */
+       prepare_data->commit_lsn = pq_getmsgint64(in);
+       if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "commit_lsn is not set in commit prepared message");
+       prepare_data->end_lsn = pq_getmsgint64(in);
+       if (prepare_data->end_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "end_lsn is not set in commit prepared message");
+       prepare_data->commit_time = pq_getmsgint64(in);
14) should we change "commit prepared message" to "rollback prepared message"
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+
LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+       /* read flags */
+       uint8           flags = pq_getmsgbyte(in);
+
+       if (flags != 0)
+               elog(ERROR, "unrecognized flags %u in rollback prepare
message", flags);
+
+       /* read fields */
+       rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+       if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "prepare_end_lsn is not set in commit
prepared message");
+       rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+       if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "rollback_end_lsn is not set in commit
prepared message");
+       rollback_data->prepare_time = pq_getmsgint64(in);
+       rollback_data->rollback_time = pq_getmsgint64(in);
+       rollback_data->xid = pq_getmsgint(in, 4);
+
+       /* read gid (copy it into a pre-allocated buffer) */
+       strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
15) We can include check  pg_stat_replication_slots to verify if
statistics is getting updated.
+$node_publisher->safe_psql('postgres', "
+       BEGIN;
+       INSERT INTO tab_full VALUES (11);
+       PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*)
FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED
'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);

Regards,
Vignesh

#309vignesh C
vignesh21@gmail.com
In reply to: Peter Smith (#307)

On Wed, Apr 21, 2021 at 12:13 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Apr 20, 2021 at 3:45 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v73`*

Differences from v72* are:

* Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly)

* Minor documentation correction for protocol messages for Commit Prepared ('K')

* Non-functional code tidy (mostly proto.c) to reduce overloading
different meanings to same member names for prepare/commit times.

Please find attached a re-posting of patch set v73*

Few comments when I was having a look at the tests added:
1) Can the below:
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*)
FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*)
FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');

be changed to:
$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM
tab_full where a IN (21,22);");
is($result, qq(21), 'Rows committed are on the subscriber');

And Test count need to be reduced to "use Test::More tests => 19"

2) we can change tx to transaction:
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM
pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM
pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');

3) There are few more instances present in the same file, those also
can be changed.

4) Can the below:
check inserts are visible at subscriber(s).
# 22 should be rolled back.
# 21 should be committed.
$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM
tab_full where a IN (21);");
is($result, qq(1), 'Rows committed are present on subscriber B');
$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM
tab_full where a IN (22);");
is($result, qq(0), 'Rows rolled back are not present on subscriber B');
$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM
tab_full where a IN (21);");
is($result, qq(1), 'Rows committed are present on subscriber C');
$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM
tab_full where a IN (22);");
is($result, qq(0), 'Rows rolled back are not present on subscriber C');

be changed to:
$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where
a IN (21,22);");
is($result, qq(21), 'Rows committed are on the subscriber');
$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where
a IN (21,22);");
is($result, qq(21), 'Rows committed are on the subscriber');

And Test count need to be reduced to "use Test::More tests => 27"

5) should we change "Two phase commit" to "Two phase commit state" :
+               /*
+                * Binary, streaming, and two_phase are only supported
in v14 and
+                * higher
+                */
                if (pset.sversion >= 140000)
                        appendPQExpBuffer(&buf,
                                                          ", subbinary
AS \"%s\"\n"
-                                                         ", substream
AS \"%s\"\n",
+                                                         ", substream
AS \"%s\"\n"
+                                                         ",
subtwophasestate AS \"%s\"\n",

gettext_noop("Binary"),
-
gettext_noop("Streaming"));
+
gettext_noop("Streaming"),
+
gettext_noop("Two phase commit"));

Regards,
Vignesh

#310vignesh C
vignesh21@gmail.com
In reply to: Peter Smith (#307)

On Wed, Apr 21, 2021 at 12:13 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Apr 20, 2021 at 3:45 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v73`*

Differences from v72* are:

* Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly)

* Minor documentation correction for protocol messages for Commit Prepared ('K')

* Non-functional code tidy (mostly proto.c) to reduce overloading
different meanings to same member names for prepare/commit times.

Please find attached a re-posting of patch set v73*

This is the same as yesterday's v73 but with a contrib module compile
error fixed.

Few comments on
v73-0002-Add-prepare-API-support-for-streaming-transactio.patch patch:
1) There are slight differences in error message in case of Alter
subscription ... drop publication, we can keep the error message
similar:
postgres=# ALTER SUBSCRIPTION mysub drop PUBLICATION mypub WITH
(refresh = false, copy_data=true, two_phase=true);
ERROR: unrecognized subscription parameter: "copy_data"
postgres=# ALTER SUBSCRIPTION mysub drop PUBLICATION mypub WITH
(refresh = false, two_phase=true, streaming=true);
ERROR: cannot alter two_phase option

2) We are sending txn->xid twice, I felt we should send only once in
logicalrep_write_stream_prepare:
+       /* transaction ID */
+       Assert(TransactionIdIsValid(txn->xid));
+       pq_sendint32(out, txn->xid);
+
+       /* send the flags field */
+       pq_sendbyte(out, flags);
+
+       /* send fields */
+       pq_sendint64(out, prepare_lsn);
+       pq_sendint64(out, txn->end_lsn);
+       pq_sendint64(out, txn->u_op_time.prepare_time);
+       pq_sendint32(out, txn->xid);
+
3) We could remove xid and return prepare_data->xid
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in,
LogicalRepPreparedTxnData *prepare_data)
+{
+       TransactionId xid;
+       uint8           flags;
+
+       xid = pq_getmsgint(in, 4);
4) Here comments can be above apply_spooled_messages for better readability
+       /*
+        * 1. Replay all the spooled operations - Similar code as for
+        * apply_handle_stream_commit (i.e. non two-phase stream commit)
+        */
+
+       ensure_transaction();
+
+       nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
5) Similarly this below comment can be above PrepareTransactionBlock
+       /*
+        * 2. Mark the transaction as prepared. - Similar code as for
+        * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+        */
+
+       /*
+        * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+        * called within the PrepareTransactionBlock below.
+        */
+       BeginTransactionBlock();
+       CommitTransactionCommand();
+
+       /*
+        * Update origin state so we can restart streaming from correct position
+        * in case of crash.
+        */
+       replorigin_session_origin_lsn = prepare_data.end_lsn;
+       replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+       PrepareTransactionBlock(gid);
+       CommitTransactionCommand();
+
+       pgstat_report_stat(false);
6) There is a lot of common code between apply_handle_stream_prepare
and apply_handle_prepare, if possible try to have a common function to
avoid fixing at both places.
+       /*
+        * 2. Mark the transaction as prepared. - Similar code as for
+        * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+        */
+
+       /*
+        * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+        * called within the PrepareTransactionBlock below.
+        */
+       BeginTransactionBlock();
+       CommitTransactionCommand();
+
+       /*
+        * Update origin state so we can restart streaming from correct position
+        * in case of crash.
+        */
+       replorigin_session_origin_lsn = prepare_data.end_lsn;
+       replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+       PrepareTransactionBlock(gid);
+       CommitTransactionCommand();
+
+       pgstat_report_stat(false);
+
+       store_flush_position(prepare_data.end_lsn);
7) two-phase commit is slightly misleading, we can just mention
streaming prepare.
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+                                                       ReorderBufferTXN *txn,
+                                                       XLogRecPtr prepare_lsn)
8) should we include Assert of in_streaming similar to other
pgoutput_stream*** functions.
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+                                                       ReorderBufferTXN *txn,
+                                                       XLogRecPtr prepare_lsn)
+{
+       Assert(rbtxn_is_streamed(txn));
+
+       OutputPluginUpdateProgress(ctx);
+       OutputPluginPrepareWrite(ctx, true);
+       logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+       OutputPluginWrite(ctx, true);
+}
9) Here also, we can verify that the transaction is streamed by
checking the pg_stat_replication_slots.
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed
on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*)
FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');

Regards,
Vignesh

#311Ajin Cherian
itsajin@gmail.com
In reply to: vignesh C (#310)

Modified pgbench's "tpcb-like" builtin query as below to do two-phase
commits and then ran a 4 cascade replication setup.

"BEGIN;\n"
"UPDATE pgbench_accounts SET abalance = abalance + :delta
WHERE aid = :aid;\n"
"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
"UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE
tid = :tid;\n"
"UPDATE pgbench_branches SET bbalance = bbalance + :delta
WHERE bid = :bid;\n"
"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime)
VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
"PREPARE TRANSACTION ':aid:';\n"
"COMMIT PREPARED ':aid:';\n"

The tests ran fine and all 4 cascaded servers replicated the changes
correctly. All the subscriptions were configured with two_phase
enabled.

regards,
Ajin Cherian
Fujitsu Australia

#312Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#307)
2 attachment(s)

Attachments:

v74-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v74-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From aecae4a637cfc5fdaeba515142b322b1c95f3fed Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 29 Apr 2021 18:42:14 +1000
Subject: [PATCH v74] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.
---
 doc/src/sgml/protocol.sgml                         |  70 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 132 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |   9 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 450 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 268 ++++++++++++
 11 files changed, 1012 insertions(+), 78 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 871088e..8bffe54 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Commit Prepared messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7388,7 +7388,8 @@ Stream Abort
 <!-- ==================== TWO_PHASE Messages ==================== -->
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared,
+Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7655,6 +7656,71 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f777872..ec5c409 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -909,12 +894,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 98d2b00..7ebfd91 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -319,6 +319,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 72149a1..cad5b30 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -322,6 +322,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -1009,6 +1011,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1206,30 +1288,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1237,7 +1310,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1252,7 +1325,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1327,6 +1400,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2257,6 +2355,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ecf9b9a..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -178,7 +180,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -283,17 +285,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1010,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 36fa320..9b3e934 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -244,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..0ac4433
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,450 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..e8bb726
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,268 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v74-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v74-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 1e785763125684e91440cdc6c3b70f37d29e25e2 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 29 Apr 2021 18:22:46 +1000
Subject: [PATCH v74] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         | 313 ++++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  69 +++++
 src/backend/catalog/pg_subscription.c              |  35 +++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 135 +++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 218 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 200 ++++++++++--
 src/backend/replication/logical/worker.c           | 341 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  74 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 291 ++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 232 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 44 files changed, 2342 insertions(+), 206 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..9393c85 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 492ed34..de1f8ed 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7640,6 +7640,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index a7ec5c3..493432d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1250,9 +1250,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..871088e 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Commit Prepared messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7364,6 +7385,278 @@ Stream Abort
 
 </variablelist>
 
+<!-- ==================== TWO_PHASE Messages ==================== -->
+
+<para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies this message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies this message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the commit transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies this message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the rollback transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
 <para>
 
 The following message parts are shared by the above messages.
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b658134..2fb2e69 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2460,3 +2460,72 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		is around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 4039768..67d010d 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -449,6 +450,40 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	int			nkeys = 0;
+	ScanKeyData skey[2];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[nkeys++],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, nkeys, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 08f95c4..a90ff7c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1268,5 +1268,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 517c8ed..f777872 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -814,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +909,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +938,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +984,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1001,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1043,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1061,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1101,34 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..eb03c53 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +838,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +852,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7924581..99f2afc 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 00543ed..b73d08c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -209,7 +209,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -435,10 +435,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -542,10 +551,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -612,7 +632,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 39471fd..b258174 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..98d2b00 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
 }
 
 /*
@@ -107,6 +107,218 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1053,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c27f710..5de70f9 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2546,7 +2546,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->u_op_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2637,7 +2637,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->u_op_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2684,7 +2684,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->u_op_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2704,7 +2704,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2723,19 +2723,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->u_op_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2753,12 +2754,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->u_op_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->u_op_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 9118e21..1f834f6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 0638f5c..a1bae34 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -359,7 +363,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -367,42 +370,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -416,16 +391,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1058,7 +1054,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(wrconn, slotname, false /* permanent */ ,
+	walrcv_create_slot(wrconn, slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1144,3 +1140,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_no_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index d09703f..72149a1 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -246,6 +319,9 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -759,6 +835,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1991,6 +2241,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2467,6 +2733,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2953,6 +3222,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3119,15 +3402,67 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(wrconn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(wrconn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(wrconn, &options);
+
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(wrconn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(wrconn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..ecf9b9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +161,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +177,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -170,10 +190,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,8 +271,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -322,6 +365,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +398,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +418,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +439,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +928,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1236,3 +1335,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index cf261e2..8ada941 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9a0e380..92f3373 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 3cb3598..846a486 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4368,6 +4369,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4411,9 +4413,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4434,6 +4443,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4459,6 +4469,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4486,6 +4498,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4527,6 +4540,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5340843..70c072d 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 3e39fdb..920f083 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,13 +6415,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 7c49333..edf06de 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2780,7 +2780,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index 7dfcb7b..72f049b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -88,11 +88,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..36fa320 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -115,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +181,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index bfab830..bfd3423 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -311,7 +311,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} u_op_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -650,7 +654,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3570684..71638a3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 1cac75e..daf6ad4 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..91d9032
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,291 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 19;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..2bea214
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,232 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 878b67a..24264ed 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1347,9 +1347,11 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCtxStruct
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

#313Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#308)

On Mon, Apr 26, 2021 at 9:22 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, Apr 21, 2021 at 12:13 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Apr 20, 2021 at 3:45 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v73`*

Differences from v72* are:

* Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly)

* Minor documentation correction for protocol messages for Commit Prepared ('K')

* Non-functional code tidy (mostly proto.c) to reduce overloading
different meanings to same member names for prepare/commit times.

Please find attached a re-posting of patch set v73*

This is the same as yesterday's v73 but with a contrib module compile
error fixed.

Thanks for the updated patch, few comments:

Thanks for your feedback comments, My replies are inline below.

1) Should "final_lsn not set in begin message" be "prepare_lsn not set
in begin message"
+logicalrep_read_begin_prepare(StringInfo in,
LogicalRepPreparedTxnData *begin_data)
+{
+       /* read fields */
+       begin_data->prepare_lsn = pq_getmsgint64(in);
+       if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "final_lsn not set in begin message");

OK. Updated in v74.

2) Should "These commands" be "ALTER SUBSCRIPTION ... REFRESH
PUBLICATION and ALTER SUBSCRIPTION ... SET/ADD PUBLICATION ..." as
copy_data cannot be specified with alter subscription .. drop
publication.
+   These commands also cannot be executed with <literal>copy_data =
true</literal>
+   when the subscription has <literal>two_phase</literal> commit enabled. See
+   column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual
two-phase state.

OK. Updated in v74. While technically more correct, I think rewording
it as suggested makes the doc harder to understand. But I have
reworded it slightly to account for the fact that the copy_data
setting is not possible with the DROP.

3) <term>Byte1('A')</term> should be <term>Byte1('r')</term> as we
have defined LOGICAL_REP_MSG_ROLLBACK_PREPARED as r.
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('A')</term>
+<listitem><para>
+                Identifies this message as the rollback of a
two-phase transaction message.
+</para></listitem>
+</varlistentry>

OK. Updated in v74.

4) Should "Check if the prepared transaction with the given GID and
lsn is around." be
"Check if the prepared transaction with the given GID, lsn & timestamp
is around."
+/*
+ * LookupGXact
+ *             Check if the prepared transaction with the given GID
and lsn is around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */

OK. Updated in v74.

5) Should we change "The LSN of the prepare." to "The LSN of the begin prepare."
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies this message as the beginning of a
two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>

Not updated. The PG Docs is correct as-is I think.

6) Similarly in cases of "Commit Prepared" and "Rollback Prepared"

Not updated. AFAIK these are correct – it really is LSN of the PREPARE
just like it says.

7) No need to initialize has_subrels as we will always assign the
value returned by HeapTupleIsValid
+HasSubscriptionRelations(Oid subid)
+{
+       Relation        rel;
+       int                     nkeys = 0;
+       ScanKeyData skey[2];
+       SysScanDesc scan;
+       bool            has_subrels = false;
+
+       rel = table_open(SubscriptionRelRelationId, AccessShareLock);

OK. Updated in v74.

8) We could include errhint, like errhint("Option \"two_phase\"
specified more than once") to specify a more informative error
message.
+               else if (strcmp(defel->defname, "two_phase") == 0)
+               {
+                       if (two_phase_option_given)
+                               ereport(ERROR,
+                                               (errcode(ERRCODE_SYNTAX_ERROR),
+                                                errmsg("conflicting
or redundant options")));
+                       two_phase_option_given = true;
+
+                       data->two_phase = defGetBoolean(defel);
+               }

Not updated. Yes, maybe it would be better like you say, but the code
would then be inconsistent with every other option in this function.
Perhaps your idea can be raised as a separate patch to fix all of
them.

9) We have a lot of function parameters for
parse_subscription_options, should we change it to struct?
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
char **synchronous_commit,
bool *refresh,
bool *binary_given,
bool *binary,
-                                                  bool
*streaming_given, bool *streaming)
+                                                  bool
*streaming_given, bool *streaming,
+                                                  bool
*twophase_given, bool *twophase)

Not updated. This is not really related to the 2PC functionality so I
think your idea might be good, but it should be done as a later
refactoring patch after the 2PC patch is pushed.

10) Should we change " errhint("Use ALTER SUBSCRIPTION ...SET
PUBLICATION with refresh = false, or with copy_data = false, or use
DROP/CREATE SUBSCRIPTION.")" to  "errhint("Use ALTER SUBSCRIPTION
...SET/ADD PUBLICATION with refresh = false, or with copy_data =
false.")" as we don't support copy_data in ALTER subscription ... DROP
publication.
+                                       /*
+                                        * See
ALTER_SUBSCRIPTION_REFRESH for details why this is
+                                        * not allowed.
+                                        */
+                                       if (sub->twophasestate ==
LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+                                               ereport(ERROR,
+
(errcode(ERRCODE_SYNTAX_ERROR),
+
errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed
when two_phase is enabled"),
+
errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh =
false, or with copy_data = false"
+
", or use DROP/CREATE SUBSCRIPTION.")));

Not updated. The hint is saying that one workaround is to DROP and
re-CREATE the SUBSCRIPTIPON. It doesn’t say anything about “support of
copy_data in ALTER subscription ... DROP publication.” So I did not
understand the point of your comment.

11) Should 14000 be 15000 as this feature will be committed in PG15
+               if (options->proto.logical.twophase &&
+                       PQserverVersion(conn->streamConn) >= 140000)
+                       appendStringInfoString(&cmd, ", two_phase 'on'");

Not updated. This is already a known TODO task; I will do this as soon
as PG15 development starts.

12) should we change "begin message" to "begin prepare message"
+       if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "final_lsn not set in begin message");
+       begin_data->end_lsn = pq_getmsgint64(in);
+       if (begin_data->end_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "end_lsn not set in begin message");

OK. Updated in v74.

13) should we change "commit prepare message" to "commit prepared message"
+       if (flags != 0)
+               elog(ERROR, "unrecognized flags %u in commit prepare
message", flags);
+
+       /* read fields */
+       prepare_data->commit_lsn = pq_getmsgint64(in);
+       if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "commit_lsn is not set in commit prepared message");
+       prepare_data->end_lsn = pq_getmsgint64(in);
+       if (prepare_data->end_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "end_lsn is not set in commit prepared message");
+       prepare_data->commit_time = pq_getmsgint64(in);

OK, updated in v74

14) should we change "commit prepared message" to "rollback prepared message"
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+
LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+       /* read flags */
+       uint8           flags = pq_getmsgbyte(in);
+
+       if (flags != 0)
+               elog(ERROR, "unrecognized flags %u in rollback prepare
message", flags);
+
+       /* read fields */
+       rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+       if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "prepare_end_lsn is not set in commit
prepared message");
+       rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+       if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+               elog(ERROR, "rollback_end_lsn is not set in commit
prepared message");
+       rollback_data->prepare_time = pq_getmsgint64(in);
+       rollback_data->rollback_time = pq_getmsgint64(in);
+       rollback_data->xid = pq_getmsgint(in, 4);
+
+       /* read gid (copy it into a pre-allocated buffer) */
+       strcpy(rollback_data->gid, pq_getmsgstring(in));
+}

OK. Updated in v74.

15) We can include check  pg_stat_replication_slots to verify if
statistics is getting updated.
+$node_publisher->safe_psql('postgres', "
+       BEGIN;
+       INSERT INTO tab_full VALUES (11);
+       PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*)
FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED
'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);

Not updated. But I recorded this as a TODO task - I agree we need to
introduce some stats tests later.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#314Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#309)

On Tue, Apr 27, 2021 at 1:41 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, Apr 21, 2021 at 12:13 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Apr 20, 2021 at 3:45 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v73`*

Differences from v72* are:

* Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly)

* Minor documentation correction for protocol messages for Commit Prepared ('K')

* Non-functional code tidy (mostly proto.c) to reduce overloading
different meanings to same member names for prepare/commit times.

Please find attached a re-posting of patch set v73*

Few comments when I was having a look at the tests added:

Thanks for your feedback comments. My replies are inline below.

1) Can the below:
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*)
FROM tab_full where a IN (21);");
+is($result, qq(1), 'Rows committed are on the subscriber');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*)
FROM tab_full where a IN (22);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');

be changed to:
$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM
tab_full where a IN (21,22);");
is($result, qq(21), 'Rows committed are on the subscriber');

And Test count need to be reduced to "use Test::More tests => 19"

OK. Updated in v74.

2) we can change tx to transaction:
+# check the tx state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM
pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM
pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');

OK. Updated in v74

3) There are few more instances present in the same file, those also
can be changed.

OK. I found no others in the same file, but there were similar cases
in the 021 TAP test. Those were also updated in v74/

4) Can the below:
check inserts are visible at subscriber(s).
# 22 should be rolled back.
# 21 should be committed.
$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM
tab_full where a IN (21);");
is($result, qq(1), 'Rows committed are present on subscriber B');
$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM
tab_full where a IN (22);");
is($result, qq(0), 'Rows rolled back are not present on subscriber B');
$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM
tab_full where a IN (21);");
is($result, qq(1), 'Rows committed are present on subscriber C');
$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM
tab_full where a IN (22);");
is($result, qq(0), 'Rows rolled back are not present on subscriber C');

be changed to:
$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where
a IN (21,22);");
is($result, qq(21), 'Rows committed are on the subscriber');
$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where
a IN (21,22);");
is($result, qq(21), 'Rows committed are on the subscriber');

And Test count need to be reduced to "use Test::More tests => 27"

OK. Updated in v74.

5) should we change "Two phase commit" to "Two phase commit state" :
+               /*
+                * Binary, streaming, and two_phase are only supported
in v14 and
+                * higher
+                */
if (pset.sversion >= 140000)
appendPQExpBuffer(&buf,
", subbinary
AS \"%s\"\n"
-                                                         ", substream
AS \"%s\"\n",
+                                                         ", substream
AS \"%s\"\n"
+                                                         ",
subtwophasestate AS \"%s\"\n",

gettext_noop("Binary"),
-
gettext_noop("Streaming"));
+
gettext_noop("Streaming"),
+
gettext_noop("Two phase commit"));

Not updated. I think the column name is already the longest one and
this just makes it even longer - far too long IMO. I am not sure what
is better having the “state” suffix. After all, booleans are also
states. Anyway, I did not make this change now but if people feel
strongly about it then I can revisit it.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#315Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#310)

On Tue, Apr 27, 2021 at 6:17 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, Apr 21, 2021 at 12:13 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Apr 20, 2021 at 3:45 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v73`*

Differences from v72* are:

* Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly)

* Minor documentation correction for protocol messages for Commit Prepared ('K')

* Non-functional code tidy (mostly proto.c) to reduce overloading
different meanings to same member names for prepare/commit times.

Please find attached a re-posting of patch set v73*

This is the same as yesterday's v73 but with a contrib module compile
error fixed.

Few comments on
v73-0002-Add-prepare-API-support-for-streaming-transactio.patch patch:

Thanks for your feedback comments. My replies are inline below.

1) There are slight differences in error message in case of Alter
subscription ... drop publication, we can keep the error message
similar:
postgres=# ALTER SUBSCRIPTION mysub drop PUBLICATION mypub WITH
(refresh = false, copy_data=true, two_phase=true);
ERROR: unrecognized subscription parameter: "copy_data"
postgres=# ALTER SUBSCRIPTION mysub drop PUBLICATION mypub WITH
(refresh = false, two_phase=true, streaming=true);
ERROR: cannot alter two_phase option

OK. Updated in v74.

2) We are sending txn->xid twice, I felt we should send only once in
logicalrep_write_stream_prepare:
+       /* transaction ID */
+       Assert(TransactionIdIsValid(txn->xid));
+       pq_sendint32(out, txn->xid);
+
+       /* send the flags field */
+       pq_sendbyte(out, flags);
+
+       /* send fields */
+       pq_sendint64(out, prepare_lsn);
+       pq_sendint64(out, txn->end_lsn);
+       pq_sendint64(out, txn->u_op_time.prepare_time);
+       pq_sendint32(out, txn->xid);
+

OK. Updated in v74.

3) We could remove xid and return prepare_data->xid
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in,
LogicalRepPreparedTxnData *prepare_data)
+{
+       TransactionId xid;
+       uint8           flags;
+
+       xid = pq_getmsgint(in, 4);

OK. Updated in v74.

4) Here comments can be above apply_spooled_messages for better readability
+       /*
+        * 1. Replay all the spooled operations - Similar code as for
+        * apply_handle_stream_commit (i.e. non two-phase stream commit)
+        */
+
+       ensure_transaction();
+
+       nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+

Not done. It was deliberately commented this way because the part
below the comment is what is in apply_handle_stream_commit.

5) Similarly this below comment can be above PrepareTransactionBlock
+       /*
+        * 2. Mark the transaction as prepared. - Similar code as for
+        * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+        */
+
+       /*
+        * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+        * called within the PrepareTransactionBlock below.
+        */
+       BeginTransactionBlock();
+       CommitTransactionCommand();
+
+       /*
+        * Update origin state so we can restart streaming from correct position
+        * in case of crash.
+        */
+       replorigin_session_origin_lsn = prepare_data.end_lsn;
+       replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+       PrepareTransactionBlock(gid);
+       CommitTransactionCommand();
+
+       pgstat_report_stat(false);

Not done. It is deliberately commented this way because the part below
the comment is what is in apply_handle_prepare.

6) There is a lot of common code between apply_handle_stream_prepare
and apply_handle_prepare, if possible try to have a common function to
avoid fixing at both places.
+       /*
+        * 2. Mark the transaction as prepared. - Similar code as for
+        * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+        */
+
+       /*
+        * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+        * called within the PrepareTransactionBlock below.
+        */
+       BeginTransactionBlock();
+       CommitTransactionCommand();
+
+       /*
+        * Update origin state so we can restart streaming from correct position
+        * in case of crash.
+        */
+       replorigin_session_origin_lsn = prepare_data.end_lsn;
+       replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+       PrepareTransactionBlock(gid);
+       CommitTransactionCommand();
+
+       pgstat_report_stat(false);
+
+       store_flush_position(prepare_data.end_lsn);

Not done. If you diff those functions there are really only ~ 10
statements in common so I felt it is more readable to keep it this way
than to try to make a “common” function out of an arbitrary code
fragment.

7) two-phase commit is slightly misleading, we can just mention
streaming prepare.
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+                                                       ReorderBufferTXN *txn,
+                                                       XLogRecPtr prepare_lsn)

OK. Updated in v74.

8) should we include Assert of in_streaming similar to other
pgoutput_stream*** functions.
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+                                                       ReorderBufferTXN *txn,
+                                                       XLogRecPtr prepare_lsn)
+{
+       Assert(rbtxn_is_streamed(txn));
+
+       OutputPluginUpdateProgress(ctx);
+       OutputPluginPrepareWrite(ctx, true);
+       logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+       OutputPluginWrite(ctx, true);
+}

Not done. AFAIK it is correct as-is.

9) Here also, we can verify that the transaction is streamed by
checking the pg_stat_replication_slots.
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed
on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*)
FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');

Not done. If the purpose of this comment is just to confirm that the
SQL INSERT of 5000 rows of md5 data exceeds 64K then I think we can
simply take that as self-evident. We don’t need some SQL to confirm
it.

If the purpose of this is just to ensure that stats work properly with
2PC then I agree that there should be some test cases added for stats,
but this has already been recorded elsewhere as a future TODO task.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#316vignesh C
vignesh21@gmail.com
In reply to: Peter Smith (#312)

On Thu, Apr 29, 2021 at 2:23 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v74*

Differences from v73* are:

* Rebased to HEAD @ 2 days ago.

* v74 addresses most of the feedback comments from Vignesh posts [1][2][3].

Thanks for the updated patch.
Few comments:
1) I felt skey[2] should be skey as we are just using one key here.

+       ScanKeyData skey[2];
+       SysScanDesc scan;
+       bool            has_subrels;
+
+       rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+       ScanKeyInit(&skey[nkeys++],
+                               Anum_pg_subscription_rel_srsubid,
+                               BTEqualStrategyNumber, F_OIDEQ,
+                               ObjectIdGetDatum(subid));
+
+       scan = systable_beginscan(rel, InvalidOid, false,
+                                                         NULL, nkeys, skey);
+
2) I felt we can change lsn data type from Int64 to XLogRecPtr
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
3) I felt we can change lsn data type from Int32 to TransactionId
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the
transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
4) Should we change this to "The end LSN of the prepared transaction"
just to avoid any confusion of it meaning commit/rollback.
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>

Similar problems related to comments 2 and 3 are being discussed at
[1]: /messages/by-id/CAHut+Ps2JsSd_OpBR9kXt1Rt4bwyXAjh875gUpFw6T210ttO7Q@mail.gmail.com
thread.
[1]: /messages/by-id/CAHut+Ps2JsSd_OpBR9kXt1Rt4bwyXAjh875gUpFw6T210ttO7Q@mail.gmail.com

Regards,
Vignesh

#317Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#316)

On Mon, May 10, 2021 at 1:31 PM vignesh C <vignesh21@gmail.com> wrote:

4) Should we change this to "The end LSN of the prepared transaction"
just to avoid any confusion of it meaning commit/rollback.
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>

Can you please provide more details so I can be sure of the context of
this feedback, e.g. there are multiple places that match that patch
fragment provided. So was this suggestion to change all of them ( 'b',
'P', 'K' , 'r' of patch 0001; and also 'p' of patch 0002) ?

------
Kind Regards,
Peter Smith.
Fujitsu Australia.

#318vignesh C
vignesh21@gmail.com
In reply to: Peter Smith (#317)

On Mon, May 10, 2021 at 10:51 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, May 10, 2021 at 1:31 PM vignesh C <vignesh21@gmail.com> wrote:

4) Should we change this to "The end LSN of the prepared transaction"
just to avoid any confusion of it meaning commit/rollback.
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>

Can you please provide more details so I can be sure of the context of
this feedback, e.g. there are multiple places that match that patch
fragment provided. So was this suggestion to change all of them ( 'b',
'P', 'K' , 'r' of patch 0001; and also 'p' of patch 0002) ?

My suggestion was for all of them.

Regards,
Vignesh

#319Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#318)
2 attachment(s)

Please find attached the latest patch set v75*

Differences from v74* are:

* Rebased to HEAD @ today.

* v75 also addresses some of the feedback comments from Vignesh [1]/messages/by-id/CALDaNm3U4fGxTnQfaT1TqUkgX5c0CSDvmW12Bfksis8zB_XinA@mail.gmail.com.

----
[1]: /messages/by-id/CALDaNm3U4fGxTnQfaT1TqUkgX5c0CSDvmW12Bfksis8zB_XinA@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v75-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v75-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 407a93a2d4960c4cb05483cd688c3a564a8bf622 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 13 May 2021 18:47:24 +1000
Subject: [PATCH v75] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         | 313 ++++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  69 +++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 135 +++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 218 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 201 ++++++++++--
 src/backend/replication/logical/worker.c           | 341 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  74 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 291 ++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 232 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 44 files changed, 2342 insertions(+), 206 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..9393c85 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 6d06ad2..df9e41c 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7639,6 +7639,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index a7ec5c3..493432d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1250,9 +1250,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..80016c6 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Commit Prepared messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7364,6 +7385,278 @@ Stream Abort
 
 </variablelist>
 
+<!-- ==================== TWO_PHASE Messages ==================== -->
+
+<para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies this message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies this message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the commit transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies this message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the rollback transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
 <para>
 
 The following message parts are shared by the above messages.
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..93093ce 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,72 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		is around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5c84d75..9b941e9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1254,5 +1254,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8aa6de1..020c7cf 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -814,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +909,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +938,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +984,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1001,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1043,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1061,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1101,34 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..eb03c53 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +838,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +852,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..58b4e2c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..c387997 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b955f43..f5d1bca 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..98d2b00 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
 }
 
 /*
@@ -107,6 +107,218 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1053,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b0ab91c..bd36eb3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2548,7 +2548,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->u_op_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2639,7 +2639,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->u_op_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2686,7 +2686,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->u_op_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2706,7 +2706,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2725,19 +2725,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->u_op_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2755,12 +2756,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->u_op_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->u_op_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 67f907c..0c43a89 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1065,7 +1061,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1151,3 +1148,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_no_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 60bf7f7..9b7276d 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -246,6 +319,9 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -759,6 +835,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1988,6 +2238,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2465,6 +2731,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2951,6 +3220,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3117,15 +3400,67 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..ecf9b9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +161,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +177,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -170,10 +190,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,8 +271,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -322,6 +365,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +398,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +418,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +439,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +928,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1236,3 +1335,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c88b803..6a172d3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b94910b..285a321 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index e384690..d58ff5d 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4359,6 +4360,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4402,9 +4404,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4425,6 +4434,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4450,6 +4460,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4477,6 +4489,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4518,6 +4531,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 3e39fdb..920f083 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,13 +6415,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 6598c53..194c322 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2759,7 +2759,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..413a5ce 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..36fa320 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -115,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +181,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 53cdfa5..61fda1e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -311,7 +311,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} u_op_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -650,7 +654,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3570684..71638a3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..d72082c 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..91d9032
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,291 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 19;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..2bea214
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,232 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..2cfc1ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,9 +1388,11 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v75-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v75-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From b46ddc1b34ab6b07cfd374453aa78ecc8d9c61fb Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 13 May 2021 19:17:37 +1000
Subject: [PATCH v75] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.
---
 doc/src/sgml/protocol.sgml                         |  70 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 132 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |   9 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 450 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 268 ++++++++++++
 11 files changed, 1012 insertions(+), 78 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 80016c6..d13b58b 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Commit Prepared messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7388,7 +7388,8 @@ Stream Abort
 <!-- ==================== TWO_PHASE Messages ==================== -->
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared,
+Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7655,6 +7656,71 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 020c7cf..f7d175d 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -909,12 +894,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 98d2b00..7ebfd91 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -319,6 +319,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9b7276d..549af1e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -322,6 +322,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -1009,6 +1011,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1206,30 +1288,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1237,7 +1310,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1252,7 +1325,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1327,6 +1400,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2254,6 +2352,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ecf9b9a..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -178,7 +180,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -283,17 +285,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1010,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 36fa320..9b3e934 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -244,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..0ac4433
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,450 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..e8bb726
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,268 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#320Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#316)

On Mon, May 10, 2021 at 1:31 PM vignesh C <vignesh21@gmail.com> wrote:

On Thu, Apr 29, 2021 at 2:23 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v74*

Differences from v73* are:

* Rebased to HEAD @ 2 days ago.

* v74 addresses most of the feedback comments from Vignesh posts [1][2][3].

Thanks for the updated patch.
Few comments:
1) I felt skey[2] should be skey as we are just using one key here.

+       ScanKeyData skey[2];
+       SysScanDesc scan;
+       bool            has_subrels;
+
+       rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+       ScanKeyInit(&skey[nkeys++],
+                               Anum_pg_subscription_rel_srsubid,
+                               BTEqualStrategyNumber, F_OIDEQ,
+                               ObjectIdGetDatum(subid));
+
+       scan = systable_beginscan(rel, InvalidOid, false,
+                                                         NULL, nkeys, skey);
+

Fixed in v75.

2) I felt we can change lsn data type from Int64 to XLogRecPtr
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>

Deferred.

3) I felt we can change lsn data type from Int32 to TransactionId
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the
transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>

Deferred.

4) Should we change this to "The end LSN of the prepared transaction"
just to avoid any confusion of it meaning commit/rollback.
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>

Modified in v75 for message types 'b', 'P', 'K', 'r', 'p'.

Similar problems related to comments 2 and 3 are being discussed at
[1], we can change it accordingly based on the conclusion in the other
thread.
[1] - /messages/by-id/CAHut+Ps2JsSd_OpBR9kXt1Rt4bwyXAjh875gUpFw6T210ttO7Q@mail.gmail.com

Yes, I will defer addressing those feedback comments 2 and 3 pending
the outcome of your other patch of the above thread.

----------
Kind Regards,
Peter Smith.
Fujitsu Australia

#321Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#319)
3 attachment(s)

On Thu, May 13, 2021 at 7:50 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v75*

Differences from v74* are:

* Rebased to HEAD @ today.

* v75 also addresses some of the feedback comments from Vignesh [1].

Adding a patch to this patch-set that avoids empty transactions from
being sent to the subscriber/replica. This patch is based on the
logic that was proposed for empty transactions in the thread [1]/messages/by-id/CAFPTHDYegcoS3xjGBj0XHfcdZr6Y35+YG1jq79TBD1VCkK7v3A@mail.gmail.com. This
patch uses that patch and handles empty prepared transactions
as well. So, this will avoid empty prepared transactions from being
sent to the subscriber/replica. This patch also avoids sending
COMMIT PREPARED /ROLLBACK PREPARED if the prepared transaction was
skipped provided the COMMIT /ROLLBACK happens
prior to a restart of the walsender. If the COMMIT/ROLLBACK PREPARED
happens after a restart, it will not be able know that the
prepared transaction prior to the restart was not sent, in this case
the apply worker of the subscription will check if a prepare of the
same type exists
and if it does not, it will silently ignore the COMMIT PREPARED
(ROLLBACK PREPARED logic was already doing this).
Do have a look and let me know if you have any comments.

[1]: /messages/by-id/CAFPTHDYegcoS3xjGBj0XHfcdZr6Y35+YG1jq79TBD1VCkK7v3A@mail.gmail.com

regards,
Ajin Cherian
Fujitsu Australia.

Attachments:

v76-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v76-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 224cdde13ee43214b6bf3e088e94c10d59ff09f3 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 14 May 2021 09:24:11 -0400
Subject: [PATCH v76] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         | 313 ++++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  69 +++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 135 +++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 218 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 201 ++++++++++--
 src/backend/replication/logical/worker.c           | 341 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  74 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 291 ++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 232 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 44 files changed, 2342 insertions(+), 206 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..9393c85 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 6d06ad2..df9e41c 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7639,6 +7639,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index a7ec5c3..493432d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1250,9 +1250,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..80016c6 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Commit Prepared messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7364,6 +7385,278 @@ Stream Abort
 
 </variablelist>
 
+<!-- ==================== TWO_PHASE Messages ==================== -->
+
+<para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies this message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies this message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the commit transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies this message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the rollback transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
 <para>
 
 The following message parts are shared by the above messages.
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..93093ce 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,72 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		is around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5c84d75..9b941e9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1254,5 +1254,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8aa6de1..020c7cf 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -814,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +909,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +938,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +984,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1001,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1043,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1061,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1101,34 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..eb03c53 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +838,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +852,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..58b4e2c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..c387997 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b955f43..f5d1bca 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..98d2b00 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
 }
 
 /*
@@ -107,6 +107,218 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1053,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b0ab91c..bd36eb3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2548,7 +2548,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->u_op_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2639,7 +2639,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->u_op_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2686,7 +2686,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->u_op_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2706,7 +2706,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2725,19 +2725,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->u_op_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2755,12 +2756,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->u_op_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->u_op_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 67f907c..0c43a89 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1065,7 +1061,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1151,3 +1148,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_no_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 1432554..74f5678 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -246,6 +319,9 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -759,6 +835,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1987,6 +2237,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2464,6 +2730,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2950,6 +3219,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3116,15 +3399,67 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..ecf9b9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +161,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +177,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -170,10 +190,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,8 +271,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -322,6 +365,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +398,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +418,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +439,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +928,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1236,3 +1335,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c88b803..6a172d3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b94910b..285a321 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 339c393..cb70974 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4359,6 +4360,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4402,9 +4404,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4425,6 +4434,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4450,6 +4460,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4477,6 +4489,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4518,6 +4531,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 3e39fdb..920f083 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,13 +6415,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 6598c53..194c322 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2759,7 +2759,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..413a5ce 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..36fa320 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -115,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +181,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 53cdfa5..61fda1e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -311,7 +311,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} u_op_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -650,7 +654,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3570684..71638a3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..d72082c 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..91d9032
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,291 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 19;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..2bea214
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,232 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..2cfc1ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,9 +1388,11 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v76-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v76-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 261247ac0b0d02907fbda5d75fc58944c8d6b2ad Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 14 May 2021 10:02:33 -0400
Subject: [PATCH v76] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.
---
 doc/src/sgml/protocol.sgml                         |  70 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 132 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |   9 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 450 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 268 ++++++++++++
 11 files changed, 1012 insertions(+), 78 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 80016c6..d13b58b 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Commit Prepared messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7388,7 +7388,8 @@ Stream Abort
 <!-- ==================== TWO_PHASE Messages ==================== -->
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared,
+Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7655,6 +7656,71 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 020c7cf..f7d175d 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -909,12 +894,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 98d2b00..7ebfd91 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -319,6 +319,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 74f5678..c1470ab 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -322,6 +322,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -1009,6 +1011,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1206,30 +1288,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1237,7 +1310,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1252,7 +1325,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1327,6 +1400,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2253,6 +2351,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ecf9b9a..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -178,7 +180,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -283,17 +285,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1010,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 36fa320..9b3e934 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -244,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..0ac4433
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,450 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..e8bb726
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,268 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v76-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v76-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From f9f7e8b8f1011a3b4a4b803a0a74e8035d36f16f Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Sat, 15 May 2021 09:43:23 -0400
Subject: [PATCH v76] Skip empty transactions for logical replication.

The current logical replication behavior is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 doc/src/sgml/logicaldecoding.sgml               |  12 ++-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/proto.c         |  14 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  36 ++++---
 src/backend/replication/pgoutput/pgoutput.c     | 135 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/pgoutput.h              |   5 +
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 10 files changed, 209 insertions(+), 27 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 493432d..5cd4146 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -862,11 +862,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can apply the rollback, otherwise, it can skip the rollback operation. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index d13b58b..d2e2f82 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7543,6 +7543,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+<varlistentry>
+
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit.
 </para></listitem>
 </varlistentry>
@@ -7557,6 +7564,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Timestamp of the prepare. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Timestamp of the commit transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 7ebfd91..fd1cf6c 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -223,8 +225,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->u_op_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -245,12 +249,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bd36eb3..a4ce0a7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2766,7 +2766,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c1470ab..786e553 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -938,26 +938,38 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* there is no transaction when COMMIT PREPARED is called */
-	ensure_transaction();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	CommitTransactionCommand();
+		/* there is no transaction when COMMIT PREPARED is called */
+		ensure_transaction();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7c3a33d..89ce7ba 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -76,6 +78,7 @@ static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
 
 static bool publications_valid;
 static bool in_streaming;
+static bool in_prepared_txn;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
@@ -377,6 +380,9 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		else
 			ctx->twophase_opt_given = true;
 
+		/* Also remember we're currently not in a prepared transaction. */
+		in_prepared_txn = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -404,10 +410,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	(void)txn; /* keep compiler quiet */
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -422,8 +450,14 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (!data->sent_begin_txn)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -435,10 +469,31 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+	in_prepared_txn = true;
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -453,11 +508,18 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
+	in_prepared_txn = false;
 }
 
 /*
@@ -465,12 +527,28 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data && !data->sent_begin_txn)
+	{
+		pfree(data);
+		return;
+	}
+
+	if (data)
+		pfree(data);
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -483,8 +561,22 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data && !data->sent_begin_txn)
+	{
+		pfree(data);
+		return;
+	}
+
+	if (data)
+		pfree(data);
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -613,6 +705,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
@@ -651,6 +744,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (!in_prepared_txn)
+			pgoutput_begin(ctx, txn);
+		else
+			pgoutput_begin_prepare(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -750,6 +852,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -793,6 +896,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (!in_prepared_txn)
+				pgoutput_begin(ctx, txn);
+			else
+				pgoutput_begin_prepare(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -813,11 +925,15 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
 		return;
 
+	if (txn && txn->output_plugin_private)
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+
 	/*
 	 * Remember the xid for the message in streaming mode. See
 	 * pgoutput_change.
@@ -825,6 +941,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (!in_prepared_txn)
+				pgoutput_begin(ctx, txn);
+			else
+				pgoutput_begin_prepare(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 9b3e934..a6d9977 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 0dc460f..bec526f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -30,4 +30,9 @@ typedef struct PGOutputData
 	bool		two_phase;
 } PGOutputData;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;	/* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 61fda1e..0c61278 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -456,7 +456,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 0940d0f..39bff1b 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -82,9 +82,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
-- 
1.8.3.1

#322Ajin Cherian
itsajin@gmail.com
In reply to: Ajin Cherian (#321)
3 attachment(s)

The above patch had some changes missing which resulted in some tap
tests failing. Sending an updated patchset. Keeping the patchset
version the same.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v76-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v76-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From e780c58c4cd59b9dbf44a214a7dbcd47c7443fe0 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 17 May 2021 08:32:51 -0400
Subject: [PATCH v76] Skip empty transactions for logical replication.

The current logical replication behavior is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 ++-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  14 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  36 ++++---
 src/backend/replication/pgoutput/pgoutput.c     | 135 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/pgoutput.h              |   5 +
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 13 files changed, 223 insertions(+), 33 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index 9393c85..f3329b8 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 493432d..5cd4146 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -862,11 +862,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can apply the rollback, otherwise, it can skip the rollback operation. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index d13b58b..d2e2f82 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7543,6 +7543,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+<varlistentry>
+
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit.
 </para></listitem>
 </varlistentry>
@@ -7557,6 +7564,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Timestamp of the prepare. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Timestamp of the commit transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c387997..ed60719 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -940,7 +941,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -975,7 +977,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 7ebfd91..fd1cf6c 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -223,8 +225,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->u_op_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -245,12 +249,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bd36eb3..a4ce0a7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2766,7 +2766,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c1470ab..786e553 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -938,26 +938,38 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* there is no transaction when COMMIT PREPARED is called */
-	ensure_transaction();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	CommitTransactionCommand();
+		/* there is no transaction when COMMIT PREPARED is called */
+		ensure_transaction();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7c3a33d..89ce7ba 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -76,6 +78,7 @@ static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
 
 static bool publications_valid;
 static bool in_streaming;
+static bool in_prepared_txn;
 
 static List *LoadPublications(List *pubnames);
 static void publication_invalidation_cb(Datum arg, int cacheid,
@@ -377,6 +380,9 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		else
 			ctx->twophase_opt_given = true;
 
+		/* Also remember we're currently not in a prepared transaction. */
+		in_prepared_txn = false;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -404,10 +410,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	(void)txn; /* keep compiler quiet */
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -422,8 +450,14 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (!data->sent_begin_txn)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -435,10 +469,31 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+	in_prepared_txn = true;
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -453,11 +508,18 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
+	in_prepared_txn = false;
 }
 
 /*
@@ -465,12 +527,28 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data && !data->sent_begin_txn)
+	{
+		pfree(data);
+		return;
+	}
+
+	if (data)
+		pfree(data);
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -483,8 +561,22 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data && !data->sent_begin_txn)
+	{
+		pfree(data);
+		return;
+	}
+
+	if (data)
+		pfree(data);
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -613,6 +705,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
@@ -651,6 +744,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (!in_prepared_txn)
+			pgoutput_begin(ctx, txn);
+		else
+			pgoutput_begin_prepare(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -750,6 +852,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -793,6 +896,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (!in_prepared_txn)
+				pgoutput_begin(ctx, txn);
+			else
+				pgoutput_begin_prepare(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -813,11 +925,15 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
 		return;
 
+	if (txn && txn->output_plugin_private)
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+
 	/*
 	 * Remember the xid for the message in streaming mode. See
 	 * pgoutput_change.
@@ -825,6 +941,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (!in_prepared_txn)
+				pgoutput_begin(ctx, txn);
+			else
+				pgoutput_begin_prepare(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 9b3e934..a6d9977 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 0dc460f..bec526f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -30,4 +30,9 @@ typedef struct PGOutputData
 	bool		two_phase;
 } PGOutputData;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;	/* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 61fda1e..0c61278 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -456,7 +456,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 0940d0f..39bff1b 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -82,9 +82,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
-- 
1.8.3.1

v76-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v76-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 691e0d04fb7b9fdc413bf1684f7e437bed021dd8 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 17 May 2021 00:48:47 -0400
Subject: [PATCH v76] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         | 313 ++++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  69 +++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 135 +++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 218 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 201 ++++++++++--
 src/backend/replication/logical/worker.c           | 341 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  74 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 291 ++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 232 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 44 files changed, 2342 insertions(+), 206 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..9393c85 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->u_op_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 6d06ad2..df9e41c 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7639,6 +7639,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index a7ec5c3..493432d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1250,9 +1250,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..80016c6 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Commit Prepared messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7364,6 +7385,278 @@ Stream Abort
 
 </variablelist>
 
+<!-- ==================== TWO_PHASE Messages ==================== -->
+
+<para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies this message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies this message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the commit transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies this message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the rollback transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
 <para>
 
 The following message parts are shared by the above messages.
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..93093ce 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,72 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		is around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are neither expecting the collisions of GXACTs (same gid)
+			 * between publisher and subscribers nor the apply worker restarts
+			 * after prepared xacts, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5c84d75..9b941e9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1254,5 +1254,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8aa6de1..020c7cf 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			{
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 */
+				walrcv_create_slot(wrconn, slotname, false, false,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
@@ -814,7 +874,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +909,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +938,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +984,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1001,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1043,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1061,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1101,34 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription two_phase commit implementation requires
+				 * that replication has passed the initial table
+				 * synchronization phase before the two_phase becomes properly
+				 * enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..eb03c53 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +838,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +852,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..58b4e2c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..c387997 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b955f43..f5d1bca 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..98d2b00 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
 }
 
 /*
@@ -107,6 +107,218 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID. Additionally, the transaction
+	 * must be prepared. See ReorderBufferFinishPrepared.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions. In
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1053,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->u_op_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b0ab91c..bd36eb3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2548,7 +2548,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->u_op_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2639,7 +2639,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->u_op_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2686,7 +2686,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->u_op_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2706,7 +2706,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2725,19 +2725,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->u_op_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2755,12 +2756,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->u_op_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->u_op_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 67f907c..0c43a89 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1065,7 +1061,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1151,3 +1148,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_no_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the current subscription.
+ */
+void
+UpdateTwoPhaseState(char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(MySubscription->oid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 MySubscription->oid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 1432554..74f5678 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -246,6 +319,9 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -759,6 +835,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1987,6 +2237,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2464,6 +2730,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2950,6 +3219,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3116,15 +3399,67 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+
+			UpdateTwoPhaseState(LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..ecf9b9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +161,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +177,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -170,10 +190,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,8 +271,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -322,6 +365,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +398,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +418,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +439,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +928,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1236,3 +1335,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c88b803..6a172d3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b94910b..285a321 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 339c393..cb70974 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4359,6 +4360,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4402,9 +4404,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4425,6 +4434,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4450,6 +4460,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4477,6 +4489,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4518,6 +4531,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 3e39fdb..920f083 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,13 +6415,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 6598c53..194c322 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2759,7 +2759,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..413a5ce 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..36fa320 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -115,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +181,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 53cdfa5..61fda1e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -311,7 +311,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} u_op_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -650,7 +654,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3570684..71638a3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..d72082c 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..91d9032
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,291 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 19;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..2bea214
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,232 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..2cfc1ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,9 +1388,11 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v76-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v76-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 91b78aaccc0bf109a463ff76d9e63ec11e21a625 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 17 May 2021 01:03:52 -0400
Subject: [PATCH v76] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.
---
 doc/src/sgml/protocol.sgml                         |  70 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 132 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |   9 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 450 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 268 ++++++++++++
 11 files changed, 1012 insertions(+), 78 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 80016c6..d13b58b 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Commit Prepared messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7388,7 +7388,8 @@ Stream Abort
 <!-- ==================== TWO_PHASE Messages ==================== -->
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared,
+Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7655,6 +7656,71 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 020c7cf..f7d175d 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -909,12 +894,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 98d2b00..7ebfd91 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -319,6 +319,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase transactions. In which case
+	 * we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->u_op_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 74f5678..c1470ab 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -322,6 +322,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -1009,6 +1011,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1206,30 +1288,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1237,7 +1310,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1252,7 +1325,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1327,6 +1400,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2253,6 +2351,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ecf9b9a..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -178,7 +180,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -283,17 +285,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1010,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 36fa320..9b3e934 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -244,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..0ac4433
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,450 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..e8bb726
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,268 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#323vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#322)

On Mon, May 17, 2021 at 6:10 PM Ajin Cherian <itsajin@gmail.com> wrote:

The above patch had some changes missing which resulted in some tap
tests failing. Sending an updated patchset. Keeping the patchset
version the same.

Thanks for the updated patch, the updated patch fixes the tap test failures.

Regards,
Vignesh

#324Peter Smith
smithpb2250@gmail.com
In reply to: Ajin Cherian (#321)

On Sun, May 16, 2021 at 12:07 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Thu, May 13, 2021 at 7:50 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v75*

Differences from v74* are:

* Rebased to HEAD @ today.

* v75 also addresses some of the feedback comments from Vignesh [1].

Adding a patch to this patch-set that avoids empty transactions from
being sent to the subscriber/replica. This patch is based on the
logic that was proposed for empty transactions in the thread [1]. This
patch uses that patch and handles empty prepared transactions
as well. So, this will avoid empty prepared transactions from being
sent to the subscriber/replica. This patch also avoids sending
COMMIT PREPARED /ROLLBACK PREPARED if the prepared transaction was
skipped provided the COMMIT /ROLLBACK happens
prior to a restart of the walsender. If the COMMIT/ROLLBACK PREPARED
happens after a restart, it will not be able know that the
prepared transaction prior to the restart was not sent, in this case
the apply worker of the subscription will check if a prepare of the
same type exists
and if it does not, it will silently ignore the COMMIT PREPARED
(ROLLBACK PREPARED logic was already doing this).
Do have a look and let me know if you have any comments.

[1] - /messages/by-id/CAFPTHDYegcoS3xjGBj0XHfcdZr6Y35+YG1jq79TBD1VCkK7v3A@mail.gmail.com

Hi Ajin.

I have applied the latest patch set v76*.

The patches applied cleanly.

All of the make, make check, and TAP subscriptions tests worked OK.

Below are my REVIEW COMMENTS for the v76-0003 part.

==========

1. File: doc/src/sgml/logicaldecoding.sgml

1.1

@@ -862,11 +862,19 @@ typedef void (*LogicalDecodePrepareCB) (struct
LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has
been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can apply the rollback, otherwise, it can skip the rollback
operation. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with same identifier.

This is in the commit prepared section, but that new text is referring
to "it can apply to the rollback" etc.
Is this deliberate text, or maybe cut/paste error?

==========

2. File: src/backend/replication/pgoutput/pgoutput.c

2.1

@@ -76,6 +78,7 @@ static void
pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,

static bool publications_valid;
static bool in_streaming;
+static bool in_prepared_txn;

Wondering why this is a module static flag. That makes it looks like
it somehow applies globally to all the functions in this scope, but
really I think this is just a txn property, right?
- e.g. why not use another member of the private TXN data instead? or
- e.g. why not use rbtxn_prepared(txn) macro?

----------

2.2

@@ -404,10 +410,32 @@ pgoutput_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+ PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+ sizeof(PGOutputTxnData));
+
+ (void)txn; /* keep compiler quiet */

I guess since now the arg "txn" is being used the added statement to
"keep compiler quiet" is now redundant, so should be removed.

----------

2.3

+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
  bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+ PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private;

OutputPluginPrepareWrite(ctx, !send_replication_origin);
logicalrep_write_begin(ctx->out, txn);
+ data->sent_begin_txn = true;

I wondered is it worth adding Assert(data); here?

----------

2.4

@@ -422,8 +450,14 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  XLogRecPtr commit_lsn)
 {
+ PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private;
+
  OutputPluginUpdateProgress(ctx);

I wondered is it worthwhile to add Assert(data); here also?

----------

2.5
@@ -422,8 +450,14 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  XLogRecPtr commit_lsn)
 {
+ PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private;
+
  OutputPluginUpdateProgress(ctx);
+ /* skip COMMIT message if nothing was sent */
+ if (!data->sent_begin_txn)
+ return;

Shouldn't this code also be freeing that allocated data? I think you
do free it in similar functions later in this patch.

----------

2.6

@@ -435,10 +469,31 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+ PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+ sizeof(PGOutputTxnData));
+
+ /*
+ * Don't send BEGIN message here. Instead, postpone it until the first
+ * change. In logical replication, a common scenario is to replicate a set
+ * of tables (instead of all tables) and transactions whose changes were on
+ * table(s) that are not published will produce empty transactions. These
+ * empty transactions will send BEGIN and COMMIT messages to subscribers,
+ * using bandwidth on something with little/no use for logical replication.
+ */
+ data->sent_begin_txn = false;
+ txn->output_plugin_private = data;
+ in_prepared_txn = true;
+}

Apart from setting the in_prepared_txn = true; this is all identical
code to pgoutput_begin_txn so you could consider just delegating to
call that other function to save all the cut/paste data allocation and
big comment. Or maybe this way is better - I am not sure.

----------

2.7

+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
  bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+ PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;

OutputPluginPrepareWrite(ctx, !send_replication_origin);
logicalrep_write_begin_prepare(ctx->out, txn);
+ data->sent_begin_txn = true;

I wondered is it worth adding Assert(data); here also?

----------

2.8

@@ -453,11 +508,18 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  XLogRecPtr prepare_lsn)
 {
+ PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
  OutputPluginUpdateProgress(ctx);

I wondered is it worth adding Assert(data); here also?

----------

2.9

@@ -465,12 +527,28 @@ pgoutput_prepare_txn(LogicalDecodingContext
*ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
- XLogRecPtr commit_lsn)
+ XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+ TimestampTz prepare_time)
 {
+ PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
  OutputPluginUpdateProgress(ctx);
+ /*
+ * skip sending COMMIT PREPARED message if prepared transaction
+ * has not been sent.
+ */
+ if (data && !data->sent_begin_txn)
+ {
+ pfree(data);
+ return;
+ }
+
+ if (data)
+ pfree(data);
  OutputPluginPrepareWrite(ctx, true);

I think this pfree logic might be refactored more simply to just be
done in one place. e.g. like:

if (data)
{
bool skip = !data->sent_begin_txn;
pfree(data);
if (skip)
return;
}

BTW, is it even possible to get in this function with NULL private
data? Perhaps that should be an Assert(data) ?

----------

2.10

@@ -483,8 +561,22 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
     XLogRecPtr prepare_end_lsn,
     TimestampTz prepare_time)
 {
+ PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
  OutputPluginUpdateProgress(ctx);
+ /*
+ * skip sending COMMIT PREPARED message if prepared transaction
+ * has not been sent.
+ */
+ if (data && !data->sent_begin_txn)
+ {
+ pfree(data);
+ return;
+ }
+
+ if (data)
+ pfree(data);

Same comment as above for refactoring the pfree logic.

----------

2.11

@@ -483,8 +561,22 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
     XLogRecPtr prepare_end_lsn,
     TimestampTz prepare_time)
 {
+ PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
  OutputPluginUpdateProgress(ctx);
+ /*
+ * skip sending COMMIT PREPARED message if prepared transaction
+ * has not been sent.
+ */
+ if (data && !data->sent_begin_txn)
+ {
+ pfree(data);
+ return;
+ }
+
+ if (data)
+ pfree(data);

Is that comment correct or cut/paste error? Why does it say "COMMIT PREPARED" ?

----------

2.12

@@ -613,6 +705,7 @@ pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
  Relation relation, ReorderBufferChange *change)
 {
  PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+ PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
  MemoryContext old;

I wondered is it worth adding Assert(txndata); here also?

----------

2.13

@@ -750,6 +852,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
    int nrelations, Relation relations[], ReorderBufferChange *change)
 {
  PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+ PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
  MemoryContext old;

I wondered is it worth adding Assert(txndata); here also?

----------

2.14

@@ -813,11 +925,15 @@ pgoutput_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
const char *message)
{
PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+ PGOutputTxnData *txndata;
TransactionId xid = InvalidTransactionId;

if (!data->messages)
return;

+ if (txn && txn->output_plugin_private)
+ txndata = (PGOutputTxnData *) txn->output_plugin_private;
+
  /*
  * Remember the xid for the message in streaming mode. See
  * pgoutput_change.
@@ -825,6 +941,19 @@ pgoutput_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
  if (in_streaming)
  xid = txn->xid;
+ /* output BEGIN if we haven't yet, avoid for streaming and
non-transactional messages */
+ if (!in_streaming && transactional)
+ {
+ txndata = (PGOutputTxnData *) txn->output_plugin_private;
+ if (!txndata->sent_begin_txn)
+ {
+ if (!in_prepared_txn)
+ pgoutput_begin(ctx, txn);
+ else
+ pgoutput_begin_prepare(ctx, txn);
+ }
+ }
That code:
+ if (txn && txn->output_plugin_private)
+ txndata = (PGOutputTxnData *) txn->output_plugin_private;
looked misplaced to me.

Shouldn't all that be relocated to be put inside the if block:
+ if (!in_streaming && transactional)

And when you do that maybe the condition can be simplified because you could
Assert(txn);

==========

3. File src/include/replication/pgoutput.h

3.1

@@ -30,4 +30,9 @@ typedef struct PGOutputData
bool two_phase;
} PGOutputData;

+typedef struct PGOutputTxnData
+{
+ bool sent_begin_txn; /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+

Why is this typedef here? IIUC it is only used inside the pgoutput.c,
so shouldn't it be declared in that file also?

----------

3.2

@@ -30,4 +30,9 @@ typedef struct PGOutputData
bool two_phase;
} PGOutputData;

+typedef struct PGOutputTxnData
+{
+ bool sent_begin_txn; /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+

That is a new typedef so maybe your patch also should update the
src/tools/pgindent/typedefs.list to name this new typedef.

----------
Kind Regards,
Peter Smith.
Fujitsu Australia

#325tanghy.fnst@fujitsu.com
tanghy.fnst@fujitsu.com
In reply to: Peter Smith (#324)
RE: [HACKERS] logical decoding of two-phase transactions

Hi Ajin

The above patch had some changes missing which resulted in some tap
tests failing. Sending an updated patchset. Keeping the patchset
version the same.

Thanks for your patch. I see a problem about Segmentation fault when using it. Please take a look at this.
The steps to reproduce the problem are as follows.

------publisher------
create table test (a int primary key, b varchar);
create publication pub for table test;

------subscriber------
create table test (a int primary key, b varchar);
create subscription sub connection 'dbname=postgres' publication pub with(two_phase=on);

Then, I prepare, commit, rollback transactions and TRUNCATE table in a sql as follows:
-------------
BEGIN;
INSERT INTO test SELECT i, md5(i::text) FROM generate_series(1, 10000) s(i);
PREPARE TRANSACTION 't1';
COMMIT PREPARED 't1';

BEGIN;
INSERT INTO test SELECT i, md5(i::text) FROM generate_series(10001, 20000) s(i);
PREPARE TRANSACTION 't2';
ROLLBACK PREPARED 't2';

TRUNCATE test;
-------------

To make sure the problem produce easily, I looped above operations in my sql file about 10 times, then I can 100% reproduce it and got segmentation fault in publisher log as follows:
-------------
2021-05-18 16:30:56.952 CST [548189] postmaster LOG: server process (PID 548222) was terminated by signal 11: Segmentation fault
2021-05-18 16:30:56.952 CST [548189] postmaster DETAIL: Failed process was running: START_REPLICATION SLOT "sub" LOGICAL 0/0 (proto_version '3', two_phase 'on', publication_names '"pub"')
-------------

Here is the core dump information :
-------------
#0 0x000000000090afe4 in pq_sendstring (buf=buf@entry=0x251ca80, str=0x0) at pqformat.c:199
#1 0x0000000000ab0a2b in logicalrep_write_begin_prepare (out=0x251ca80, txn=txn@entry=0x25346e8) at proto.c:124
#2 0x00007f9528842dd6 in pgoutput_begin_prepare (ctx=ctx@entry=0x2514700, txn=txn@entry=0x25346e8) at pgoutput.c:495
#3 0x00007f9528843f70 in pgoutput_truncate (ctx=0x2514700, txn=0x25346e8, nrelations=1, relations=0x262f678, change=0x25370b8) at pgoutput.c:905
#4 0x0000000000aa57cb in truncate_cb_wrapper (cache=<optimized out>, txn=<optimized out>, nrelations=<optimized out>, relations=<optimized out>, change=<optimized out>)
at logical.c:1103
#5 0x0000000000abf333 in ReorderBufferApplyTruncate (streaming=false, change=0x25370b8, relations=0x262f678, nrelations=1, txn=0x25346e8, rb=0x2516710)
at reorderbuffer.c:1918
#6 ReorderBufferProcessTXN (rb=rb@entry=0x2516710, txn=0x25346e8, commit_lsn=commit_lsn@entry=27517176, snapshot_now=<optimized out>, command_id=command_id@entry=0,
streaming=streaming@entry=false) at reorderbuffer.c:2278
#7 0x0000000000ac0b14 in ReorderBufferReplay (txn=<optimized out>, rb=rb@entry=0x2516710, xid=xid@entry=738, commit_lsn=commit_lsn@entry=27517176,
end_lsn=end_lsn@entry=27517544, commit_time=commit_time@entry=674644388404356, origin_id=0, origin_lsn=0) at reorderbuffer.c:2591
#8 0x0000000000ac1713 in ReorderBufferCommit (rb=0x2516710, xid=xid@entry=738, commit_lsn=27517176, end_lsn=27517544, commit_time=commit_time@entry=674644388404356,
origin_id=origin_id@entry=0, origin_lsn=0) at reorderbuffer.c:2615
#9 0x0000000000a9f702 in DecodeCommit (ctx=ctx@entry=0x2514700, buf=buf@entry=0x7ffdd027c2b0, parsed=parsed@entry=0x7ffdd027c140, xid=xid@entry=738,
two_phase=<optimized out>) at decode.c:742
#10 0x0000000000a9fc6c in DecodeXactOp (ctx=ctx@entry=0x2514700, buf=buf@entry=0x7ffdd027c2b0) at decode.c:278
#11 0x0000000000aa1b75 in LogicalDecodingProcessRecord (ctx=0x2514700, record=0x2514ac0) at decode.c:142
#12 0x0000000000af6db1 in XLogSendLogical () at walsender.c:2876
#13 0x0000000000afb6aa in WalSndLoop (send_data=send_data@entry=0xaf6d49 <XLogSendLogical>) at walsender.c:2306
#14 0x0000000000afbdac in StartLogicalReplication (cmd=cmd@entry=0x24da288) at walsender.c:1206
#15 0x0000000000afd646 in exec_replication_command (
cmd_string=cmd_string@entry=0x2452570 "START_REPLICATION SLOT \"sub\" LOGICAL 0/0 (proto_version '3', two_phase 'on', publication_names '\"pub\"')") at walsender.c:1646
#16 0x0000000000ba3514 in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffdd027c560, dbname=<optimized out>, username=<optimized out>) at postgres.c:4482
#17 0x0000000000a7284a in BackendRun (port=port@entry=0x2477b60) at postmaster.c:4491
#18 0x0000000000a78bba in BackendStartup (port=port@entry=0x2477b60) at postmaster.c:4213
#19 0x0000000000a78ff9 in ServerLoop () at postmaster.c:1745
#20 0x0000000000a7bbdf in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x244bae0) at postmaster.c:1417
#21 0x000000000090dc80 in main (argc=3, argv=0x244bae0) at main.c:209
-------------

I noticed that it called pgoutput_truncate function and pgoutput_begin_prepare function. It seems odd because TRUNCATE is not in a prepared transaction in my case.

I tried to debug this to learn more and found that in pgoutput_truncate function, the value of in_prepared_txn was true. Later, it got a segmentation fault when it tried to get gid in logicalrep_write_begin_prepare function - it has no gid so we got the segmentation fault.

FYI:
I also tested the case in synchronous mode, and it can execute successfully. So, I think the value of in_prepared_txn is sometimes incorrect in asynchronous mode. Maybe there's a better way to get this.

Regards
Tang

#326Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#319)

On Thu, May 13, 2021 at 3:20 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v75*

Review comments for v75-0001-Add-support-for-prepared-transactions-to-built-i:
===============================================================================
1.
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable
class="parameter">slot_name</replaceable> [
<literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [
<literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal>
<replaceable class="parameter">output_plugin</replaceable> [
<literal>EXPORT_SNAPSHOT</literal> |
<literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal>
] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable
class="parameter">slot_name</replaceable> [
<literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] {
<literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] |
<literal>LOGICAL</literal> <replaceable
class="parameter">output_plugin</replaceable> [
<literal>EXPORT_SNAPSHOT</literal> |
<literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal>
] }

Can we do some testing of the code related to this in some way? One
random idea could be to change the current subscriber-side code just
for testing purposes to see if this works. Can we enhance and use
pg_recvlogical to test this? It is possible that if you address
comment number 13 below, this can be tested with Create Subscription
command.

2.
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Commit Prepared messages belong to the same transaction.

I think here we need to write Prepare instead of Commit Prepared
because Commit Prepared for a transaction can come at a later point of
time and all the messages in-between won't belong to the same
transaction.

3.
+<!-- ==================== TWO_PHASE Messages ==================== -->
+
+<para>
+The following messages (Begin Prepare, Prepare, Commit Prepared,
Rollback Prepared)
+are available since protocol version 3.
+</para>

I am not sure here marker like "TWO_PHASE Messages" is required. We
don't have any such marker for streaming messages.

4.
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction.

Isn't it better to write this description as "Prepare timestamp of the
transaction" to match with the similar description of Commit
timestamp. Also, there are similar occurances in the patch at other
places, change those as well.

5.
+<term>Begin Prepare</term>
+<listitem>
+<para>
...
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the
transaction for top-level
+                transactions).

The above description seems wrong to me. It should be Xid of the
transaction as we won't receive Xid of subtransaction in Begin
message. The same applies to the prepare/commit prepared/rollback
prepared transaction messages as well, so change that as well
accordingly.

6.
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare
transaction message.
+</para></listitem>

In all the similar messages, we are using "Identifies the message as
...". I feel it is better to be consistent in this and similar
messages in the patch.

7.
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
..
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>

This should be end LSN of the prepared transaction.

8.
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+ TimestampTz origin_prepare_timestamp)
..
..
+ /*
+ * We are neither expecting the collisions of GXACTs (same gid)
+ * between publisher and subscribers nor the apply worker restarts
+ * after prepared xacts,

The second part of the comment ".. nor the apply worker restarts after
prepared xacts .." is no longer true after commit 8bdb1332eb[1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8bdb1332eb51837c15a10a972c179b84f654279e. So,
we can remove it.

9.
+ /*
+ * Does the subscription have tables?
+ *
+ * If there were not-READY relations found then we know it does. But if
+ * table_state_no_ready was empty we still need to check again to see
+ * if there are 0 tables.
+ */
+ has_subrels = (list_length(table_states_not_ready) > 0) ||

Typo in comments. /table_state_no_ready/table_state_not_ready

10.
+ if (!twophase)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));

errmsg is not aligned properly. Can we make the error message clear,
something like: "cannot change two_phase option"

11.
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
     char **synchronous_commit,
     bool *refresh,
     bool *binary_given, bool *binary,
-    bool *streaming_given, bool *streaming)
+    bool *streaming_given, bool *streaming,
+    bool *twophase_given, bool *twophase)

This function already has 14 parameters and this patch adds 2 new
ones. Isn't it better to have a struct (ParseSubOptions) for these
parameters? I think that might lead to some code churn but we can have
that as a separate patch on top of which we can create two_pc patch.

12.
* The subscription two_phase commit implementation requires
+ * that replication has passed the initial table
+ * synchronization phase before the two_phase becomes properly
+ * enabled.

Can we slightly modify the starting of this sentence as:"The
subscription option 'two_phase' requires that ..."

13.
@@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt,
bool isTopLevel)
{
Assert(slotname);

- walrcv_create_slot(wrconn, slotname, false,
+ /*
+ * Even if two_phase is set, don't create the slot with
+ * two-phase enabled. Will enable it once all the tables are
+ * synced and ready. This avoids race-conditions like prepared
+ * transactions being skipped due to changes not being applied
+ * due to checks in should_apply_changes_for_rel() when
+ * tablesync for the corresponding tables are in progress. See
+ * comments atop worker.c.
+ */
+ walrcv_create_slot(wrconn, slotname, false, false,

Can't we enable two_phase if copy_data is false? Because in that case,
all relations will be in a READY state. If we do that then we should
also set two_phase state as 'enabled' during createsubscription. I
think we need to be careful to check that connect option is given and
copy_data is false before setting such a state. Now, I guess we may
not be able to optimize this to not set 'enabled' state when the
subscription has no rels.

14.
+ if (options->proto.logical.twophase &&
+ PQserverVersion(conn->streamConn) >= 140000)
+ appendStringInfoString(&cmd, ", two_phase 'on'");
+

We need to check 150000 here but for now, maybe we can add a comment
similar to what you have added in ApplyWorkerMain to avoid forgetting
this change. Probably a similar comment is required pg_dump.c.

15.
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)

  /* fixed fields */
  pq_sendint64(out, txn->final_lsn);
- pq_sendint64(out, txn->commit_time);
+ pq_sendint64(out, txn->u_op_time.prepare_time);
  pq_sendint32(out, txn->xid);

Why here prepare_time? It should be commit_time. We use prepare_time
in begin_prepare not in begin.

16.
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+ /*
+ * This should only ever happen for two-phase commit transactions. In
+ * which case we expect to have a valid GID. Additionally, the transaction
+ * must be prepared. See ReorderBufferFinishPrepared.
+ */
+ Assert(txn->gid != NULL);
+

The second part of the comment ("Additionally, the transaction must be
prepared) is no longer true. Also, we can combine the first two
sentences here and at other places where a similar comment is used.

17.
+ union
+ {
+ TimestampTz commit_time;
+ TimestampTz prepare_time;
+ } u_op_time;

I think it is better to name this union as xact_time or trans_time.

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8bdb1332eb51837c15a10a972c179b84f654279e

--
With Regards,
Amit Kapila.

#327Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#326)
2 attachment(s)

Please find attached the latest patch set v77*

Differences from v76* are:

* Rebased to HEAD @ yesterday

* v77* addresses most of Amit's recent feedback comments [1]/messages/by-id/CAA4eK1Jz64rwLyB6H7Z_SmEDouJ41KN42=VkVFp6JTpafJFG8Q@mail.gmail.com; I will
reply to that mail separately with the details.

* The v77-003 is temporarily omitted from this patch set. That will be
re-added in v78* early next week.

----
[1]: /messages/by-id/CAA4eK1Jz64rwLyB6H7Z_SmEDouJ41KN42=VkVFp6JTpafJFG8Q@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v77-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v77-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 58b8f12e09cedd2aa8d3637f98e8652e1a3572ad Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 21 May 2021 15:41:56 +1000
Subject: [PATCH v77] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         | 307 ++++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 150 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  19 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 201 ++++++++++--
 src/backend/replication/logical/worker.c           | 341 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  29 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  74 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 291 ++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 232 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 44 files changed, 2367 insertions(+), 206 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 6d06ad2..df9e41c 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7639,6 +7639,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index a7ec5c3..493432d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1250,9 +1250,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..6683929 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7365,6 +7386,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..776295c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		is around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5c84d75..9b941e9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1254,5 +1254,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8aa6de1..8c0e6a8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -505,10 +556,35 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Special case: if when tables were specified but copy_data is
+				 * false then it is safe to enable two_phase up-front because
+				 * those tables are already initially READY state. Note, if
+				 * the subscription has no tables then enablement cannot be
+				 * done here - we must leave the twophase state as PENDING, to
+				 * allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -814,7 +890,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +925,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +954,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +1000,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1017,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1059,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1077,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1117,33 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..ccde3bc 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,19 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/*
+		 * FIXME - 21/May. The below code is a temporary hack to check for
+		 * for server version 140000, even though this two-phase feature did
+		 * not make it into the PG 14 release.
+		 *
+		 * When the PG 15 development officially starts someone will update the
+		 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+		 * to revisit this code to remove this hack and write the code properly.
+		 */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +847,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +861,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..58b4e2c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..c387997 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b955f43..f5d1bca 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b0ab91c..b50aa24 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2548,7 +2548,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2639,7 +2639,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2686,7 +2686,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2706,7 +2706,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2725,19 +2725,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2755,12 +2756,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 67f907c..4a9275d 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1065,7 +1061,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1151,3 +1148,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 1432554..22ef22e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -246,6 +319,9 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 									   LogicalRepRelMapEntry *relmapentry,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -759,6 +835,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1987,6 +2237,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2464,6 +2730,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2950,6 +3219,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3116,15 +3399,67 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..ecf9b9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +161,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +177,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -170,10 +190,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,8 +271,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -322,6 +365,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +398,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +418,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +439,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +928,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1236,3 +1335,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c88b803..6a172d3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b94910b..285a321 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 339c393..cdfd063 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4359,6 +4360,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4402,9 +4404,25 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	/*
+	 * FIXME - 21/May. The below code is a temporary hack to check for
+	 * for server version 140000, even though this two-phase feature did
+	 * not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 */
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4425,6 +4443,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4450,6 +4469,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4477,6 +4498,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4518,6 +4540,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 3e39fdb..920f083 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,13 +6415,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 6598c53..194c322 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2759,7 +2759,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..413a5ce 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..36fa320 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -115,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +181,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 53cdfa5..86628c7 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -311,7 +311,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -650,7 +654,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3570684..71638a3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..91d9032
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,291 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 19;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..2bea214
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,232 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..2cfc1ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,9 +1388,11 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v77-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v77-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From cfa38d78a9423ee8c10b997e32b025c39ee1d00f Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 21 May 2021 18:07:31 +1000
Subject: [PATCH v77] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.
---
 doc/src/sgml/protocol.sgml                         |  68 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 132 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |   9 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 450 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 268 ++++++++++++
 11 files changed, 1010 insertions(+), 78 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 6683929..19cd6a2 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7386,7 +7386,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7649,6 +7649,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8c0e6a8..e5e2723 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -925,12 +910,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 22ef22e..6fcf314 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -322,6 +322,8 @@ static void apply_handle_tuple_routing(ResultRelInfo *relinfo,
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
 
+static int	apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -1009,6 +1011,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1206,30 +1288,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1237,7 +1310,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1252,7 +1325,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1327,6 +1400,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2253,6 +2351,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ecf9b9a..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -178,7 +180,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -283,17 +285,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1010,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 36fa320..9b3e934 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -244,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..0ac4433
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,450 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..e8bb726
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,268 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#328Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#326)

On Tue, May 18, 2021 at 9:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 13, 2021 at 3:20 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v75*

Review comments for v75-0001-Add-support-for-prepared-transactions-to-built-i:
===============================================================================
1.
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable
class="parameter">slot_name</replaceable> [
<literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [
<literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal>
<replaceable class="parameter">output_plugin</replaceable> [
<literal>EXPORT_SNAPSHOT</literal> |
<literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal>
] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable
class="parameter">slot_name</replaceable> [
<literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] {
<literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] |
<literal>LOGICAL</literal> <replaceable
class="parameter">output_plugin</replaceable> [
<literal>EXPORT_SNAPSHOT</literal> |
<literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal>
] }

Can we do some testing of the code related to this in some way? One
random idea could be to change the current subscriber-side code just
for testing purposes to see if this works. Can we enhance and use
pg_recvlogical to test this? It is possible that if you address
comment number 13 below, this can be tested with Create Subscription
command.

TODO

2.
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Commit Prepared messages belong to the same transaction.

I think here we need to write Prepare instead of Commit Prepared
because Commit Prepared for a transaction can come at a later point of
time and all the messages in-between won't belong to the same
transaction.

Fixed in v77-0001

3.
+<!-- ==================== TWO_PHASE Messages ==================== -->
+
+<para>
+The following messages (Begin Prepare, Prepare, Commit Prepared,
Rollback Prepared)
+are available since protocol version 3.
+</para>

I am not sure here marker like "TWO_PHASE Messages" is required. We
don't have any such marker for streaming messages.

Fixed in v77-0001

4.
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Timestamp of the prepare transaction.

Isn't it better to write this description as "Prepare timestamp of the
transaction" to match with the similar description of Commit
timestamp. Also, there are similar occurances in the patch at other
places, change those as well.

Fixed in v77-0001, v77-0002

5.
+<term>Begin Prepare</term>
+<listitem>
+<para>
...
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the
transaction for top-level
+                transactions).

The above description seems wrong to me. It should be Xid of the
transaction as we won't receive Xid of subtransaction in Begin
message. The same applies to the prepare/commit prepared/rollback
prepared transaction messages as well, so change that as well
accordingly.

Fixed in v77-0001, v77-0002

6.
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies this message as a two-phase prepare
transaction message.
+</para></listitem>

In all the similar messages, we are using "Identifies the message as
...". I feel it is better to be consistent in this and similar
messages in the patch.

Fixed in v77-0001, v77-0002

7.
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
..
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>

This should be end LSN of the prepared transaction.

Fixed in v77-0001

8.
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+ TimestampTz origin_prepare_timestamp)
..
..
+ /*
+ * We are neither expecting the collisions of GXACTs (same gid)
+ * between publisher and subscribers nor the apply worker restarts
+ * after prepared xacts,

The second part of the comment ".. nor the apply worker restarts after
prepared xacts .." is no longer true after commit 8bdb1332eb[1]. So,
we can remove it.

Fixed in v77-0001

9.
+ /*
+ * Does the subscription have tables?
+ *
+ * If there were not-READY relations found then we know it does. But if
+ * table_state_no_ready was empty we still need to check again to see
+ * if there are 0 tables.
+ */
+ has_subrels = (list_length(table_states_not_ready) > 0) ||

Typo in comments. /table_state_no_ready/table_state_not_ready

Fixed in v77-0001

10.
+ if (!twophase)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));

errmsg is not aligned properly. Can we make the error message clear,
something like: "cannot change two_phase option"

Fixed in v77-0001.

I fixed the alignment, but did not modify the message text.This
message was already changed in v74 to make it more consistent with
similar errors. Please see Vignesh feedback [1]= /messages/by-id/CALDaNm0u=QGwd7jDAj-4u=7vvPn5rarFjBMCgfiJbDte55CWAA@mail.gmail.com comment #1.

11.
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
char **synchronous_commit,
bool *refresh,
bool *binary_given, bool *binary,
-    bool *streaming_given, bool *streaming)
+    bool *streaming_given, bool *streaming,
+    bool *twophase_given, bool *twophase)

This function already has 14 parameters and this patch adds 2 new
ones. Isn't it better to have a struct (ParseSubOptions) for these
parameters? I think that might lead to some code churn but we can have
that as a separate patch on top of which we can create two_pc patch.

This same modification is already being addressed in another thread
[2]: /messages/by-id/CALj2ACWEjphPsfpyX9M+RdqmoRwRbWVKMoW7Tx1o+h+oNEs4pQ@mail.gmail.com
needs to be re-based later after the other patch is pushed,

12.
* The subscription two_phase commit implementation requires
+ * that replication has passed the initial table
+ * synchronization phase before the two_phase becomes properly
+ * enabled.

Can we slightly modify the starting of this sentence as:"The
subscription option 'two_phase' requires that ..."

Fixed in v77-0001

13.
@@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt,
bool isTopLevel)
{
Assert(slotname);

- walrcv_create_slot(wrconn, slotname, false,
+ /*
+ * Even if two_phase is set, don't create the slot with
+ * two-phase enabled. Will enable it once all the tables are
+ * synced and ready. This avoids race-conditions like prepared
+ * transactions being skipped due to changes not being applied
+ * due to checks in should_apply_changes_for_rel() when
+ * tablesync for the corresponding tables are in progress. See
+ * comments atop worker.c.
+ */
+ walrcv_create_slot(wrconn, slotname, false, false,

Can't we enable two_phase if copy_data is false? Because in that case,
all relations will be in a READY state. If we do that then we should
also set two_phase state as 'enabled' during createsubscription. I
think we need to be careful to check that connect option is given and
copy_data is false before setting such a state. Now, I guess we may
not be able to optimize this to not set 'enabled' state when the
subscription has no rels.

Fixed in v77-0001

14.
+ if (options->proto.logical.twophase &&
+ PQserverVersion(conn->streamConn) >= 140000)
+ appendStringInfoString(&cmd, ", two_phase 'on'");
+

We need to check 150000 here but for now, maybe we can add a comment
similar to what you have added in ApplyWorkerMain to avoid forgetting
this change. Probably a similar comment is required pg_dump.c.

Fixed in v77-0001

15.
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)

/* fixed fields */
pq_sendint64(out, txn->final_lsn);
- pq_sendint64(out, txn->commit_time);
+ pq_sendint64(out, txn->u_op_time.prepare_time);
pq_sendint32(out, txn->xid);

Why here prepare_time? It should be commit_time. We use prepare_time
in begin_prepare not in begin.

Fixed in v77-0001

16.
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+ XLogRecPtr commit_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+ /*
+ * This should only ever happen for two-phase commit transactions. In
+ * which case we expect to have a valid GID. Additionally, the transaction
+ * must be prepared. See ReorderBufferFinishPrepared.
+ */
+ Assert(txn->gid != NULL);
+

The second part of the comment ("Additionally, the transaction must be
prepared) is no longer true. Also, we can combine the first two
sentences here and at other places where a similar comment is used.

Fixed in v77-0001, v77-0002

17.
+ union
+ {
+ TimestampTz commit_time;
+ TimestampTz prepare_time;
+ } u_op_time;

I think it is better to name this union as xact_time or trans_time.

Fixed in v77-0001, v77-0002

--------
[1]: = /messages/by-id/CALDaNm0u=QGwd7jDAj-4u=7vvPn5rarFjBMCgfiJbDte55CWAA@mail.gmail.com
[2]: /messages/by-id/CALj2ACWEjphPsfpyX9M+RdqmoRwRbWVKMoW7Tx1o+h+oNEs4pQ@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

#329Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#328)
3 attachment(s)

On Fri, May 21, 2021 at 6:43 PM Peter Smith <smithpb2250@gmail.com> wrote:

Fixed in v77-0001, v77-0002

Attaching a new patch-set that rebases the patch, addresses review
comments from Peter as well as a test failure reported by Tang. I've
also added some new test case into patch-2 authored by Tang.

I've addressed the following comments:

On Tue, May 18, 2021 at 6:53 PM Peter Smith <smithpb2250@gmail.com> wrote:

1. File: doc/src/sgml/logicaldecoding.sgml

1.1

@@ -862,11 +862,19 @@ typedef void (*LogicalDecodePrepareCB) (struct
LogicalDecodingContext *ctx,
The required <function>commit_prepared_cb</function> callback is called
whenever a transaction <command>COMMIT PREPARED</command> has
been decoded.
The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can apply the rollback, otherwise, it can skip the rollback
operation. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with same identifier.

This is in the commit prepared section, but that new text is referring
to "it can apply to the rollback" etc.
Is this deliberate text, or maybe cut/paste error?

==========

Fixed.

2. File: src/backend/replication/pgoutput/pgoutput.c

2.1

@@ -76,6 +78,7 @@ static void
pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,

static bool publications_valid;
static bool in_streaming;
+static bool in_prepared_txn;

Wondering why this is a module static flag. That makes it looks like
it somehow applies globally to all the functions in this scope, but
really I think this is just a txn property, right?
- e.g. why not use another member of the private TXN data instead? or
- e.g. why not use rbtxn_prepared(txn) macro?

----------

I've removed this flag and used rbtxn_prepared(txn) macro. This also
seems to fix the crash reported by Tang, where it
was trying to send a "BEGIN PREPARE" as part of a non-prepared tx.
I've changed the logic to rely on the prepared flag in
the txn to decide if BEGIN needs to be sent or BEGIN PREPARE needs to be sent.

2.2

@@ -404,10 +410,32 @@ pgoutput_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
static void
pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
{
+ PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+ sizeof(PGOutputTxnData));
+
+ (void)txn; /* keep compiler quiet */

I guess since now the arg "txn" is being used the added statement to
"keep compiler quiet" is now redundant, so should be removed.

Removed this.

----------

2.3

+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+ PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private;

OutputPluginPrepareWrite(ctx, !send_replication_origin);
logicalrep_write_begin(ctx->out, txn);
+ data->sent_begin_txn = true;

I wondered is it worth adding Assert(data); here?

----------

Added.

2.4

@@ -422,8 +450,14 @@ static void
pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn)
{
+ PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private;
+
OutputPluginUpdateProgress(ctx);

I wondered is it worthwhile to add Assert(data); here also?

----------

Added.

2.5
@@ -422,8 +450,14 @@ static void
pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn)
{
+ PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private;
+
OutputPluginUpdateProgress(ctx);
+ /* skip COMMIT message if nothing was sent */
+ if (!data->sent_begin_txn)
+ return;

Shouldn't this code also be freeing that allocated data? I think you
do free it in similar functions later in this patch.

----------

Modified this.

2.6

@@ -435,10 +469,31 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
static void
pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
{
+ PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+ sizeof(PGOutputTxnData));
+
+ /*
+ * Don't send BEGIN message here. Instead, postpone it until the first
+ * change. In logical replication, a common scenario is to replicate a set
+ * of tables (instead of all tables) and transactions whose changes were on
+ * table(s) that are not published will produce empty transactions. These
+ * empty transactions will send BEGIN and COMMIT messages to subscribers,
+ * using bandwidth on something with little/no use for logical replication.
+ */
+ data->sent_begin_txn = false;
+ txn->output_plugin_private = data;
+ in_prepared_txn = true;
+}

Apart from setting the in_prepared_txn = true; this is all identical
code to pgoutput_begin_txn so you could consider just delegating to
call that other function to save all the cut/paste data allocation and
big comment. Or maybe this way is better - I am not sure.

----------

Updated this.

2.7

+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
bool send_replication_origin = txn->origin_id != InvalidRepOriginId;
+ PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;

OutputPluginPrepareWrite(ctx, !send_replication_origin);
logicalrep_write_begin_prepare(ctx->out, txn);
+ data->sent_begin_txn = true;

I wondered is it worth adding Assert(data); here also?

----------

Added Assert.

2.8

@@ -453,11 +508,18 @@ static void
pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
XLogRecPtr prepare_lsn)
{
+ PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
OutputPluginUpdateProgress(ctx);

I wondered is it worth adding Assert(data); here also?

----------

Added.

2.9

@@ -465,12 +527,28 @@ pgoutput_prepare_txn(LogicalDecodingContext
*ctx, ReorderBufferTXN *txn,
*/
static void
pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
- XLogRecPtr commit_lsn)
+ XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+ TimestampTz prepare_time)
{
+ PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
OutputPluginUpdateProgress(ctx);
+ /*
+ * skip sending COMMIT PREPARED message if prepared transaction
+ * has not been sent.
+ */
+ if (data && !data->sent_begin_txn)
+ {
+ pfree(data);
+ return;
+ }
+
+ if (data)
+ pfree(data);
OutputPluginPrepareWrite(ctx, true);

I think this pfree logic might be refactored more simply to just be
done in one place. e.g. like:

if (data)
{
bool skip = !data->sent_begin_txn;
pfree(data);
if (skip)
return;
}

BTW, is it even possible to get in this function with NULL private
data? Perhaps that should be an Assert(data) ?

----------

Changed the logic as per your suggestion but did not add the Assert
because you can come into this function
with a NULL private data, this can happen as the commit prepared for
the transaction can come after a restart of the
WALSENDER and the previously setup private data is lost. This is only
applicable for commit prepared and rollback prepared.

2.10

@@ -483,8 +561,22 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
XLogRecPtr prepare_end_lsn,
TimestampTz prepare_time)
{
+ PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
OutputPluginUpdateProgress(ctx);
+ /*
+ * skip sending COMMIT PREPARED message if prepared transaction
+ * has not been sent.
+ */
+ if (data && !data->sent_begin_txn)
+ {
+ pfree(data);
+ return;
+ }
+
+ if (data)
+ pfree(data);

Same comment as above for refactoring the pfree logic.

----------

Refactored.

2.11

@@ -483,8 +561,22 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
XLogRecPtr prepare_end_lsn,
TimestampTz prepare_time)
{
+ PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
OutputPluginUpdateProgress(ctx);
+ /*
+ * skip sending COMMIT PREPARED message if prepared transaction
+ * has not been sent.
+ */
+ if (data && !data->sent_begin_txn)
+ {
+ pfree(data);
+ return;
+ }
+
+ if (data)
+ pfree(data);

Is that comment correct or cut/paste error? Why does it say "COMMIT PREPARED" ?

----------

Fixed.

2.12

@@ -613,6 +705,7 @@ pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
Relation relation, ReorderBufferChange *change)
{
PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+ PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
MemoryContext old;

I wondered is it worth adding Assert(txndata); here also?

----------

Added.

2.13

@@ -750,6 +852,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
int nrelations, Relation relations[], ReorderBufferChange *change)
{
PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+ PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
MemoryContext old;

I wondered is it worth adding Assert(txndata); here also?

----------

Added.

2.14

@@ -813,11 +925,15 @@ pgoutput_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
const char *message)
{
PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+ PGOutputTxnData *txndata;
TransactionId xid = InvalidTransactionId;

if (!data->messages)
return;

+ if (txn && txn->output_plugin_private)
+ txndata = (PGOutputTxnData *) txn->output_plugin_private;
+
/*
* Remember the xid for the message in streaming mode. See
* pgoutput_change.
@@ -825,6 +941,19 @@ pgoutput_message(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
if (in_streaming)
xid = txn->xid;
+ /* output BEGIN if we haven't yet, avoid for streaming and
non-transactional messages */
+ if (!in_streaming && transactional)
+ {
+ txndata = (PGOutputTxnData *) txn->output_plugin_private;
+ if (!txndata->sent_begin_txn)
+ {
+ if (!in_prepared_txn)
+ pgoutput_begin(ctx, txn);
+ else
+ pgoutput_begin_prepare(ctx, txn);
+ }
+ }
That code:
+ if (txn && txn->output_plugin_private)
+ txndata = (PGOutputTxnData *) txn->output_plugin_private;
looked misplaced to me.

Shouldn't all that be relocated to be put inside the if block:
+ if (!in_streaming && transactional)

And when you do that maybe the condition can be simplified because you could
Assert(txn);

==========

Removed that redundant code but cannot add Assert here as in streaming
and transactional messages, there will be no
output_plugin_private.

3. File src/include/replication/pgoutput.h

3.1

@@ -30,4 +30,9 @@ typedef struct PGOutputData
bool two_phase;
} PGOutputData;

+typedef struct PGOutputTxnData
+{
+ bool sent_begin_txn; /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+

Why is this typedef here? IIUC it is only used inside the pgoutput.c,
so shouldn't it be declared in that file also?

----------

Changed this accordingly.

3.2

@@ -30,4 +30,9 @@ typedef struct PGOutputData
bool two_phase;
} PGOutputData;

+typedef struct PGOutputTxnData
+{
+ bool sent_begin_txn; /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+

That is a new typedef so maybe your patch also should update the
src/tools/pgindent/typedefs.list to name this new typedef.

----------

Added.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v78-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v78-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From 2ccd3bfd7844773e64d2ed37343727d4d9f076af Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 24 May 2021 22:58:19 -0400
Subject: [PATCH v78] Skip empty transactions for logical replication.

The current logical replication behavior is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  14 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  36 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 141 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/pgoutput.h              |   1 -
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 225 insertions(+), 34 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 493432d..aceff9a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -862,11 +862,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 19cd6a2..ae2cd11 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7538,6 +7538,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+<varlistentry>
+
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit.
 </para></listitem>
 </varlistentry>
@@ -7552,6 +7559,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Timestamp of the prepare. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c387997..ed60719 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -940,7 +941,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -975,7 +977,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..f1d8bf7 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b50aa24..bacf849 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2766,7 +2766,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 174d43f..cb87326 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -971,26 +971,38 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* there is no transaction when COMMIT PREPARED is called */
-	ensure_transaction();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	CommitTransactionCommand();
+		/* there is no transaction when COMMIT PREPARED is called */
+		ensure_transaction();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7c3a33d..84e9cfe 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -135,6 +137,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -404,10 +411,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -422,8 +451,18 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -435,10 +474,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -453,8 +510,15 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -465,12 +529,28 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+			return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -483,8 +563,21 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+			return;
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -613,11 +706,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -651,6 +749,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -750,6 +857,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -757,6 +865,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -793,6 +905,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -813,6 +934,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -825,6 +947,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 9b3e934..a6d9977 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 0dc460f..93c6731 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -29,5 +29,4 @@ typedef struct PGOutputData
 	bool		messages;
 	bool		two_phase;
 } PGOutputData;
-
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 86628c7..0ff430d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -456,7 +456,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 52bd92d..2b43ae0 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -86,9 +86,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2cfc1ae..f0941ad 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1596,6 +1596,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

v78-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v78-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 4f2f2fedfbb2b3af392431191e9d084416c3ab28 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 24 May 2021 22:23:58 -0400
Subject: [PATCH v78] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         | 307 ++++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 150 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  19 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 201 ++++++++++--
 src/backend/replication/logical/worker.c           | 341 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  29 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  74 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 291 ++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 232 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 44 files changed, 2367 insertions(+), 206 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 8aebc4d..8a3d350 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7639,6 +7639,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index a7ec5c3..493432d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1250,9 +1250,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..6683929 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7365,6 +7386,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..776295c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		is around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5c84d75..9b941e9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1254,5 +1254,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8aa6de1..8c0e6a8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -505,10 +556,35 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Special case: if when tables were specified but copy_data is
+				 * false then it is safe to enable two_phase up-front because
+				 * those tables are already initially READY state. Note, if
+				 * the subscription has no tables then enablement cannot be
+				 * done here - we must leave the twophase state as PENDING, to
+				 * allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -814,7 +890,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +925,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +954,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +1000,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1017,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1059,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1077,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1117,33 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..ccde3bc 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,19 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/*
+		 * FIXME - 21/May. The below code is a temporary hack to check for
+		 * for server version 140000, even though this two-phase feature did
+		 * not make it into the PG 14 release.
+		 *
+		 * When the PG 15 development officially starts someone will update the
+		 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+		 * to revisit this code to remove this hack and write the code properly.
+		 */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +847,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +861,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..58b4e2c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..c387997 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b955f43..f5d1bca 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b0ab91c..b50aa24 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2548,7 +2548,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2639,7 +2639,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2686,7 +2686,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2706,7 +2706,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2725,19 +2725,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2755,12 +2756,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 67f907c..4a9275d 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1065,7 +1061,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1151,3 +1148,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6ba447e..ef8c38f 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -794,6 +870,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2032,6 +2282,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2509,6 +2775,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2995,6 +3264,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3161,15 +3444,67 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..ecf9b9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +161,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +177,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -170,10 +190,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,8 +271,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -322,6 +365,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +398,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +418,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +439,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +928,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1236,3 +1335,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c88b803..6a172d3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b94910b..285a321 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 339c393..cdfd063 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4359,6 +4360,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4402,9 +4404,25 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	/*
+	 * FIXME - 21/May. The below code is a temporary hack to check for
+	 * for server version 140000, even though this two-phase feature did
+	 * not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 */
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4425,6 +4443,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4450,6 +4469,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4477,6 +4498,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4518,6 +4540,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 3e39fdb..920f083 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,13 +6415,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 6598c53..194c322 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2759,7 +2759,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..413a5ce 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..36fa320 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -115,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +181,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 53cdfa5..86628c7 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -311,7 +311,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -650,7 +654,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3570684..71638a3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..91d9032
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,291 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 19;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..2bea214
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,232 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..2cfc1ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,9 +1388,11 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v78-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v78-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 99628b7d592b16e8fe4fc94239a3d6b50e12b179 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 24 May 2021 22:46:46 -0400
Subject: [PATCH v78] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.
---
 doc/src/sgml/protocol.sgml                         |  68 ++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 132 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |   9 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/021_twophase.pl            |   8 +
 src/test/subscription/t/022_twophase_cascade.pl    |  15 +
 src/test/subscription/t/023_twophase_stream.pl     | 459 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 285 +++++++++++++
 13 files changed, 1058 insertions(+), 79 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 6683929..19cd6a2 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7386,7 +7386,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7649,6 +7649,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8c0e6a8..e5e2723 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -925,12 +910,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ef8c38f..174d43f 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1044,6 +1044,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1241,30 +1321,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1272,7 +1343,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1287,7 +1358,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1362,6 +1433,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2298,6 +2394,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ecf9b9a..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -178,7 +180,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -283,17 +285,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1010,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 36fa320..9b3e934 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -244,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 91d9032..90430f4 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -79,6 +79,14 @@ $node_publisher->wait_for_catchup($appname);
 my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
 is($result, qq(1), 'transaction is prepared on subscriber');
 
+# Wait for the statistics to be updated
+$node_publisher->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
 # check that 2PC gets committed on subscriber
 $node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
 
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 2bea214..ac27384 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -101,6 +101,21 @@ is($result, qq(1), 'transaction is prepared on subscriber B');
 $result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
 is($result, qq(1), 'transaction is prepared on subscriber C');
 
+# Wait for the statistics to be updated
+$node_A->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_b'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+$node_B->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_c'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
 # 2PC COMMIT
 $node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
 
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..b2d52cd
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,459 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Wait for the statistics to be updated
+$node_publisher->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..9d5c6f5
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,285 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# Wait for the statistics to be updated
+$node_A->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_b'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+$node_B->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_c'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#330tanghy.fnst@fujitsu.com
tanghy.fnst@fujitsu.com
In reply to: Peter Smith (#328)
2 attachment(s)
RE: [HACKERS] logical decoding of two-phase transactions

13.
@@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt,
bool isTopLevel)
{
Assert(slotname);

- walrcv_create_slot(wrconn, slotname, false,
+ /*
+ * Even if two_phase is set, don't create the slot with
+ * two-phase enabled. Will enable it once all the tables are
+ * synced and ready. This avoids race-conditions like prepared
+ * transactions being skipped due to changes not being applied
+ * due to checks in should_apply_changes_for_rel() when
+ * tablesync for the corresponding tables are in progress. See
+ * comments atop worker.c.
+ */
+ walrcv_create_slot(wrconn, slotname, false, false,

Can't we enable two_phase if copy_data is false? Because in that case,
all relations will be in a READY state. If we do that then we should
also set two_phase state as 'enabled' during createsubscription. I
think we need to be careful to check that connect option is given and
copy_data is false before setting such a state. Now, I guess we may
not be able to optimize this to not set 'enabled' state when the
subscription has no rels.

Fixed in v77-0001

I noticed this modification in v77-0001 and executed "CREATE SUBSCRIPTION ... WITH (two_phase = on, copy_data = false)", but it crashed.
-------------
postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres' PUBLICATION pub WITH(two_phase = on, copy_data = false);
WARNING: relcache reference leak: relation "pg_subscription" not closed
WARNING: snapshot 0x34278d0 still active
NOTICE: created replication slot "sub" on publisher
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>
-------------

There are two warnings and a segmentation fault in subscriber log:
-------------
2021-05-24 15:08:32.435 CST [2848572] WARNING: relcache reference leak: relation "pg_subscription" not closed
2021-05-24 15:08:32.435 CST [2848572] WARNING: snapshot 0x32ce8b0 still active
2021-05-24 15:08:33.012 CST [2848555] LOG: server process (PID 2848572) was terminated by signal 11: Segmentation fault
2021-05-24 15:08:33.012 CST [2848555] DETAIL: Failed process was running: CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres' PUBLICATION pub WITH(two_phase = on, copy_data = false);
-------------

The backtrace about segmentation fault is attached. It happened in table_close function, we got it because "CurrentResourceOwner" was NULL.

I think it was related with the first warning, which reported "relcache reference leak". The backtrace information is attached, too. When updating two-phase state in CreateSubscription function, it released resource owner and set "CurrentResourceOwner" as NULL in CommitTransaction function.

The second warning about "snapshot still active" was also happened in CommitTransaction function. It called AtEOXact_Snapshot function, checked leftover snapshots and reported the warning.
I debugged and found the snapshot was added in function PortalRunUtility by calling PushActiveSnapshot function, the address of "ActiveSnapshot" at this time was same as the address in warning.

In summary, when creating subscription with two_phase = on and copy_data = false, it calls UpdateTwoPhaseState function in CreateSubscription function to set two_phase state as 'enabled', and it checked and released relcache and snapshot too early so the NG happened. I think some change should be made to avoid it. Thought?

FYI
I also tested the new released V78* at [1]/messages/by-id/CAFPTHDab56twVmC+0a=RNcRw4KuyFdqzW0JAcvJdS63n_fRnOQ@mail.gmail.com, the above NG still exists.
[1]: /messages/by-id/CAFPTHDab56twVmC+0a=RNcRw4KuyFdqzW0JAcvJdS63n_fRnOQ@mail.gmail.com

Regards
Tang

Attachments:

backtrace_segmentation_fault.txttext/plain; name=backtrace_segmentation_fault.txtDownload
backtrace_first_warning.txttext/plain; name=backtrace_first_warning.txtDownload
#331vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#329)

On Tue, May 25, 2021 at 8:54 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Fri, May 21, 2021 at 6:43 PM Peter Smith <smithpb2250@gmail.com> wrote:

Fixed in v77-0001, v77-0002

Attaching a new patch-set that rebases the patch, addresses review
comments from Peter as well as a test failure reported by Tang. I've
also added some new test case into patch-2 authored by Tang.

Thanks for the updated patch, few comments:
1) Should "The end LSN of the prepare." be changed to "end LSN of the
prepare transaction."?

--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7538,6 +7538,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+<varlistentry>
+
+<term>Int64</term>
+<listitem><para>
2) Should the ";" be "," here?
+++ b/doc/src/sgml/catalogs.sgml
@@ -7639,6 +7639,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration
count&gt;</replaceable>:<replaceable>&l
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is
pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
3) Should end_lsn be commit_end_lsn?
+       prepare_data->commit_end_lsn = pq_getmsgint64(in);
+       if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
                elog(ERROR, "end_lsn is not set in commit prepared message");
+       prepare_data->prepare_time = pq_getmsgint64(in);

4) This change is not required

diff --git a/src/include/replication/pgoutput.h
b/src/include/replication/pgoutput.h
index 0dc460f..93c6731 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -29,5 +29,4 @@ typedef struct PGOutputData
        bool            messages;
        bool            two_phase;
 } PGOutputData;
-
 #endif                                                 /* PGOUTPUT_H */

5) Will the worker receive commit prepared/rollback prepared as we
have skip logic to skip commit prepared / commit rollback in
pgoutput_rollback_prepared_txn and pgoutput_commit_prepared_txn:

+        * It is possible that we haven't received the prepare because
+        * the transaction did not have any changes relevant to this
+        * subscription and was essentially an empty prepare. In which case,
+        * the walsender is optimized to drop the empty transaction and the
+        * accompanying prepare. Silently ignore if we don't find the prepared
+        * transaction.
         */
-       replorigin_session_origin_lsn = prepare_data.end_lsn;
-       replorigin_session_origin_timestamp = prepare_data.commit_time;
+       if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+                                       prepare_data.prepare_time))
+       {

6) I'm not sure if we could add some tests for skip empty prepare
transactions, if possible add few tests.

7) We could add some debug level log messages for the transaction that
will be skipped.

Regards,
Vignesh

#332Ajin Cherian
itsajin@gmail.com
In reply to: tanghy.fnst@fujitsu.com (#330)
3 attachment(s)

On Tue, May 25, 2021 at 4:41 PM tanghy.fnst@fujitsu.com
<tanghy.fnst@fujitsu.com> wrote:

Fixed in v77-0001

I noticed this modification in v77-0001 and executed "CREATE SUBSCRIPTION ... WITH (two_phase = on, copy_data = false)", but it crashed.
-------------
postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres' PUBLICATION pub WITH(two_phase = on, copy_data = false);
WARNING: relcache reference leak: relation "pg_subscription" not closed
WARNING: snapshot 0x34278d0 still active
NOTICE: created replication slot "sub" on publisher
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>
-------------

There are two warnings and a segmentation fault in subscriber log:
-------------
2021-05-24 15:08:32.435 CST [2848572] WARNING: relcache reference leak: relation "pg_subscription" not closed
2021-05-24 15:08:32.435 CST [2848572] WARNING: snapshot 0x32ce8b0 still active
2021-05-24 15:08:33.012 CST [2848555] LOG: server process (PID 2848572) was terminated by signal 11: Segmentation fault
2021-05-24 15:08:33.012 CST [2848555] DETAIL: Failed process was running: CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres' PUBLICATION pub WITH(two_phase = on, copy_data = false);
-------------

Hi Tang,
I've attached a patch that fixes this issue. Do test and confirm.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v79-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v79-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From ba0173740af1b32bd2f05ad51ecae3ff9b2d16b5 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 24 May 2021 22:46:46 -0400
Subject: [PATCH v79] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.
---
 doc/src/sgml/protocol.sgml                         |  68 ++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/tablesync.c        |   5 -
 src/backend/replication/logical/worker.c           | 134 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |   9 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/021_twophase.pl            |   8 +
 src/test/subscription/t/022_twophase_cascade.pl    |  15 +
 src/test/subscription/t/023_twophase_stream.pl     | 459 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 285 +++++++++++++
 14 files changed, 1060 insertions(+), 84 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 6683929..19cd6a2 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7386,7 +7386,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7649,6 +7649,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8c0e6a8..e5e2723 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -925,12 +910,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 4a9275d..e3cbe32 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1256,9 +1256,6 @@ UpdateTwoPhaseState(Oid suboid, char new_state)
 		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
 		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
 
-	if (!IsTransactionState())
-		StartTransactionCommand();
-
 	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
 	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
 	if (!HeapTupleIsValid(tup))
@@ -1281,6 +1278,4 @@ UpdateTwoPhaseState(Oid suboid, char new_state)
 
 	heap_freetuple(tup);
 	table_close(rel, RowExclusiveLock);
-
-	CommitTransactionCommand();
 }
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ef8c38f..40b00c9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1044,6 +1044,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1241,30 +1321,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1272,7 +1343,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1287,7 +1358,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1362,6 +1433,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2298,6 +2394,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -3484,8 +3584,10 @@ ApplyWorkerMain(Datum main_arg)
 			options.proto.logical.twophase = true;
 			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
 
+			StartTransactionCommand();
 			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
 			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
 		}
 		else
 		{
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ecf9b9a..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -178,7 +180,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -283,17 +285,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1010,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 36fa320..9b3e934 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -244,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 91d9032..90430f4 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -79,6 +79,14 @@ $node_publisher->wait_for_catchup($appname);
 my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
 is($result, qq(1), 'transaction is prepared on subscriber');
 
+# Wait for the statistics to be updated
+$node_publisher->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
 # check that 2PC gets committed on subscriber
 $node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
 
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 2bea214..ac27384 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -101,6 +101,21 @@ is($result, qq(1), 'transaction is prepared on subscriber B');
 $result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
 is($result, qq(1), 'transaction is prepared on subscriber C');
 
+# Wait for the statistics to be updated
+$node_A->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_b'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+$node_B->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_c'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
 # 2PC COMMIT
 $node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
 
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..b2d52cd
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,459 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Wait for the statistics to be updated
+$node_publisher->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..9d5c6f5
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,285 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# Wait for the statistics to be updated
+$node_A->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_b'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+$node_B->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_c'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v79-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v79-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From 2f3b6b2dc1d30048099927d5862483fb92a83e4c Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 26 May 2021 10:06:19 -0400
Subject: [PATCH v79] Skip empty transactions for logical replication.

The current logical replication behavior is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  14 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  36 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 141 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/pgoutput.h              |   1 -
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 225 insertions(+), 34 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 493432d..aceff9a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -862,11 +862,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 19cd6a2..ae2cd11 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7538,6 +7538,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+<varlistentry>
+
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit.
 </para></listitem>
 </varlistentry>
@@ -7552,6 +7559,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Timestamp of the prepare. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c387997..ed60719 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -940,7 +941,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -975,7 +977,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..f1d8bf7 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b50aa24..bacf849 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2766,7 +2766,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 40b00c9..f7db5ef 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -971,26 +971,38 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* there is no transaction when COMMIT PREPARED is called */
-	ensure_transaction();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	CommitTransactionCommand();
+		/* there is no transaction when COMMIT PREPARED is called */
+		ensure_transaction();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7c3a33d..84e9cfe 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -135,6 +137,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -404,10 +411,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -422,8 +451,18 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -435,10 +474,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -453,8 +510,15 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -465,12 +529,28 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+			return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -483,8 +563,21 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+			return;
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -613,11 +706,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -651,6 +749,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -750,6 +857,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -757,6 +865,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -793,6 +905,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -813,6 +934,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -825,6 +947,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 9b3e934..a6d9977 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 0dc460f..93c6731 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -29,5 +29,4 @@ typedef struct PGOutputData
 	bool		messages;
 	bool		two_phase;
 } PGOutputData;
-
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 86628c7..0ff430d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -456,7 +456,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 52bd92d..2b43ae0 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -86,9 +86,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2cfc1ae..f0941ad 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1596,6 +1596,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

v79-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v79-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 4f2f2fedfbb2b3af392431191e9d084416c3ab28 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 24 May 2021 22:23:58 -0400
Subject: [PATCH v79] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         | 307 ++++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 150 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  19 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 201 ++++++++++--
 src/backend/replication/logical/worker.c           | 341 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  29 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  74 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 291 ++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 232 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 44 files changed, 2367 insertions(+), 206 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 8aebc4d..8a3d350 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7639,6 +7639,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index a7ec5c3..493432d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1250,9 +1250,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..6683929 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7365,6 +7386,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..776295c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		is around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5c84d75..9b941e9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1254,5 +1254,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8aa6de1..8c0e6a8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -505,10 +556,35 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Special case: if when tables were specified but copy_data is
+				 * false then it is safe to enable two_phase up-front because
+				 * those tables are already initially READY state. Note, if
+				 * the subscription has no tables then enablement cannot be
+				 * done here - we must leave the twophase state as PENDING, to
+				 * allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -814,7 +890,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +925,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +954,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +1000,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1017,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1059,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1077,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1117,33 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..ccde3bc 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,19 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/*
+		 * FIXME - 21/May. The below code is a temporary hack to check for
+		 * for server version 140000, even though this two-phase feature did
+		 * not make it into the PG 14 release.
+		 *
+		 * When the PG 15 development officially starts someone will update the
+		 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+		 * to revisit this code to remove this hack and write the code properly.
+		 */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +847,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +861,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..58b4e2c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..c387997 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b955f43..f5d1bca 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b0ab91c..b50aa24 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2548,7 +2548,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2639,7 +2639,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2686,7 +2686,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2706,7 +2706,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2725,19 +2725,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2755,12 +2756,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 67f907c..4a9275d 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1065,7 +1061,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1151,3 +1148,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6ba447e..ef8c38f 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -794,6 +870,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2032,6 +2282,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2509,6 +2775,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2995,6 +3264,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3161,15 +3444,67 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..ecf9b9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +161,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +177,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -170,10 +190,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,8 +271,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -322,6 +365,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +398,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +418,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +439,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +928,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1236,3 +1335,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c88b803..6a172d3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b94910b..285a321 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 339c393..cdfd063 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4359,6 +4360,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4402,9 +4404,25 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	/*
+	 * FIXME - 21/May. The below code is a temporary hack to check for
+	 * for server version 140000, even though this two-phase feature did
+	 * not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 */
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4425,6 +4443,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4450,6 +4469,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4477,6 +4498,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4518,6 +4540,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 3e39fdb..920f083 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,13 +6415,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 6598c53..194c322 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2759,7 +2759,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..413a5ce 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..36fa320 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -115,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +181,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 53cdfa5..86628c7 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -311,7 +311,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -650,7 +654,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3570684..71638a3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..91d9032
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,291 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 19;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..2bea214
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,232 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..2cfc1ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,9 +1388,11 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

#333tanghy.fnst@fujitsu.com
tanghy.fnst@fujitsu.com
In reply to: Ajin Cherian (#332)
RE: [HACKERS] logical decoding of two-phase transactions

On Wed, May 26, 2021 10:13 PM Ajin Cherian <itsajin@gmail.com> wrote:

I've attached a patch that fixes this issue. Do test and confirm.

Thanks for your patch.
I have tested and confirmed that the issue I reported has been fixed.

Regards
Tang

#334Ajin Cherian
itsajin@gmail.com
In reply to: tanghy.fnst@fujitsu.com (#333)

On Thu, May 27, 2021 at 11:20 AM tanghy.fnst@fujitsu.com
<tanghy.fnst@fujitsu.com> wrote:

On Wed, May 26, 2021 10:13 PM Ajin Cherian <itsajin@gmail.com> wrote:

I've attached a patch that fixes this issue. Do test and confirm.

Thanks for your patch.
I have tested and confirmed that the issue I reported has been fixed.

Thanks for confirmation. The problem seemed to be as you reported a
table not closed when a transaction was committed.
This seems to be because the function UpdateTwoPhaseState was
committing a transaction inside the function when the caller of
UpdateTwoPhaseState had
a table open in CreateSubscription. This function was newly included
in the CreateSubscription code, to handle the new use case of
two_phase being enabled on
create subscription if "copy_data = false". I don't think
CreateSubscription required this to be inside a transaction and the
committing of transaction
was only meant for where this function was originally created to be
used in the apply worker code (ApplyWorkerMain()).
So, I removed the committing of the transaction from inside the
function UpdateTwoPhaseState() and instead started and committed the
transaction
prior to and after this function is invoked in the apply worker code.

regards,
Ajin Cherian
Fujitsu Australia

#335Ajin Cherian
itsajin@gmail.com
In reply to: vignesh C (#331)
3 attachment(s)

On Wed, May 26, 2021 at 6:53 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, May 25, 2021 at 8:54 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Fri, May 21, 2021 at 6:43 PM Peter Smith <smithpb2250@gmail.com> wrote:

Fixed in v77-0001, v77-0002

Attaching a new patch-set that rebases the patch, addresses review
comments from Peter as well as a test failure reported by Tang. I've
also added some new test case into patch-2 authored by Tang.

Thanks for the updated patch, few comments:
1) Should "The end LSN of the prepare." be changed to "end LSN of the
prepare transaction."?

No, this is the end LSN of the prepare. The prepare consists of multiple LSNs.

2) Should the ";" be "," here?
+++ b/doc/src/sgml/catalogs.sgml
@@ -7639,6 +7639,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration
count&gt;</replaceable>:<replaceable>&l
<row>
<entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is
pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>

no, I think the ";" is correct here, connecting multiple parts of the sentence.

3) Should end_lsn be commit_end_lsn?
+       prepare_data->commit_end_lsn = pq_getmsgint64(in);
+       if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
elog(ERROR, "end_lsn is not set in commit prepared message");
+       prepare_data->prepare_time = pq_getmsgint64(in);

Changed this.

4) This change is not required

diff --git a/src/include/replication/pgoutput.h
b/src/include/replication/pgoutput.h
index 0dc460f..93c6731 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -29,5 +29,4 @@ typedef struct PGOutputData
bool            messages;
bool            two_phase;
} PGOutputData;
-

removed.

#endif /* PGOUTPUT_H */

5) Will the worker receive commit prepared/rollback prepared as we
have skip logic to skip commit prepared / commit rollback in
pgoutput_rollback_prepared_txn and pgoutput_commit_prepared_txn:

+        * It is possible that we haven't received the prepare because
+        * the transaction did not have any changes relevant to this
+        * subscription and was essentially an empty prepare. In which case,
+        * the walsender is optimized to drop the empty transaction and the
+        * accompanying prepare. Silently ignore if we don't find the prepared
+        * transaction.
*/
-       replorigin_session_origin_lsn = prepare_data.end_lsn;
-       replorigin_session_origin_timestamp = prepare_data.commit_time;
+       if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+                                       prepare_data.prepare_time))
+       {

Commit prepared will be skipped if it happens in the same walsender's
lifetime. But if the walsender restarts it no longer
knows about the skipped prepare. In this case walsender will not skip
the commit prepared. Hence, the logic for handling
stray commit prepared in the apply worker.

6) I'm not sure if we could add some tests for skip empty prepare
transactions, if possible add few tests.

I've added a test case using pg_logical_slot_peek_binary_changes() for
empty prepares
have a look.

7) We could add some debug level log messages for the transaction that
will be skipped.

If this is for the test, I was able to add a test without debug messages.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v80-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v80-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 4f2f2fedfbb2b3af392431191e9d084416c3ab28 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 24 May 2021 22:23:58 -0400
Subject: [PATCH v80] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         | 307 ++++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 150 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  19 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 201 ++++++++++--
 src/backend/replication/logical/worker.c           | 341 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  29 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  74 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 291 ++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 232 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 44 files changed, 2367 insertions(+), 206 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 8aebc4d..8a3d350 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7639,6 +7639,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index a7ec5c3..493432d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1250,9 +1250,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..6683929 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7365,6 +7386,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..776295c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		is around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5c84d75..9b941e9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1254,5 +1254,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8aa6de1..8c0e6a8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -505,10 +556,35 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Special case: if when tables were specified but copy_data is
+				 * false then it is safe to enable two_phase up-front because
+				 * those tables are already initially READY state. Note, if
+				 * the subscription has no tables then enablement cannot be
+				 * done here - we must leave the twophase state as PENDING, to
+				 * allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -814,7 +890,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +925,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +954,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +1000,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1017,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1059,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1077,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1117,33 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..ccde3bc 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,19 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/*
+		 * FIXME - 21/May. The below code is a temporary hack to check for
+		 * for server version 140000, even though this two-phase feature did
+		 * not make it into the PG 14 release.
+		 *
+		 * When the PG 15 development officially starts someone will update the
+		 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+		 * to revisit this code to remove this hack and write the code properly.
+		 */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +847,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +861,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..58b4e2c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..c387997 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b955f43..f5d1bca 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b0ab91c..b50aa24 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2548,7 +2548,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2639,7 +2639,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2686,7 +2686,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2706,7 +2706,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2725,19 +2725,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2755,12 +2756,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 67f907c..4a9275d 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1065,7 +1061,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1151,3 +1148,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6ba447e..ef8c38f 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -794,6 +870,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2032,6 +2282,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2509,6 +2775,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2995,6 +3264,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3161,15 +3444,67 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..ecf9b9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +161,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +177,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -170,10 +190,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,8 +271,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -322,6 +365,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +398,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +418,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +439,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +928,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1236,3 +1335,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c88b803..6a172d3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b94910b..285a321 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 339c393..cdfd063 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4359,6 +4360,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4402,9 +4404,25 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	/*
+	 * FIXME - 21/May. The below code is a temporary hack to check for
+	 * for server version 140000, even though this two-phase feature did
+	 * not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 */
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4425,6 +4443,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4450,6 +4469,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4477,6 +4498,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4518,6 +4540,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 3e39fdb..920f083 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,13 +6415,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 6598c53..194c322 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2759,7 +2759,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..413a5ce 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..36fa320 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -115,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +181,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 53cdfa5..86628c7 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -311,7 +311,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -650,7 +654,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3570684..71638a3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..91d9032
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,291 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 19;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..2bea214
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,232 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..2cfc1ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,9 +1388,11 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v80-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v80-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From edcb9484b4ec56abc7d16fae14ac14c632060e2b Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 27 May 2021 23:29:33 -0400
Subject: [PATCH v80] Skip empty transactions for logical replication.

The current logical replication behavior is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  14 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  36 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 141 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/pgoutput.h              |   1 -
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  41 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 15 files changed, 265 insertions(+), 35 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 493432d..aceff9a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -862,11 +862,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 19cd6a2..ae2cd11 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7538,6 +7538,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+<varlistentry>
+
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit.
 </para></listitem>
 </varlistentry>
@@ -7552,6 +7559,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Timestamp of the prepare. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c387997..ed60719 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -940,7 +941,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -975,7 +977,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..f1d8bf7 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b50aa24..bacf849 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2766,7 +2766,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 40b00c9..f7db5ef 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -971,26 +971,38 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* there is no transaction when COMMIT PREPARED is called */
-	ensure_transaction();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	CommitTransactionCommand();
+		/* there is no transaction when COMMIT PREPARED is called */
+		ensure_transaction();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7c3a33d..84e9cfe 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -135,6 +137,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -404,10 +411,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -422,8 +451,18 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -435,10 +474,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -453,8 +510,15 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -465,12 +529,28 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+			return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -483,8 +563,21 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+			return;
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -613,11 +706,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -651,6 +749,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -750,6 +857,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -757,6 +865,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -793,6 +905,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -813,6 +934,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -825,6 +947,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 9b3e934..a6d9977 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 0dc460f..93c6731 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -29,5 +29,4 @@ typedef struct PGOutputData
 	bool		messages;
 	bool		two_phase;
 } PGOutputData;
-
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 86628c7..0ff430d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -456,7 +456,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 52bd92d..2b43ae0 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -86,9 +86,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 90430f4..3428c6d 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 19;
+use Test::More tests => 20;
 
 ###############################
 # Setup
@@ -277,6 +277,45 @@ $node_publisher->wait_for_catchup($appname);
 $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
 is($result, qq(0), 'transaction is aborted on subscriber');
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+		"CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot.
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_nopub SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'empty_transaction';
+	COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+	'postgres', qq(
+		SELECT get_byte(data, 0)
+		FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+			'proto_version', '1',
+			'publication_names', 'tap_pub')
+));
+
+# the empty tranaction should be skipped
+is($result, qq(),
+	'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2cfc1ae..f0941ad 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1596,6 +1596,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

v80-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v80-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From ba0173740af1b32bd2f05ad51ecae3ff9b2d16b5 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 24 May 2021 22:46:46 -0400
Subject: [PATCH v80] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.
---
 doc/src/sgml/protocol.sgml                         |  68 ++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/tablesync.c        |   5 -
 src/backend/replication/logical/worker.c           | 134 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |   9 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/021_twophase.pl            |   8 +
 src/test/subscription/t/022_twophase_cascade.pl    |  15 +
 src/test/subscription/t/023_twophase_stream.pl     | 459 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 285 +++++++++++++
 14 files changed, 1060 insertions(+), 84 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 6683929..19cd6a2 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7386,7 +7386,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7649,6 +7649,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8c0e6a8..e5e2723 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -925,12 +910,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 4a9275d..e3cbe32 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1256,9 +1256,6 @@ UpdateTwoPhaseState(Oid suboid, char new_state)
 		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
 		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
 
-	if (!IsTransactionState())
-		StartTransactionCommand();
-
 	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
 	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
 	if (!HeapTupleIsValid(tup))
@@ -1281,6 +1278,4 @@ UpdateTwoPhaseState(Oid suboid, char new_state)
 
 	heap_freetuple(tup);
 	table_close(rel, RowExclusiveLock);
-
-	CommitTransactionCommand();
 }
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ef8c38f..40b00c9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1044,6 +1044,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1241,30 +1321,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1272,7 +1343,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1287,7 +1358,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1362,6 +1433,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2298,6 +2394,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -3484,8 +3584,10 @@ ApplyWorkerMain(Datum main_arg)
 			options.proto.logical.twophase = true;
 			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
 
+			StartTransactionCommand();
 			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
 			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
 		}
 		else
 		{
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ecf9b9a..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -178,7 +180,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -283,17 +285,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1010,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 36fa320..9b3e934 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -244,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 91d9032..90430f4 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -79,6 +79,14 @@ $node_publisher->wait_for_catchup($appname);
 my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
 is($result, qq(1), 'transaction is prepared on subscriber');
 
+# Wait for the statistics to be updated
+$node_publisher->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
 # check that 2PC gets committed on subscriber
 $node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
 
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 2bea214..ac27384 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -101,6 +101,21 @@ is($result, qq(1), 'transaction is prepared on subscriber B');
 $result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
 is($result, qq(1), 'transaction is prepared on subscriber C');
 
+# Wait for the statistics to be updated
+$node_A->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_b'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+$node_B->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_c'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
 # 2PC COMMIT
 $node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
 
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..b2d52cd
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,459 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Wait for the statistics to be updated
+$node_publisher->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..9d5c6f5
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,285 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# Wait for the statistics to be updated
+$node_A->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_b'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+$node_B->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_c'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#336Ajin Cherian
itsajin@gmail.com
In reply to: Ajin Cherian (#335)
3 attachment(s)

On Fri, May 28, 2021 at 1:44 PM Ajin Cherian <itsajin@gmail.com> wrote:

Sorry, please ignore the previous patch-set. I attached the wrong
files. Here's the correct patch-set.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v80-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v80-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From ba0173740af1b32bd2f05ad51ecae3ff9b2d16b5 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 24 May 2021 22:46:46 -0400
Subject: [PATCH v80] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.
---
 doc/src/sgml/protocol.sgml                         |  68 ++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/tablesync.c        |   5 -
 src/backend/replication/logical/worker.c           | 134 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |   9 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/021_twophase.pl            |   8 +
 src/test/subscription/t/022_twophase_cascade.pl    |  15 +
 src/test/subscription/t/023_twophase_stream.pl     | 459 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 285 +++++++++++++
 14 files changed, 1060 insertions(+), 84 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 6683929..19cd6a2 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7386,7 +7386,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7649,6 +7649,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8c0e6a8..e5e2723 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -925,12 +910,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 4a9275d..e3cbe32 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1256,9 +1256,6 @@ UpdateTwoPhaseState(Oid suboid, char new_state)
 		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
 		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
 
-	if (!IsTransactionState())
-		StartTransactionCommand();
-
 	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
 	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
 	if (!HeapTupleIsValid(tup))
@@ -1281,6 +1278,4 @@ UpdateTwoPhaseState(Oid suboid, char new_state)
 
 	heap_freetuple(tup);
 	table_close(rel, RowExclusiveLock);
-
-	CommitTransactionCommand();
 }
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ef8c38f..40b00c9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1044,6 +1044,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1241,30 +1321,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1272,7 +1343,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1287,7 +1358,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1362,6 +1433,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2298,6 +2394,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -3484,8 +3584,10 @@ ApplyWorkerMain(Datum main_arg)
 			options.proto.logical.twophase = true;
 			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
 
+			StartTransactionCommand();
 			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
 			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
 		}
 		else
 		{
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ecf9b9a..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -178,7 +180,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -283,17 +285,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1010,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 36fa320..9b3e934 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -244,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 91d9032..90430f4 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -79,6 +79,14 @@ $node_publisher->wait_for_catchup($appname);
 my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
 is($result, qq(1), 'transaction is prepared on subscriber');
 
+# Wait for the statistics to be updated
+$node_publisher->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
 # check that 2PC gets committed on subscriber
 $node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
 
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index 2bea214..ac27384 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -101,6 +101,21 @@ is($result, qq(1), 'transaction is prepared on subscriber B');
 $result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
 is($result, qq(1), 'transaction is prepared on subscriber C');
 
+# Wait for the statistics to be updated
+$node_A->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_b'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+$node_B->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_c'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
 # 2PC COMMIT
 $node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
 
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..b2d52cd
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,459 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Wait for the statistics to be updated
+$node_publisher->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..9d5c6f5
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,285 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# Wait for the statistics to be updated
+$node_A->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_b'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+$node_B->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_c'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v80-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v80-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From afa27f2df7e18815c2fb0eb2ef759691704735a5 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Fri, 28 May 2021 00:37:34 -0400
Subject: [PATCH v80] Skip empty transactions for logical replication.

The current logical replication behavior is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  16 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  36 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 141 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  41 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 266 insertions(+), 35 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 493432d..aceff9a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -862,11 +862,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 19cd6a2..ae2cd11 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7538,6 +7538,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+<varlistentry>
+
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit.
 </para></listitem>
 </varlistentry>
@@ -7552,6 +7559,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Timestamp of the prepare. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c387997..ed60719 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -940,7 +941,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -975,7 +977,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..4653d6d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b50aa24..bacf849 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2766,7 +2766,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 40b00c9..f7db5ef 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -971,26 +971,38 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* there is no transaction when COMMIT PREPARED is called */
-	ensure_transaction();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	CommitTransactionCommand();
+		/* there is no transaction when COMMIT PREPARED is called */
+		ensure_transaction();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7c3a33d..84e9cfe 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -135,6 +137,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -404,10 +411,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -422,8 +451,18 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -435,10 +474,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -453,8 +510,15 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -465,12 +529,28 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+			return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -483,8 +563,21 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+			return;
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -613,11 +706,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -651,6 +749,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -750,6 +857,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -757,6 +865,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -793,6 +905,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -813,6 +934,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -825,6 +947,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 9b3e934..a6d9977 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 86628c7..0ff430d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -456,7 +456,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 52bd92d..2b43ae0 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -86,9 +86,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 90430f4..3428c6d 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 19;
+use Test::More tests => 20;
 
 ###############################
 # Setup
@@ -277,6 +277,45 @@ $node_publisher->wait_for_catchup($appname);
 $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
 is($result, qq(0), 'transaction is aborted on subscriber');
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+		"CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot.
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_nopub SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'empty_transaction';
+	COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+	'postgres', qq(
+		SELECT get_byte(data, 0)
+		FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+			'proto_version', '1',
+			'publication_names', 'tap_pub')
+));
+
+# the empty tranaction should be skipped
+is($result, qq(),
+	'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2cfc1ae..f0941ad 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1596,6 +1596,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

v80-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v80-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 4f2f2fedfbb2b3af392431191e9d084416c3ab28 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Mon, 24 May 2021 22:23:58 -0400
Subject: [PATCH v80] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         | 307 ++++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 150 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  19 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 201 ++++++++++--
 src/backend/replication/logical/worker.c           | 341 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  29 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  74 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 291 ++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 232 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 44 files changed, 2367 insertions(+), 206 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 8aebc4d..8a3d350 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7639,6 +7639,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index a7ec5c3..493432d 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1250,9 +1250,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..6683929 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7365,6 +7386,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..776295c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		is around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5c84d75..9b941e9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1254,5 +1254,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8aa6de1..8c0e6a8 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -505,10 +556,35 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Special case: if when tables were specified but copy_data is
+				 * false then it is safe to enable two_phase up-front because
+				 * those tables are already initially READY state. Note, if
+				 * the subscription has no tables then enablement cannot be
+				 * done here - we must leave the twophase state as PENDING, to
+				 * allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -814,7 +890,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +925,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +954,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +1000,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1017,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1059,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1077,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1117,33 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..ccde3bc 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,19 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/*
+		 * FIXME - 21/May. The below code is a temporary hack to check for
+		 * for server version 140000, even though this two-phase feature did
+		 * not make it into the PG 14 release.
+		 *
+		 * When the PG 15 development officially starts someone will update the
+		 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+		 * to revisit this code to remove this hack and write the code properly.
+		 */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +847,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +861,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..58b4e2c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..c387997 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b955f43..f5d1bca 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b0ab91c..b50aa24 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2548,7 +2548,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2639,7 +2639,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2686,7 +2686,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2706,7 +2706,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2725,19 +2725,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2755,12 +2756,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 67f907c..4a9275d 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1065,7 +1061,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1151,3 +1148,139 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	if (!IsTransactionState())
+		StartTransactionCommand();
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+
+	CommitTransactionCommand();
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6ba447e..ef8c38f 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -794,6 +870,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2032,6 +2282,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2509,6 +2775,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2995,6 +3264,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3161,15 +3444,67 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..ecf9b9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +161,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +177,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -170,10 +190,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,8 +271,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -322,6 +365,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +398,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +418,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +439,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +928,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1236,3 +1335,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c88b803..6a172d3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b94910b..285a321 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 339c393..cdfd063 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4359,6 +4360,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4402,9 +4404,25 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	/*
+	 * FIXME - 21/May. The below code is a temporary hack to check for
+	 * for server version 140000, even though this two-phase feature did
+	 * not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 */
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4425,6 +4443,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4450,6 +4469,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4477,6 +4498,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4518,6 +4540,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 3e39fdb..920f083 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,13 +6415,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 6598c53..194c322 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2759,7 +2759,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..413a5ce 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..36fa320 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -115,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +181,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 53cdfa5..86628c7 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -311,7 +311,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -650,7 +654,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3570684..71638a3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..91d9032
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,291 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 19;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..2bea214
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,232 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..2cfc1ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,9 +1388,11 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

#337vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#335)

On Fri, May 28, 2021 at 9:14 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Wed, May 26, 2021 at 6:53 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, May 25, 2021 at 8:54 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Fri, May 21, 2021 at 6:43 PM Peter Smith <smithpb2250@gmail.com> wrote:

Fixed in v77-0001, v77-0002

Attaching a new patch-set that rebases the patch, addresses review
comments from Peter as well as a test failure reported by Tang. I've
also added some new test case into patch-2 authored by Tang.

Thanks for the updated patch, few comments:
1) Should "The end LSN of the prepare." be changed to "end LSN of the
prepare transaction."?

No, this is the end LSN of the prepare. The prepare consists of multiple LSNs.

2) Should the ";" be "," here?
+++ b/doc/src/sgml/catalogs.sgml
@@ -7639,6 +7639,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration
count&gt;</replaceable>:<replaceable>&l
<row>
<entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is
pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>

no, I think the ";" is correct here, connecting multiple parts of the sentence.

3) Should end_lsn be commit_end_lsn?
+       prepare_data->commit_end_lsn = pq_getmsgint64(in);
+       if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
elog(ERROR, "end_lsn is not set in commit prepared message");
+       prepare_data->prepare_time = pq_getmsgint64(in);

Changed this.

4) This change is not required

diff --git a/src/include/replication/pgoutput.h
b/src/include/replication/pgoutput.h
index 0dc460f..93c6731 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -29,5 +29,4 @@ typedef struct PGOutputData
bool            messages;
bool            two_phase;
} PGOutputData;
-

removed.

#endif /* PGOUTPUT_H */

5) Will the worker receive commit prepared/rollback prepared as we
have skip logic to skip commit prepared / commit rollback in
pgoutput_rollback_prepared_txn and pgoutput_commit_prepared_txn:

+        * It is possible that we haven't received the prepare because
+        * the transaction did not have any changes relevant to this
+        * subscription and was essentially an empty prepare. In which case,
+        * the walsender is optimized to drop the empty transaction and the
+        * accompanying prepare. Silently ignore if we don't find the prepared
+        * transaction.
*/
-       replorigin_session_origin_lsn = prepare_data.end_lsn;
-       replorigin_session_origin_timestamp = prepare_data.commit_time;
+       if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+                                       prepare_data.prepare_time))
+       {

Commit prepared will be skipped if it happens in the same walsender's
lifetime. But if the walsender restarts it no longer
knows about the skipped prepare. In this case walsender will not skip
the commit prepared. Hence, the logic for handling
stray commit prepared in the apply worker.

6) I'm not sure if we could add some tests for skip empty prepare
transactions, if possible add few tests.

I've added a test case using pg_logical_slot_peek_binary_changes() for
empty prepares
have a look.

7) We could add some debug level log messages for the transaction that
will be skipped.

If this is for the test, I was able to add a test without debug messages.

The idea here is to include any debug logs which will help in
analyzing any bugs that we might get from an environment where debug
access might not be available.
Thanks for fixing the comments and posting an updated patch.

Regards,
Vignesh

#338Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#334)

On Thu, May 27, 2021 at 8:05 AM Ajin Cherian <itsajin@gmail.com> wrote:

Thanks for confirmation. The problem seemed to be as you reported a
table not closed when a transaction was committed.
This seems to be because the function UpdateTwoPhaseState was
committing a transaction inside the function when the caller of
UpdateTwoPhaseState had
a table open in CreateSubscription. This function was newly included
in the CreateSubscription code, to handle the new use case of
two_phase being enabled on
create subscription if "copy_data = false". I don't think
CreateSubscription required this to be inside a transaction and the
committing of transaction
was only meant for where this function was originally created to be
used in the apply worker code (ApplyWorkerMain()).
So, I removed the committing of the transaction from inside the
function UpdateTwoPhaseState() and instead started and committed the
transaction
prior to and after this function is invoked in the apply worker code.

You have made these changes in 0002 whereas they should be part of 0001.

One minor comment for 0001.
* Special case: if when tables were specified but copy_data is
+ * false then it is safe to enable two_phase up-front because
+ * those tables are already initially READY state. Note, if
+ * the subscription has no tables then enablement cannot be
+ * done here - we must leave the twophase state as PENDING, to
+ * allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.

Can we slightly modify this comment as: "Note that if tables were
specified but copy_data is false then it is safe to enable two_phase
up-front because those tables are already initially READY state. When
the subscription has no tables, we leave the twophase state as
PENDING, to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work."

Also, I don't see any test after you enable this special case. Is it
covered by existing tests, if not then let's try to add a test for
this?

--
With Regards,
Amit Kapila.

#339Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#338)
3 attachment(s)

On Fri, May 28, 2021 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 27, 2021 at 8:05 AM Ajin Cherian <itsajin@gmail.com> wrote:

Thanks for confirmation. The problem seemed to be as you reported a
table not closed when a transaction was committed.
This seems to be because the function UpdateTwoPhaseState was
committing a transaction inside the function when the caller of
UpdateTwoPhaseState had
a table open in CreateSubscription. This function was newly included
in the CreateSubscription code, to handle the new use case of
two_phase being enabled on
create subscription if "copy_data = false". I don't think
CreateSubscription required this to be inside a transaction and the
committing of transaction
was only meant for where this function was originally created to be
used in the apply worker code (ApplyWorkerMain()).
So, I removed the committing of the transaction from inside the
function UpdateTwoPhaseState() and instead started and committed the
transaction
prior to and after this function is invoked in the apply worker code.

You have made these changes in 0002 whereas they should be part of 0001.

One minor comment for 0001.
* Special case: if when tables were specified but copy_data is
+ * false then it is safe to enable two_phase up-front because
+ * those tables are already initially READY state. Note, if
+ * the subscription has no tables then enablement cannot be
+ * done here - we must leave the twophase state as PENDING, to
+ * allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.

Can we slightly modify this comment as: "Note that if tables were
specified but copy_data is false then it is safe to enable two_phase
up-front because those tables are already initially READY state. When
the subscription has no tables, we leave the twophase state as
PENDING, to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work."

Created v81 - rebased to head and I have corrected the patch-set such
that the fix as well as Tang's test cases are now part of
patch-1. Also added this above minor comment update.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v81-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v81-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 9325c2335ce58f1b077efee901871acee0eb3385 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Sun, 30 May 2021 22:43:18 -0400
Subject: [PATCH v81] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/logicaldecoding.sgml                  |   6 +-
 doc/src/sgml/protocol.sgml                         | 307 +++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 149 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  19 +-
 src/backend/replication/logical/decode.c           |  10 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 196 ++++++++++--
 src/backend/replication/logical/worker.c           | 343 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  29 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  74 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 299 ++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 247 +++++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 44 files changed, 2386 insertions(+), 206 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 1649320..c5e078f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7644,6 +7644,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index d2c6e15..33a3b81 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1253,9 +1253,9 @@ stream_commit_cb(...);  &lt;-- commit of the streamed transaction
       <para>
        The logical replication solution that builds distributed two phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command> command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit <command>LOCK</command> command)
+       in such transactions.
       </para>
      </listitem>
     </itemizedlist>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..6683929 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7365,6 +7386,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..776295c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		is around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5c84d75..9b941e9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1254,5 +1254,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8aa6de1..9788f01 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -505,10 +556,34 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false then
+				 * it is safe to enable two_phase up-front because those tables
+				 * are already initially in READY state. When the subscription
+				 * has no tables, we leave the twophase state as PENDING,
+				 * to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -814,7 +889,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +924,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +953,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +999,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1016,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1058,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1076,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1116,33 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything which could interfere with the apply
+				 * worker's message handling.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..ccde3bc 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,19 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/*
+		 * FIXME - 21/May. The below code is a temporary hack to check for
+		 * for server version 140000, even though this two-phase feature did
+		 * not make it into the PG 14 release.
+		 *
+		 * When the PG 15 development officially starts someone will update the
+		 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+		 * to revisit this code to remove this hack and write the code properly.
+		 */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +847,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +861,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..58b4e2c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,9 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +733,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..c387997 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b955f43..f5d1bca 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e127..da0e5e8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2552,7 +2552,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2643,7 +2643,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2690,7 +2690,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2710,7 +2710,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2729,19 +2729,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2759,12 +2760,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 67f907c..e3cbe32 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1065,7 +1061,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1151,3 +1148,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		found_busy = false;
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	found_busy = list_length(table_states_not_ready) > 0;
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && !found_busy;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase ENABLED */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6ba447e..98a57e7 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -794,6 +870,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2032,6 +2282,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2509,6 +2775,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2995,6 +3264,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3161,15 +3444,69 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..ecf9b9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +161,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +177,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -170,10 +190,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,8 +271,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -322,6 +365,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +398,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +418,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +439,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +928,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1236,3 +1335,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c88b803..6a172d3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b94910b..285a321 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 8f53cc7..8141311 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4304,6 +4305,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4347,9 +4349,25 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	/*
+	 * FIXME - 21/May. The below code is a temporary hack to check for
+	 * for server version 140000, even though this two-phase feature did
+	 * not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 */
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4370,6 +4388,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4395,6 +4414,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4422,6 +4443,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4463,6 +4485,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 195f8d8..14623d5 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,13 +6415,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 6598c53..194c322 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2759,7 +2759,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..413a5ce 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..36fa320 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -115,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +181,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0c6e9d1..109000d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -296,7 +296,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -635,7 +639,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3570684..71638a3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..90430f4
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,299 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 19;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Wait for the statistics to be updated
+$node_publisher->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..ac27384
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,247 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# Wait for the statistics to be updated
+$node_A->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_b'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+$node_B->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_c'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..2cfc1ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,9 +1388,11 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v81-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v81-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 29b834365ae1e8c273fe6b8ae4d53625ea4841f6 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Sun, 30 May 2021 22:56:18 -0400
Subject: [PATCH v81] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.
---
 doc/src/sgml/protocol.sgml                         |  68 ++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 132 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |   9 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 459 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 285 +++++++++++++
 11 files changed, 1035 insertions(+), 79 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 6683929..19cd6a2 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7386,7 +7386,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7649,6 +7649,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 9788f01..ee15b0e 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -924,12 +909,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 98a57e7..40b00c9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1044,6 +1044,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1241,30 +1321,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1272,7 +1343,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1287,7 +1358,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1362,6 +1433,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2298,6 +2394,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ecf9b9a..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -178,7 +180,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -283,17 +285,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1010,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 36fa320..9b3e934 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -244,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..b2d52cd
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,459 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Wait for the statistics to be updated
+$node_publisher->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..9d5c6f5
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,285 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# Wait for the statistics to be updated
+$node_A->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_b'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+$node_B->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_c'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v81-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v81-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From 6af495f832b8f767e4dce49f388c9c2e1e9afb39 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Sun, 30 May 2021 23:05:53 -0400
Subject: [PATCH v81] Skip empty transactions for logical replication.

The current logical replication behavior is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  16 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  36 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 141 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  41 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 266 insertions(+), 35 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 33a3b81..67c3268 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -865,11 +865,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 19cd6a2..ae2cd11 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7538,6 +7538,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+<varlistentry>
+
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit.
 </para></listitem>
 </varlistentry>
@@ -7552,6 +7559,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Timestamp of the prepare. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c387997..ed60719 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -940,7 +941,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -975,7 +977,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..4653d6d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index da0e5e8..282da49 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2770,7 +2770,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 40b00c9..f7db5ef 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -971,26 +971,38 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* there is no transaction when COMMIT PREPARED is called */
-	ensure_transaction();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	CommitTransactionCommand();
+		/* there is no transaction when COMMIT PREPARED is called */
+		ensure_transaction();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7c3a33d..84e9cfe 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -135,6 +137,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -404,10 +411,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -422,8 +451,18 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -435,10 +474,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -453,8 +510,15 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -465,12 +529,28 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+			return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -483,8 +563,21 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+			return;
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -613,11 +706,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -651,6 +749,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -750,6 +857,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -757,6 +865,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -793,6 +905,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -813,6 +934,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -825,6 +947,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 9b3e934..a6d9977 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 109000d..7cf4499 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -441,7 +441,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 52bd92d..2b43ae0 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -86,9 +86,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 90430f4..3428c6d 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 19;
+use Test::More tests => 20;
 
 ###############################
 # Setup
@@ -277,6 +277,45 @@ $node_publisher->wait_for_catchup($appname);
 $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
 is($result, qq(0), 'transaction is aborted on subscriber');
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+		"CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot.
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_nopub SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'empty_transaction';
+	COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+	'postgres', qq(
+		SELECT get_byte(data, 0)
+		FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+			'proto_version', '1',
+			'publication_names', 'tap_pub')
+));
+
+# the empty tranaction should be skipped
+is($result, qq(),
+	'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2cfc1ae..f0941ad 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1596,6 +1596,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

#340Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#338)

On Fri, May 28, 2021 at 11:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

One minor comment for 0001.
* Special case: if when tables were specified but copy_data is
+ * false then it is safe to enable two_phase up-front because
+ * those tables are already initially READY state. Note, if
+ * the subscription has no tables then enablement cannot be
+ * done here - we must leave the twophase state as PENDING, to
+ * allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.

Can we slightly modify this comment as: "Note that if tables were
specified but copy_data is false then it is safe to enable two_phase
up-front because those tables are already initially READY state. When
the subscription has no tables, we leave the twophase state as
PENDING, to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work."

Also, I don't see any test after you enable this special case. Is it
covered by existing tests, if not then let's try to add a test for
this?

I see that Ajin's latest patch has addressed the other comments except
for the above test case suggestion. I have again reviewed the first
patch and have some comments.

Comments on v81-0001-Add-support-for-prepared-transactions-to-built-i
============================================================================
1.
<para>
        The logical replication solution that builds distributed two
phase commit
        using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command>
command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit
<command>LOCK</command> command)
+       in such transactions.
       </para>

This change doesn't belong to this patch. I see the proposed text
could be considered as an improvement but still we can do this
separately. We are already trying to improve things in this regard in
the thread [1]/messages/by-id/20210222222847.tpnb6eg3yiykzpky@alap3.anarazel.de, so you can propose this change there.

2.
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase
transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit transaction.
+</para></listitem>
+</varlistentry>

Can we change the description of LSN's as "The LSN of the commit
prepared." and "The end LSN of the commit prepared transaction."
respectively? This will make their description different from regular
commit and I think that defines them better.

3.
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback transaction.
+</para></listitem>
+</varlistentry>

Similar to above, can we change the description here as: "The end LSN
of the rollback prepared transaction."?

4.
+ * The exception to this restriction is when copy_data =
+ * false, because when copy_data is false the tablesync will
+ * start already in READY state and will exit directly without
+ * doing anything which could interfere with the apply
+ * worker's message handling.
+ *
+ * For more details see comments atop worker.c.
+ */
+ if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed
when two_phase is enabled"),
+ errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+ ", or use DROP/CREATE SUBSCRIPTION.")));

The above comment is a bit unclear because it seems you are saying
there is some problem even when copy_data is false. Are you missing
'not' after 'could' in the comment?

5.
 XXX Now, this can even lead to a deadlock if the prepare
  * transaction is waiting to get it logically replicated for
- * distributed 2PC. Currently, we don't have an in-core
- * implementation of prepares for distributed 2PC but some
- * out-of-core logical replication solution can have such an
- * implementation. They need to inform users to not have locks
- * on catalog tables in such transactions.
+ * distributed 2PC. This can be avoided by disallowing to
+ * prepare transactions that have locked [user] catalog tables
+ * exclusively.

Can we slightly modify this part of the comment as: "This can be
avoided by disallowing to prepare transactions that have locked [user]
catalog tables exclusively but as of now we ask users not to do such
operation"?

6.
+AllTablesyncsReady(void)
+{
+ bool found_busy = false;
+ bool started_tx = false;
+ bool has_subrels = false;
+
+ /* We need up-to-date sync state info for subscription tables here. */
+ has_subrels = FetchTableStates(&started_tx);
+
+ found_busy = list_length(table_states_not_ready) > 0;
+
+ if (started_tx)
+ {
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+ }
+
+ /*
+ * When there are no tables, then return false.
+ * When no tablesyncs are busy, then all are READY
+ */
+ return has_subrels && !found_busy;
+}

Do we really need found_busy variable in above function. Can't we
change the return as (has_subrels) && (table_states_not_ready != NIL)?
If so, then change the comments above return.

7.
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)

Can we update comments indicating that if this function starts the
transaction then the caller is responsible to commit it?

8.
(errmsg("logical replication apply worker for subscription \"%s\" will
restart so two_phase can be enabled",
+ MySubscription->name)));

Can we slightly change the message as: "logical replication apply
worker for subscription \"%s\" will restart so that two_phase can be
enabled"?

9.
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
{
..
+ /* And update/set two_phase ENABLED */
+ values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+ replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
..
}

The above comment seems wrong to me as we are updating the state as
passed by the caller.

[1]: /messages/by-id/20210222222847.tpnb6eg3yiykzpky@alap3.anarazel.de

--
With Regards,
Amit Kapila.

#341Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#340)
3 attachment(s)

Please find attached the latest patch set v82*

Differences from v81* are:

* Rebased to HEAD @ yesterday

* v82 addresses all of Amit's feedback comments from [1]/messages/by-id/CAA4eK1Jd9sqWtt5kEJZL1ehJB2y_DFnvDjY9vJ51k8Wq6XWVyw@mail.gmail.com; I will reply
to that mail separately with any details.

----
[1]: /messages/by-id/CAA4eK1Jd9sqWtt5kEJZL1ehJB2y_DFnvDjY9vJ51k8Wq6XWVyw@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v82-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v82-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From b9c8cf892ae4c1826d19188bf92f4d51b6ec8204 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 1 Jun 2021 18:32:29 +1000
Subject: [PATCH v82] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the below things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.

We don't support the below operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions is not supported.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         | 307 +++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 148 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  19 +-
 src/backend/replication/logical/decode.c           |  11 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 196 ++++++++++--
 src/backend/replication/logical/worker.c           | 343 ++++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  29 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  13 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  74 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 299 ++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 247 +++++++++++++++
 src/tools/pgindent/typedefs.list                   |   2 +
 43 files changed, 2383 insertions(+), 203 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 1649320..c5e078f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7644,6 +7644,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..8d4fdf3 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7365,6 +7386,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit prepared.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..776295c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		is around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5c84d75..9b941e9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1254,5 +1254,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 
 -- All columns of pg_subscription except subconninfo are readable.
 REVOKE ALL ON pg_subscription FROM public;
-GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subslotname, subpublications)
+GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subslotname, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8aa6de1..f8826fb 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -505,10 +556,34 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false then
+				 * it is safe to enable two_phase up-front because those tables
+				 * are already initially in READY state. When the subscription
+				 * has no tables, we leave the twophase state as PENDING,
+				 * to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -814,7 +889,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +924,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +953,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +999,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1016,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1058,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1076,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1116,32 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..ccde3bc 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,19 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/*
+		 * FIXME - 21/May. The below code is a temporary hack to check for
+		 * for server version 140000, even though this two-phase feature did
+		 * not make it into the PG 14 release.
+		 *
+		 * When the PG 15 development officially starts someone will update the
+		 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+		 * to revisit this code to remove this hack and write the code properly.
+		 */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +847,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +861,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..b106588 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively but as of now we ask users not to do such
+				 * operation.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +734,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..c387997 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b955f43..f5d1bca 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e127..da0e5e8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2552,7 +2552,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2643,7 +2643,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2690,7 +2690,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2710,7 +2710,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2729,19 +2729,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2759,12 +2760,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 67f907c..f4290f6 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1065,7 +1061,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1151,3 +1148,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ *
+ * Note: If this function started the transaction (indicated by the parameter)
+ * then it is the caller's responsibility to commit it.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && list_length(table_states_not_ready) == 0;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase state */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6ba447e..98a57e7 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -794,6 +870,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2032,6 +2282,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2509,6 +2775,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2995,6 +3264,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3161,15 +3444,69 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348d..ecf9b9a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -148,6 +161,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -159,6 +177,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -170,10 +190,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -249,8 +271,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -322,6 +365,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -334,8 +398,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -350,29 +418,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -392,6 +439,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -819,18 +928,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1236,3 +1335,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c88b803..6a172d3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b94910b..285a321 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 8f53cc7..8141311 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4304,6 +4305,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4347,9 +4349,25 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	/*
+	 * FIXME - 21/May. The below code is a temporary hack to check for
+	 * for server version 140000, even though this two-phase feature did
+	 * not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 */
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4370,6 +4388,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4395,6 +4414,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4422,6 +4443,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4463,6 +4485,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 195f8d8..14623d5 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,13 +6415,18 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary, streaming, and two_phase are only supported in v14 and
+		 * higher
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
-							  ", substream AS \"%s\"\n",
+							  ", substream AS \"%s\"\n"
+							  ", subtwophasestate AS \"%s\"\n",
 							  gettext_noop("Binary"),
-							  gettext_noop("Streaming"));
+							  gettext_noop("Streaming"),
+							  gettext_noop("Two phase commit"));
 
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 6598c53..194c322 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2759,7 +2759,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index a5d6efd..ca9814f 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -54,6 +62,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -91,6 +101,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..413a5ce 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..36fa320 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -115,6 +124,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +181,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0c6e9d1..109000d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -296,7 +296,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -635,7 +639,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3570684..71638a3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..90430f4
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,299 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 19;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Wait for the statistics to be updated
+$node_publisher->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..ac27384
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,247 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# Wait for the statistics to be updated
+$node_A->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_b'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+$node_B->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_c'
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..2cfc1ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,9 +1388,11 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
-- 
1.8.3.1

v82-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v82-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From 163c9e5d7a6be6655c079a8151378d7bbb2e4a3b Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 2 Jun 2021 08:49:50 +1000
Subject: [PATCH v82] Skip empty transactions for logical replication.

The current logical replication behavior is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  16 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  36 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 141 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  41 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 266 insertions(+), 35 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index d2c6e15..940f80c 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -865,11 +865,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 5a38433..0add083 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7538,6 +7538,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit prepared.
 </para></listitem>
 </varlistentry>
@@ -7552,6 +7559,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c387997..ed60719 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -940,7 +941,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -975,7 +977,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..4653d6d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index da0e5e8..282da49 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2770,7 +2770,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 40b00c9..f7db5ef 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -971,26 +971,38 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* there is no transaction when COMMIT PREPARED is called */
-	ensure_transaction();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	CommitTransactionCommand();
+		/* there is no transaction when COMMIT PREPARED is called */
+		ensure_transaction();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7c3a33d..84e9cfe 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -135,6 +137,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -404,10 +411,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -422,8 +451,18 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -435,10 +474,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -453,8 +510,15 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -465,12 +529,28 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+			return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -483,8 +563,21 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+			return;
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -613,11 +706,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -651,6 +749,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -750,6 +857,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -757,6 +865,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -793,6 +905,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -813,6 +934,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -825,6 +947,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 9b3e934..a6d9977 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 109000d..7cf4499 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -441,7 +441,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 52bd92d..2b43ae0 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -86,9 +86,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 90430f4..3428c6d 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 19;
+use Test::More tests => 20;
 
 ###############################
 # Setup
@@ -277,6 +277,45 @@ $node_publisher->wait_for_catchup($appname);
 $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
 is($result, qq(0), 'transaction is aborted on subscriber');
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+		"CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot.
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_nopub SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'empty_transaction';
+	COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+	'postgres', qq(
+		SELECT get_byte(data, 0)
+		FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+			'proto_version', '1',
+			'publication_names', 'tap_pub')
+));
+
+# the empty tranaction should be skipped
+is($result, qq(),
+	'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2cfc1ae..f0941ad 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1596,6 +1596,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

v82-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v82-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 66e40df89a6fd553421895440493f7d08bb1fda2 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 1 Jun 2021 18:56:11 +1000
Subject: [PATCH v82] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG doumentation.
---
 doc/src/sgml/protocol.sgml                         |  68 ++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 132 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |   9 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 459 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 285 +++++++++++++
 11 files changed, 1035 insertions(+), 79 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 8d4fdf3..5a38433 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7386,7 +7386,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7649,6 +7649,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f8826fb..894a1b3 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -924,12 +909,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 98a57e7..40b00c9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1044,6 +1044,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1241,30 +1321,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1272,7 +1343,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1287,7 +1358,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1362,6 +1433,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2298,6 +2394,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ecf9b9a..7c3a33d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -178,7 +180,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -283,17 +285,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1010,6 +1001,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 36fa320..9b3e934 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -244,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..b2d52cd
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,459 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Wait for the statistics to be updated
+$node_publisher->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..9d5c6f5
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,285 @@
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# Wait for the statistics to be updated
+$node_A->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_b'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+$node_B->poll_query_until(
+	'postgres', qq[
+	SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+	WHERE slot_name = 'tap_sub_c'
+	AND stream_txns > 0 AND stream_count > 0
+	AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#342Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#340)

On Mon, May 31, 2021 at 9:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 28, 2021 at 11:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

One minor comment for 0001.
* Special case: if when tables were specified but copy_data is
+ * false then it is safe to enable two_phase up-front because
+ * those tables are already initially READY state. Note, if
+ * the subscription has no tables then enablement cannot be
+ * done here - we must leave the twophase state as PENDING, to
+ * allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.

Can we slightly modify this comment as: "Note that if tables were
specified but copy_data is false then it is safe to enable two_phase
up-front because those tables are already initially READY state. When
the subscription has no tables, we leave the twophase state as
PENDING, to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work."

Also, I don't see any test after you enable this special case. Is it
covered by existing tests, if not then let's try to add a test for
this?

I see that Ajin's latest patch has addressed the other comments except
for the above test case suggestion.

Yes, this is a known pending task.

I have again reviewed the first
patch and have some comments.

Comments on v81-0001-Add-support-for-prepared-transactions-to-built-i
============================================================================
1.
<para>
The logical replication solution that builds distributed two
phase commit
using this feature can deadlock if the prepared transaction has locked
-       [user] catalog tables exclusively. They need to inform users to not have
-       locks on catalog tables (via explicit <command>LOCK</command>
command) in
-       such transactions.
+       [user] catalog tables exclusively. To avoid this users must refrain from
+       having locks on catalog tables (via explicit
<command>LOCK</command> command)
+       in such transactions.
</para>

This change doesn't belong to this patch. I see the proposed text
could be considered as an improvement but still we can do this
separately. We are already trying to improve things in this regard in
the thread [1], so you can propose this change there.

OK. This change has been removed in v82, and a patch posted to other
thread here [1]/messages/by-id/CAHut+PuTjTp_WERO=3Ybp8snTgDpiZeNaxzZhN8ky8XMo4KFVQ@mail.gmail.com

2.
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase
transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit transaction.
+</para></listitem>
+</varlistentry>

Can we change the description of LSN's as "The LSN of the commit
prepared." and "The end LSN of the commit prepared transaction."
respectively? This will make their description different from regular
commit and I think that defines them better.

3.
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback transaction.
+</para></listitem>
+</varlistentry>

Similar to above, can we change the description here as: "The end LSN
of the rollback prepared transaction."?

4.
+ * The exception to this restriction is when copy_data =
+ * false, because when copy_data is false the tablesync will
+ * start already in READY state and will exit directly without
+ * doing anything which could interfere with the apply
+ * worker's message handling.
+ *
+ * For more details see comments atop worker.c.
+ */
+ if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed
when two_phase is enabled"),
+ errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+ ", or use DROP/CREATE SUBSCRIPTION.")));

The above comment is a bit unclear because it seems you are saying
there is some problem even when copy_data is false. Are you missing
'not' after 'could' in the comment?

5.
XXX Now, this can even lead to a deadlock if the prepare
* transaction is waiting to get it logically replicated for
- * distributed 2PC. Currently, we don't have an in-core
- * implementation of prepares for distributed 2PC but some
- * out-of-core logical replication solution can have such an
- * implementation. They need to inform users to not have locks
- * on catalog tables in such transactions.
+ * distributed 2PC. This can be avoided by disallowing to
+ * prepare transactions that have locked [user] catalog tables
+ * exclusively.

Can we slightly modify this part of the comment as: "This can be
avoided by disallowing to prepare transactions that have locked [user]
catalog tables exclusively but as of now we ask users not to do such
operation"?

6.
+AllTablesyncsReady(void)
+{
+ bool found_busy = false;
+ bool started_tx = false;
+ bool has_subrels = false;
+
+ /* We need up-to-date sync state info for subscription tables here. */
+ has_subrels = FetchTableStates(&started_tx);
+
+ found_busy = list_length(table_states_not_ready) > 0;
+
+ if (started_tx)
+ {
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+ }
+
+ /*
+ * When there are no tables, then return false.
+ * When no tablesyncs are busy, then all are READY
+ */
+ return has_subrels && !found_busy;
+}

Do we really need found_busy variable in above function. Can't we
change the return as (has_subrels) && (table_states_not_ready != NIL)?
If so, then change the comments above return.

7.
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ */
+static bool
+FetchTableStates(bool *started_tx)

Can we update comments indicating that if this function starts the
transaction then the caller is responsible to commit it?

8.
(errmsg("logical replication apply worker for subscription \"%s\" will
restart so two_phase can be enabled",
+ MySubscription->name)));

Can we slightly change the message as: "logical replication apply
worker for subscription \"%s\" will restart so that two_phase can be
enabled"?

9.
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
{
..
+ /* And update/set two_phase ENABLED */
+ values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+ replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
..
}

The above comment seems wrong to me as we are updating the state as
passed by the caller.

All the above reported issues 2-9 are addressed in the latest 2PC patch set v82

------
[1]: /messages/by-id/CAHut+PuTjTp_WERO=3Ybp8snTgDpiZeNaxzZhN8ky8XMo4KFVQ@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

#343Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#341)

On Wed, Jun 2, 2021 at 4:34 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v82*

Few comments on 0001:
====================
1.
+ /*
+ * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+ * called within the PrepareTransactionBlock below.
+ */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+
+ /*
+ * Update origin state so we can restart streaming from correct position
+ * in case of crash.
+ */
+ replorigin_session_origin_lsn = prepare_data.end_lsn;
+ replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+ PrepareTransactionBlock(gid);
+ CommitTransactionCommand();

Here, the call to CommitTransactionCommand() twice looks a bit odd.
Before the first call, can we write a comment like "This is to
complete the Begin command started by the previous call"?

2.
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
bool streaming;

  /*
- * Does the output plugin support two-phase decoding, and is it enabled?
+ * Does the output plugin support two-phase decoding.
  */
  bool twophase;
  /*
+ * Is two-phase option given by output plugin?
+ */
+ bool twophase_opt_given;
+
+ /*
  * State for writing output.

I think we can write few comments as to why we need a separate
twophase parameter here? The description of twophase_opt_given can be
changed to: "Is two-phase option given by output plugin? This is to
allow output plugins to enable two_phase at the start of streaming. We
can't rely on twophase parameter that tells whether the plugin
provides all the necessary two_phase APIs for this purpose." Feel free
to add more to it.

3.
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
MemoryContextSwitchTo(old_context);

  /*
- * We allow decoding of prepared transactions iff the two_phase option is
- * enabled at the time of slot creation.
+ * We allow decoding of prepared transactions when the two_phase is
+ * enabled at the time of slot creation, or when the two_phase option is
+ * given at the streaming start.
  */
- ctx->twophase &= MyReplicationSlot->data.two_phase;
+ ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+ /* Mark slot to allow two_phase decoding if not already marked */
+ if (ctx->twophase && !slot->data.two_phase)
+ {
+ slot->data.two_phase = true;
+ ReplicationSlotMarkDirty();
+ ReplicationSlotSave();
+ }

Why do we need to change this during CreateInitDecodingContext which
is called at create_slot time? At that time, we don't need to consider
any options and there is no need to toggle slot's two_phase value.

4.
- /* Binary mode and streaming are only supported in v14 and higher */
+ /*
+ * Binary, streaming, and two_phase are only supported in v14 and
+ * higher
+ */

We can say v15 for two_phase.

5.
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3

Isn't it better to define LOGICALREP_PROTO_MAX_VERSION_NUM as
LOGICALREP_PROTO_TWOPHASE_VERSION_NUM instead of specifying directly
the number?

6.
+/* Commit (and abort) information */
typedef struct LogicalRepCommitData
{
XLogRecPtr commit_lsn;
@@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData
TimestampTz committime;
} LogicalRepCommitData;

Is there a reason for the above comment addition? If so, how is it
related to this patch?

7.
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,299 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;

In the nearby test files, we have Copyright notice like "# Copyright
(c) 2021, PostgreSQL Global Development Group". We should add one to
the new test files in this patch as well.

8.
+# Also wait for two-phase to be enabled
+my $twophase_query =
+ "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT
IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";

Isn't it better to write this query as: "SELECT count(1) = 1 FROM
pg_subscription WHERE subtwophasestate ='e';"? It looks a bit odd to
use the NOT IN operator here. Similarly, change the same query used at
another place in the patch.

9.
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*)
FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Wait for the statistics to be updated
+$node_publisher->poll_query_until(
+ 'postgres', qq[
+ SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+ WHERE slot_name = 'tap_sub'
+ AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";

I don't see the need to check for stats in this test. If we really
want to test stats then we can add a separate test in
contrib\test_decoding\sql\stats but I suggest leaving it. Please do
the same for other stats tests in the patch.

10. I think you missed to update LogicalRepRollbackPreparedTxnData in
typedefs.list.

--
With Regards,
Amit Kapila.

#344Greg Nancarrow
gregn4422@gmail.com
In reply to: Peter Smith (#341)

On Wed, Jun 2, 2021 at 9:04 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v82*

Some suggested changes to the 0001 patch comments (and note also the
typo "doumentation"):
diff of before and after follows:

8c8
< built-in logical replication, we need to do the below things:
---

built-in logical replication, we need to do the following things:

16,17c16,17
< * Add a new SUBSCRIPTION option "two_phase" to allow users to enable it.
< We enable the two_phase once the initial data sync is over.
---

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase
transactions. We enable the two_phase once the initial data sync is over.

23c23
< * Adds new subscription TAP tests, and new subscription.sql regression tests.
---

* Add new subscription TAP tests, and new subscription.sql regression tests.

25c25
< * Updates PG doumentation.
---

* Update PG documentation.

33c33
< * Prepare API for in-progress transactions is not supported.
---

* Prepare API for in-progress transactions.

Regards,
Greg Nancarrow
Fujitsu Australia

#345Peter Smith
smithpb2250@gmail.com
In reply to: Greg Nancarrow (#344)
3 attachment(s)

Please find attached the latest patch set v83*

Differences from v82* are:

* Rebased to HEAD @ yesterday. This was necessary because some recent
HEAD pushes broke the v82.

* Adds a 2PC copy_data=false test case for [1]/messages/by-id/CAA4eK1K7qhqigORdEgqFTOPfj4r2+jV-uLc4-RCtgyDZwvbF8w@mail.gmail.com;

* Addresses most of Amit's recent feedback comments from [2]/messages/by-id/CAA4eK1+8L8h9qUQ6sS48EY0osfN7zs=ZPqR6sE4eQxFhgwBxRw@mail.gmail.com; I will
reply to that mail separately with the details.

* Addresses Greg's feedback [3]/messages/by-id/CAJcOf-cvn4EpSo4cD_9Awop72roKL1vnMtpURn1FnXv+gX5VPA@mail.gmail.com about the patch 0001 commit comment

----
[1]: /messages/by-id/CAA4eK1K7qhqigORdEgqFTOPfj4r2+jV-uLc4-RCtgyDZwvbF8w@mail.gmail.com
[2]: /messages/by-id/CAA4eK1+8L8h9qUQ6sS48EY0osfN7zs=ZPqR6sE4eQxFhgwBxRw@mail.gmail.com
[3]: /messages/by-id/CAJcOf-cvn4EpSo4cD_9Awop72roKL1vnMtpURn1FnXv+gX5VPA@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v83-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v83-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From d3c4adc43b2391a0736e927aae8faf34e540a3d1 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 8 Jun 2021 12:42:51 +1000
Subject: [PATCH v83] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the following things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase
transactions. We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Add new subscription TAP tests, and new subscription.sql regression tests.

* Update PG documentation.

We don't support the following operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         | 307 +++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 148 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  19 +-
 src/backend/replication/logical/decode.c           |  11 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 196 ++++++++++--
 src/backend/replication/logical/worker.c           | 343 +++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/backend/replication/walsender.c                |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  29 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  17 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/nodes/replnodes.h                      |   1 +
 src/include/replication/logical.h                  |   7 +-
 src/include/replication/logicalproto.h             |  73 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 349 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 235 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   3 +
 45 files changed, 2429 insertions(+), 202 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 1649320..c5e078f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7644,6 +7644,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..8d4fdf3 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7365,6 +7386,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit prepared.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..776295c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		is around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984..55f6e37 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1255,5 +1255,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subslotname, subsynccommit, subpublications)
+              substream, subtwophasestate, subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8aa6de1..f8826fb 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -505,10 +556,34 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false then
+				 * it is safe to enable two_phase up-front because those tables
+				 * are already initially in READY state. When the subscription
+				 * has no tables, we leave the twophase state as PENDING,
+				 * to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -814,7 +889,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +924,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +953,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +999,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1016,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1058,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1076,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1116,32 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..ccde3bc 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,19 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/*
+		 * FIXME - 21/May. The below code is a temporary hack to check for
+		 * for server version 140000, even though this two-phase feature did
+		 * not make it into the PG 14 release.
+		 *
+		 * When the PG 15 development officially starts someone will update the
+		 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+		 * to revisit this code to remove this hack and write the code properly.
+		 */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +847,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +861,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..b106588 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively but as of now we ask users not to do such
+				 * operation.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +734,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..c387997 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b955f43..f5d1bca 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e127..da0e5e8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2552,7 +2552,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2643,7 +2643,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2690,7 +2690,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2710,7 +2710,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2729,19 +2729,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2759,12 +2760,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 67f907c..f4290f6 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1065,7 +1061,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1151,3 +1148,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ *
+ * Note: If this function started the transaction (indicated by the parameter)
+ * then it is the caller's responsibility to commit it.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && list_length(table_states_not_ready) == 0;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase state */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6ba447e..3e987ca 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -794,6 +870,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2032,6 +2282,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2509,6 +2775,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2995,6 +3264,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3161,15 +3444,69 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index fe12d08..85ba1dc 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -145,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -156,6 +174,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -167,10 +187,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -246,8 +268,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -319,6 +362,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -331,8 +395,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -347,29 +415,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -389,6 +436,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -836,18 +945,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1266,3 +1365,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c88b803..6a172d3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b94910b..285a321 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 109c723..d18b19b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -939,7 +939,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 	{
 		ReplicationSlotCreate(cmd->slotname, false,
 							  cmd->temporary ? RS_TEMPORARY : RS_PERSISTENT,
-							  false);
+							  cmd->two_phase);
 	}
 	else
 	{
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 8f53cc7..8141311 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4304,6 +4305,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4347,9 +4349,25 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	/*
+	 * FIXME - 21/May. The below code is a temporary hack to check for
+	 * for server version 140000, even though this two-phase feature did
+	 * not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 */
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4370,6 +4388,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4395,6 +4414,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4422,6 +4443,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4463,6 +4485,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2abf255..6caa701 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,7 +6415,9 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary and streaming are only supported in v14 and higher.
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
@@ -6423,6 +6425,17 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Binary"),
 							  gettext_noop("Streaming"));
 
+		/*
+		 * Two_phase is only supported in v15 and higher.
+		 *
+		 * FIXME: When PG15 development starts, change the following
+		 * 140000 to 150000
+		 */
+		if (pset.sversion >= 140000)
+			appendPQExpBuffer(&buf,
+							  ", subtwophasestate AS \"%s\"\n",
+							  gettext_noop("Two phase commit"));
+
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
 						  ",  subconninfo AS \"%s\"\n",
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 109b22a..d01aa75 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2759,7 +2759,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0060ebf..e84353e 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -57,6 +65,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -94,6 +104,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index faa3a25..ebc43a0 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		temporary;
+	bool		two_phase;
 	List	   *options;
 } CreateReplicationSlotCmd;
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..413a5ce 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
 	bool		streaming;
 
 	/*
-	 * Does the output plugin support two-phase decoding, and is it enabled?
+	 * Does the output plugin support two-phase decoding.
 	 */
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..e20f2da 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -122,6 +131,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +180,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0c6e9d1..109000d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -296,7 +296,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -635,7 +639,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3570684..71638a3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..1ea42c4
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,349 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 23;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# copy_data=false and two_phase
+###############################
+
+#create some test tables for copy tests
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_copy SELECT generate_series(1,5);");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "INSERT INTO tab_copy VALUES (88);");
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Setup logical replication
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_copy FOR TABLE tab_copy;");
+
+my $appname_copy = 'appname_copy';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_copy
+	CONNECTION '$publisher_connstr application_name=$appname_copy'
+	PUBLICATION tap_pub_copy
+	WITH (two_phase=on, copy_data=false);");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Also wait for initial table sync to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+# Check that the initial table data was NOT replicated (because we said copy_data=false)
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Now do a single 2PC insert on publisher and check that it IS replicated
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_copy VALUES (99);
+    PREPARE TRANSACTION 'mygid';
+	COMMIT PREPARED 'mygid';");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(6), 'publisher inserted data');
+
+$node_publisher->wait_for_catchup($appname_copy);
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(2), 'replicated data in subscriber table');
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..e61d28a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..cabc0bb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,12 +1388,15 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v83-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v83-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From b0dd924be40e212b37e27a6c3512d7ee97f8e932 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 8 Jun 2021 15:21:15 +1000
Subject: [PATCH v83] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/protocol.sgml                         |  68 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 132 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |  10 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 453 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 271 ++++++++++++
 11 files changed, 1016 insertions(+), 79 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 8d4fdf3..5a38433 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7386,7 +7386,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7649,6 +7649,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f8826fb..894a1b3 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -924,12 +909,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3e987ca..c33d1ba 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1044,6 +1044,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1241,30 +1321,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1272,7 +1343,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1287,7 +1358,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1362,6 +1433,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2298,6 +2394,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 85ba1dc..ccf801a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1027,6 +1018,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index e20f2da..7a4804f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -124,6 +125,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -243,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c90e3f6
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,453 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3a0be82
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v83-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v83-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From 6bab9c549e79e29d3fd1eeb9cad0dcf0c59d7dd7 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 8 Jun 2021 15:55:05 +1000
Subject: [PATCH v83] Skip empty transactions for logical replication.

The current logical replication behavior is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  16 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  36 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 141 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  41 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 266 insertions(+), 35 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index d2c6e15..940f80c 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -865,11 +865,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 5a38433..0add083 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7538,6 +7538,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit prepared.
 </para></listitem>
 </varlistentry>
@@ -7552,6 +7559,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c387997..ed60719 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -940,7 +941,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -975,7 +977,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..4653d6d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index da0e5e8..282da49 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2770,7 +2770,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c33d1ba..d13c0c8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -971,26 +971,38 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* there is no transaction when COMMIT PREPARED is called */
-	ensure_transaction();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	CommitTransactionCommand();
+		/* there is no transaction when COMMIT PREPARED is called */
+		ensure_transaction();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ccf801a..a3e2bae 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -132,6 +134,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -401,10 +408,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -419,8 +448,18 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -432,10 +471,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -450,8 +507,15 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+		return;
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -462,12 +526,28 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+			return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -480,8 +560,21 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+			return;
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -630,11 +723,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -668,6 +766,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -767,6 +874,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -774,6 +882,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -810,6 +922,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -830,6 +951,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -842,6 +964,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 7a4804f..2fa60b5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 109000d..7cf4499 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -441,7 +441,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 52bd92d..2b43ae0 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -86,9 +86,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 1ea42c4..93944d8 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 23;
+use Test::More tests => 24;
 
 ###############################
 # Setup
@@ -327,6 +327,45 @@ is($result, qq(2), 'replicated data in subscriber table');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
 $node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+   "CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+   "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot
+$node_publisher->safe_psql('postgres', "
+   BEGIN;
+   INSERT INTO tab_nopub SELECT generate_series(1,10);
+   PREPARE TRANSACTION 'empty_transaction';
+   COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+   'postgres', qq(
+       SELECT get_byte(data, 0)
+       FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+           'proto_version', '1',
+           'publication_names', 'tap_pub')
+));
+
+# the empty transaction should be skipped
+is($result, qq(),
+   'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cabc0bb..ad62bbe 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1597,6 +1597,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

#346Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#343)

On Thu, Jun 3, 2021 at 7:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 2, 2021 at 4:34 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v82*

Few comments on 0001:
====================
1.
+ /*
+ * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+ * called within the PrepareTransactionBlock below.
+ */
+ BeginTransactionBlock();
+ CommitTransactionCommand();
+
+ /*
+ * Update origin state so we can restart streaming from correct position
+ * in case of crash.
+ */
+ replorigin_session_origin_lsn = prepare_data.end_lsn;
+ replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+ PrepareTransactionBlock(gid);
+ CommitTransactionCommand();

Here, the call to CommitTransactionCommand() twice looks a bit odd.
Before the first call, can we write a comment like "This is to
complete the Begin command started by the previous call"?

Fixed in v83-0001 and v83-0002

2.
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
bool streaming;

/*
- * Does the output plugin support two-phase decoding, and is it enabled?
+ * Does the output plugin support two-phase decoding.
*/
bool twophase;
/*
+ * Is two-phase option given by output plugin?
+ */
+ bool twophase_opt_given;
+
+ /*
* State for writing output.

I think we can write few comments as to why we need a separate
twophase parameter here? The description of twophase_opt_given can be
changed to: "Is two-phase option given by output plugin? This is to
allow output plugins to enable two_phase at the start of streaming. We
can't rely on twophase parameter that tells whether the plugin
provides all the necessary two_phase APIs for this purpose." Feel free
to add more to it.

TODO

3.
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
MemoryContextSwitchTo(old_context);

/*
- * We allow decoding of prepared transactions iff the two_phase option is
- * enabled at the time of slot creation.
+ * We allow decoding of prepared transactions when the two_phase is
+ * enabled at the time of slot creation, or when the two_phase option is
+ * given at the streaming start.
*/
- ctx->twophase &= MyReplicationSlot->data.two_phase;
+ ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+ /* Mark slot to allow two_phase decoding if not already marked */
+ if (ctx->twophase && !slot->data.two_phase)
+ {
+ slot->data.two_phase = true;
+ ReplicationSlotMarkDirty();
+ ReplicationSlotSave();
+ }

Why do we need to change this during CreateInitDecodingContext which
is called at create_slot time? At that time, we don't need to consider
any options and there is no need to toggle slot's two_phase value.

TODO

4.
- /* Binary mode and streaming are only supported in v14 and higher */
+ /*
+ * Binary, streaming, and two_phase are only supported in v14 and
+ * higher
+ */

We can say v15 for two_phase.

Fixed in v83-0001

5.
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM 3

Isn't it better to define LOGICALREP_PROTO_MAX_VERSION_NUM as
LOGICALREP_PROTO_TWOPHASE_VERSION_NUM instead of specifying directly
the number?

Fixed in v83-0001

6.
+/* Commit (and abort) information */
typedef struct LogicalRepCommitData
{
XLogRecPtr commit_lsn;
@@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData
TimestampTz committime;
} LogicalRepCommitData;

Is there a reason for the above comment addition? If so, how is it
related to this patch?

The LogicalRepCommitData is used by the 0002 patch and during
implementation it was not clear what was this struct, so I added the
missing comment (all other nearby typedefs except this one were
commented). But it is not strictly related to anything in patch 0001
so I have moved this change into the v83-0002 patch.

7.
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,299 @@
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;

In the nearby test files, we have Copyright notice like "# Copyright
(c) 2021, PostgreSQL Global Development Group". We should add one to
the new test files in this patch as well.

Fixed in v83-0001 and v83-0002

8.
+# Also wait for two-phase to be enabled
+my $twophase_query =
+ "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT
IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";

Isn't it better to write this query as: "SELECT count(1) = 1 FROM
pg_subscription WHERE subtwophasestate ='e';"? It looks a bit odd to
use the NOT IN operator here. Similarly, change the same query used at
another place in the patch.

Not changed. This way keeps all the test parts more independent of
each other doesn’t it? E.g. without NOT, if there were other
subscriptions in the same test file then the expected result of ‘e’
may be 1 or 2 or 3 or whatever. Using NOT means you don't have to
worry about any other test part. I think we had been bitten by similar
state checks before which is why it was written like this.

9.
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*)
FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Wait for the statistics to be updated
+$node_publisher->poll_query_until(
+ 'postgres', qq[
+ SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots
+ WHERE slot_name = 'tap_sub'
+ AND total_txns > 0 AND total_bytes > 0;
+]) or die "Timed out while waiting for statistics to be updated";

I don't see the need to check for stats in this test. If we really
want to test stats then we can add a separate test in
contrib\test_decoding\sql\stats but I suggest leaving it. Please do
the same for other stats tests in the patch.

Removed statistics tests from v83-0001 and v83-0002

10. I think you missed to update LogicalRepRollbackPreparedTxnData in
typedefs.list.

Fixed in v83-0001.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#347Greg Nancarrow
gregn4422@gmail.com
In reply to: Peter Smith (#345)

On Tue, Jun 8, 2021 at 4:12 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v83*

Some feedback for the v83 patch set:

v83-0001:

(1) doc/src/sgml/protocol.sgml

(i) Remove extra space:

BEFORE:
+         The transaction will be  decoded and transmitted at
AFTER:
+         The transaction will be decoded and transmitted at
(ii)
BEFORE:
+   contains Stream Commit or Stream Abort message.
AFTER:
+   contains a Stream Commit or Stream Abort message.
(iii)
BEFORE:
+                The LSN of the commit prepared.
AFTER:
+                The LSN of the commit prepared transaction.

(iv) Should documentation say "prepared transaction" as opposed to
"prepare transaction" ???

BEFORE:
+                The end LSN of the prepare transaction.
AFTER:
+                The end LSN of the prepared transaction.

(2) doc/src/sgml/ref/create_subscription.sgml

(i)
BEFORE:
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
AFTER:
+          The <literal>streaming</literal> option cannot be used with the
+          <literal>two_phase</literal> option.

(3) doc/src/sgml/ref/create_subscription.sgml

(i)
BEFORE:
+          prepared on publisher is decoded as normal transaction at commit.
AFTER:
+          prepared on the publisher is decoded as a normal
transaction at commit.
(ii)
BEFORE:
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
AFTER:
+          The <literal>two_phase</literal> option cannot be used with the
+          <literal>streaming</literal> option.

(4) src/backend/access/transam/twophase.c

(i)
BEFORE:
+ * Check if the prepared transaction with the given GID, lsn and timestamp
+ * is around.
AFTER:
+ * Check if the prepared transaction with the given GID, lsn and timestamp
+ * exists.

(5) src/backend/access/transam/twophase.c

Question:

Is:

+ * do this optimization if we encounter many collisions in GID

meant to be:

+ * do this optimization if we encounter any collisions in GID

???

(6) src/backend/replication/logical/decode.c

Grammar:

BEFORE:
+ * distributed 2PC. This can be avoided by disallowing to
+ * prepare transactions that have locked [user] catalog tables
+ * exclusively but as of now we ask users not to do such
+ * operation.
AFTER:
+ * distributed 2PC. This can be avoided by disallowing
+ * prepared transactions that have locked [user] catalog tables
+ * exclusively but as of now we ask users not to do such an
+ * operation.

(7) src/backend/replication/logical/logical.c

From the comment above it, it's not clear if the "&=" in the following
line is intentional:

+ ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);

Also, the boolean conditions tested are in the reverse order of what
is mentioned in that comment.
Based on the comment, I would expect the following code:

+ ctx->twophase = (slot->data.two_phase || ctx->twophase_opt_given);

Please check it, and maybe update the comment if "&=" is really intended.

There are TWO places where this same code is used.

(8) src/backend/replication/logical/tablesync.c

In the following code, "has_subrels" should be a bool, not an int.

+static bool
+FetchTableStates(bool *started_tx)
+{
+ static int has_subrels = false;

(9) src/backend/replication/logical/worker.c

Mixed current/past tense:

BEFORE:
+ * was still busy (see the condition of should_apply_changes_for_rel). The
AFTER:
+ * is still busy (see the condition of should_apply_changes_for_rel). The

(10)

2 places:

BEFORE:
+ /* there is no transaction when COMMIT PREPARED is called */
AFTER:
+ /* There is no transaction when COMMIT PREPARED is called */

v83-0002:

1) doc/src/sgml/protocol.sgml

BEFORE:
+   contains Stream Prepare or Stream Commit or Stream Abort message.
AFTER:
+   contains a Stream Prepare or Stream Commit or Stream Abort message.

v83-0003:

1) src/backend/replication/pgoutput/pgoutput.c

i) In pgoutput_commit_txn(), the following code that pfree()s a
pointer in a struct, without then NULLing it out, seems dangerous to
me (because what is to stop other code, either now or in the future,
from subsequently referencing that freed data or perhaps trying to
pfree() again?):

+ PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private;
+ bool            skip;
+
+ Assert(data);
+ skip = !data->sent_begin_txn;
+ pfree(data);

I suggest adding the following line of code after the pfree():
+ txn->output_plugin_private = NULL;

ii) In pgoutput_commit_prepared_txn(), there's the same type of code:

+ if (data)
+ {
+ bool skip = !data->sent_begin_txn;
+ pfree(data);
+ if (skip)
+ return;
+ }

I suggest adding the following line after the pfree() above:

+ txn->output_plugin_private = NULL;

iii) Again, same thing in pgoutput_rollback_prepared_txn():

I suggest adding the following line after the pfree() above:

+ txn->output_plugin_private = NULL;

Regards,
Greg Nancarrow
Fujitsu Australia

#348Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#346)
3 attachment(s)

On Tue, Jun 8, 2021 at 4:19 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Jun 3, 2021 at 7:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 2, 2021 at 4:34 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v82*

Attaching patchset-v84 that addresses some of Amit's and Vignesh's comments:
This patch-set also modifies the test case added for copy_data = false
to check that two-phase
transactions are decoded correctly.

2.
@@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext
bool streaming;

/*
- * Does the output plugin support two-phase decoding, and is it enabled?
+ * Does the output plugin support two-phase decoding.
*/
bool twophase;
/*
+ * Is two-phase option given by output plugin?
+ */
+ bool twophase_opt_given;
+
+ /*
* State for writing output.

I think we can write few comments as to why we need a separate
twophase parameter here? The description of twophase_opt_given can be
changed to: "Is two-phase option given by output plugin? This is to
allow output plugins to enable two_phase at the start of streaming. We
can't rely on twophase parameter that tells whether the plugin
provides all the necessary two_phase APIs for this purpose." Feel free
to add more to it.

TODO

Added comments here.

3.
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
MemoryContextSwitchTo(old_context);

/*
- * We allow decoding of prepared transactions iff the two_phase option is
- * enabled at the time of slot creation.
+ * We allow decoding of prepared transactions when the two_phase is
+ * enabled at the time of slot creation, or when the two_phase option is
+ * given at the streaming start.
*/
- ctx->twophase &= MyReplicationSlot->data.two_phase;
+ ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+ /* Mark slot to allow two_phase decoding if not already marked */
+ if (ctx->twophase && !slot->data.two_phase)
+ {
+ slot->data.two_phase = true;
+ ReplicationSlotMarkDirty();
+ ReplicationSlotSave();
+ }

Why do we need to change this during CreateInitDecodingContext which
is called at create_slot time? At that time, we don't need to consider
any options and there is no need to toggle slot's two_phase value.

TODO

As part of the recent changes, we do turn on two_phase at create_slot time when
the subscription is created with (copy_data = false, two_phase = on).
So, this code is required.

Amit:

"1.
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable
class="parameter">slot_name</replaceable> [
<literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [
<literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal>
<replaceable class="parameter">output_plugin</replaceable> [
<literal>EXPORT_SNAPSHOT</literal> |
<literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal>
] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable
class="parameter">slot_name</replaceable> [
<literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] {
<literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] |
<literal>LOGICAL</literal> <replaceable
class="parameter">output_plugin</replaceable> [
<literal>EXPORT_SNAPSHOT</literal> |
<literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal>
] }

Can we do some testing of the code related to this in some way? One
random idea could be to change the current subscriber-side code just
for testing purposes to see if this works. Can we enhance and use
pg_recvlogical to test this? It is possible that if you address
comment number 13 below, this can be tested with Create Subscription
command."

Actually this is tested in the test case added when Create
Subscription with (copy_data = false) because in that case
the slot is created with the two-phase option.

Vignesh's comment:

"We could add some debug level log messages for the transaction that
will be skipped."

Updated debug messages.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v84-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v84-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 68e6bcc99e3f078551ac25608c146115a3fa6187 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Jun 2021 22:43:03 -0400
Subject: [PATCH v84] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the following things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase
transactions. We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Add new subscription TAP tests, and new subscription.sql regression tests.

* Update PG documentation.

We don't support the following operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         | 307 +++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 148 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  19 +-
 src/backend/replication/logical/decode.c           |  11 +-
 src/backend/replication/logical/logical.c          |  37 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 196 +++++++++--
 src/backend/replication/logical/worker.c           | 343 +++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/backend/replication/walsender.c                |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  29 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  17 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/nodes/replnodes.h                      |   1 +
 src/include/replication/logical.h                  |  10 +
 src/include/replication/logicalproto.h             |  73 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 359 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 235 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   3 +
 45 files changed, 2443 insertions(+), 201 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 1649320..c5e078f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7644,6 +7644,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..8d4fdf3 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be  decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7365,6 +7386,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit prepared.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..bbef613 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used along with
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used along with
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..776295c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		is around.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984..55f6e37 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1255,5 +1255,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subslotname, subsynccommit, subpublications)
+              substream, subtwophasestate, subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8aa6de1..f8826fb 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -505,10 +556,34 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false then
+				 * it is safe to enable two_phase up-front because those tables
+				 * are already initially in READY state. When the subscription
+				 * has no tables, we leave the twophase state as PENDING,
+				 * to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -814,7 +889,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +924,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +953,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +999,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1016,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1058,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1076,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1116,32 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..ccde3bc 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,19 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/*
+		 * FIXME - 21/May. The below code is a temporary hack to check for
+		 * for server version 140000, even though this two-phase feature did
+		 * not make it into the PG 14 release.
+		 *
+		 * When the PG 15 development officially starts someone will update the
+		 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+		 * to revisit this code to remove this hack and write the code properly.
+		 */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +847,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +861,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..b106588 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing to
+				 * prepare transactions that have locked [user] catalog tables
+				 * exclusively but as of now we ask users not to do such
+				 * operation.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +734,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..c387997 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +547,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +622,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b955f43..f5d1bca 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e127..da0e5e8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2552,7 +2552,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2643,7 +2643,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2690,7 +2690,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2710,7 +2710,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2729,19 +2729,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2759,12 +2760,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 67f907c..f4290f6 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1065,7 +1061,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1151,3 +1148,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ *
+ * Note: If this function started the transaction (indicated by the parameter)
+ * then it is the caller's responsibility to commit it.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static int has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && list_length(table_states_not_ready) == 0;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase state */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6ba447e..3e987ca 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * was still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -794,6 +870,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* there is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* there is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2032,6 +2282,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2509,6 +2775,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2995,6 +3264,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3161,15 +3444,69 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index fe12d08..85ba1dc 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -145,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -156,6 +174,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -167,10 +187,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -246,8 +268,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -319,6 +362,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -331,8 +395,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -347,29 +415,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -389,6 +436,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -836,18 +945,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1266,3 +1365,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c88b803..6a172d3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b94910b..285a321 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 109c723..e94069c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -954,7 +954,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 */
 		ReplicationSlotCreate(cmd->slotname, true,
 							  cmd->temporary ? RS_TEMPORARY : RS_EPHEMERAL,
-							  false);
+							  cmd->two_phase);
 	}
 
 	if (cmd->kind == REPLICATION_KIND_LOGICAL)
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 8f53cc7..8141311 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4304,6 +4305,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4347,9 +4349,25 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	/*
+	 * FIXME - 21/May. The below code is a temporary hack to check for
+	 * for server version 140000, even though this two-phase feature did
+	 * not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 */
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4370,6 +4388,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4395,6 +4414,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4422,6 +4443,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4463,6 +4485,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2abf255..6caa701 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,7 +6415,9 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary and streaming are only supported in v14 and higher.
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
@@ -6423,6 +6425,17 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Binary"),
 							  gettext_noop("Streaming"));
 
+		/*
+		 * Two_phase is only supported in v15 and higher.
+		 *
+		 * FIXME: When PG15 development starts, change the following
+		 * 140000 to 150000
+		 */
+		if (pset.sversion >= 140000)
+			appendPQExpBuffer(&buf,
+							  ", subtwophasestate AS \"%s\"\n",
+							  gettext_noop("Two phase commit"));
+
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
 						  ",  subconninfo AS \"%s\"\n",
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 32c1bdf..79114e1 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2759,7 +2759,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0060ebf..e84353e 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -57,6 +65,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -94,6 +104,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index faa3a25..ebc43a0 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		temporary;
+	bool		two_phase;
 	List	   *options;
 } CreateReplicationSlotCmd;
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..0b071a6 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -90,6 +90,16 @@ typedef struct LogicalDecodingContext
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 * This flag indicates that the plugin passed in the two-phase option as
+	 * part of the START_STREAMING command. We can't rely solely on the twophase
+	 * flag which only tells whether the plugin provided all the necessary
+	 * two-phase callbacks.
+	 *
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..e20f2da 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -122,6 +131,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +180,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0c6e9d1..109000d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -296,7 +296,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -635,7 +639,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3570684..71638a3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..4c372a6
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,359 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# copy_data=false and two_phase
+###############################
+
+#create some test tables for copy tests
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_copy SELECT generate_series(1,5);");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "INSERT INTO tab_copy VALUES (88);");
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Setup logical replication
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_copy FOR TABLE tab_copy;");
+
+my $appname_copy = 'appname_copy';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_copy
+	CONNECTION '$publisher_connstr application_name=$appname_copy'
+	PUBLICATION tap_pub_copy
+	WITH (two_phase=on, copy_data=false);");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Also wait for initial table sync to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+# Check that the initial table data was NOT replicated (because we said copy_data=false)
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Now do a prepare on publisher and check that it IS replicated
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_copy VALUES (99);
+    PREPARE TRANSACTION 'mygid';");
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Check that the transaction has been prepared on the subscriber, there will be 2
+# prepared transactions for the 2 subscriptions.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
+is($result, qq(2), 'transaction is prepared on subscriber');
+
+# Now commit the insert and verify that it IS replicated
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(6), 'publisher inserted data');
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(2), 'replicated data in subscriber table');
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..e61d28a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..cabc0bb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,12 +1388,15 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v84-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v84-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 8fab36a0e4ce07c2f0ea1472cf144876d114a9a2 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 8 Jun 2021 23:09:40 -0400
Subject: [PATCH v84] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/protocol.sgml                         |  68 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  11 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 132 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |  10 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 453 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 271 ++++++++++++
 11 files changed, 1016 insertions(+), 79 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 8d4fdf3..5a38433 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains Stream Commit or Stream Abort message.
+   contains Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7386,7 +7386,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7649,6 +7649,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index bbef613..a985e0d 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,12 +237,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
-
-         <para>
-          The <literal>streaming</literal> option cannot be used along with
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +263,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used along with
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f8826fb..894a1b3 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -924,12 +909,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3e987ca..c33d1ba 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1044,6 +1044,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1241,30 +1321,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1272,7 +1343,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1287,7 +1358,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1362,6 +1433,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2298,6 +2394,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 85ba1dc..ccf801a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1027,6 +1018,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index e20f2da..7a4804f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -124,6 +125,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -243,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c90e3f6
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,453 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3a0be82
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v84-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v84-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From 48d597be65576458e2d44566681f73115446f446 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 9 Jun 2021 00:40:12 -0400
Subject: [PATCH v84] Skip empty transactions for logical replication.

The current logical replication behaviour is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  16 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  36 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 155 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  46 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 282 insertions(+), 38 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index d2c6e15..940f80c 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -865,11 +865,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 5a38433..0add083 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7538,6 +7538,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit prepared.
 </para></listitem>
 </varlistentry>
@@ -7552,6 +7559,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c387997..ed60719 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -940,7 +941,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -975,7 +977,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..4653d6d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index da0e5e8..282da49 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2770,7 +2770,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c33d1ba..d13c0c8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -971,26 +971,38 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* there is no transaction when COMMIT PREPARED is called */
-	ensure_transaction();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	CommitTransactionCommand();
+		/* there is no transaction when COMMIT PREPARED is called */
+		ensure_transaction();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ccf801a..6186b24 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -132,6 +134,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -401,10 +408,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -419,8 +448,21 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -432,10 +474,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -450,8 +510,18 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty prepared transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -462,12 +532,32 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of COMMIT PREPARED of an empty transaction");
+			return;
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -480,8 +570,25 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of ROLLBACK of an empty transaction");
+			return;
+		}
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -630,11 +737,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -668,6 +780,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -767,6 +888,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -774,6 +896,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -810,6 +936,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -830,6 +965,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -842,6 +978,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 7a4804f..2fa60b5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 109000d..7cf4499 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -441,7 +441,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 52bd92d..2b43ae0 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -86,9 +86,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 4c372a6..8a33641 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 24;
+use Test::More tests => 25;
 
 ###############################
 # Setup
@@ -318,10 +318,9 @@ $node_publisher->safe_psql('postgres', "
 
 $node_publisher->wait_for_catchup($appname_copy);
 
-# Check that the transaction has been prepared on the subscriber, there will be 2
-# prepared transactions for the 2 subscriptions.
+# Check that the transaction has been prepared on the subscriber
 $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
-is($result, qq(2), 'transaction is prepared on subscriber');
+is($result, qq(1), 'transaction is prepared on subscriber');
 
 # Now commit the insert and verify that it IS replicated
 $node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
@@ -337,6 +336,45 @@ is($result, qq(2), 'replicated data in subscriber table');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
 $node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+   "CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+   "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot
+$node_publisher->safe_psql('postgres', "
+   BEGIN;
+   INSERT INTO tab_nopub SELECT generate_series(1,10);
+   PREPARE TRANSACTION 'empty_transaction';
+   COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+   'postgres', qq(
+       SELECT get_byte(data, 0)
+       FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+           'proto_version', '1',
+           'publication_names', 'tap_pub')
+));
+
+# the empty transaction should be skipped
+is($result, qq(),
+   'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cabc0bb..ad62bbe 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1597,6 +1597,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

#349Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#348)

On Wed, Jun 9, 2021 at 10:34 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Jun 8, 2021 at 4:19 PM Peter Smith <smithpb2250@gmail.com> wrote:

3.
@@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin,
MemoryContextSwitchTo(old_context);

/*
- * We allow decoding of prepared transactions iff the two_phase option is
- * enabled at the time of slot creation.
+ * We allow decoding of prepared transactions when the two_phase is
+ * enabled at the time of slot creation, or when the two_phase option is
+ * given at the streaming start.
*/
- ctx->twophase &= MyReplicationSlot->data.two_phase;
+ ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+ /* Mark slot to allow two_phase decoding if not already marked */
+ if (ctx->twophase && !slot->data.two_phase)
+ {
+ slot->data.two_phase = true;
+ ReplicationSlotMarkDirty();
+ ReplicationSlotSave();
+ }

Why do we need to change this during CreateInitDecodingContext which
is called at create_slot time? At that time, we don't need to consider
any options and there is no need to toggle slot's two_phase value.

TODO

As part of the recent changes, we do turn on two_phase at create_slot time when
the subscription is created with (copy_data = false, two_phase = on).
So, this code is required.

But in that case, won't we deal it with the value passed in
CreateReplicationSlotCmd. It should be enabled after we call
ReplicationSlotCreate.

--
With Regards,
Amit Kapila.

#350Amit Kapila
amit.kapila16@gmail.com
In reply to: Greg Nancarrow (#347)

On Wed, Jun 9, 2021 at 9:58 AM Greg Nancarrow <gregn4422@gmail.com> wrote:

(5) src/backend/access/transam/twophase.c

Question:

Is:

+ * do this optimization if we encounter many collisions in GID

meant to be:

+ * do this optimization if we encounter any collisions in GID

No, it should be fine if there are very few collisions.

--
With Regards,
Amit Kapila.

#351Peter Smith
smithpb2250@gmail.com
In reply to: Greg Nancarrow (#347)
3 attachment(s)

Please find attached the latest patch set v85*

Differences from v84* are:

* Rebased to HEAD @ 10/June.

* This addresses all Greg's feedback comments [1]/messages/by-id/CAJcOf-fPcpe21RciPRn_56FwO6K_B+VcTZ2prAv4xvAk4cqYiQ@mail.gmail.com except.
- Skipped (1).iii. I think this line in the documentation is OK as-is
- Skipped (5). Amit wrote [2]/messages/by-id/CAA4eK1J2XBSbWXcf9P0z30op+GL-cUrrqJuy-kFVmbjS1fx-eQ@mail.gmail.com that this comment is OK as-is
- Every other feedback has been fixed exactly (or close to) the suggestions.

KNOWN ISSUES: This v85 patch was built and tested using yesterday's
master, but due to lots of recent activity in the replication area I
expect it will be broken for HEAD very soon (if not already). I'll
rebase it again ASAP to try to keep it in working order.

----
[1]: /messages/by-id/CAJcOf-fPcpe21RciPRn_56FwO6K_B+VcTZ2prAv4xvAk4cqYiQ@mail.gmail.com
[2]: /messages/by-id/CAA4eK1J2XBSbWXcf9P0z30op+GL-cUrrqJuy-kFVmbjS1fx-eQ@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v85-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v85-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 1c4ee848fd0b66a9c13e315bbd37dbd28277100c Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 11 Jun 2021 13:22:32 +1000
Subject: [PATCH v85] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the following things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase
transactions. We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Add new subscription TAP tests, and new subscription.sql regression tests.

* Update PG documentation.

We don't support the following operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         | 307 +++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 148 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  19 +-
 src/backend/replication/logical/decode.c           |  11 +-
 src/backend/replication/logical/logical.c          |  39 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 196 +++++++++--
 src/backend/replication/logical/worker.c           | 343 +++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/backend/replication/walsender.c                |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  29 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  17 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/nodes/replnodes.h                      |   1 +
 src/include/replication/logical.h                  |  10 +
 src/include/replication/logicalproto.h             |  73 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 359 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 235 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   3 +
 45 files changed, 2445 insertions(+), 201 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 1649320..c5e078f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7644,6 +7644,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 2f4dde3..23d0422 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains a Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7365,6 +7386,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepared transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit prepared.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..3bcef78 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used with the
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as a normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used with the
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..76eba34 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		exists.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984..55f6e37 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1255,5 +1255,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subslotname, subsynccommit, subpublications)
+              substream, subtwophasestate, subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8aa6de1..f8826fb 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -505,10 +556,34 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false then
+				 * it is safe to enable two_phase up-front because those tables
+				 * are already initially in READY state. When the subscription
+				 * has no tables, we leave the twophase state as PENDING,
+				 * to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -814,7 +889,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +924,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +953,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +999,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1016,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1058,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1076,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1116,32 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..ccde3bc 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,19 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/*
+		 * FIXME - 21/May. The below code is a temporary hack to check for
+		 * for server version 140000, even though this two-phase feature did
+		 * not make it into the PG 14 release.
+		 *
+		 * When the PG 15 development officially starts someone will update the
+		 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+		 * to revisit this code to remove this hack and write the code properly.
+		 */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +847,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +861,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..26420dd 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing preparing
+				 * transactions that have locked [user] catalog tables
+				 * exclusively but as of now we ask users not to do such an
+				 * operation.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +734,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..c421c9d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,20 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (slot->data.two_phase || ctx->twophase_opt_given);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +548,22 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (slot->data.two_phase || ctx->twophase_opt_given);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +624,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index b955f43..f5d1bca 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e127..da0e5e8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2552,7 +2552,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2643,7 +2643,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2690,7 +2690,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2710,7 +2710,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2729,19 +2729,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2759,12 +2760,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 67f907c..75f4e16 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1065,7 +1061,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1151,3 +1148,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ *
+ * Note: If this function started the transaction (indicated by the parameter)
+ * then it is the caller's responsibility to commit it.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static bool has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && list_length(table_states_not_ready) == 0;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase state */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6ba447e..48b7df8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * is still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -794,6 +870,180 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	ensure_transaction();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* There is no transaction when COMMIT PREPARED is called */
+	ensure_transaction();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
+		ensure_transaction();
+		FinishPreparedTransaction(gid, false);
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2032,6 +2282,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2509,6 +2775,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -2995,6 +3264,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3161,15 +3444,69 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index fe12d08..85ba1dc 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -145,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -156,6 +174,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -167,10 +187,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -246,8 +268,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -319,6 +362,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -331,8 +395,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -347,29 +415,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -389,6 +436,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -836,18 +945,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1266,3 +1365,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c88b803..6a172d3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -285,6 +285,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b94910b..285a321 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 109c723..e94069c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -954,7 +954,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 */
 		ReplicationSlotCreate(cmd->slotname, true,
 							  cmd->temporary ? RS_TEMPORARY : RS_EPHEMERAL,
-							  false);
+							  cmd->two_phase);
 	}
 
 	if (cmd->kind == REPLICATION_KIND_LOGICAL)
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 8f53cc7..8141311 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4304,6 +4305,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4347,9 +4349,25 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	/*
+	 * FIXME - 21/May. The below code is a temporary hack to check for
+	 * for server version 140000, even though this two-phase feature did
+	 * not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 */
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4370,6 +4388,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4395,6 +4414,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4422,6 +4443,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4463,6 +4485,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2abf255..6caa701 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,7 +6415,9 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary and streaming are only supported in v14 and higher.
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
@@ -6423,6 +6425,17 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Binary"),
 							  gettext_noop("Streaming"));
 
+		/*
+		 * Two_phase is only supported in v15 and higher.
+		 *
+		 * FIXME: When PG15 development starts, change the following
+		 * 140000 to 150000
+		 */
+		if (pset.sversion >= 140000)
+			appendPQExpBuffer(&buf,
+							  ", subtwophasestate AS \"%s\"\n",
+							  gettext_noop("Two phase commit"));
+
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
 						  ",  subconninfo AS \"%s\"\n",
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 32c1bdf..79114e1 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2759,7 +2759,7 @@ psql_completion(const char *text, int start, int end)
 	/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("copy_data", "connect", "create_slot", "enabled",
-					  "slot_name", "synchronous_commit");
+					  "slot_name", "synchronous_commit", "streaming", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0060ebf..e84353e 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -57,6 +65,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -94,6 +104,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index faa3a25..ebc43a0 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		temporary;
+	bool		two_phase;
 	List	   *options;
 } CreateReplicationSlotCmd;
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..0b071a6 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -90,6 +90,16 @@ typedef struct LogicalDecodingContext
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 * This flag indicates that the plugin passed in the two-phase option as
+	 * part of the START_STREAMING command. We can't rely solely on the twophase
+	 * flag which only tells whether the plugin provided all the necessary
+	 * two-phase callbacks.
+	 *
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..e20f2da 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -122,6 +131,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +180,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0c6e9d1..109000d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -296,7 +296,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -635,7 +639,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3570684..71638a3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -92,11 +92,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..4c372a6
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,359 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# copy_data=false and two_phase
+###############################
+
+#create some test tables for copy tests
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_copy SELECT generate_series(1,5);");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "INSERT INTO tab_copy VALUES (88);");
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Setup logical replication
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_copy FOR TABLE tab_copy;");
+
+my $appname_copy = 'appname_copy';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_copy
+	CONNECTION '$publisher_connstr application_name=$appname_copy'
+	PUBLICATION tap_pub_copy
+	WITH (two_phase=on, copy_data=false);");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Also wait for initial table sync to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+# Check that the initial table data was NOT replicated (because we said copy_data=false)
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Now do a prepare on publisher and check that it IS replicated
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_copy VALUES (99);
+    PREPARE TRANSACTION 'mygid';");
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Check that the transaction has been prepared on the subscriber, there will be 2
+# prepared transactions for the 2 subscriptions.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
+is($result, qq(2), 'transaction is prepared on subscriber');
+
+# Now commit the insert and verify that it IS replicated
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(6), 'publisher inserted data');
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(2), 'replicated data in subscriber table');
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..e61d28a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..cabc0bb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,12 +1388,15 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v85-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v85-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 99e86acaa933296835393b76add778e052f4c692 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 11 Jun 2021 15:55:04 +1000
Subject: [PATCH v85] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/protocol.sgml                         |  68 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  10 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 132 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |  10 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 453 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 271 ++++++++++++
 11 files changed, 1016 insertions(+), 78 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 23d0422..1d77434 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains a Stream Commit or Stream Abort message.
+   contains a Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7386,7 +7386,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7649,6 +7649,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 3bcef78..4238baa 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f8826fb..894a1b3 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -924,12 +909,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 48b7df8..5594a4c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1044,6 +1044,86 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	Assert(!in_streamed_transaction);
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1241,30 +1321,21 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
 	bool		found;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	Assert(!in_streamed_transaction);
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
-	ensure_transaction();
-
 	/*
 	 * Allocate file handle and memory required to process all the messages in
 	 * TopTransactionContext to avoid them getting reset after each message is
@@ -1272,7 +1343,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 	ent = (StreamXidHash *) hash_search(xidhash,
@@ -1287,7 +1358,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1362,6 +1433,31 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	Assert(!in_streamed_transaction);
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	ensure_transaction();
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2298,6 +2394,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 85ba1dc..ccf801a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1027,6 +1018,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index e20f2da..7a4804f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -124,6 +125,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -243,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c90e3f6
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,453 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3a0be82
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v85-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v85-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From b701b0b8a235b661365a66fe21f7df26cf863563 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 11 Jun 2021 17:57:44 +1000
Subject: [PATCH v85] Skip empty transactions for logical replication.

The current logical replication behaviour is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  16 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  36 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 158 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  46 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 285 insertions(+), 38 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index d2c6e15..940f80c 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -865,11 +865,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 1d77434..a68d61e 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7538,6 +7538,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit prepared.
 </para></listitem>
 </varlistentry>
@@ -7552,6 +7559,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c421c9d..88334ec 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -942,7 +943,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -977,7 +979,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..4653d6d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index da0e5e8..282da49 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2770,7 +2770,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 5594a4c..2758315 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -971,26 +971,38 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* There is no transaction when COMMIT PREPARED is called */
-	ensure_transaction();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	CommitTransactionCommand();
+		/* There is no transaction when COMMIT PREPARED is called */
+		ensure_transaction();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index ccf801a..c54fb0f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -132,6 +134,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -401,10 +408,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -419,8 +448,22 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
+	txn->output_plugin_private = NULL;
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -432,10 +475,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -450,8 +511,18 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty prepared transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -462,12 +533,33 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of COMMIT PREPARED of an empty transaction");
+			return;
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -480,8 +572,26 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of ROLLBACK of an empty transaction");
+			return;
+		}
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -630,11 +740,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -668,6 +783,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -767,6 +891,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -774,6 +899,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -810,6 +939,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -830,6 +968,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -842,6 +981,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 7a4804f..2fa60b5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 109000d..7cf4499 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -441,7 +441,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 52bd92d..2b43ae0 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -86,9 +86,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 4c372a6..8a33641 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 24;
+use Test::More tests => 25;
 
 ###############################
 # Setup
@@ -318,10 +318,9 @@ $node_publisher->safe_psql('postgres', "
 
 $node_publisher->wait_for_catchup($appname_copy);
 
-# Check that the transaction has been prepared on the subscriber, there will be 2
-# prepared transactions for the 2 subscriptions.
+# Check that the transaction has been prepared on the subscriber
 $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
-is($result, qq(2), 'transaction is prepared on subscriber');
+is($result, qq(1), 'transaction is prepared on subscriber');
 
 # Now commit the insert and verify that it IS replicated
 $node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
@@ -337,6 +336,45 @@ is($result, qq(2), 'replicated data in subscriber table');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
 $node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+   "CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+   "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot
+$node_publisher->safe_psql('postgres', "
+   BEGIN;
+   INSERT INTO tab_nopub SELECT generate_series(1,10);
+   PREPARE TRANSACTION 'empty_transaction';
+   COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+   'postgres', qq(
+       SELECT get_byte(data, 0)
+       FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+           'proto_version', '1',
+           'publication_names', 'tap_pub')
+));
+
+# the empty transaction should be skipped
+is($result, qq(),
+   'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cabc0bb..ad62bbe 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1597,6 +1597,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

#352Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#351)
3 attachment(s)

On Fri, Jun 11, 2021 at 6:34 PM Peter Smith <smithpb2250@gmail.com> wrote:

KNOWN ISSUES: This v85 patch was built and tested using yesterday's
master, but due to lots of recent activity in the replication area I
expect it will be broken for HEAD very soon (if not already). I'll
rebase it again ASAP to try to keep it in working order.

Please find attached the latest patch set v86*

Differences from v86* are:

* Rebased to HEAD @ today.

* Some recent pushes (e.g. [1]https://github.com/postgres/postgres/commit/3a09d75b4f6cabc8331e228b6988dbfcd9afdfbe[2]https://github.com/postgres/postgres/commit/d08237b5b494f96e72220bcef36a14a642969f16[3]https://github.com/postgres/postgres/commit/fe6a20ce54cbbb6fcfe9f6675d563af836ae799a) in the replication area had
broken the v85* patch. v86 is now working for the current HEAD.

NOTE: I only changed what was necessary to get the 2PC patches working
again. Specifically, one of the pushes [3]https://github.com/postgres/postgres/commit/fe6a20ce54cbbb6fcfe9f6675d563af836ae799a changed a number of
protocol Asserts into ereports, but this 2PC patch set also introduces
a number of new Asserts. If you find that any of these new Asserts are
of the same kind which should be changed to ereports (in keeping with
[3]: https://github.com/postgres/postgres/commit/fe6a20ce54cbbb6fcfe9f6675d563af836ae799a

----
[1]: https://github.com/postgres/postgres/commit/3a09d75b4f6cabc8331e228b6988dbfcd9afdfbe
[2]: https://github.com/postgres/postgres/commit/d08237b5b494f96e72220bcef36a14a642969f16
[3]: https://github.com/postgres/postgres/commit/fe6a20ce54cbbb6fcfe9f6675d563af836ae799a

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v86-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v86-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 7b82819807457ebb619c5efff56906ce34370054 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 15 Jun 2021 15:56:59 +1000
Subject: [PATCH v86] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the following things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase
transactions. We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Add new subscription TAP tests, and new subscription.sql regression tests.

* Update PG documentation.

We don't support the following operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         | 307 +++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 148 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  19 +-
 src/backend/replication/logical/decode.c           |  11 +-
 src/backend/replication/logical/logical.c          |  39 ++-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 196 +++++++++--
 src/backend/replication/logical/worker.c           | 346 +++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |  16 +-
 src/backend/replication/repl_scanner.l             |   1 +
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/backend/replication/walsender.c                |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  29 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  17 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/nodes/replnodes.h                      |   1 +
 src/include/replication/logical.h                  |  10 +
 src/include/replication/logicalproto.h             |  73 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 359 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 235 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   3 +
 45 files changed, 2448 insertions(+), 201 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f517a7d..0235639 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7643,6 +7643,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index bc2a2fe..f812976 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2797,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2857,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains a Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7365,6 +7386,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepared transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit prepared.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..3bcef78 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used with the
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as a normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used with the
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..76eba34 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		exists.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984..55f6e37 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1255,5 +1255,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subslotname, subsynccommit, subpublications)
+              substream, subtwophasestate, subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8aa6de1..f8826fb 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -505,10 +556,34 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false then
+				 * it is safe to enable two_phase up-front because those tables
+				 * are already initially in READY state. When the subscription
+				 * has no tables, we leave the twophase state as PENDING,
+				 * to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -814,7 +889,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -848,6 +924,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -871,7 +953,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -916,7 +999,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -932,6 +1016,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1058,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -980,6 +1076,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1009,7 +1116,32 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 021c1b3..ccde3bc 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -433,6 +434,19 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/*
+		 * FIXME - 21/May. The below code is a temporary hack to check for
+		 * for server version 140000, even though this two-phase feature did
+		 * not make it into the PG 14 release.
+		 *
+		 * When the PG 15 development officially starts someone will update the
+		 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+		 * to revisit this code to remove this hack and write the code properly.
+		 */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -833,7 +847,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -847,6 +861,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (temporary)
 		appendStringInfoString(&cmd, " TEMPORARY");
 
+	if (two_phase)
+		appendStringInfoString(&cmd, " TWO_PHASE");
+
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..26420dd 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing preparing
+				 * transactions that have locked [user] catalog tables
+				 * exclusively but as of now we ask users not to do such an
+				 * operation.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +734,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..c421c9d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,20 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (slot->data.two_phase || ctx->twophase_opt_given);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +548,22 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (slot->data.two_phase || ctx->twophase_opt_given);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +624,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index cb42fcb..2c191de 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f96029f..e4530e2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2552,7 +2552,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2643,7 +2643,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2690,7 +2690,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2710,7 +2710,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2729,19 +2729,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2759,12 +2760,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 67f907c..75f4e16 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1065,7 +1061,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1151,3 +1148,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ *
+ * Note: If this function started the transaction (indicated by the parameter)
+ * then it is the caller's responsibility to commit it.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static bool has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && list_length(table_states_not_ready) == 0;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase state */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4b11259..ee95ac8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * is still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -804,6 +880,183 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	begin_replication_step();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* There is no transaction when COMMIT PREPARED is called */
+	begin_replication_step();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
+		begin_replication_step();
+		FinishPreparedTransaction(gid, false);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2082,6 +2335,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2560,6 +2829,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -3061,6 +3333,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3228,15 +3514,69 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
+
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 63f108f..7a1d42a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -145,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -156,6 +174,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -167,10 +187,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -246,8 +268,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -319,6 +362,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -331,8 +395,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -347,29 +415,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -389,6 +436,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -839,18 +948,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1269,3 +1368,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..8c1f353 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -241,16 +243,17 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
-			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
+			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary opt_two_phase K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
 					cmd = makeNode(CreateReplicationSlotCmd);
 					cmd->kind = REPLICATION_KIND_LOGICAL;
 					cmd->slotname = $2;
 					cmd->temporary = $3;
-					cmd->plugin = $5;
-					cmd->options = $6;
+					cmd->two_phase = $4;
+					cmd->plugin = $6;
+					cmd->options = $7;
 					$$ = (Node *) cmd;
 				}
 			;
@@ -365,6 +368,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8c18b4e..33b85d8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -283,6 +283,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b94910b..285a321 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -365,7 +365,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3224536..c691eda 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -954,7 +954,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 */
 		ReplicationSlotCreate(cmd->slotname, true,
 							  cmd->temporary ? RS_TEMPORARY : RS_EPHEMERAL,
-							  false);
+							  cmd->two_phase);
 	}
 
 	if (cmd->kind == REPLICATION_KIND_LOGICAL)
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 8f53cc7..8141311 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4304,6 +4305,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4347,9 +4349,25 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	/*
+	 * FIXME - 21/May. The below code is a temporary hack to check for
+	 * for server version 140000, even though this two-phase feature did
+	 * not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 */
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4370,6 +4388,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4395,6 +4414,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4422,6 +4443,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4463,6 +4485,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2abf255..6caa701 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,7 +6415,9 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary and streaming are only supported in v14 and higher.
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
@@ -6423,6 +6425,17 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Binary"),
 							  gettext_noop("Streaming"));
 
+		/*
+		 * Two_phase is only supported in v15 and higher.
+		 *
+		 * FIXME: When PG15 development starts, change the following
+		 * 140000 to 150000
+		 */
+		if (pset.sversion >= 140000)
+			appendPQExpBuffer(&buf,
+							  ", subtwophasestate AS \"%s\"\n",
+							  gettext_noop("Two phase commit"));
+
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
 						  ",  subconninfo AS \"%s\"\n",
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index bd8e9ea..d2f2727 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2760,7 +2760,7 @@ psql_completion(const char *text, int start, int end)
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
 					  "enabled", "slot_name", "streaming",
-					  "synchronous_commit");
+					  "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0060ebf..e84353e 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -57,6 +65,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -94,6 +104,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index faa3a25..ebc43a0 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		temporary;
+	bool		two_phase;
 	List	   *options;
 } CreateReplicationSlotCmd;
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..0b071a6 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -90,6 +90,16 @@ typedef struct LogicalDecodingContext
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 * This flag indicates that the plugin passed in the two-phase option as
+	 * part of the START_STREAMING command. We can't rely solely on the twophase
+	 * flag which only tells whether the plugin provided all the necessary
+	 * two-phase callbacks.
+	 *
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..e20f2da 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -122,6 +131,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +180,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0c6e9d1..109000d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -296,7 +296,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -635,7 +639,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 2eb7e3a..34d95ea 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -84,11 +84,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..4c372a6
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,359 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# copy_data=false and two_phase
+###############################
+
+#create some test tables for copy tests
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_copy SELECT generate_series(1,5);");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "INSERT INTO tab_copy VALUES (88);");
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Setup logical replication
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_copy FOR TABLE tab_copy;");
+
+my $appname_copy = 'appname_copy';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_copy
+	CONNECTION '$publisher_connstr application_name=$appname_copy'
+	PUBLICATION tap_pub_copy
+	WITH (two_phase=on, copy_data=false);");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Also wait for initial table sync to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+# Check that the initial table data was NOT replicated (because we said copy_data=false)
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Now do a prepare on publisher and check that it IS replicated
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_copy VALUES (99);
+    PREPARE TRANSACTION 'mygid';");
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Check that the transaction has been prepared on the subscriber, there will be 2
+# prepared transactions for the 2 subscriptions.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
+is($result, qq(2), 'transaction is prepared on subscriber');
+
+# Now commit the insert and verify that it IS replicated
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(6), 'publisher inserted data');
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(2), 'replicated data in subscriber table');
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..e61d28a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..cabc0bb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,12 +1388,15 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v86-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v86-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From a5c90734a099a4f64cdd5e60f86ea11383904fd9 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 15 Jun 2021 17:58:09 +1000
Subject: [PATCH v86] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/protocol.sgml                         |  68 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  10 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 135 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |  10 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 453 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 271 ++++++++++++
 11 files changed, 1018 insertions(+), 79 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index f812976..9cc7192 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains a Stream Commit or Stream Abort message.
+   contains a Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7386,7 +7386,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7649,6 +7649,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 3bcef78..4238baa 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f8826fb..894a1b3 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -924,12 +909,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ee95ac8..684fec2 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1057,6 +1057,87 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1272,30 +1353,20 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	if (in_streamed_transaction)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROTOCOL_VIOLATION),
-				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
 	/* Make sure we have an open transaction */
 	begin_replication_step();
 
@@ -1306,7 +1377,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -1327,7 +1398,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1406,6 +1477,32 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2351,6 +2448,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7a1d42a..d5a284d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1030,6 +1021,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index e20f2da..7a4804f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -124,6 +125,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -243,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c90e3f6
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,453 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3a0be82
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v86-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v86-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From b77cad523237955e7a42c2a6e6723cd83ce722e9 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 15 Jun 2021 18:29:44 +1000
Subject: [PATCH v86] Skip empty transactions for logical replication.

The current logical replication behaviour is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  16 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  38 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 158 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  46 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 286 insertions(+), 39 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index d2c6e15..940f80c 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -865,11 +865,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 9cc7192..6cd3b17 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7538,6 +7538,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit prepared.
 </para></listitem>
 </varlistentry>
@@ -7552,6 +7559,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c421c9d..88334ec 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -942,7 +943,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -977,7 +979,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..4653d6d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e4530e2..016bc79 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2770,7 +2770,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 684fec2..ba31740 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -982,27 +982,39 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* There is no transaction when COMMIT PREPARED is called */
-	begin_replication_step();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	end_replication_step();
-	CommitTransactionCommand();
+		/* There is no transaction when COMMIT PREPARED is called */
+		begin_replication_step();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index d5a284d..43679a2 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -132,6 +134,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -401,10 +408,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -419,8 +448,22 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
+	txn->output_plugin_private = NULL;
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -432,10 +475,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -450,8 +511,18 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty prepared transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -462,12 +533,33 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of COMMIT PREPARED of an empty transaction");
+			return;
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -480,8 +572,26 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of ROLLBACK of an empty transaction");
+			return;
+		}
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -630,11 +740,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -668,6 +783,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -770,6 +894,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -777,6 +902,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -813,6 +942,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -833,6 +971,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -845,6 +984,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 7a4804f..2fa60b5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 109000d..7cf4499 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -441,7 +441,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 52bd92d..2b43ae0 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -86,9 +86,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 4c372a6..8a33641 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 24;
+use Test::More tests => 25;
 
 ###############################
 # Setup
@@ -318,10 +318,9 @@ $node_publisher->safe_psql('postgres', "
 
 $node_publisher->wait_for_catchup($appname_copy);
 
-# Check that the transaction has been prepared on the subscriber, there will be 2
-# prepared transactions for the 2 subscriptions.
+# Check that the transaction has been prepared on the subscriber
 $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
-is($result, qq(2), 'transaction is prepared on subscriber');
+is($result, qq(1), 'transaction is prepared on subscriber');
 
 # Now commit the insert and verify that it IS replicated
 $node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
@@ -337,6 +336,45 @@ is($result, qq(2), 'replicated data in subscriber table');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
 $node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+   "CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+   "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot
+$node_publisher->safe_psql('postgres', "
+   BEGIN;
+   INSERT INTO tab_nopub SELECT generate_series(1,10);
+   PREPARE TRANSACTION 'empty_transaction';
+   COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+   'postgres', qq(
+       SELECT get_byte(data, 0)
+       FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+           'proto_version', '1',
+           'publication_names', 'tap_pub')
+));
+
+# the empty transaction should be skipped
+is($result, qq(),
+   'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cabc0bb..ad62bbe 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1597,6 +1597,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

#353Greg Nancarrow
gregn4422@gmail.com
In reply to: Peter Smith (#352)

On Wed, Jun 16, 2021 at 9:08 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v86*

A couple of comments:

(1) I think one of my suggested changes was missed (or was that intentional?):

BEFORE:
+                The LSN of the commit prepared.
AFTER:
+                The LSN of the commit prepared transaction.

(2) In light of Tom Lane's recent changes in:

fe6a20ce54cbbb6fcfe9f6675d563af836ae799a (Don't use Asserts to check
for violations of replication protocol)

there appear to be some instances of such code in these patches.

For example, in the v86-0001 patch:

+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepPreparedTxnData prepare_data;
+ char gid[GIDSIZE];
+
+ logicalrep_read_prepare(s, &prepare_data);
+
+ Assert(prepare_data.prepare_lsn == remote_final_lsn);

The above Assert() should be changed to something like:

+    if (prepare_data.prepare_lsn != remote_final_lsn)
+        ereport(ERROR,
+                (errcode(ERRCODE_PROTOCOL_VIOLATION),
+                 errmsg_internal("incorrect prepare LSN %X/%X in
prepare message (expected %X/%X)",
+                                 LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
+                                 LSN_FORMAT_ARGS(remote_final_lsn))));

Without being more familiar with this code, it's difficult for me to
judge exactly how many of such cases are in these patches.

Regards,
Greg Nancarrow
Fujitsu Australia

#354Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#352)
5 attachment(s)

On Wed, Jun 16, 2021 at 9:08 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Fri, Jun 11, 2021 at 6:34 PM Peter Smith <smithpb2250@gmail.com> wrote:

KNOWN ISSUES: This v85 patch was built and tested using yesterday's
master, but due to lots of recent activity in the replication area I
expect it will be broken for HEAD very soon (if not already). I'll
rebase it again ASAP to try to keep it in working order.

Please find attached the latest patch set v86*

I've modified the patchset based on comments received on thread [1]/messages/by-id/64b9f783c6e125f18f88fbc0c0234e34e71d8639.camel@j-davis.com
for the CREATE_REPLICATION_SLOT
changes. Based on the request from that thread, I've taken out those
changes as two new patches (patch-1 and patch-2)
and made this into 5 patches. I've also changed the logic to align
with the changes in the command syntax.

I've also addressed one pending comment from Amit about
CreateInitDecodingContext, I've taken out the logic that
sets slot->data.two_phase, and only kept the logic that sets ctx->twophase.

Before:

- ctx->twophase &= MyReplicationSlot->data.two_phase;
+ ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase);
+
+ /* Mark slot to allow two_phase decoding if not already marked */
+ if (ctx->twophase && !slot->data.two_phase)
+ {
+ slot->data.two_phase = true;
+ ReplicationSlotMarkDirty();
+ ReplicationSlotSave();
+ }

After:

- ctx->twophase &= MyReplicationSlot->data.two_phase;
+ ctx->twophase &= slot->data.two_phase;

[1]: /messages/by-id/64b9f783c6e125f18f88fbc0c0234e34e71d8639.camel@j-davis.com

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v87-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patchapplication/octet-stream; name=v87-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patchDownload
From 47e160e12c12f321fd2254eec834e09745f0e9f8 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 17 Jun 2021 09:24:19 -0400
Subject: [PATCH v87] Add option to set two-phase in CREATE_REPLICATION_SLOT
 command.

CREATE_REPLICATION_SLOT modified to support two-phase encoding in the slot.
This will allow the decoding of commands like PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED for slots created with this option.
---
 doc/src/sgml/protocol.sgml             | 16 +++++++++++++++-
 src/backend/replication/repl_gram.y    | 12 ++++++++++++
 src/backend/replication/repl_scanner.l |  1 +
 src/backend/replication/walsender.c    | 18 +++++++++++++++---
 4 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index bc2a2fe..205fbd2 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> | <literal>TWO_PHASE</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this logical replication slot supports decoding of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..eead144 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -283,6 +285,11 @@ create_slot_opt:
 				  $$ = makeDefElem("reserve_wal",
 								   (Node *)makeInteger(true), -1);
 				}
+			| K_TWO_PHASE
+				{
+				  $$ = makeDefElem("two_phase",
+								   (Node *)makeInteger(true), -1);
+				}
 			;
 
 /* DROP_REPLICATION_SLOT slot */
@@ -365,6 +372,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3224536..92c755f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -863,11 +863,13 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 static void
 parseCreateReplSlotOptions(CreateReplicationSlotCmd *cmd,
 						   bool *reserve_wal,
-						   CRSSnapshotAction *snapshot_action)
+						   CRSSnapshotAction *snapshot_action,
+						   bool *two_phase)
 {
 	ListCell   *lc;
 	bool		snapshot_action_given = false;
 	bool		reserve_wal_given = false;
+	bool		two_phase_given = false;
 
 	/* Parse options */
 	foreach(lc, cmd->options)
@@ -905,6 +907,15 @@ parseCreateReplSlotOptions(CreateReplicationSlotCmd *cmd,
 			reserve_wal_given = true;
 			*reserve_wal = true;
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_given || cmd->kind != REPLICATION_KIND_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_given = true;
+			*two_phase = true;
+		}
 		else
 			elog(ERROR, "unrecognized option: %s", defel->defname);
 	}
@@ -920,6 +931,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 	char		xloc[MAXFNAMELEN];
 	char	   *slot_name;
 	bool		reserve_wal = false;
+	bool		two_phase = false;
 	CRSSnapshotAction snapshot_action = CRS_EXPORT_SNAPSHOT;
 	DestReceiver *dest;
 	TupOutputState *tstate;
@@ -929,7 +941,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	Assert(!MyReplicationSlot);
 
-	parseCreateReplSlotOptions(cmd, &reserve_wal, &snapshot_action);
+	parseCreateReplSlotOptions(cmd, &reserve_wal, &snapshot_action, &two_phase);
 
 	/* setup state for WalSndSegmentOpen */
 	sendTimeLineIsHistoric = false;
@@ -954,7 +966,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 */
 		ReplicationSlotCreate(cmd->slotname, true,
 							  cmd->temporary ? RS_TEMPORARY : RS_EPHEMERAL,
-							  false);
+							  two_phase);
 	}
 
 	if (cmd->kind == REPLICATION_KIND_LOGICAL)
-- 
1.8.3.1

v87-0003-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v87-0003-Add-support-for-prepared-transactions-to-built-i.patchDownload
From e7f5e0ae807dd09ecb50a42fb06485b686e0f602 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 17 Jun 2021 09:34:29 -0400
Subject: [PATCH v87] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the following things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase
transactions. We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Add new subscription TAP tests, and new subscription.sql regression tests.

* Update PG documentation.

We don't support the following operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         | 305 ++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 148 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  19 +-
 src/backend/replication/logical/decode.c           |  11 +-
 src/backend/replication/logical/logical.c          |  31 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 196 +++++++++--
 src/backend/replication/logical/worker.c           | 346 +++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |   8 +-
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  29 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  17 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/nodes/replnodes.h                      |   1 +
 src/include/replication/logical.h                  |  10 +
 src/include/replication/logicalproto.h             |  73 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 359 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 235 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   3 +
 43 files changed, 2426 insertions(+), 202 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f517a7d..0235639 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7643,6 +7643,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 205fbd2..94e60e0 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1970,6 +1970,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2811,11 +2825,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2871,10 +2891,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains a Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7379,6 +7400,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepared transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit prepared.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..3bcef78 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used with the
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as a normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used with the
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..76eba34 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		exists.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984..55f6e37 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1255,5 +1255,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subslotname, subsynccommit, subpublications)
+              substream, subtwophasestate, subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 75e195f..08d0295 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -506,10 +557,34 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false then
+				 * it is safe to enable two_phase up-front because those tables
+				 * are already initially in READY state. When the subscription
+				 * has no tables, we leave the twophase state as PENDING,
+				 * to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -816,7 +891,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -850,6 +926,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -873,7 +955,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -918,7 +1001,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -934,6 +1018,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -965,7 +1060,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -982,6 +1078,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1011,7 +1118,32 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 6eaa84a..2838b89 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -436,6 +437,19 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/*
+		 * FIXME - 21/May. The below code is a temporary hack to check for
+		 * for server version 140000, even though this two-phase feature did
+		 * not make it into the PG 14 release.
+		 *
+		 * When the PG 15 development officially starts someone will update the
+		 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+		 * to revisit this code to remove this hack and write the code properly.
+		 */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -851,7 +865,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -868,6 +882,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
+		if (two_phase)
+			appendStringInfoString(&cmd, " TWO_PHASE");
+
 		switch (snapshot_action)
 		{
 			case CRS_EXPORT_SNAPSHOT:
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 453efc5..74df75e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing preparing
+				 * transactions that have locked [user] catalog tables
+				 * exclusively but as of now we ask users not to do such an
+				 * operation.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +734,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..89d91c2 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,12 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= slot->data.two_phase;
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +540,22 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (slot->data.two_phase || ctx->twophase_opt_given);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +616,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index cb42fcb..2c191de 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 19e96f3..48239c0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2574,7 +2574,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2665,7 +2665,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2712,7 +2712,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2732,7 +2732,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2751,19 +2751,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2781,12 +2782,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index cc50eb8..b923c95 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1071,7 +1067,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1158,3 +1155,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ *
+ * Note: If this function started the transaction (indicated by the parameter)
+ * then it is the caller's responsibility to commit it.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static bool has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && list_length(table_states_not_ready) == 0;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase state */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index bbb659d..8ba1ad1 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * is still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -804,6 +880,183 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	Assert(prepare_data.prepare_lsn == remote_final_lsn);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	begin_replication_step();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* There is no transaction when COMMIT PREPARED is called */
+	begin_replication_step();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
+		begin_replication_step();
+		FinishPreparedTransaction(gid, false);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2082,6 +2335,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2561,6 +2830,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -3062,6 +3334,20 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+	Assert(TransactionIdIsValid(xid));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3230,15 +3516,69 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
+
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 63f108f..7a1d42a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -145,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -156,6 +174,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -167,10 +187,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -246,8 +268,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -319,6 +362,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -331,8 +395,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -347,29 +415,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -389,6 +436,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -839,18 +948,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1269,3 +1368,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eead144..0910546 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -103,7 +103,6 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
-%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -243,7 +242,7 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
 			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
@@ -372,11 +371,6 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
-opt_two_phase:
-			K_TWO_PHASE						{ $$ = true; }
-			| /* EMPTY */					{ $$ = false; }
-			;
-
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8c18b4e..33b85d8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -283,6 +283,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index faeea9f..9f0b13f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -370,7 +370,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 8f53cc7..8141311 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4304,6 +4305,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4347,9 +4349,25 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	/*
+	 * FIXME - 21/May. The below code is a temporary hack to check for
+	 * for server version 140000, even though this two-phase feature did
+	 * not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 */
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4370,6 +4388,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4395,6 +4414,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4422,6 +4443,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4463,6 +4485,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2abf255..6caa701 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,7 +6415,9 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary and streaming are only supported in v14 and higher.
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
@@ -6423,6 +6425,17 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Binary"),
 							  gettext_noop("Streaming"));
 
+		/*
+		 * Two_phase is only supported in v15 and higher.
+		 *
+		 * FIXME: When PG15 development starts, change the following
+		 * 140000 to 150000
+		 */
+		if (pset.sversion >= 140000)
+			appendPQExpBuffer(&buf,
+							  ", subtwophasestate AS \"%s\"\n",
+							  gettext_noop("Two phase commit"));
+
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
 						  ",  subconninfo AS \"%s\"\n",
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index bd8e9ea..d2f2727 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2760,7 +2760,7 @@ psql_completion(const char *text, int start, int end)
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
 					  "enabled", "slot_name", "streaming",
-					  "synchronous_commit");
+					  "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0060ebf..e84353e 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -57,6 +65,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -94,6 +104,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index faa3a25..ebc43a0 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		temporary;
+	bool		two_phase;
 	List	   *options;
 } CreateReplicationSlotCmd;
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..0b071a6 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -90,6 +90,16 @@ typedef struct LogicalDecodingContext
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 * This flag indicates that the plugin passed in the two-phase option as
+	 * part of the START_STREAMING command. We can't rely solely on the twophase
+	 * flag which only tells whether the plugin provided all the necessary
+	 * two-phase callbacks.
+	 *
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..e20f2da 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -122,6 +131,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +180,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d8..d7c785b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -297,7 +297,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -636,7 +640,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 2eb7e3a..34d95ea 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -84,11 +84,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..4c372a6
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,359 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# copy_data=false and two_phase
+###############################
+
+#create some test tables for copy tests
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_copy SELECT generate_series(1,5);");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "INSERT INTO tab_copy VALUES (88);");
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Setup logical replication
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_copy FOR TABLE tab_copy;");
+
+my $appname_copy = 'appname_copy';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_copy
+	CONNECTION '$publisher_connstr application_name=$appname_copy'
+	PUBLICATION tap_pub_copy
+	WITH (two_phase=on, copy_data=false);");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Also wait for initial table sync to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+# Check that the initial table data was NOT replicated (because we said copy_data=false)
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Now do a prepare on publisher and check that it IS replicated
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_copy VALUES (99);
+    PREPARE TRANSACTION 'mygid';");
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Check that the transaction has been prepared on the subscriber, there will be 2
+# prepared transactions for the 2 subscriptions.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
+is($result, qq(2), 'transaction is prepared on subscriber');
+
+# Now commit the insert and verify that it IS replicated
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(6), 'publisher inserted data');
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(2), 'replicated data in subscriber table');
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..e61d28a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..cabc0bb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,12 +1388,15 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v87-0004-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v87-0004-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 5661755324ec0a0a8e88003a26dce35fa4da665c Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 17 Jun 2021 09:44:01 -0400
Subject: [PATCH v87] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/protocol.sgml                         |  68 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  10 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 135 +++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |  10 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 453 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 271 ++++++++++++
 11 files changed, 1018 insertions(+), 79 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 94e60e0..ae549db 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2895,7 +2895,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains a Stream Commit or Stream Abort message.
+   contains a Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7400,7 +7400,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7663,6 +7663,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 3bcef78..4238baa 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 08d0295..7df8742 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -926,12 +911,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 8ba1ad1..aea0224 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1057,6 +1057,87 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	Assert(!am_tablesync_worker());
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1272,30 +1353,20 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	if (in_streamed_transaction)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROTOCOL_VIOLATION),
-				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
 	/* Make sure we have an open transaction */
 	begin_replication_step();
 
@@ -1306,7 +1377,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -1327,7 +1398,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1406,6 +1477,32 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2351,6 +2448,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7a1d42a..d5a284d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1030,6 +1021,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index e20f2da..7a4804f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -124,6 +125,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -243,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c90e3f6
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,453 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3a0be82
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v87-0005-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v87-0005-Skip-empty-transactions-for-logical-replication.patchDownload
From 01343b4a30ed1c79aca301b6d8f66221a590d2b8 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 17 Jun 2021 09:56:42 -0400
Subject: [PATCH v87] Skip empty transactions for logical replication.

The current logical replication behaviour is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  16 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  38 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 158 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  46 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 286 insertions(+), 39 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index ffa5ca7..d499d63 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -879,11 +879,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index ae549db..40d3011 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7552,6 +7552,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit prepared.
 </para></listitem>
 </varlistentry>
@@ -7566,6 +7573,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 89d91c2..97ca648 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -934,7 +935,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -969,7 +971,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..4653d6d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 48239c0..6cdca07 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2792,7 +2792,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index aea0224..807561c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -982,27 +982,39 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* There is no transaction when COMMIT PREPARED is called */
-	begin_replication_step();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	end_replication_step();
-	CommitTransactionCommand();
+		/* There is no transaction when COMMIT PREPARED is called */
+		begin_replication_step();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index d5a284d..43679a2 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -132,6 +134,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -401,10 +408,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -419,8 +448,22 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
+	txn->output_plugin_private = NULL;
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -432,10 +475,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -450,8 +511,18 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty prepared transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -462,12 +533,33 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of COMMIT PREPARED of an empty transaction");
+			return;
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -480,8 +572,26 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of ROLLBACK of an empty transaction");
+			return;
+		}
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -630,11 +740,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -668,6 +783,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -770,6 +894,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -777,6 +902,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -813,6 +942,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -833,6 +971,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -845,6 +984,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 7a4804f..2fa60b5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d7c785b..ffc0b56 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -442,7 +442,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 52bd92d..2b43ae0 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -86,9 +86,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 4c372a6..8a33641 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 24;
+use Test::More tests => 25;
 
 ###############################
 # Setup
@@ -318,10 +318,9 @@ $node_publisher->safe_psql('postgres', "
 
 $node_publisher->wait_for_catchup($appname_copy);
 
-# Check that the transaction has been prepared on the subscriber, there will be 2
-# prepared transactions for the 2 subscriptions.
+# Check that the transaction has been prepared on the subscriber
 $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
-is($result, qq(2), 'transaction is prepared on subscriber');
+is($result, qq(1), 'transaction is prepared on subscriber');
 
 # Now commit the insert and verify that it IS replicated
 $node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
@@ -337,6 +336,45 @@ is($result, qq(2), 'replicated data in subscriber table');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
 $node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+   "CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+   "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot
+$node_publisher->safe_psql('postgres', "
+   BEGIN;
+   INSERT INTO tab_nopub SELECT generate_series(1,10);
+   PREPARE TRANSACTION 'empty_transaction';
+   COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+   'postgres', qq(
+       SELECT get_byte(data, 0)
+       FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+           'proto_version', '1',
+           'publication_names', 'tap_pub')
+));
+
+# the empty transaction should be skipped
+is($result, qq(),
+   'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cabc0bb..ad62bbe 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1597,6 +1597,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

v87-0002-Add-support-for-two-phase-decoding-in-pg_recvlog.patchapplication/octet-stream; name=v87-0002-Add-support-for-two-phase-decoding-in-pg_recvlog.patchDownload
From aab62834643814dcffc344bcf1228d4d6b2a766a Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Thu, 17 Jun 2021 09:29:05 -0400
Subject: [PATCH v87] Add support for two-phase decoding in pg_recvlogical.

Modified streamutils to pass in two-phase option when calling
CREATE_REPLICATION_SLOT. Added new option --two-phase in pg_recvlogical
to allow decoding of two-phase transactions.
---
 doc/src/sgml/logicaldecoding.sgml             | 20 ++++++++++--
 doc/src/sgml/ref/pg_recvlogical.sgml          | 16 ++++++++++
 src/bin/pg_basebackup/pg_basebackup.c         |  2 +-
 src/bin/pg_basebackup/pg_receivewal.c         |  2 +-
 src/bin/pg_basebackup/pg_recvlogical.c        | 19 +++++++++--
 src/bin/pg_basebackup/streamutil.c            |  6 +++-
 src/bin/pg_basebackup/streamutil.h            |  2 +-
 src/bin/pg_basebackup/t/030_pg_recvlogical.pl | 45 ++++++++++++++++++++++++++-
 8 files changed, 102 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 1765ea6..ffa5ca7 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -39,7 +39,6 @@
 
   <sect1 id="logicaldecoding-example">
    <title>Logical Decoding Examples</title>
-
    <para>
     The following example demonstrates controlling logical decoding using the
     SQL interface.
@@ -144,14 +143,14 @@ postgres=# SELECT pg_drop_replication_slot('regression_slot');
 </programlisting>
 
    <para>
-    The following example shows how logical decoding is controlled over the
+    The following examples shows how logical decoding is controlled over the
     streaming replication protocol, using the
     program <xref linkend="app-pgrecvlogical"/> included in the PostgreSQL
     distribution.  This requires that client authentication is set up to allow
     replication connections
     (see <xref linkend="streaming-replication-authentication"/>) and
     that <varname>max_wal_senders</varname> is set sufficiently high to allow
-    an additional connection.
+    an additional connection. The second example enables two-phase decoding.
    </para>
 <programlisting>
 $ pg_recvlogical -d postgres --slot=test --create-slot
@@ -164,6 +163,21 @@ table public.data: INSERT: id[integer]:4 data[text]:'4'
 COMMIT 693
 <keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
 $ pg_recvlogical -d postgres --slot=test --drop-slot
+
+$ pg_recvlogical -d postgres --slot=test --create-slot --two-phase
+$ pg_recvlogical -d postgres --slot=test --start -f -
+<keycombo action="simul"><keycap>Control</keycap><keycap>Z</keycap></keycombo>
+$ psql -d postgres -c "BEGIN;INSERT INTO data(data) VALUES('5');PREPARE TRANSACTION 'test';"
+$ fg
+BEGIN 694
+table public.data: INSERT: id[integer]:5 data[text]:'5'
+PREPARE TRANSACTION 'test', txid 694
+<keycombo action="simul"><keycap>Control</keycap><keycap>Z</keycap></keycombo>
+$ psql -d postgres -c "COMMIT PREPARED 'test';"
+$ fg
+COMMIT PREPARED 'test', txid 694
+<keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
+$ pg_recvlogical -d postgres --slot=test --drop-slot
 </programlisting>
 
   <para>
diff --git a/doc/src/sgml/ref/pg_recvlogical.sgml b/doc/src/sgml/ref/pg_recvlogical.sgml
index 6b1d98d..57c7e1b 100644
--- a/doc/src/sgml/ref/pg_recvlogical.sgml
+++ b/doc/src/sgml/ref/pg_recvlogical.sgml
@@ -65,6 +65,11 @@ PostgreSQL documentation
         <option>--plugin</option>, for the database specified
         by <option>--dbname</option>.
        </para>
+
+       <para>
+        The <option>--two-phase</option> can be specified with
+        <option>--create-slot</option> to enable two-phase decoding.
+       </para>
       </listitem>
      </varlistentry>
 
@@ -265,6 +270,17 @@ PostgreSQL documentation
        </para>
        </listitem>
      </varlistentry>
+
+     <varlistentry>
+       <term><option>-t</option></term>
+       <term><option>--two-phase</option></term>
+       <listitem>
+       <para>
+        Enables two-phase decoding. This option should only be used with
+        <option>--create-slot</option>
+       </para>
+       </listitem>
+     </varlistentry>
     </variablelist>
    </para>
 
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 16d8929..8bb0acf 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -646,7 +646,7 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 	if (temp_replication_slot || create_slot)
 	{
 		if (!CreateReplicationSlot(param->bgconn, replication_slot, NULL,
-								   temp_replication_slot, true, true, false))
+								   temp_replication_slot, true, true, false, false))
 			exit(1);
 
 		if (verbose)
diff --git a/src/bin/pg_basebackup/pg_receivewal.c b/src/bin/pg_basebackup/pg_receivewal.c
index 0d15012..c1334fa 100644
--- a/src/bin/pg_basebackup/pg_receivewal.c
+++ b/src/bin/pg_basebackup/pg_receivewal.c
@@ -741,7 +741,7 @@ main(int argc, char **argv)
 			pg_log_info("creating replication slot \"%s\"", replication_slot);
 
 		if (!CreateReplicationSlot(conn, replication_slot, NULL, false, true, false,
-								   slot_exists_ok))
+								   slot_exists_ok, false))
 			exit(1);
 		exit(0);
 	}
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 5efec16..729082b 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -35,6 +35,7 @@
 /* Global Options */
 static char *outfile = NULL;
 static int	verbose = 0;
+static bool two_phase = false;
 static int	noloop = 0;
 static int	standby_message_timeout = 10 * 1000;	/* 10 sec = default */
 static int	fsync_interval = 10 * 1000; /* 10 sec = default */
@@ -94,6 +95,7 @@ usage(void)
 			 "                         time between status packets sent to server (default: %d)\n"), (standby_message_timeout / 1000));
 	printf(_("  -S, --slot=SLOTNAME    name of the logical replication slot\n"));
 	printf(_("  -v, --verbose          output verbose messages\n"));
+	printf(_("  -t, --two-phase        enable two-phase decoding when creating a slot\n"));
 	printf(_("  -V, --version          output version information, then exit\n"));
 	printf(_("  -?, --help             show this help, then exit\n"));
 	printf(_("\nConnection options:\n"));
@@ -678,6 +680,7 @@ main(int argc, char **argv)
 		{"fsync-interval", required_argument, NULL, 'F'},
 		{"no-loop", no_argument, NULL, 'n'},
 		{"verbose", no_argument, NULL, 'v'},
+		{"two-phase", no_argument, NULL, 't'},
 		{"version", no_argument, NULL, 'V'},
 		{"help", no_argument, NULL, '?'},
 /* connection options */
@@ -726,7 +729,7 @@ main(int argc, char **argv)
 		}
 	}
 
-	while ((c = getopt_long(argc, argv, "E:f:F:nvd:h:p:U:wWI:o:P:s:S:",
+	while ((c = getopt_long(argc, argv, "E:f:F:nvtd:h:p:U:wWI:o:P:s:S:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -749,6 +752,9 @@ main(int argc, char **argv)
 			case 'v':
 				verbose++;
 				break;
+			case 't':
+				two_phase = true;
+				break;
 /* connection options */
 			case 'd':
 				dbname = pg_strdup(optarg);
@@ -920,6 +926,15 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
+	if (two_phase && !do_create_slot)
+	{
+		pg_log_error("--two-phase may only be specified with --create-slot");
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}
+
+
 #ifndef WIN32
 	pqsignal(SIGINT, sigint_handler);
 	pqsignal(SIGHUP, sighup_handler);
@@ -976,7 +991,7 @@ main(int argc, char **argv)
 			pg_log_info("creating replication slot \"%s\"", replication_slot);
 
 		if (!CreateReplicationSlot(conn, replication_slot, plugin, false,
-								   false, false, slot_exists_ok))
+								   false, false, slot_exists_ok, two_phase))
 			exit(1);
 		startpos = InvalidXLogRecPtr;
 	}
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 99daf0e..1f99aae 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -486,7 +486,7 @@ RunIdentifySystem(PGconn *conn, char **sysid, TimeLineID *starttli,
 bool
 CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 					  bool is_temporary, bool is_physical, bool reserve_wal,
-					  bool slot_exists_ok)
+					  bool slot_exists_ok, bool two_phase)
 {
 	PQExpBuffer query;
 	PGresult   *res;
@@ -495,6 +495,7 @@ CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 
 	Assert((is_physical && plugin == NULL) ||
 		   (!is_physical && plugin != NULL));
+	Assert(!(two_phase && is_physical));
 	Assert(slot_name != NULL);
 
 	/* Build query */
@@ -510,6 +511,9 @@ CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 	else
 	{
 		appendPQExpBuffer(query, " LOGICAL \"%s\"", plugin);
+		if (two_phase && PQserverVersion(conn) >= 140000)
+			appendPQExpBufferStr(query, " TWO_PHASE");
+
 		if (PQserverVersion(conn) >= 100000)
 			/* pg_recvlogical doesn't use an exported snapshot, so suppress */
 			appendPQExpBufferStr(query, " NOEXPORT_SNAPSHOT");
diff --git a/src/bin/pg_basebackup/streamutil.h b/src/bin/pg_basebackup/streamutil.h
index 10f87ad..504803b 100644
--- a/src/bin/pg_basebackup/streamutil.h
+++ b/src/bin/pg_basebackup/streamutil.h
@@ -34,7 +34,7 @@ extern PGconn *GetConnection(void);
 extern bool CreateReplicationSlot(PGconn *conn, const char *slot_name,
 								  const char *plugin, bool is_temporary,
 								  bool is_physical, bool reserve_wal,
-								  bool slot_exists_ok);
+								  bool slot_exists_ok, bool two_phase);
 extern bool DropReplicationSlot(PGconn *conn, const char *slot_name);
 extern bool RunIdentifySystem(PGconn *conn, char **sysid,
 							  TimeLineID *starttli,
diff --git a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
index 53f4181..bbbf9e2 100644
--- a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
+++ b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use TestLib;
 use PostgresNode;
-use Test::More tests => 15;
+use Test::More tests => 20;
 
 program_help_ok('pg_recvlogical');
 program_version_ok('pg_recvlogical');
@@ -22,6 +22,7 @@ max_replication_slots = 4
 max_wal_senders = 4
 log_min_messages = 'debug1'
 log_error_verbosity = verbose
+max_prepared_transactions = 10
 });
 $node->dump_info;
 $node->start;
@@ -63,3 +64,45 @@ $node->command_ok(
 		'--start', '--endpos', "$nextlsn", '--no-loop', '-f', '-'
 	],
 	'replayed a transaction');
+
+$node->command_ok(
+	[
+		'pg_recvlogical',           '-S',
+		'test',                     '-d',
+		$node->connstr('postgres'), '--drop-slot'
+	],
+	'slot dropped');
+
+#test with two-phase option enabled
+$node->command_ok(
+	[
+		'pg_recvlogical',           '-S',
+		'test',                     '-d',
+		$node->connstr('postgres'), '--create-slot', '--two-phase'
+	],
+	'slot with two-phase created');
+
+$slot = $node->slot('test');
+isnt($slot->{'restart_lsn'}, '', 'restart lsn is defined for new slot');
+
+$node->safe_psql('postgres',
+	"BEGIN; INSERT INTO test_table values (11); PREPARE TRANSACTION 'test'");
+$node->safe_psql('postgres',
+	"COMMIT PREPARED 'test'");
+$nextlsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn()');
+chomp($nextlsn);
+
+$node->command_fails(
+	[
+		'pg_recvlogical', '-S', 'test', '-d', $node->connstr('postgres'),
+		'--start', '--endpos', "$nextlsn", '--two-phase', '--no-loop', '-f', '-'
+	],
+	'incorrect usage');
+
+$node->command_ok(
+	[
+		'pg_recvlogical', '-S', 'test', '-d', $node->connstr('postgres'),
+		'--start', '--endpos', "$nextlsn", '--no-loop', '-f', '-'
+	],
+	'replayed a two-phase transaction');
-- 
1.8.3.1

#355Peter Smith
smithpb2250@gmail.com
In reply to: Greg Nancarrow (#353)

On Thu, Jun 17, 2021 at 6:22 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Wed, Jun 16, 2021 at 9:08 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v86*

A couple of comments:

(1) I think one of my suggested changes was missed (or was that intentional?):

BEFORE:
+                The LSN of the commit prepared.
AFTER:
+                The LSN of the commit prepared transaction.

No, not missed. I already dismissed that one and wrote about it when I
posted v85 [1]/messages/by-id/CAHut+PvOVkiVBf4P5chdVSoVs5=a=F_GtTSHHoXDb4LiOM_8Qw@mail.gmail.com.

(2) In light of Tom Lane's recent changes in:

fe6a20ce54cbbb6fcfe9f6675d563af836ae799a (Don't use Asserts to check
for violations of replication protocol)

there appear to be some instances of such code in these patches.

Yes, I already noted [2]/messages/by-id/CAHut+Pvdio4=OE6cz5pr8VcJNcAgt5uGBPdKf-tnGEMa1mANGg@mail.gmail.com there are likely to be such cases which need
to be fixed.

For example, in the v86-0001 patch:

+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepPreparedTxnData prepare_data;
+ char gid[GIDSIZE];
+
+ logicalrep_read_prepare(s, &prepare_data);
+
+ Assert(prepare_data.prepare_lsn == remote_final_lsn);

The above Assert() should be changed to something like:

+    if (prepare_data.prepare_lsn != remote_final_lsn)
+        ereport(ERROR,
+                (errcode(ERRCODE_PROTOCOL_VIOLATION),
+                 errmsg_internal("incorrect prepare LSN %X/%X in
prepare message (expected %X/%X)",
+                                 LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
+                                 LSN_FORMAT_ARGS(remote_final_lsn))));

Without being more familiar with this code, it's difficult for me to
judge exactly how many of such cases are in these patches.

Thanks for the above example. I will fix this one later, after
receiving some more reviews and reports of other Assert cases just
like this one.

------
[1]: /messages/by-id/CAHut+PvOVkiVBf4P5chdVSoVs5=a=F_GtTSHHoXDb4LiOM_8Qw@mail.gmail.com
[2]: /messages/by-id/CAHut+Pvdio4=OE6cz5pr8VcJNcAgt5uGBPdKf-tnGEMa1mANGg@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

#356vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#354)

On Thu, Jun 17, 2021 at 7:40 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Wed, Jun 16, 2021 at 9:08 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Fri, Jun 11, 2021 at 6:34 PM Peter Smith <smithpb2250@gmail.com> wrote:

KNOWN ISSUES: This v85 patch was built and tested using yesterday's
master, but due to lots of recent activity in the replication area I
expect it will be broken for HEAD very soon (if not already). I'll
rebase it again ASAP to try to keep it in working order.

Please find attached the latest patch set v86*

I've modified the patchset based on comments received on thread [1]
for the CREATE_REPLICATION_SLOT
changes. Based on the request from that thread, I've taken out those
changes as two new patches (patch-1 and patch-2)
and made this into 5 patches. I've also changed the logic to align
with the changes in the command syntax.

Few comments:
1) This content is present in
v87-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patch and
v87-0003-Add-support-for-prepared-transactions-to-built-i.patch, it
can be removed from one of them
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this logical replication slot supports decoding
of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT
PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>

2) This change is not required, it can be removed:
<sect1 id="logicaldecoding-example">
<title>Logical Decoding Examples</title>
-
<para>
The following example demonstrates controlling logical decoding using the
SQL interface.

3) We could add comment mentioning example 1 at the beginning of
example 1 and example 2 for the newly added example with description,
that will clearly mark the examples.
COMMIT 693
 <keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
 $ pg_recvlogical -d postgres --slot=test --drop-slot
+
+$ pg_recvlogical -d postgres --slot=test --create-slot --two-phase
+$ pg_recvlogical -d postgres --slot=test --start -f -
4) You could mention "Before you use two-phase commit commands, you
must set max_prepared_transactions to at least 1" for example 2.
 $ pg_recvlogical -d postgres --slot=test --drop-slot
+
+$ pg_recvlogical -d postgres --slot=test --create-slot --two-phase
+$ pg_recvlogical -d postgres --slot=test --start -f -
5) This should be before verbose, the options are documented alphabetically
+     <varlistentry>
+       <term><option>-t</option></term>
+       <term><option>--two-phase</option></term>
+       <listitem>
+       <para>
+        Enables two-phase decoding. This option should only be used with
+        <option>--create-slot</option>
+       </para>
+       </listitem>
+     </varlistentry>

6) This should be before verbose, the options are printed alphabetically
printf(_(" -v, --verbose output verbose messages\n"));
+ printf(_(" -t, --two-phase enable two-phase decoding
when creating a slot\n"));
printf(_(" -V, --version output version information,
then exit\n"));

Regards,
Vignesh

#357Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#355)

On Fri, Jun 18, 2021 at 7:43 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Jun 17, 2021 at 6:22 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

For example, in the v86-0001 patch:

+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepPreparedTxnData prepare_data;
+ char gid[GIDSIZE];
+
+ logicalrep_read_prepare(s, &prepare_data);
+
+ Assert(prepare_data.prepare_lsn == remote_final_lsn);

The above Assert() should be changed to something like:

+    if (prepare_data.prepare_lsn != remote_final_lsn)
+        ereport(ERROR,
+                (errcode(ERRCODE_PROTOCOL_VIOLATION),
+                 errmsg_internal("incorrect prepare LSN %X/%X in
prepare message (expected %X/%X)",
+                                 LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
+                                 LSN_FORMAT_ARGS(remote_final_lsn))));

Without being more familiar with this code, it's difficult for me to
judge exactly how many of such cases are in these patches.

Thanks for the above example. I will fix this one later, after
receiving some more reviews and reports of other Assert cases just
like this one.

I think on similar lines below asserts also need to be changed.

1.
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+ LogicalRepPreparedTxnData begin_data;
+ char gid[GIDSIZE];
+
+ /* Tablesync should never receive prepare. */
+ Assert(!am_tablesync_worker());
2.
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
..
+ Assert(TransactionIdIsValid(xid));
3.
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+ int nchanges = 0;
+ LogicalRepPreparedTxnData prepare_data;
+ TransactionId xid;
+ char gid[GIDSIZE];
+
..
..
+
+ /* Tablesync should never receive prepare. */
+ Assert(!am_tablesync_worker());

--
With Regards,
Amit Kapila.

#358Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#357)
5 attachment(s)

On Fri, Jun 18, 2021 at 3:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 18, 2021 at 7:43 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Jun 17, 2021 at 6:22 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

For example, in the v86-0001 patch:

+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+ LogicalRepPreparedTxnData prepare_data;
+ char gid[GIDSIZE];
+
+ logicalrep_read_prepare(s, &prepare_data);
+
+ Assert(prepare_data.prepare_lsn == remote_final_lsn);

The above Assert() should be changed to something like:

+    if (prepare_data.prepare_lsn != remote_final_lsn)
+        ereport(ERROR,
+                (errcode(ERRCODE_PROTOCOL_VIOLATION),
+                 errmsg_internal("incorrect prepare LSN %X/%X in
prepare message (expected %X/%X)",
+                                 LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
+                                 LSN_FORMAT_ARGS(remote_final_lsn))));

Without being more familiar with this code, it's difficult for me to
judge exactly how many of such cases are in these patches.

Thanks for the above example. I will fix this one later, after
receiving some more reviews and reports of other Assert cases just
like this one.

I think on similar lines below asserts also need to be changed.

1.
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+ LogicalRepPreparedTxnData begin_data;
+ char gid[GIDSIZE];
+
+ /* Tablesync should never receive prepare. */
+ Assert(!am_tablesync_worker());
2.
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
..
+ Assert(TransactionIdIsValid(xid));
3.
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+ int nchanges = 0;
+ LogicalRepPreparedTxnData prepare_data;
+ TransactionId xid;
+ char gid[GIDSIZE];
+
..
..
+
+ /* Tablesync should never receive prepare. */
+ Assert(!am_tablesync_worker());

Please find attached the latest patch set v88*

Differences from v87* are:

* Rebased to HEAD @ today.

* Replaces several protocol Asserts with ereports
(ERRCODE_PROTOCOL_VIOLATION) in patch 0003 and 0004, as reported by
Greg [1]/messages/by-id/CAHut+PuJKTNRjFre0VBufWMz9BEScC__nT+PUhbSaUNW2biPow@mail.gmail.com and Amit [2]/messages/by-id/CAA4eK1JO3HsOurS988=Jarej=AK6ChE1tLuMNP=AZCt6--hVrw@mail.gmail.com. This is in keeping with the commit [3]https://github.com/postgres/postgres/commit/fe6a20ce54cbbb6fcfe9f6675d563af836ae799a.

----
[1]: /messages/by-id/CAHut+PuJKTNRjFre0VBufWMz9BEScC__nT+PUhbSaUNW2biPow@mail.gmail.com
[2]: /messages/by-id/CAA4eK1JO3HsOurS988=Jarej=AK6ChE1tLuMNP=AZCt6--hVrw@mail.gmail.com
[3]: https://github.com/postgres/postgres/commit/fe6a20ce54cbbb6fcfe9f6675d563af836ae799a

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v88-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patchapplication/octet-stream; name=v88-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patchDownload
From 76193315c822fcabf25f6ff732cbd8b6d56d6492 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 21 Jun 2021 11:47:21 +1000
Subject: [PATCH v88] Add option to set two-phase in CREATE_REPLICATION_SLOT
 command.

CREATE_REPLICATION_SLOT modified to support two-phase encoding in the slot.
This will allow the decoding of commands like PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED for slots created with this option.
---
 doc/src/sgml/protocol.sgml             | 16 +++++++++++++++-
 src/backend/replication/repl_gram.y    | 12 ++++++++++++
 src/backend/replication/repl_scanner.l |  1 +
 src/backend/replication/walsender.c    | 18 +++++++++++++++---
 4 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index bc2a2fe..205fbd2 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> | <literal>TWO_PHASE</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this logical replication slot supports decoding of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..eead144 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -283,6 +285,11 @@ create_slot_opt:
 				  $$ = makeDefElem("reserve_wal",
 								   (Node *)makeInteger(true), -1);
 				}
+			| K_TWO_PHASE
+				{
+				  $$ = makeDefElem("two_phase",
+								   (Node *)makeInteger(true), -1);
+				}
 			;
 
 /* DROP_REPLICATION_SLOT slot */
@@ -365,6 +372,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3224536..92c755f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -863,11 +863,13 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 static void
 parseCreateReplSlotOptions(CreateReplicationSlotCmd *cmd,
 						   bool *reserve_wal,
-						   CRSSnapshotAction *snapshot_action)
+						   CRSSnapshotAction *snapshot_action,
+						   bool *two_phase)
 {
 	ListCell   *lc;
 	bool		snapshot_action_given = false;
 	bool		reserve_wal_given = false;
+	bool		two_phase_given = false;
 
 	/* Parse options */
 	foreach(lc, cmd->options)
@@ -905,6 +907,15 @@ parseCreateReplSlotOptions(CreateReplicationSlotCmd *cmd,
 			reserve_wal_given = true;
 			*reserve_wal = true;
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_given || cmd->kind != REPLICATION_KIND_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_given = true;
+			*two_phase = true;
+		}
 		else
 			elog(ERROR, "unrecognized option: %s", defel->defname);
 	}
@@ -920,6 +931,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 	char		xloc[MAXFNAMELEN];
 	char	   *slot_name;
 	bool		reserve_wal = false;
+	bool		two_phase = false;
 	CRSSnapshotAction snapshot_action = CRS_EXPORT_SNAPSHOT;
 	DestReceiver *dest;
 	TupOutputState *tstate;
@@ -929,7 +941,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	Assert(!MyReplicationSlot);
 
-	parseCreateReplSlotOptions(cmd, &reserve_wal, &snapshot_action);
+	parseCreateReplSlotOptions(cmd, &reserve_wal, &snapshot_action, &two_phase);
 
 	/* setup state for WalSndSegmentOpen */
 	sendTimeLineIsHistoric = false;
@@ -954,7 +966,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 */
 		ReplicationSlotCreate(cmd->slotname, true,
 							  cmd->temporary ? RS_TEMPORARY : RS_EPHEMERAL,
-							  false);
+							  two_phase);
 	}
 
 	if (cmd->kind == REPLICATION_KIND_LOGICAL)
-- 
1.8.3.1

v88-0002-Add-support-for-two-phase-decoding-in-pg_recvlog.patchapplication/octet-stream; name=v88-0002-Add-support-for-two-phase-decoding-in-pg_recvlog.patchDownload
From 1c7e05264e1eef7ce75e6f6e0a955b086bf8ae1c Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 21 Jun 2021 12:27:03 +1000
Subject: [PATCH v88] Add support for two-phase decoding in pg_recvlogical.

Modified streamutils to pass in two-phase option when calling
CREATE_REPLICATION_SLOT. Added new option --two-phase in pg_recvlogical
to allow decoding of two-phase transactions.
---
 doc/src/sgml/logicaldecoding.sgml             | 20 ++++++++++--
 doc/src/sgml/ref/pg_recvlogical.sgml          | 16 ++++++++++
 src/bin/pg_basebackup/pg_basebackup.c         |  2 +-
 src/bin/pg_basebackup/pg_receivewal.c         |  2 +-
 src/bin/pg_basebackup/pg_recvlogical.c        | 19 +++++++++--
 src/bin/pg_basebackup/streamutil.c            |  6 +++-
 src/bin/pg_basebackup/streamutil.h            |  2 +-
 src/bin/pg_basebackup/t/030_pg_recvlogical.pl | 45 ++++++++++++++++++++++++++-
 8 files changed, 102 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 1765ea6..ffa5ca7 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -39,7 +39,6 @@
 
   <sect1 id="logicaldecoding-example">
    <title>Logical Decoding Examples</title>
-
    <para>
     The following example demonstrates controlling logical decoding using the
     SQL interface.
@@ -144,14 +143,14 @@ postgres=# SELECT pg_drop_replication_slot('regression_slot');
 </programlisting>
 
    <para>
-    The following example shows how logical decoding is controlled over the
+    The following examples shows how logical decoding is controlled over the
     streaming replication protocol, using the
     program <xref linkend="app-pgrecvlogical"/> included in the PostgreSQL
     distribution.  This requires that client authentication is set up to allow
     replication connections
     (see <xref linkend="streaming-replication-authentication"/>) and
     that <varname>max_wal_senders</varname> is set sufficiently high to allow
-    an additional connection.
+    an additional connection. The second example enables two-phase decoding.
    </para>
 <programlisting>
 $ pg_recvlogical -d postgres --slot=test --create-slot
@@ -164,6 +163,21 @@ table public.data: INSERT: id[integer]:4 data[text]:'4'
 COMMIT 693
 <keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
 $ pg_recvlogical -d postgres --slot=test --drop-slot
+
+$ pg_recvlogical -d postgres --slot=test --create-slot --two-phase
+$ pg_recvlogical -d postgres --slot=test --start -f -
+<keycombo action="simul"><keycap>Control</keycap><keycap>Z</keycap></keycombo>
+$ psql -d postgres -c "BEGIN;INSERT INTO data(data) VALUES('5');PREPARE TRANSACTION 'test';"
+$ fg
+BEGIN 694
+table public.data: INSERT: id[integer]:5 data[text]:'5'
+PREPARE TRANSACTION 'test', txid 694
+<keycombo action="simul"><keycap>Control</keycap><keycap>Z</keycap></keycombo>
+$ psql -d postgres -c "COMMIT PREPARED 'test';"
+$ fg
+COMMIT PREPARED 'test', txid 694
+<keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
+$ pg_recvlogical -d postgres --slot=test --drop-slot
 </programlisting>
 
   <para>
diff --git a/doc/src/sgml/ref/pg_recvlogical.sgml b/doc/src/sgml/ref/pg_recvlogical.sgml
index 6b1d98d..57c7e1b 100644
--- a/doc/src/sgml/ref/pg_recvlogical.sgml
+++ b/doc/src/sgml/ref/pg_recvlogical.sgml
@@ -65,6 +65,11 @@ PostgreSQL documentation
         <option>--plugin</option>, for the database specified
         by <option>--dbname</option>.
        </para>
+
+       <para>
+        The <option>--two-phase</option> can be specified with
+        <option>--create-slot</option> to enable two-phase decoding.
+       </para>
       </listitem>
      </varlistentry>
 
@@ -265,6 +270,17 @@ PostgreSQL documentation
        </para>
        </listitem>
      </varlistentry>
+
+     <varlistentry>
+       <term><option>-t</option></term>
+       <term><option>--two-phase</option></term>
+       <listitem>
+       <para>
+        Enables two-phase decoding. This option should only be used with
+        <option>--create-slot</option>
+       </para>
+       </listitem>
+     </varlistentry>
     </variablelist>
    </para>
 
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 16d8929..8bb0acf 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -646,7 +646,7 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 	if (temp_replication_slot || create_slot)
 	{
 		if (!CreateReplicationSlot(param->bgconn, replication_slot, NULL,
-								   temp_replication_slot, true, true, false))
+								   temp_replication_slot, true, true, false, false))
 			exit(1);
 
 		if (verbose)
diff --git a/src/bin/pg_basebackup/pg_receivewal.c b/src/bin/pg_basebackup/pg_receivewal.c
index 0d15012..c1334fa 100644
--- a/src/bin/pg_basebackup/pg_receivewal.c
+++ b/src/bin/pg_basebackup/pg_receivewal.c
@@ -741,7 +741,7 @@ main(int argc, char **argv)
 			pg_log_info("creating replication slot \"%s\"", replication_slot);
 
 		if (!CreateReplicationSlot(conn, replication_slot, NULL, false, true, false,
-								   slot_exists_ok))
+								   slot_exists_ok, false))
 			exit(1);
 		exit(0);
 	}
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 5efec16..729082b 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -35,6 +35,7 @@
 /* Global Options */
 static char *outfile = NULL;
 static int	verbose = 0;
+static bool two_phase = false;
 static int	noloop = 0;
 static int	standby_message_timeout = 10 * 1000;	/* 10 sec = default */
 static int	fsync_interval = 10 * 1000; /* 10 sec = default */
@@ -94,6 +95,7 @@ usage(void)
 			 "                         time between status packets sent to server (default: %d)\n"), (standby_message_timeout / 1000));
 	printf(_("  -S, --slot=SLOTNAME    name of the logical replication slot\n"));
 	printf(_("  -v, --verbose          output verbose messages\n"));
+	printf(_("  -t, --two-phase        enable two-phase decoding when creating a slot\n"));
 	printf(_("  -V, --version          output version information, then exit\n"));
 	printf(_("  -?, --help             show this help, then exit\n"));
 	printf(_("\nConnection options:\n"));
@@ -678,6 +680,7 @@ main(int argc, char **argv)
 		{"fsync-interval", required_argument, NULL, 'F'},
 		{"no-loop", no_argument, NULL, 'n'},
 		{"verbose", no_argument, NULL, 'v'},
+		{"two-phase", no_argument, NULL, 't'},
 		{"version", no_argument, NULL, 'V'},
 		{"help", no_argument, NULL, '?'},
 /* connection options */
@@ -726,7 +729,7 @@ main(int argc, char **argv)
 		}
 	}
 
-	while ((c = getopt_long(argc, argv, "E:f:F:nvd:h:p:U:wWI:o:P:s:S:",
+	while ((c = getopt_long(argc, argv, "E:f:F:nvtd:h:p:U:wWI:o:P:s:S:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -749,6 +752,9 @@ main(int argc, char **argv)
 			case 'v':
 				verbose++;
 				break;
+			case 't':
+				two_phase = true;
+				break;
 /* connection options */
 			case 'd':
 				dbname = pg_strdup(optarg);
@@ -920,6 +926,15 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
+	if (two_phase && !do_create_slot)
+	{
+		pg_log_error("--two-phase may only be specified with --create-slot");
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}
+
+
 #ifndef WIN32
 	pqsignal(SIGINT, sigint_handler);
 	pqsignal(SIGHUP, sighup_handler);
@@ -976,7 +991,7 @@ main(int argc, char **argv)
 			pg_log_info("creating replication slot \"%s\"", replication_slot);
 
 		if (!CreateReplicationSlot(conn, replication_slot, plugin, false,
-								   false, false, slot_exists_ok))
+								   false, false, slot_exists_ok, two_phase))
 			exit(1);
 		startpos = InvalidXLogRecPtr;
 	}
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 99daf0e..1f99aae 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -486,7 +486,7 @@ RunIdentifySystem(PGconn *conn, char **sysid, TimeLineID *starttli,
 bool
 CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 					  bool is_temporary, bool is_physical, bool reserve_wal,
-					  bool slot_exists_ok)
+					  bool slot_exists_ok, bool two_phase)
 {
 	PQExpBuffer query;
 	PGresult   *res;
@@ -495,6 +495,7 @@ CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 
 	Assert((is_physical && plugin == NULL) ||
 		   (!is_physical && plugin != NULL));
+	Assert(!(two_phase && is_physical));
 	Assert(slot_name != NULL);
 
 	/* Build query */
@@ -510,6 +511,9 @@ CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 	else
 	{
 		appendPQExpBuffer(query, " LOGICAL \"%s\"", plugin);
+		if (two_phase && PQserverVersion(conn) >= 140000)
+			appendPQExpBufferStr(query, " TWO_PHASE");
+
 		if (PQserverVersion(conn) >= 100000)
 			/* pg_recvlogical doesn't use an exported snapshot, so suppress */
 			appendPQExpBufferStr(query, " NOEXPORT_SNAPSHOT");
diff --git a/src/bin/pg_basebackup/streamutil.h b/src/bin/pg_basebackup/streamutil.h
index 10f87ad..504803b 100644
--- a/src/bin/pg_basebackup/streamutil.h
+++ b/src/bin/pg_basebackup/streamutil.h
@@ -34,7 +34,7 @@ extern PGconn *GetConnection(void);
 extern bool CreateReplicationSlot(PGconn *conn, const char *slot_name,
 								  const char *plugin, bool is_temporary,
 								  bool is_physical, bool reserve_wal,
-								  bool slot_exists_ok);
+								  bool slot_exists_ok, bool two_phase);
 extern bool DropReplicationSlot(PGconn *conn, const char *slot_name);
 extern bool RunIdentifySystem(PGconn *conn, char **sysid,
 							  TimeLineID *starttli,
diff --git a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
index 53f4181..bbbf9e2 100644
--- a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
+++ b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use TestLib;
 use PostgresNode;
-use Test::More tests => 15;
+use Test::More tests => 20;
 
 program_help_ok('pg_recvlogical');
 program_version_ok('pg_recvlogical');
@@ -22,6 +22,7 @@ max_replication_slots = 4
 max_wal_senders = 4
 log_min_messages = 'debug1'
 log_error_verbosity = verbose
+max_prepared_transactions = 10
 });
 $node->dump_info;
 $node->start;
@@ -63,3 +64,45 @@ $node->command_ok(
 		'--start', '--endpos', "$nextlsn", '--no-loop', '-f', '-'
 	],
 	'replayed a transaction');
+
+$node->command_ok(
+	[
+		'pg_recvlogical',           '-S',
+		'test',                     '-d',
+		$node->connstr('postgres'), '--drop-slot'
+	],
+	'slot dropped');
+
+#test with two-phase option enabled
+$node->command_ok(
+	[
+		'pg_recvlogical',           '-S',
+		'test',                     '-d',
+		$node->connstr('postgres'), '--create-slot', '--two-phase'
+	],
+	'slot with two-phase created');
+
+$slot = $node->slot('test');
+isnt($slot->{'restart_lsn'}, '', 'restart lsn is defined for new slot');
+
+$node->safe_psql('postgres',
+	"BEGIN; INSERT INTO test_table values (11); PREPARE TRANSACTION 'test'");
+$node->safe_psql('postgres',
+	"COMMIT PREPARED 'test'");
+$nextlsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn()');
+chomp($nextlsn);
+
+$node->command_fails(
+	[
+		'pg_recvlogical', '-S', 'test', '-d', $node->connstr('postgres'),
+		'--start', '--endpos', "$nextlsn", '--two-phase', '--no-loop', '-f', '-'
+	],
+	'incorrect usage');
+
+$node->command_ok(
+	[
+		'pg_recvlogical', '-S', 'test', '-d', $node->connstr('postgres'),
+		'--start', '--endpos', "$nextlsn", '--no-loop', '-f', '-'
+	],
+	'replayed a two-phase transaction');
-- 
1.8.3.1

v88-0005-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v88-0005-Skip-empty-transactions-for-logical-replication.patchDownload
From f7a087345188fac945d0114b66238f8f84ca9d8d Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 21 Jun 2021 16:24:40 +1000
Subject: [PATCH v88] Skip empty transactions for logical replication.

The current logical replication behaviour is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  16 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  38 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 158 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  46 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 286 insertions(+), 39 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index ffa5ca7..d499d63 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -879,11 +879,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index ae549db..40d3011 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7552,6 +7552,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit prepared.
 </para></listitem>
 </varlistentry>
@@ -7566,6 +7573,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 89d91c2..97ca648 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -934,7 +935,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -969,7 +971,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..4653d6d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 48239c0..6cdca07 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2792,7 +2792,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e6f1276..47d5a53 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -990,27 +990,39 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* There is no transaction when COMMIT PREPARED is called */
-	begin_replication_step();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	end_replication_step();
-	CommitTransactionCommand();
+		/* There is no transaction when COMMIT PREPARED is called */
+		begin_replication_step();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index d5a284d..43679a2 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -132,6 +134,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -401,10 +408,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -419,8 +448,22 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
+	txn->output_plugin_private = NULL;
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -432,10 +475,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -450,8 +511,18 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty prepared transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -462,12 +533,33 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of COMMIT PREPARED of an empty transaction");
+			return;
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -480,8 +572,26 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of ROLLBACK of an empty transaction");
+			return;
+		}
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -630,11 +740,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -668,6 +783,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -770,6 +894,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -777,6 +902,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -813,6 +942,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -833,6 +971,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -845,6 +984,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 7a4804f..2fa60b5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d7c785b..ffc0b56 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -442,7 +442,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 52bd92d..2b43ae0 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -86,9 +86,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 4c372a6..8a33641 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 24;
+use Test::More tests => 25;
 
 ###############################
 # Setup
@@ -318,10 +318,9 @@ $node_publisher->safe_psql('postgres', "
 
 $node_publisher->wait_for_catchup($appname_copy);
 
-# Check that the transaction has been prepared on the subscriber, there will be 2
-# prepared transactions for the 2 subscriptions.
+# Check that the transaction has been prepared on the subscriber
 $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
-is($result, qq(2), 'transaction is prepared on subscriber');
+is($result, qq(1), 'transaction is prepared on subscriber');
 
 # Now commit the insert and verify that it IS replicated
 $node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
@@ -337,6 +336,45 @@ is($result, qq(2), 'replicated data in subscriber table');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
 $node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+   "CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+   "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot
+$node_publisher->safe_psql('postgres', "
+   BEGIN;
+   INSERT INTO tab_nopub SELECT generate_series(1,10);
+   PREPARE TRANSACTION 'empty_transaction';
+   COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+   'postgres', qq(
+       SELECT get_byte(data, 0)
+       FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+           'proto_version', '1',
+           'publication_names', 'tap_pub')
+));
+
+# the empty transaction should be skipped
+is($result, qq(),
+   'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cabc0bb..ad62bbe 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1597,6 +1597,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

v88-0004-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v88-0004-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 03d7957862a30a8fed618be84ae55ada3da18a8e Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 21 Jun 2021 16:10:11 +1000
Subject: [PATCH v88] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/protocol.sgml                         |  68 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  10 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 138 ++++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |  10 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 453 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 271 ++++++++++++
 11 files changed, 1021 insertions(+), 79 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 94e60e0..ae549db 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2895,7 +2895,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains a Stream Commit or Stream Abort message.
+   contains a Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7400,7 +7400,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7663,6 +7663,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 3bcef78..4238baa 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 08d0295..7df8742 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -926,12 +911,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 31bce6b..e6f1276 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1065,6 +1065,90 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a STREAM PREPARE message")));
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1280,30 +1364,20 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	if (in_streamed_transaction)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROTOCOL_VIOLATION),
-				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
 	/* Make sure we have an open transaction */
 	begin_replication_step();
 
@@ -1314,7 +1388,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -1335,7 +1409,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1414,6 +1488,32 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2359,6 +2459,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7a1d42a..d5a284d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1030,6 +1021,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index e20f2da..7a4804f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -124,6 +125,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -243,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c90e3f6
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,453 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3a0be82
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v88-0003-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v88-0003-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 59cf8da510c49f336c2c28465967fc77489f9f56 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 21 Jun 2021 15:02:42 +1000
Subject: [PATCH v88] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the following things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase
transactions. We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Add new subscription TAP tests, and new subscription.sql regression tests.

* Update PG documentation.

We don't support the following operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* Prepare API for in-progress transactions.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         | 305 ++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 148 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  19 +-
 src/backend/replication/logical/decode.c           |  11 +-
 src/backend/replication/logical/logical.c          |  31 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 196 +++++++++--
 src/backend/replication/logical/worker.c           | 358 +++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |   8 +-
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  29 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  17 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/nodes/replnodes.h                      |   1 +
 src/include/replication/logical.h                  |  10 +
 src/include/replication/logicalproto.h             |  73 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 359 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 235 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   3 +
 43 files changed, 2438 insertions(+), 202 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f517a7d..0235639 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7643,6 +7643,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 205fbd2..94e60e0 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1970,6 +1970,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2811,11 +2825,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2871,10 +2891,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains a Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7379,6 +7400,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepared transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit prepared.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..3bcef78 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used with the
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on publisher is decoded as a normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used with the
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..76eba34 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		exists.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984..55f6e37 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1255,5 +1255,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subslotname, subsynccommit, subpublications)
+              substream, subtwophasestate, subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 75e195f..08d0295 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -506,10 +557,34 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false then
+				 * it is safe to enable two_phase up-front because those tables
+				 * are already initially in READY state. When the subscription
+				 * has no tables, we leave the twophase state as PENDING,
+				 * to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -816,7 +891,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -850,6 +926,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -873,7 +955,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -918,7 +1001,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -934,6 +1018,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -965,7 +1060,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -982,6 +1078,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1011,7 +1118,32 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 6eaa84a..2838b89 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -436,6 +437,19 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/*
+		 * FIXME - 21/May. The below code is a temporary hack to check for
+		 * for server version 140000, even though this two-phase feature did
+		 * not make it into the PG 14 release.
+		 *
+		 * When the PG 15 development officially starts someone will update the
+		 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+		 * to revisit this code to remove this hack and write the code properly.
+		 */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -851,7 +865,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -868,6 +882,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
+		if (two_phase)
+			appendStringInfoString(&cmd, " TWO_PHASE");
+
 		switch (snapshot_action)
 		{
 			case CRS_EXPORT_SNAPSHOT:
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 453efc5..74df75e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing preparing
+				 * transactions that have locked [user] catalog tables
+				 * exclusively but as of now we ask users not to do such an
+				 * operation.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +734,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..89d91c2 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,12 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= slot->data.two_phase;
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +540,22 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (slot->data.two_phase || ctx->twophase_opt_given);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +616,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index cb42fcb..2c191de 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 19e96f3..48239c0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2574,7 +2574,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2665,7 +2665,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2712,7 +2712,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2732,7 +2732,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2751,19 +2751,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2781,12 +2782,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..2500954 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot needs
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index cc50eb8..b923c95 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1071,7 +1067,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1158,3 +1155,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ *
+ * Note: If this function started the transaction (indicated by the parameter)
+ * then it is the caller's responsibility to commit it.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static bool has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && list_length(table_states_not_ready) == 0;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase state */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index bbb659d..31bce6b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * is still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -804,6 +880,191 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a BEGIN PREPARE message")));
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (prepare_data.prepare_lsn != remote_final_lsn)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("incorrect prepare LSN %X/%X in prepare message (expected %X/%X)",
+								 LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
+								 LSN_FORMAT_ARGS(remote_final_lsn))));
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	begin_replication_step();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* There is no transaction when COMMIT PREPARED is called */
+	begin_replication_step();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
+		begin_replication_step();
+		FinishPreparedTransaction(gid, false);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2082,6 +2343,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2561,6 +2838,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -3062,6 +3342,24 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+
+	if (!TransactionIdIsValid(xid))
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("invalid two-phase transaction ID")));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3230,15 +3528,69 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
+
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 63f108f..7a1d42a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -145,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -156,6 +174,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -167,10 +187,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -246,8 +268,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -319,6 +362,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -331,8 +395,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -347,29 +415,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -389,6 +436,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -839,18 +948,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1269,3 +1368,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eead144..0910546 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -103,7 +103,6 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
-%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -243,7 +242,7 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
 			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
@@ -372,11 +371,6 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
-opt_two_phase:
-			K_TWO_PHASE						{ $$ = true; }
-			| /* EMPTY */					{ $$ = false; }
-			;
-
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8c18b4e..33b85d8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -283,6 +283,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index faeea9f..9f0b13f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -370,7 +370,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 8f53cc7..8141311 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4304,6 +4305,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4347,9 +4349,25 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	/*
+	 * FIXME - 21/May. The below code is a temporary hack to check for
+	 * for server version 140000, even though this two-phase feature did
+	 * not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 */
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4370,6 +4388,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4395,6 +4414,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4422,6 +4443,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4463,6 +4485,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2abf255..6caa701 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,7 +6415,9 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary and streaming are only supported in v14 and higher.
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
@@ -6423,6 +6425,17 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Binary"),
 							  gettext_noop("Streaming"));
 
+		/*
+		 * Two_phase is only supported in v15 and higher.
+		 *
+		 * FIXME: When PG15 development starts, change the following
+		 * 140000 to 150000
+		 */
+		if (pset.sversion >= 140000)
+			appendPQExpBuffer(&buf,
+							  ", subtwophasestate AS \"%s\"\n",
+							  gettext_noop("Two phase commit"));
+
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
 						  ",  subconninfo AS \"%s\"\n",
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 38af568..8f0a921 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2760,7 +2760,7 @@ psql_completion(const char *text, int start, int end)
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
 					  "enabled", "slot_name", "streaming",
-					  "synchronous_commit");
+					  "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0060ebf..e84353e 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -57,6 +65,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -94,6 +104,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index faa3a25..ebc43a0 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		temporary;
+	bool		two_phase;
 	List	   *options;
 } CreateReplicationSlotCmd;
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..0b071a6 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -90,6 +90,16 @@ typedef struct LogicalDecodingContext
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 * This flag indicates that the plugin passed in the two-phase option as
+	 * part of the START_STREAMING command. We can't rely solely on the twophase
+	 * flag which only tells whether the plugin provided all the necessary
+	 * two-phase callbacks.
+	 *
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..e20f2da 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -122,6 +131,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +180,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d8..d7c785b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -297,7 +297,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -636,7 +640,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 2eb7e3a..34d95ea 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -84,11 +84,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..4c372a6
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,359 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# copy_data=false and two_phase
+###############################
+
+#create some test tables for copy tests
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_copy SELECT generate_series(1,5);");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "INSERT INTO tab_copy VALUES (88);");
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Setup logical replication
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_copy FOR TABLE tab_copy;");
+
+my $appname_copy = 'appname_copy';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_copy
+	CONNECTION '$publisher_connstr application_name=$appname_copy'
+	PUBLICATION tap_pub_copy
+	WITH (two_phase=on, copy_data=false);");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Also wait for initial table sync to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+# Check that the initial table data was NOT replicated (because we said copy_data=false)
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Now do a prepare on publisher and check that it IS replicated
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_copy VALUES (99);
+    PREPARE TRANSACTION 'mygid';");
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Check that the transaction has been prepared on the subscriber, there will be 2
+# prepared transactions for the 2 subscriptions.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
+is($result, qq(2), 'transaction is prepared on subscriber');
+
+# Now commit the insert and verify that it IS replicated
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(6), 'publisher inserted data');
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(2), 'replicated data in subscriber table');
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..e61d28a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..cabc0bb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,12 +1388,15 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

#359Greg Nancarrow
gregn4422@gmail.com
In reply to: Peter Smith (#358)

On Mon, Jun 21, 2021 at 4:37 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v88*

Some minor comments:

(1)
v88-0002

doc/src/sgml/logicaldecoding.sgml

"examples shows" is not correct.
I think there is only ONE example being referred to.

BEFORE:
+    The following examples shows how logical decoding is controlled over the
AFTER:
+    The following example shows how logical decoding is controlled over the

(2)
v88 - 0003

doc/src/sgml/ref/create_subscription.sgml

(i)

BEFORE:
+          to the subscriber on the PREPARE TRANSACTION. By default,
the transaction
+          prepared on publisher is decoded as a normal transaction at commit.
AFTER:
+          to the subscriber on the PREPARE TRANSACTION. By default,
the transaction
+          prepared on the publisher is decoded as a normal
transaction at commit time.

(ii)

src/backend/access/transam/twophase.c

The double-bracketing is unnecessary:

BEFORE:
+ if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
AFTER:
+ if (gxact->valid && strcmp(gxact->gid, gid) == 0)

(iii)

src/backend/replication/logical/snapbuild.c

Need to add some commas to make the following easier to read, and
change "needs" to "need":

BEFORE:
+ * The prepared transactions that were skipped because previously
+ * two-phase was not enabled or are not covered by initial snapshot needs
+ * to be sent later along with commit prepared and they must be before
+ * this point.
AFTER:
+ * The prepared transactions, that were skipped because previously
+ * two-phase was not enabled or are not covered by initial snapshot, need
+ * to be sent later along with commit prepared and they must be before
+ * this point.

(iv)

src/backend/replication/logical/tablesync.c

I think the convention used in Postgres code is to check for empty
Lists using "== NIL" and non-empty Lists using "!= NIL".

BEFORE:
+ if (table_states_not_ready && !last_start_times)
AFTER:
+ if (table_states_not_ready != NIL && !last_start_times)
BEFORE:
+ else if (!table_states_not_ready && last_start_times)
AFTER:
+ else if (table_states_not_ready == NIL && last_start_times)

Regards,
Greg Nancarrow
Fujitsu Australia

#360Ajin Cherian
itsajin@gmail.com
In reply to: Greg Nancarrow (#359)
5 attachment(s)

On Tue, Jun 22, 2021 at 3:36 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

Some minor comments:

(1)
v88-0002

doc/src/sgml/logicaldecoding.sgml

"examples shows" is not correct.
I think there is only ONE example being referred to.

BEFORE:
+    The following examples shows how logical decoding is controlled over the
AFTER:
+    The following example shows how logical decoding is controlled over the

fixed.

(2)
v88 - 0003

doc/src/sgml/ref/create_subscription.sgml

(i)

BEFORE:
+          to the subscriber on the PREPARE TRANSACTION. By default,
the transaction
+          prepared on publisher is decoded as a normal transaction at commit.
AFTER:
+          to the subscriber on the PREPARE TRANSACTION. By default,
the transaction
+          prepared on the publisher is decoded as a normal
transaction at commit time.

fixed.

(ii)

src/backend/access/transam/twophase.c

The double-bracketing is unnecessary:

BEFORE:
+ if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
AFTER:
+ if (gxact->valid && strcmp(gxact->gid, gid) == 0)

fixed.

(iii)

src/backend/replication/logical/snapbuild.c

Need to add some commas to make the following easier to read, and
change "needs" to "need":

BEFORE:
+ * The prepared transactions that were skipped because previously
+ * two-phase was not enabled or are not covered by initial snapshot needs
+ * to be sent later along with commit prepared and they must be before
+ * this point.
AFTER:
+ * The prepared transactions, that were skipped because previously
+ * two-phase was not enabled or are not covered by initial snapshot, need
+ * to be sent later along with commit prepared and they must be before
+ * this point.

fixed.

(iv)

src/backend/replication/logical/tablesync.c

I think the convention used in Postgres code is to check for empty
Lists using "== NIL" and non-empty Lists using "!= NIL".

BEFORE:
+ if (table_states_not_ready && !last_start_times)
AFTER:
+ if (table_states_not_ready != NIL && !last_start_times)
BEFORE:
+ else if (!table_states_not_ready && last_start_times)
AFTER:
+ else if (table_states_not_ready == NIL && last_start_times)

fixed.

Also fixed comments from Vignesh:

1) This content is present in
v87-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patch and
v87-0003-Add-support-for-prepared-transactions-to-built-i.patch, it
can be removed from one of them
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this logical replication slot supports decoding
of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT
PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>

I don't see this duplicate content.

2) This change is not required, it can be removed:
<sect1 id="logicaldecoding-example">
<title>Logical Decoding Examples</title>
-
<para>
The following example demonstrates controlling logical decoding using the
SQL interface.

fixed this.

3) We could add comment mentioning example 1 at the beginning of
example 1 and example 2 for the newly added example with description,
that will clearly mark the examples.

added this.

5) This should be before verbose, the options are documented alphabetically

fixed.this.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v89-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patchapplication/octet-stream; name=v89-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patchDownload
From ed58853986db2efdde1d7b767522f258504289e1 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 22 Jun 2021 08:05:23 -0400
Subject: [PATCH v89] Add option to set two-phase in CREATE_REPLICATION_SLOT
 command.

CREATE_REPLICATION_SLOT modified to support two-phase encoding in the slot.
This will allow the decoding of commands like PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED for slots created with this option.
---
 doc/src/sgml/protocol.sgml             | 16 +++++++++++++++-
 src/backend/replication/repl_gram.y    | 12 ++++++++++++
 src/backend/replication/repl_scanner.l |  1 +
 src/backend/replication/walsender.c    | 18 +++++++++++++++---
 4 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index bc2a2fe..205fbd2 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> | <literal>TWO_PHASE</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this logical replication slot supports decoding of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..eead144 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -283,6 +285,11 @@ create_slot_opt:
 				  $$ = makeDefElem("reserve_wal",
 								   (Node *)makeInteger(true), -1);
 				}
+			| K_TWO_PHASE
+				{
+				  $$ = makeDefElem("two_phase",
+								   (Node *)makeInteger(true), -1);
+				}
 			;
 
 /* DROP_REPLICATION_SLOT slot */
@@ -365,6 +372,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3224536..92c755f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -863,11 +863,13 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 static void
 parseCreateReplSlotOptions(CreateReplicationSlotCmd *cmd,
 						   bool *reserve_wal,
-						   CRSSnapshotAction *snapshot_action)
+						   CRSSnapshotAction *snapshot_action,
+						   bool *two_phase)
 {
 	ListCell   *lc;
 	bool		snapshot_action_given = false;
 	bool		reserve_wal_given = false;
+	bool		two_phase_given = false;
 
 	/* Parse options */
 	foreach(lc, cmd->options)
@@ -905,6 +907,15 @@ parseCreateReplSlotOptions(CreateReplicationSlotCmd *cmd,
 			reserve_wal_given = true;
 			*reserve_wal = true;
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_given || cmd->kind != REPLICATION_KIND_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_given = true;
+			*two_phase = true;
+		}
 		else
 			elog(ERROR, "unrecognized option: %s", defel->defname);
 	}
@@ -920,6 +931,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 	char		xloc[MAXFNAMELEN];
 	char	   *slot_name;
 	bool		reserve_wal = false;
+	bool		two_phase = false;
 	CRSSnapshotAction snapshot_action = CRS_EXPORT_SNAPSHOT;
 	DestReceiver *dest;
 	TupOutputState *tstate;
@@ -929,7 +941,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	Assert(!MyReplicationSlot);
 
-	parseCreateReplSlotOptions(cmd, &reserve_wal, &snapshot_action);
+	parseCreateReplSlotOptions(cmd, &reserve_wal, &snapshot_action, &two_phase);
 
 	/* setup state for WalSndSegmentOpen */
 	sendTimeLineIsHistoric = false;
@@ -954,7 +966,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 */
 		ReplicationSlotCreate(cmd->slotname, true,
 							  cmd->temporary ? RS_TEMPORARY : RS_EPHEMERAL,
-							  false);
+							  two_phase);
 	}
 
 	if (cmd->kind == REPLICATION_KIND_LOGICAL)
-- 
1.8.3.1

v89-0005-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v89-0005-Skip-empty-transactions-for-logical-replication.patchDownload
From 1d508fadc8b64c66b64b6bdf0cb1e6c65d72a5d5 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 22 Jun 2021 23:03:02 -0400
Subject: [PATCH v89] Skip empty transactions for logical replication.

The current logical replication behaviour is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  16 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  38 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 158 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  46 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 286 insertions(+), 39 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 70869a6..74df301 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -881,11 +881,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index ae549db..40d3011 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7552,6 +7552,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit prepared.
 </para></listitem>
 </varlistentry>
@@ -7566,6 +7573,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 89d91c2..97ca648 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -934,7 +935,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -969,7 +971,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..4653d6d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 48239c0..6cdca07 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2792,7 +2792,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e6f1276..47d5a53 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -990,27 +990,39 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* There is no transaction when COMMIT PREPARED is called */
-	begin_replication_step();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	end_replication_step();
-	CommitTransactionCommand();
+		/* There is no transaction when COMMIT PREPARED is called */
+		begin_replication_step();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index d5a284d..43679a2 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -132,6 +134,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -401,10 +408,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -419,8 +448,22 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
+	txn->output_plugin_private = NULL;
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -432,10 +475,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -450,8 +511,18 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty prepared transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -462,12 +533,33 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of COMMIT PREPARED of an empty transaction");
+			return;
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -480,8 +572,26 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of ROLLBACK of an empty transaction");
+			return;
+		}
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -630,11 +740,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -668,6 +783,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -770,6 +894,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -777,6 +902,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -813,6 +942,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -833,6 +971,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -845,6 +984,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 7a4804f..2fa60b5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d7c785b..ffc0b56 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -442,7 +442,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 52bd92d..2b43ae0 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -86,9 +86,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 4c372a6..8a33641 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 24;
+use Test::More tests => 25;
 
 ###############################
 # Setup
@@ -318,10 +318,9 @@ $node_publisher->safe_psql('postgres', "
 
 $node_publisher->wait_for_catchup($appname_copy);
 
-# Check that the transaction has been prepared on the subscriber, there will be 2
-# prepared transactions for the 2 subscriptions.
+# Check that the transaction has been prepared on the subscriber
 $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
-is($result, qq(2), 'transaction is prepared on subscriber');
+is($result, qq(1), 'transaction is prepared on subscriber');
 
 # Now commit the insert and verify that it IS replicated
 $node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
@@ -337,6 +336,45 @@ is($result, qq(2), 'replicated data in subscriber table');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
 $node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+   "CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+   "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot
+$node_publisher->safe_psql('postgres', "
+   BEGIN;
+   INSERT INTO tab_nopub SELECT generate_series(1,10);
+   PREPARE TRANSACTION 'empty_transaction';
+   COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+   'postgres', qq(
+       SELECT get_byte(data, 0)
+       FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+           'proto_version', '1',
+           'publication_names', 'tap_pub')
+));
+
+# the empty transaction should be skipped
+is($result, qq(),
+   'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cabc0bb..ad62bbe 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1597,6 +1597,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

v89-0004-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v89-0004-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 87450f22d7ab6f666e6cb5975cd3a996d70fe0e9 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 22 Jun 2021 22:17:39 -0400
Subject: [PATCH v89] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/protocol.sgml                         |  68 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  10 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 138 ++++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |  10 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 453 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 271 ++++++++++++
 11 files changed, 1021 insertions(+), 79 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 94e60e0..ae549db 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2895,7 +2895,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains a Stream Commit or Stream Abort message.
+   contains a Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7400,7 +7400,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7663,6 +7663,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1433905..702934e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 08d0295..7df8742 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -926,12 +911,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 31bce6b..e6f1276 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1065,6 +1065,90 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a STREAM PREPARE message")));
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1280,30 +1364,20 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	if (in_streamed_transaction)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROTOCOL_VIOLATION),
-				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
 	/* Make sure we have an open transaction */
 	begin_replication_step();
 
@@ -1314,7 +1388,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -1335,7 +1409,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1414,6 +1488,32 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2359,6 +2459,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7a1d42a..d5a284d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1030,6 +1021,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index e20f2da..7a4804f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -124,6 +125,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -243,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c90e3f6
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,453 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3a0be82
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v89-0003-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v89-0003-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 98232089a834a547a037b47a37ecf7929257e7e6 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 22 Jun 2021 09:09:26 -0400
Subject: [PATCH v89] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the following things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase
transactions. We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Add new subscription TAP tests, and new subscription.sql regression tests.

* Update PG documentation.

We don't support the following operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* CREATE/ALTER SUBSCRIPTION which tries to set options two_phase=true and streaming=true at the same time.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         | 305 ++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 148 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  19 +-
 src/backend/replication/logical/decode.c           |  11 +-
 src/backend/replication/logical/logical.c          |  31 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 196 +++++++++--
 src/backend/replication/logical/worker.c           | 358 +++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |   8 +-
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  29 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  17 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/nodes/replnodes.h                      |   1 +
 src/include/replication/logical.h                  |  10 +
 src/include/replication/logicalproto.h             |  73 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 359 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 235 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   3 +
 43 files changed, 2438 insertions(+), 202 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f517a7d..0235639 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7643,6 +7643,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 205fbd2..94e60e0 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1970,6 +1970,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
@@ -2811,11 +2825,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2871,10 +2891,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains a Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7379,6 +7400,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepared transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit prepared.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1433905 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used with the
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on the publisher is decoded as a normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used with the
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..6d3efb4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		exists.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984..55f6e37 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1255,5 +1255,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subslotname, subsynccommit, subpublications)
+              substream, subtwophasestate, subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 75e195f..08d0295 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -506,10 +557,34 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false then
+				 * it is safe to enable two_phase up-front because those tables
+				 * are already initially in READY state. When the subscription
+				 * has no tables, we leave the twophase state as PENDING,
+				 * to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -816,7 +891,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -850,6 +926,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -873,7 +955,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -918,7 +1001,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -934,6 +1018,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -965,7 +1060,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -982,6 +1078,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1011,7 +1118,32 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 6eaa84a..2838b89 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -436,6 +437,19 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/*
+		 * FIXME - 21/May. The below code is a temporary hack to check for
+		 * for server version 140000, even though this two-phase feature did
+		 * not make it into the PG 14 release.
+		 *
+		 * When the PG 15 development officially starts someone will update the
+		 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+		 * to revisit this code to remove this hack and write the code properly.
+		 */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -851,7 +865,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -868,6 +882,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
+		if (two_phase)
+			appendStringInfoString(&cmd, " TWO_PHASE");
+
 		switch (snapshot_action)
 		{
 			case CRS_EXPORT_SNAPSHOT:
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 453efc5..74df75e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing preparing
+				 * transactions that have locked [user] catalog tables
+				 * exclusively but as of now we ask users not to do such an
+				 * operation.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +734,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..89d91c2 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,12 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= slot->data.two_phase;
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +540,22 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (slot->data.two_phase || ctx->twophase_opt_given);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +616,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index cb42fcb..2c191de 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 19e96f3..48239c0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2574,7 +2574,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2665,7 +2665,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2712,7 +2712,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2732,7 +2732,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2751,19 +2751,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2781,12 +2782,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..a14a3d6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions, that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot, need
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index cc50eb8..9272f75 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready != NIL && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (table_states_not_ready == NIL && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1071,7 +1067,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1158,3 +1155,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ *
+ * Note: If this function started the transaction (indicated by the parameter)
+ * then it is the caller's responsibility to commit it.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static bool has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && list_length(table_states_not_ready) == 0;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase state */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index bbb659d..31bce6b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * is still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -804,6 +880,191 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a BEGIN PREPARE message")));
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (prepare_data.prepare_lsn != remote_final_lsn)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("incorrect prepare LSN %X/%X in prepare message (expected %X/%X)",
+								 LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
+								 LSN_FORMAT_ARGS(remote_final_lsn))));
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	begin_replication_step();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* There is no transaction when COMMIT PREPARED is called */
+	begin_replication_step();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
+		begin_replication_step();
+		FinishPreparedTransaction(gid, false);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2082,6 +2343,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2561,6 +2838,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -3062,6 +3342,24 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+
+	if (!TransactionIdIsValid(xid))
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("invalid two-phase transaction ID")));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3230,15 +3528,69 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
+
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 63f108f..7a1d42a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -145,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -156,6 +174,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -167,10 +187,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -246,8 +268,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -319,6 +362,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -331,8 +395,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -347,29 +415,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -389,6 +436,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -839,18 +948,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1269,3 +1368,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eead144..0910546 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -103,7 +103,6 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
-%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -243,7 +242,7 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
 			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
@@ -372,11 +371,6 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
-opt_two_phase:
-			K_TWO_PHASE						{ $$ = true; }
-			| /* EMPTY */					{ $$ = false; }
-			;
-
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8c18b4e..33b85d8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -283,6 +283,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index faeea9f..9f0b13f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -370,7 +370,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 8f53cc7..8141311 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4304,6 +4305,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4347,9 +4349,25 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	/*
+	 * FIXME - 21/May. The below code is a temporary hack to check for
+	 * for server version 140000, even though this two-phase feature did
+	 * not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 */
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4370,6 +4388,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4395,6 +4414,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4422,6 +4443,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4463,6 +4485,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2abf255..6caa701 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,7 +6415,9 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary and streaming are only supported in v14 and higher.
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
@@ -6423,6 +6425,17 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Binary"),
 							  gettext_noop("Streaming"));
 
+		/*
+		 * Two_phase is only supported in v15 and higher.
+		 *
+		 * FIXME: When PG15 development starts, change the following
+		 * 140000 to 150000
+		 */
+		if (pset.sversion >= 140000)
+			appendPQExpBuffer(&buf,
+							  ", subtwophasestate AS \"%s\"\n",
+							  gettext_noop("Two phase commit"));
+
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
 						  ",  subconninfo AS \"%s\"\n",
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 38af568..8f0a921 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2760,7 +2760,7 @@ psql_completion(const char *text, int start, int end)
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
 					  "enabled", "slot_name", "streaming",
-					  "synchronous_commit");
+					  "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0060ebf..e84353e 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -57,6 +65,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -94,6 +104,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index faa3a25..ebc43a0 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		temporary;
+	bool		two_phase;
 	List	   *options;
 } CreateReplicationSlotCmd;
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..0b071a6 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -90,6 +90,16 @@ typedef struct LogicalDecodingContext
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 * This flag indicates that the plugin passed in the two-phase option as
+	 * part of the START_STREAMING command. We can't rely solely on the twophase
+	 * flag which only tells whether the plugin provided all the necessary
+	 * two-phase callbacks.
+	 *
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..e20f2da 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -122,6 +131,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +180,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d8..d7c785b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -297,7 +297,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -636,7 +640,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 2eb7e3a..34d95ea 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -84,11 +84,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..4c372a6
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,359 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# copy_data=false and two_phase
+###############################
+
+#create some test tables for copy tests
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_copy SELECT generate_series(1,5);");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "INSERT INTO tab_copy VALUES (88);");
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Setup logical replication
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_copy FOR TABLE tab_copy;");
+
+my $appname_copy = 'appname_copy';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_copy
+	CONNECTION '$publisher_connstr application_name=$appname_copy'
+	PUBLICATION tap_pub_copy
+	WITH (two_phase=on, copy_data=false);");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Also wait for initial table sync to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+# Check that the initial table data was NOT replicated (because we said copy_data=false)
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Now do a prepare on publisher and check that it IS replicated
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_copy VALUES (99);
+    PREPARE TRANSACTION 'mygid';");
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Check that the transaction has been prepared on the subscriber, there will be 2
+# prepared transactions for the 2 subscriptions.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
+is($result, qq(2), 'transaction is prepared on subscriber');
+
+# Now commit the insert and verify that it IS replicated
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(6), 'publisher inserted data');
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(2), 'replicated data in subscriber table');
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..e61d28a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..cabc0bb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,12 +1388,15 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v89-0002-Add-support-for-two-phase-decoding-in-pg_recvlog.patchapplication/octet-stream; name=v89-0002-Add-support-for-two-phase-decoding-in-pg_recvlog.patchDownload
From e07bcd5f111a933486041d16252453cfbfdeba94 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 22 Jun 2021 08:37:47 -0400
Subject: [PATCH v89] Add support for two-phase decoding in pg_recvlogical.

Modified streamutils to pass in two-phase option when calling
CREATE_REPLICATION_SLOT. Added new option --two-phase in pg_recvlogical
to allow decoding of two-phase transactions.
---
 doc/src/sgml/logicaldecoding.sgml             | 20 ++++++++++--
 doc/src/sgml/ref/pg_recvlogical.sgml          | 16 ++++++++++
 src/bin/pg_basebackup/pg_basebackup.c         |  2 +-
 src/bin/pg_basebackup/pg_receivewal.c         |  2 +-
 src/bin/pg_basebackup/pg_recvlogical.c        | 19 +++++++++--
 src/bin/pg_basebackup/streamutil.c            |  6 +++-
 src/bin/pg_basebackup/streamutil.h            |  2 +-
 src/bin/pg_basebackup/t/030_pg_recvlogical.pl | 45 ++++++++++++++++++++++++++-
 8 files changed, 103 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 1765ea6..70869a6 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -39,7 +39,6 @@
 
   <sect1 id="logicaldecoding-example">
    <title>Logical Decoding Examples</title>
-
    <para>
     The following example demonstrates controlling logical decoding using the
     SQL interface.
@@ -151,9 +150,10 @@ postgres=# SELECT pg_drop_replication_slot('regression_slot');
     replication connections
     (see <xref linkend="streaming-replication-authentication"/>) and
     that <varname>max_wal_senders</varname> is set sufficiently high to allow
-    an additional connection.
+    an additional connection. The second example enables two-phase decoding.
    </para>
 <programlisting>
+Example 1:
 $ pg_recvlogical -d postgres --slot=test --create-slot
 $ pg_recvlogical -d postgres --slot=test --start -f -
 <keycombo action="simul"><keycap>Control</keycap><keycap>Z</keycap></keycombo>
@@ -164,6 +164,22 @@ table public.data: INSERT: id[integer]:4 data[text]:'4'
 COMMIT 693
 <keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
 $ pg_recvlogical -d postgres --slot=test --drop-slot
+
+Example 2:
+$ pg_recvlogical -d postgres --slot=test --create-slot --two-phase
+$ pg_recvlogical -d postgres --slot=test --start -f -
+<keycombo action="simul"><keycap>Control</keycap><keycap>Z</keycap></keycombo>
+$ psql -d postgres -c "BEGIN;INSERT INTO data(data) VALUES('5');PREPARE TRANSACTION 'test';"
+$ fg
+BEGIN 694
+table public.data: INSERT: id[integer]:5 data[text]:'5'
+PREPARE TRANSACTION 'test', txid 694
+<keycombo action="simul"><keycap>Control</keycap><keycap>Z</keycap></keycombo>
+$ psql -d postgres -c "COMMIT PREPARED 'test';"
+$ fg
+COMMIT PREPARED 'test', txid 694
+<keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
+$ pg_recvlogical -d postgres --slot=test --drop-slot
 </programlisting>
 
   <para>
diff --git a/doc/src/sgml/ref/pg_recvlogical.sgml b/doc/src/sgml/ref/pg_recvlogical.sgml
index 6b1d98d..d0972a1 100644
--- a/doc/src/sgml/ref/pg_recvlogical.sgml
+++ b/doc/src/sgml/ref/pg_recvlogical.sgml
@@ -65,6 +65,11 @@ PostgreSQL documentation
         <option>--plugin</option>, for the database specified
         by <option>--dbname</option>.
        </para>
+
+       <para>
+        The <option>--two-phase</option> can be specified with
+        <option>--create-slot</option> to enable two-phase decoding.
+       </para>
       </listitem>
      </varlistentry>
 
@@ -257,6 +262,17 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+       <term><option>-t</option></term>
+       <term><option>--two-phase</option></term>
+       <listitem>
+       <para>
+        Enables two-phase decoding. This option should only be used with
+        <option>--create-slot</option>
+       </para>
+       </listitem>
+     </varlistentry>
+
+     <varlistentry>
        <term><option>-v</option></term>
        <term><option>--verbose</option></term>
        <listitem>
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 16d8929..8bb0acf 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -646,7 +646,7 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 	if (temp_replication_slot || create_slot)
 	{
 		if (!CreateReplicationSlot(param->bgconn, replication_slot, NULL,
-								   temp_replication_slot, true, true, false))
+								   temp_replication_slot, true, true, false, false))
 			exit(1);
 
 		if (verbose)
diff --git a/src/bin/pg_basebackup/pg_receivewal.c b/src/bin/pg_basebackup/pg_receivewal.c
index 0d15012..c1334fa 100644
--- a/src/bin/pg_basebackup/pg_receivewal.c
+++ b/src/bin/pg_basebackup/pg_receivewal.c
@@ -741,7 +741,7 @@ main(int argc, char **argv)
 			pg_log_info("creating replication slot \"%s\"", replication_slot);
 
 		if (!CreateReplicationSlot(conn, replication_slot, NULL, false, true, false,
-								   slot_exists_ok))
+								   slot_exists_ok, false))
 			exit(1);
 		exit(0);
 	}
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 5efec16..76bd153 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -35,6 +35,7 @@
 /* Global Options */
 static char *outfile = NULL;
 static int	verbose = 0;
+static bool two_phase = false;
 static int	noloop = 0;
 static int	standby_message_timeout = 10 * 1000;	/* 10 sec = default */
 static int	fsync_interval = 10 * 1000; /* 10 sec = default */
@@ -93,6 +94,7 @@ usage(void)
 	printf(_("  -s, --status-interval=SECS\n"
 			 "                         time between status packets sent to server (default: %d)\n"), (standby_message_timeout / 1000));
 	printf(_("  -S, --slot=SLOTNAME    name of the logical replication slot\n"));
+	printf(_("  -t, --two-phase        enable two-phase decoding when creating a slot\n"));
 	printf(_("  -v, --verbose          output verbose messages\n"));
 	printf(_("  -V, --version          output version information, then exit\n"));
 	printf(_("  -?, --help             show this help, then exit\n"));
@@ -678,6 +680,7 @@ main(int argc, char **argv)
 		{"fsync-interval", required_argument, NULL, 'F'},
 		{"no-loop", no_argument, NULL, 'n'},
 		{"verbose", no_argument, NULL, 'v'},
+		{"two-phase", no_argument, NULL, 't'},
 		{"version", no_argument, NULL, 'V'},
 		{"help", no_argument, NULL, '?'},
 /* connection options */
@@ -726,7 +729,7 @@ main(int argc, char **argv)
 		}
 	}
 
-	while ((c = getopt_long(argc, argv, "E:f:F:nvd:h:p:U:wWI:o:P:s:S:",
+	while ((c = getopt_long(argc, argv, "E:f:F:nvtd:h:p:U:wWI:o:P:s:S:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -749,6 +752,9 @@ main(int argc, char **argv)
 			case 'v':
 				verbose++;
 				break;
+			case 't':
+				two_phase = true;
+				break;
 /* connection options */
 			case 'd':
 				dbname = pg_strdup(optarg);
@@ -920,6 +926,15 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
+	if (two_phase && !do_create_slot)
+	{
+		pg_log_error("--two-phase may only be specified with --create-slot");
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}
+
+
 #ifndef WIN32
 	pqsignal(SIGINT, sigint_handler);
 	pqsignal(SIGHUP, sighup_handler);
@@ -976,7 +991,7 @@ main(int argc, char **argv)
 			pg_log_info("creating replication slot \"%s\"", replication_slot);
 
 		if (!CreateReplicationSlot(conn, replication_slot, plugin, false,
-								   false, false, slot_exists_ok))
+								   false, false, slot_exists_ok, two_phase))
 			exit(1);
 		startpos = InvalidXLogRecPtr;
 	}
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 99daf0e..1f99aae 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -486,7 +486,7 @@ RunIdentifySystem(PGconn *conn, char **sysid, TimeLineID *starttli,
 bool
 CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 					  bool is_temporary, bool is_physical, bool reserve_wal,
-					  bool slot_exists_ok)
+					  bool slot_exists_ok, bool two_phase)
 {
 	PQExpBuffer query;
 	PGresult   *res;
@@ -495,6 +495,7 @@ CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 
 	Assert((is_physical && plugin == NULL) ||
 		   (!is_physical && plugin != NULL));
+	Assert(!(two_phase && is_physical));
 	Assert(slot_name != NULL);
 
 	/* Build query */
@@ -510,6 +511,9 @@ CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 	else
 	{
 		appendPQExpBuffer(query, " LOGICAL \"%s\"", plugin);
+		if (two_phase && PQserverVersion(conn) >= 140000)
+			appendPQExpBufferStr(query, " TWO_PHASE");
+
 		if (PQserverVersion(conn) >= 100000)
 			/* pg_recvlogical doesn't use an exported snapshot, so suppress */
 			appendPQExpBufferStr(query, " NOEXPORT_SNAPSHOT");
diff --git a/src/bin/pg_basebackup/streamutil.h b/src/bin/pg_basebackup/streamutil.h
index 10f87ad..504803b 100644
--- a/src/bin/pg_basebackup/streamutil.h
+++ b/src/bin/pg_basebackup/streamutil.h
@@ -34,7 +34,7 @@ extern PGconn *GetConnection(void);
 extern bool CreateReplicationSlot(PGconn *conn, const char *slot_name,
 								  const char *plugin, bool is_temporary,
 								  bool is_physical, bool reserve_wal,
-								  bool slot_exists_ok);
+								  bool slot_exists_ok, bool two_phase);
 extern bool DropReplicationSlot(PGconn *conn, const char *slot_name);
 extern bool RunIdentifySystem(PGconn *conn, char **sysid,
 							  TimeLineID *starttli,
diff --git a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
index 53f4181..bbbf9e2 100644
--- a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
+++ b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use TestLib;
 use PostgresNode;
-use Test::More tests => 15;
+use Test::More tests => 20;
 
 program_help_ok('pg_recvlogical');
 program_version_ok('pg_recvlogical');
@@ -22,6 +22,7 @@ max_replication_slots = 4
 max_wal_senders = 4
 log_min_messages = 'debug1'
 log_error_verbosity = verbose
+max_prepared_transactions = 10
 });
 $node->dump_info;
 $node->start;
@@ -63,3 +64,45 @@ $node->command_ok(
 		'--start', '--endpos', "$nextlsn", '--no-loop', '-f', '-'
 	],
 	'replayed a transaction');
+
+$node->command_ok(
+	[
+		'pg_recvlogical',           '-S',
+		'test',                     '-d',
+		$node->connstr('postgres'), '--drop-slot'
+	],
+	'slot dropped');
+
+#test with two-phase option enabled
+$node->command_ok(
+	[
+		'pg_recvlogical',           '-S',
+		'test',                     '-d',
+		$node->connstr('postgres'), '--create-slot', '--two-phase'
+	],
+	'slot with two-phase created');
+
+$slot = $node->slot('test');
+isnt($slot->{'restart_lsn'}, '', 'restart lsn is defined for new slot');
+
+$node->safe_psql('postgres',
+	"BEGIN; INSERT INTO test_table values (11); PREPARE TRANSACTION 'test'");
+$node->safe_psql('postgres',
+	"COMMIT PREPARED 'test'");
+$nextlsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn()');
+chomp($nextlsn);
+
+$node->command_fails(
+	[
+		'pg_recvlogical', '-S', 'test', '-d', $node->connstr('postgres'),
+		'--start', '--endpos', "$nextlsn", '--two-phase', '--no-loop', '-f', '-'
+	],
+	'incorrect usage');
+
+$node->command_ok(
+	[
+		'pg_recvlogical', '-S', 'test', '-d', $node->connstr('postgres'),
+		'--start', '--endpos', "$nextlsn", '--no-loop', '-f', '-'
+	],
+	'replayed a two-phase transaction');
-- 
1.8.3.1

#361vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#360)

On Wed, Jun 23, 2021 at 9:10 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Jun 22, 2021 at 3:36 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

Some minor comments:

(1)
v88-0002

doc/src/sgml/logicaldecoding.sgml

"examples shows" is not correct.
I think there is only ONE example being referred to.

BEFORE:
+    The following examples shows how logical decoding is controlled over the
AFTER:
+    The following example shows how logical decoding is controlled over the

fixed.

(2)
v88 - 0003

doc/src/sgml/ref/create_subscription.sgml

(i)

BEFORE:
+          to the subscriber on the PREPARE TRANSACTION. By default,
the transaction
+          prepared on publisher is decoded as a normal transaction at commit.
AFTER:
+          to the subscriber on the PREPARE TRANSACTION. By default,
the transaction
+          prepared on the publisher is decoded as a normal
transaction at commit time.

fixed.

(ii)

src/backend/access/transam/twophase.c

The double-bracketing is unnecessary:

BEFORE:
+ if ((gxact->valid && strcmp(gxact->gid, gid) == 0))
AFTER:
+ if (gxact->valid && strcmp(gxact->gid, gid) == 0)

fixed.

(iii)

src/backend/replication/logical/snapbuild.c

Need to add some commas to make the following easier to read, and
change "needs" to "need":

BEFORE:
+ * The prepared transactions that were skipped because previously
+ * two-phase was not enabled or are not covered by initial snapshot needs
+ * to be sent later along with commit prepared and they must be before
+ * this point.
AFTER:
+ * The prepared transactions, that were skipped because previously
+ * two-phase was not enabled or are not covered by initial snapshot, need
+ * to be sent later along with commit prepared and they must be before
+ * this point.

fixed.

(iv)

src/backend/replication/logical/tablesync.c

I think the convention used in Postgres code is to check for empty
Lists using "== NIL" and non-empty Lists using "!= NIL".

BEFORE:
+ if (table_states_not_ready && !last_start_times)
AFTER:
+ if (table_states_not_ready != NIL && !last_start_times)
BEFORE:
+ else if (!table_states_not_ready && last_start_times)
AFTER:
+ else if (table_states_not_ready == NIL && last_start_times)

fixed.

Also fixed comments from Vignesh:

1) This content is present in
v87-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patch and
v87-0003-Add-support-for-prepared-transactions-to-built-i.patch, it
can be removed from one of them
<varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this logical replication slot supports decoding
of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT
PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>

I don't see this duplicate content.

Thanks for the updated patch.
The patch v89-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patch
has the following:
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this logical replication slot supports decoding
of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT
PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
The patch v89-0003-Add-support-for-prepared-transactions-to-built-i.patch
has the following:
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT
PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>

We can remove one of them.

Regards,
Vignesh

#362Ajin Cherian
itsajin@gmail.com
In reply to: vignesh C (#361)
5 attachment(s)

On Wed, Jun 23, 2021 at 3:18 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks for the updated patch.
The patch v89-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patch
has the following:
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this logical replication slot supports decoding
of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT
PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
The patch v89-0003-Add-support-for-prepared-transactions-to-built-i.patch
has the following:
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this replication slot supports decode of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT
PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>

We can remove one of them.

I missed this. Updated.

Also fixed this comment below which I had missed in my last patch:

4) You could mention "Before you use two-phase commit commands, you
must set max_prepared_transactions to at least 1" for example 2.
$ pg_recvlogical -d postgres --slot=test --drop-slot
+
+$ pg_recvlogical -d postgres --slot=test --create-slot --two-phase
+$ pg_recvlogical -d postgres --slot=test --start -f -

Comment 6:

6) This should be before verbose, the options are printed alphabetically
printf(_(" -v, --verbose output verbose messages\n"));
+ printf(_(" -t, --two-phase enable two-phase decoding
when creating a slot\n"));
printf(_(" -V, --version output version information,
then exit\n"));

This was also fixed in the last patch.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v90-0004-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v90-0004-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 0f090496796c6ef1f48a03514f8016173cf219bc Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Jun 2021 06:25:48 -0400
Subject: [PATCH v90] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/protocol.sgml                         |  68 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  10 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 138 ++++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |  10 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 453 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 271 ++++++++++++
 11 files changed, 1021 insertions(+), 79 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 97d7bdf..8b95aaa 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains a Stream Commit or Stream Abort message.
+   contains a Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7386,7 +7386,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7649,6 +7649,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1433905..702934e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 08d0295..7df8742 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -926,12 +911,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 31bce6b..e6f1276 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1065,6 +1065,90 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a STREAM PREPARE message")));
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1280,30 +1364,20 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	if (in_streamed_transaction)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROTOCOL_VIOLATION),
-				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
 	/* Make sure we have an open transaction */
 	begin_replication_step();
 
@@ -1314,7 +1388,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -1335,7 +1409,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1414,6 +1488,32 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2359,6 +2459,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7a1d42a..d5a284d 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1030,6 +1021,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index e20f2da..7a4804f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -124,6 +125,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -243,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index f054ac8..81d27f3 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c90e3f6
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,453 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3a0be82
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v90-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patchapplication/octet-stream; name=v90-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patchDownload
From 67d5896dfaca9267f5c4f0e97a7afe975b1cb760 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Jun 2021 05:19:46 -0400
Subject: [PATCH v90] Add option to set two-phase in CREATE_REPLICATION_SLOT
 command.

CREATE_REPLICATION_SLOT modified to support two-phase encoding in the slot.
This will allow the decoding of commands like PREPARE TRANSACTION,
COMMIT PREPARED and ROLLBACK PREPARED for slots created with this option.
---
 doc/src/sgml/protocol.sgml             | 16 +++++++++++++++-
 src/backend/replication/repl_gram.y    | 12 ++++++++++++
 src/backend/replication/repl_scanner.l |  1 +
 src/backend/replication/walsender.c    | 18 +++++++++++++++---
 4 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index bc2a2fe..205fbd2 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> | <literal>TWO_PHASE</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1956,6 +1956,20 @@ The commands accepted in replication mode are:
       </varlistentry>
 
       <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this logical replication slot supports decoding of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
         <para>
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8..eead144 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -102,6 +103,7 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
+%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -283,6 +285,11 @@ create_slot_opt:
 				  $$ = makeDefElem("reserve_wal",
 								   (Node *)makeInteger(true), -1);
 				}
+			| K_TWO_PHASE
+				{
+				  $$ = makeDefElem("two_phase",
+								   (Node *)makeInteger(true), -1);
+				}
 			;
 
 /* DROP_REPLICATION_SLOT slot */
@@ -365,6 +372,11 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
+opt_two_phase:
+			K_TWO_PHASE						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+			;
+
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3f..c038a63 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3224536..92c755f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -863,11 +863,13 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 static void
 parseCreateReplSlotOptions(CreateReplicationSlotCmd *cmd,
 						   bool *reserve_wal,
-						   CRSSnapshotAction *snapshot_action)
+						   CRSSnapshotAction *snapshot_action,
+						   bool *two_phase)
 {
 	ListCell   *lc;
 	bool		snapshot_action_given = false;
 	bool		reserve_wal_given = false;
+	bool		two_phase_given = false;
 
 	/* Parse options */
 	foreach(lc, cmd->options)
@@ -905,6 +907,15 @@ parseCreateReplSlotOptions(CreateReplicationSlotCmd *cmd,
 			reserve_wal_given = true;
 			*reserve_wal = true;
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_given || cmd->kind != REPLICATION_KIND_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_given = true;
+			*two_phase = true;
+		}
 		else
 			elog(ERROR, "unrecognized option: %s", defel->defname);
 	}
@@ -920,6 +931,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 	char		xloc[MAXFNAMELEN];
 	char	   *slot_name;
 	bool		reserve_wal = false;
+	bool		two_phase = false;
 	CRSSnapshotAction snapshot_action = CRS_EXPORT_SNAPSHOT;
 	DestReceiver *dest;
 	TupOutputState *tstate;
@@ -929,7 +941,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	Assert(!MyReplicationSlot);
 
-	parseCreateReplSlotOptions(cmd, &reserve_wal, &snapshot_action);
+	parseCreateReplSlotOptions(cmd, &reserve_wal, &snapshot_action, &two_phase);
 
 	/* setup state for WalSndSegmentOpen */
 	sendTimeLineIsHistoric = false;
@@ -954,7 +966,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 */
 		ReplicationSlotCreate(cmd->slotname, true,
 							  cmd->temporary ? RS_TEMPORARY : RS_EPHEMERAL,
-							  false);
+							  two_phase);
 	}
 
 	if (cmd->kind == REPLICATION_KIND_LOGICAL)
-- 
1.8.3.1

v90-0002-Add-support-for-two-phase-decoding-in-pg_recvlog.patchapplication/octet-stream; name=v90-0002-Add-support-for-two-phase-decoding-in-pg_recvlog.patchDownload
From 427b1e35471e3fa66be55b576b7a52e3a476db4d Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Jun 2021 06:13:45 -0400
Subject: [PATCH v90] Add support for two-phase decoding in pg_recvlogical.

Modified streamutils to pass in two-phase option when calling
CREATE_REPLICATION_SLOT. Added new option --two-phase in pg_recvlogical
to allow decoding of two-phase transactions.
---
 doc/src/sgml/logicaldecoding.sgml             | 22 +++++++++++--
 doc/src/sgml/ref/pg_recvlogical.sgml          | 16 ++++++++++
 src/bin/pg_basebackup/pg_basebackup.c         |  2 +-
 src/bin/pg_basebackup/pg_receivewal.c         |  2 +-
 src/bin/pg_basebackup/pg_recvlogical.c        | 19 +++++++++--
 src/bin/pg_basebackup/streamutil.c            |  6 +++-
 src/bin/pg_basebackup/streamutil.h            |  2 +-
 src/bin/pg_basebackup/t/030_pg_recvlogical.pl | 45 ++++++++++++++++++++++++++-
 8 files changed, 105 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 1765ea6..9628eb5 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -39,7 +39,6 @@
 
   <sect1 id="logicaldecoding-example">
    <title>Logical Decoding Examples</title>
-
    <para>
     The following example demonstrates controlling logical decoding using the
     SQL interface.
@@ -151,9 +150,12 @@ postgres=# SELECT pg_drop_replication_slot('regression_slot');
     replication connections
     (see <xref linkend="streaming-replication-authentication"/>) and
     that <varname>max_wal_senders</varname> is set sufficiently high to allow
-    an additional connection.
+    an additional connection. The second example enables two-phase decoding.
+    Before you use two-phase commands, you must set
+    <xref linkend="guc-max-prepared-transactions"/> to atleast 1.
    </para>
 <programlisting>
+Example 1:
 $ pg_recvlogical -d postgres --slot=test --create-slot
 $ pg_recvlogical -d postgres --slot=test --start -f -
 <keycombo action="simul"><keycap>Control</keycap><keycap>Z</keycap></keycombo>
@@ -164,6 +166,22 @@ table public.data: INSERT: id[integer]:4 data[text]:'4'
 COMMIT 693
 <keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
 $ pg_recvlogical -d postgres --slot=test --drop-slot
+
+Example 2:
+$ pg_recvlogical -d postgres --slot=test --create-slot --two-phase
+$ pg_recvlogical -d postgres --slot=test --start -f -
+<keycombo action="simul"><keycap>Control</keycap><keycap>Z</keycap></keycombo>
+$ psql -d postgres -c "BEGIN;INSERT INTO data(data) VALUES('5');PREPARE TRANSACTION 'test';"
+$ fg
+BEGIN 694
+table public.data: INSERT: id[integer]:5 data[text]:'5'
+PREPARE TRANSACTION 'test', txid 694
+<keycombo action="simul"><keycap>Control</keycap><keycap>Z</keycap></keycombo>
+$ psql -d postgres -c "COMMIT PREPARED 'test';"
+$ fg
+COMMIT PREPARED 'test', txid 694
+<keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
+$ pg_recvlogical -d postgres --slot=test --drop-slot
 </programlisting>
 
   <para>
diff --git a/doc/src/sgml/ref/pg_recvlogical.sgml b/doc/src/sgml/ref/pg_recvlogical.sgml
index 6b1d98d..d0972a1 100644
--- a/doc/src/sgml/ref/pg_recvlogical.sgml
+++ b/doc/src/sgml/ref/pg_recvlogical.sgml
@@ -65,6 +65,11 @@ PostgreSQL documentation
         <option>--plugin</option>, for the database specified
         by <option>--dbname</option>.
        </para>
+
+       <para>
+        The <option>--two-phase</option> can be specified with
+        <option>--create-slot</option> to enable two-phase decoding.
+       </para>
       </listitem>
      </varlistentry>
 
@@ -257,6 +262,17 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+       <term><option>-t</option></term>
+       <term><option>--two-phase</option></term>
+       <listitem>
+       <para>
+        Enables two-phase decoding. This option should only be used with
+        <option>--create-slot</option>
+       </para>
+       </listitem>
+     </varlistentry>
+
+     <varlistentry>
        <term><option>-v</option></term>
        <term><option>--verbose</option></term>
        <listitem>
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 16d8929..8bb0acf 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -646,7 +646,7 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 	if (temp_replication_slot || create_slot)
 	{
 		if (!CreateReplicationSlot(param->bgconn, replication_slot, NULL,
-								   temp_replication_slot, true, true, false))
+								   temp_replication_slot, true, true, false, false))
 			exit(1);
 
 		if (verbose)
diff --git a/src/bin/pg_basebackup/pg_receivewal.c b/src/bin/pg_basebackup/pg_receivewal.c
index 0d15012..c1334fa 100644
--- a/src/bin/pg_basebackup/pg_receivewal.c
+++ b/src/bin/pg_basebackup/pg_receivewal.c
@@ -741,7 +741,7 @@ main(int argc, char **argv)
 			pg_log_info("creating replication slot \"%s\"", replication_slot);
 
 		if (!CreateReplicationSlot(conn, replication_slot, NULL, false, true, false,
-								   slot_exists_ok))
+								   slot_exists_ok, false))
 			exit(1);
 		exit(0);
 	}
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 5efec16..76bd153 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -35,6 +35,7 @@
 /* Global Options */
 static char *outfile = NULL;
 static int	verbose = 0;
+static bool two_phase = false;
 static int	noloop = 0;
 static int	standby_message_timeout = 10 * 1000;	/* 10 sec = default */
 static int	fsync_interval = 10 * 1000; /* 10 sec = default */
@@ -93,6 +94,7 @@ usage(void)
 	printf(_("  -s, --status-interval=SECS\n"
 			 "                         time between status packets sent to server (default: %d)\n"), (standby_message_timeout / 1000));
 	printf(_("  -S, --slot=SLOTNAME    name of the logical replication slot\n"));
+	printf(_("  -t, --two-phase        enable two-phase decoding when creating a slot\n"));
 	printf(_("  -v, --verbose          output verbose messages\n"));
 	printf(_("  -V, --version          output version information, then exit\n"));
 	printf(_("  -?, --help             show this help, then exit\n"));
@@ -678,6 +680,7 @@ main(int argc, char **argv)
 		{"fsync-interval", required_argument, NULL, 'F'},
 		{"no-loop", no_argument, NULL, 'n'},
 		{"verbose", no_argument, NULL, 'v'},
+		{"two-phase", no_argument, NULL, 't'},
 		{"version", no_argument, NULL, 'V'},
 		{"help", no_argument, NULL, '?'},
 /* connection options */
@@ -726,7 +729,7 @@ main(int argc, char **argv)
 		}
 	}
 
-	while ((c = getopt_long(argc, argv, "E:f:F:nvd:h:p:U:wWI:o:P:s:S:",
+	while ((c = getopt_long(argc, argv, "E:f:F:nvtd:h:p:U:wWI:o:P:s:S:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -749,6 +752,9 @@ main(int argc, char **argv)
 			case 'v':
 				verbose++;
 				break;
+			case 't':
+				two_phase = true;
+				break;
 /* connection options */
 			case 'd':
 				dbname = pg_strdup(optarg);
@@ -920,6 +926,15 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
+	if (two_phase && !do_create_slot)
+	{
+		pg_log_error("--two-phase may only be specified with --create-slot");
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}
+
+
 #ifndef WIN32
 	pqsignal(SIGINT, sigint_handler);
 	pqsignal(SIGHUP, sighup_handler);
@@ -976,7 +991,7 @@ main(int argc, char **argv)
 			pg_log_info("creating replication slot \"%s\"", replication_slot);
 
 		if (!CreateReplicationSlot(conn, replication_slot, plugin, false,
-								   false, false, slot_exists_ok))
+								   false, false, slot_exists_ok, two_phase))
 			exit(1);
 		startpos = InvalidXLogRecPtr;
 	}
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 99daf0e..1f99aae 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -486,7 +486,7 @@ RunIdentifySystem(PGconn *conn, char **sysid, TimeLineID *starttli,
 bool
 CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 					  bool is_temporary, bool is_physical, bool reserve_wal,
-					  bool slot_exists_ok)
+					  bool slot_exists_ok, bool two_phase)
 {
 	PQExpBuffer query;
 	PGresult   *res;
@@ -495,6 +495,7 @@ CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 
 	Assert((is_physical && plugin == NULL) ||
 		   (!is_physical && plugin != NULL));
+	Assert(!(two_phase && is_physical));
 	Assert(slot_name != NULL);
 
 	/* Build query */
@@ -510,6 +511,9 @@ CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 	else
 	{
 		appendPQExpBuffer(query, " LOGICAL \"%s\"", plugin);
+		if (two_phase && PQserverVersion(conn) >= 140000)
+			appendPQExpBufferStr(query, " TWO_PHASE");
+
 		if (PQserverVersion(conn) >= 100000)
 			/* pg_recvlogical doesn't use an exported snapshot, so suppress */
 			appendPQExpBufferStr(query, " NOEXPORT_SNAPSHOT");
diff --git a/src/bin/pg_basebackup/streamutil.h b/src/bin/pg_basebackup/streamutil.h
index 10f87ad..504803b 100644
--- a/src/bin/pg_basebackup/streamutil.h
+++ b/src/bin/pg_basebackup/streamutil.h
@@ -34,7 +34,7 @@ extern PGconn *GetConnection(void);
 extern bool CreateReplicationSlot(PGconn *conn, const char *slot_name,
 								  const char *plugin, bool is_temporary,
 								  bool is_physical, bool reserve_wal,
-								  bool slot_exists_ok);
+								  bool slot_exists_ok, bool two_phase);
 extern bool DropReplicationSlot(PGconn *conn, const char *slot_name);
 extern bool RunIdentifySystem(PGconn *conn, char **sysid,
 							  TimeLineID *starttli,
diff --git a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
index 53f4181..bbbf9e2 100644
--- a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
+++ b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use TestLib;
 use PostgresNode;
-use Test::More tests => 15;
+use Test::More tests => 20;
 
 program_help_ok('pg_recvlogical');
 program_version_ok('pg_recvlogical');
@@ -22,6 +22,7 @@ max_replication_slots = 4
 max_wal_senders = 4
 log_min_messages = 'debug1'
 log_error_verbosity = verbose
+max_prepared_transactions = 10
 });
 $node->dump_info;
 $node->start;
@@ -63,3 +64,45 @@ $node->command_ok(
 		'--start', '--endpos', "$nextlsn", '--no-loop', '-f', '-'
 	],
 	'replayed a transaction');
+
+$node->command_ok(
+	[
+		'pg_recvlogical',           '-S',
+		'test',                     '-d',
+		$node->connstr('postgres'), '--drop-slot'
+	],
+	'slot dropped');
+
+#test with two-phase option enabled
+$node->command_ok(
+	[
+		'pg_recvlogical',           '-S',
+		'test',                     '-d',
+		$node->connstr('postgres'), '--create-slot', '--two-phase'
+	],
+	'slot with two-phase created');
+
+$slot = $node->slot('test');
+isnt($slot->{'restart_lsn'}, '', 'restart lsn is defined for new slot');
+
+$node->safe_psql('postgres',
+	"BEGIN; INSERT INTO test_table values (11); PREPARE TRANSACTION 'test'");
+$node->safe_psql('postgres',
+	"COMMIT PREPARED 'test'");
+$nextlsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn()');
+chomp($nextlsn);
+
+$node->command_fails(
+	[
+		'pg_recvlogical', '-S', 'test', '-d', $node->connstr('postgres'),
+		'--start', '--endpos', "$nextlsn", '--two-phase', '--no-loop', '-f', '-'
+	],
+	'incorrect usage');
+
+$node->command_ok(
+	[
+		'pg_recvlogical', '-S', 'test', '-d', $node->connstr('postgres'),
+		'--start', '--endpos', "$nextlsn", '--no-loop', '-f', '-'
+	],
+	'replayed a two-phase transaction');
-- 
1.8.3.1

v90-0005-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v90-0005-Skip-empty-transactions-for-logical-replication.patchDownload
From d276b9025f66605fd8548849dccf6fd2a0f46a2a Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Jun 2021 06:32:41 -0400
Subject: [PATCH v90] Skip empty transactions for logical replication.

The current logical replication behaviour is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  16 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  38 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 158 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  46 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 286 insertions(+), 39 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 9628eb5..27abdaa 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -883,11 +883,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 8b95aaa..6ef2a6b 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7538,6 +7538,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit prepared.
 </para></listitem>
 </varlistentry>
@@ -7552,6 +7559,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 89d91c2..97ca648 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -934,7 +935,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -969,7 +971,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 				 errmsg("logical replication at prepare time requires commit_prepared_cb callback")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..4653d6d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 48239c0..6cdca07 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2792,7 +2792,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e6f1276..47d5a53 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -990,27 +990,39 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* There is no transaction when COMMIT PREPARED is called */
-	begin_replication_step();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	end_replication_step();
-	CommitTransactionCommand();
+		/* There is no transaction when COMMIT PREPARED is called */
+		begin_replication_step();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index d5a284d..43679a2 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -132,6 +134,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -401,10 +408,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -419,8 +448,22 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
+	txn->output_plugin_private = NULL;
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -432,10 +475,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -450,8 +511,18 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty prepared transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -462,12 +533,33 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of COMMIT PREPARED of an empty transaction");
+			return;
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -480,8 +572,26 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of ROLLBACK of an empty transaction");
+			return;
+		}
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -630,11 +740,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -668,6 +783,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -770,6 +894,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -777,6 +902,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -813,6 +942,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -833,6 +971,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -845,6 +984,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 7a4804f..2fa60b5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d7c785b..ffc0b56 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -442,7 +442,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 52bd92d..2b43ae0 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -86,9 +86,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 4c372a6..8a33641 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 24;
+use Test::More tests => 25;
 
 ###############################
 # Setup
@@ -318,10 +318,9 @@ $node_publisher->safe_psql('postgres', "
 
 $node_publisher->wait_for_catchup($appname_copy);
 
-# Check that the transaction has been prepared on the subscriber, there will be 2
-# prepared transactions for the 2 subscriptions.
+# Check that the transaction has been prepared on the subscriber
 $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
-is($result, qq(2), 'transaction is prepared on subscriber');
+is($result, qq(1), 'transaction is prepared on subscriber');
 
 # Now commit the insert and verify that it IS replicated
 $node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
@@ -337,6 +336,45 @@ is($result, qq(2), 'replicated data in subscriber table');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
 $node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+   "CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+   "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot
+$node_publisher->safe_psql('postgres', "
+   BEGIN;
+   INSERT INTO tab_nopub SELECT generate_series(1,10);
+   PREPARE TRANSACTION 'empty_transaction';
+   COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+   'postgres', qq(
+       SELECT get_byte(data, 0)
+       FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+           'proto_version', '1',
+           'publication_names', 'tap_pub')
+));
+
+# the empty transaction should be skipped
+is($result, qq(),
+   'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cabc0bb..ad62bbe 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1597,6 +1597,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

v90-0003-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v90-0003-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 3fbbdcafc35b958187ebb096769d032cfdd3aa8c Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 23 Jun 2021 06:23:14 -0400
Subject: [PATCH v90] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the following things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase
transactions. We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Add new subscription TAP tests, and new subscription.sql regression tests.

* Update PG documentation.

We don't support the following operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* CREATE/ALTER SUBSCRIPTION which tries to set options two_phase=true and streaming=true at the same time.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         | 291 ++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 148 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  19 +-
 src/backend/replication/logical/decode.c           |  11 +-
 src/backend/replication/logical/logical.c          |  31 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 196 +++++++++--
 src/backend/replication/logical/worker.c           | 358 +++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |   8 +-
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  29 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  17 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/nodes/replnodes.h                      |   1 +
 src/include/replication/logical.h                  |  10 +
 src/include/replication/logicalproto.h             |  73 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 359 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 235 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   3 +
 43 files changed, 2424 insertions(+), 202 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f517a7d..0235639 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7643,6 +7643,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 205fbd2..97d7bdf 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2811,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2871,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains a Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7379,6 +7386,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepared transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit prepared.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 367ac81..e9691ef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1433905 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used with the
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on the publisher is decoded as a normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used with the
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..6d3efb4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		exists.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984..55f6e37 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1255,5 +1255,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subslotname, subsynccommit, subpublications)
+              substream, subtwophasestate, subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 75e195f..08d0295 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -506,10 +557,34 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false then
+				 * it is safe to enable two_phase up-front because those tables
+				 * are already initially in READY state. When the subscription
+				 * has no tables, we leave the twophase state as PENDING,
+				 * to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -816,7 +891,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -850,6 +926,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -873,7 +955,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -918,7 +1001,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -934,6 +1018,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -965,7 +1060,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(publist);
@@ -982,6 +1078,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1011,7 +1118,32 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 6eaa84a..2838b89 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -436,6 +437,19 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		/*
+		 * FIXME - 21/May. The below code is a temporary hack to check for
+		 * for server version 140000, even though this two-phase feature did
+		 * not make it into the PG 14 release.
+		 *
+		 * When the PG 15 development officially starts someone will update the
+		 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+		 * to revisit this code to remove this hack and write the code properly.
+		 */
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 140000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -851,7 +865,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -868,6 +882,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
+		if (two_phase)
+			appendStringInfoString(&cmd, " TWO_PHASE");
+
 		switch (snapshot_action)
 		{
 			case CRS_EXPORT_SNAPSHOT:
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 453efc5..74df75e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing preparing
+				 * transactions that have locked [user] catalog tables
+				 * exclusively but as of now we ask users not to do such an
+				 * operation.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +734,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index ffc6160..89d91c2 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,12 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= slot->data.two_phase;
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +540,22 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (slot->data.two_phase || ctx->twophase_opt_given);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +616,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index cb42fcb..2c191de 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 19e96f3..48239c0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2574,7 +2574,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2665,7 +2665,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2712,7 +2712,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2732,7 +2732,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2751,19 +2751,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2781,12 +2782,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..a14a3d6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions, that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot, need
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index cc50eb8..9272f75 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready != NIL && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (table_states_not_ready == NIL && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1071,7 +1067,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1158,3 +1155,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ *
+ * Note: If this function started the transaction (indicated by the parameter)
+ * then it is the caller's responsibility to commit it.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static bool has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && list_length(table_states_not_ready) == 0;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase state */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index bbb659d..31bce6b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * is still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -804,6 +880,191 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a BEGIN PREPARE message")));
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (prepare_data.prepare_lsn != remote_final_lsn)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("incorrect prepare LSN %X/%X in prepare message (expected %X/%X)",
+								 LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
+								 LSN_FORMAT_ARGS(remote_final_lsn))));
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	begin_replication_step();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* There is no transaction when COMMIT PREPARED is called */
+	begin_replication_step();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
+		begin_replication_step();
+		FinishPreparedTransaction(gid, false);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2082,6 +2343,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2561,6 +2838,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -3062,6 +3342,24 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+
+	if (!TransactionIdIsValid(xid))
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("invalid two-phase transaction ID")));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3230,15 +3528,69 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+	/*
+	 * FIXME - 9/April. The below code is a temporary hack to set the protocol
+	 * version 3 (for two_phase) for server version 140000, even though this
+	 * feature did not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 *
+	 * e.g.
+	 * if >= 15000 use LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
+	 * else if >= 14000 use LOGICALREP_PROTO_STREAM_VERSION_NUM
+	 * else use LOGICALREP_PROTO_VERSION_NUM
+	 */
 	options.proto.logical.proto_version =
 		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		LOGICALREP_PROTO_TWOPHASE_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
+
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 63f108f..7a1d42a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -145,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -156,6 +174,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -167,10 +187,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -246,8 +268,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -319,6 +362,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -331,8 +395,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -347,29 +415,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -389,6 +436,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -839,18 +948,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1269,3 +1368,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eead144..0910546 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -103,7 +103,6 @@ static SQLCmd *make_sqlcmd(void);
 %type <node>	plugin_opt_arg
 %type <str>		opt_slot var_name
 %type <boolval>	opt_temporary
-%type <boolval>	opt_two_phase
 %type <list>	create_slot_opt_list
 %type <defelt>	create_slot_opt
 
@@ -243,7 +242,7 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
 			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
@@ -372,11 +371,6 @@ opt_temporary:
 			| /* EMPTY */					{ $$ = false; }
 			;
 
-opt_two_phase:
-			K_TWO_PHASE						{ $$ = true; }
-			| /* EMPTY */					{ $$ = false; }
-			;
-
 opt_slot:
 			K_SLOT IDENT
 				{ $$ = $2; }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8c18b4e..33b85d8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -283,6 +283,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index faeea9f..9f0b13f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -370,7 +370,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 8f53cc7..8141311 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -50,6 +50,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4304,6 +4305,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4347,9 +4349,25 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	/*
+	 * FIXME - 21/May. The below code is a temporary hack to check for
+	 * for server version 140000, even though this two-phase feature did
+	 * not make it into the PG 14 release.
+	 *
+	 * When the PG 15 development officially starts someone will update the
+	 * PG_VERSION_NUM (pg_config.h) to be 150000, and when that happens we need
+	 * to revisit this code to remove this hack and write the code properly.
+	 */
+	if (fout->remoteVersion >= 140000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4370,6 +4388,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4395,6 +4414,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4422,6 +4443,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4463,6 +4485,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 49e1b0a..d2fded5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -637,6 +637,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2abf255..6caa701 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,7 +6415,9 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary and streaming are only supported in v14 and higher.
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
@@ -6423,6 +6425,17 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Binary"),
 							  gettext_noop("Streaming"));
 
+		/*
+		 * Two_phase is only supported in v15 and higher.
+		 *
+		 * FIXME: When PG15 development starts, change the following
+		 * 140000 to 150000
+		 */
+		if (pset.sversion >= 140000)
+			appendPQExpBuffer(&buf,
+							  ", subtwophasestate AS \"%s\"\n",
+							  gettext_noop("Two phase commit"));
+
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
 						  ",  subconninfo AS \"%s\"\n",
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 38af568..8f0a921 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2760,7 +2760,7 @@ psql_completion(const char *text, int start, int end)
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
 					  "enabled", "slot_name", "streaming",
-					  "synchronous_commit");
+					  "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 0060ebf..e84353e 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -57,6 +65,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -94,6 +104,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..765e9b5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -88,6 +88,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index faa3a25..ebc43a0 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		temporary;
+	bool		two_phase;
 	List	   *options;
 } CreateReplicationSlotCmd;
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..0b071a6 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -90,6 +90,16 @@ typedef struct LogicalDecodingContext
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 * This flag indicates that the plugin passed in the two-phase option as
+	 * part of the START_STREAMING command. We can't rely solely on the twophase
+	 * flag which only tells whether the plugin provided all the necessary
+	 * two-phase callbacks.
+	 *
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..e20f2da 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -122,6 +131,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +180,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d8..d7c785b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -297,7 +297,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -636,7 +640,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 2eb7e3a..34d95ea 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -84,11 +84,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 09576c1..f054ac8 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..4c372a6
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,359 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# copy_data=false and two_phase
+###############################
+
+#create some test tables for copy tests
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_copy SELECT generate_series(1,5);");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "INSERT INTO tab_copy VALUES (88);");
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Setup logical replication
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_copy FOR TABLE tab_copy;");
+
+my $appname_copy = 'appname_copy';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_copy
+	CONNECTION '$publisher_connstr application_name=$appname_copy'
+	PUBLICATION tap_pub_copy
+	WITH (two_phase=on, copy_data=false);");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Also wait for initial table sync to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+# Check that the initial table data was NOT replicated (because we said copy_data=false)
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Now do a prepare on publisher and check that it IS replicated
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_copy VALUES (99);
+    PREPARE TRANSACTION 'mygid';");
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Check that the transaction has been prepared on the subscriber, there will be 2
+# prepared transactions for the 2 subscriptions.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
+is($result, qq(2), 'transaction is prepared on subscriber');
+
+# Now commit the insert and verify that it IS replicated
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(6), 'publisher inserted data');
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(2), 'replicated data in subscriber table');
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..e61d28a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index abdb083..cabc0bb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1388,12 +1388,15 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

#363Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#362)
1 attachment(s)

On Wed, Jun 23, 2021 at 4:10 PM Ajin Cherian <itsajin@gmail.com> wrote:

The first two patches look mostly good to me. I have combined them
into one and made some minor changes. (a) Removed opt_two_phase and
related code from repl_gram.y as that is not required for this version
of the patch. (b) made some changes in docs. Kindly check the attached
and let me know if you have any comments? I am planning to push this
first patch in the series tomorrow unless you or others have any
comments.

--
With Regards,
Amit Kapila.

Attachments:

0001-Allow-enabling-two-phase-option-via-replication-prot.patchapplication/octet-stream; name=0001-Allow-enabling-two-phase-option-via-replication-prot.patchDownload
From f64afb5a80c8dc12f7522909e690331536ec590b Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 29 Jun 2021 12:02:19 +0530
Subject: [PATCH] Allow enabling two-phase option via replication protocol.

Extend the replication command CREATE_REPLICATION_SLOT to support the
TWO_PHASE option. This will allow decoding commands like PREPARE
TRANSACTION, COMMIT PREPARED and ROLLBACK PREPARED for slots created with
this option. The decoding of the transaction happens at prepare command.

We have also added support of two-phase in pg_recvlogical via a new option
--two-phase.

This option will also be used by future patches that allow streaming of
two-phase transactions in subscribers. With this, the out-of-core logical
replication solutions can enable replication of two-phase transactions via
replication protocol.

Author: Ajin Cherian
Reviewed-By: Jeff Davis, Vignesh C, Amit Kapila
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/64b9f783c6e125f18f88fbc0c0234e34e71d8639.camel@j-davis.com
---
 doc/src/sgml/logicaldecoding.sgml             | 23 +++++++++-
 doc/src/sgml/protocol.sgml                    | 16 ++++++-
 doc/src/sgml/ref/pg_recvlogical.sgml          | 16 +++++++
 src/backend/replication/repl_gram.y           |  6 +++
 src/backend/replication/repl_scanner.l        |  1 +
 src/backend/replication/walsender.c           | 18 ++++++--
 src/bin/pg_basebackup/pg_basebackup.c         |  2 +-
 src/bin/pg_basebackup/pg_receivewal.c         |  2 +-
 src/bin/pg_basebackup/pg_recvlogical.c        | 19 +++++++-
 src/bin/pg_basebackup/streamutil.c            |  6 ++-
 src/bin/pg_basebackup/streamutil.h            |  2 +-
 src/bin/pg_basebackup/t/030_pg_recvlogical.pl | 45 ++++++++++++++++++-
 12 files changed, 143 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 5b8065901a..985db5ca11 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -144,16 +144,19 @@ postgres=# SELECT pg_drop_replication_slot('regression_slot');
 </programlisting>
 
    <para>
-    The following example shows how logical decoding is controlled over the
+    The following examples shows how logical decoding is controlled over the
     streaming replication protocol, using the
     program <xref linkend="app-pgrecvlogical"/> included in the PostgreSQL
     distribution.  This requires that client authentication is set up to allow
     replication connections
     (see <xref linkend="streaming-replication-authentication"/>) and
     that <varname>max_wal_senders</varname> is set sufficiently high to allow
-    an additional connection.
+    an additional connection.  The second example shows how to stream two-phase
+    transactions.  Before you use two-phase commands, you must set
+    <xref linkend="guc-max-prepared-transactions"/> to atleast 1.
    </para>
 <programlisting>
+Example 1:
 $ pg_recvlogical -d postgres --slot=test --create-slot
 $ pg_recvlogical -d postgres --slot=test --start -f -
 <keycombo action="simul"><keycap>Control</keycap><keycap>Z</keycap></keycombo>
@@ -164,6 +167,22 @@ table public.data: INSERT: id[integer]:4 data[text]:'4'
 COMMIT 693
 <keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
 $ pg_recvlogical -d postgres --slot=test --drop-slot
+
+Example 2:
+$ pg_recvlogical -d postgres --slot=test --create-slot --two-phase
+$ pg_recvlogical -d postgres --slot=test --start -f -
+<keycombo action="simul"><keycap>Control</keycap><keycap>Z</keycap></keycombo>
+$ psql -d postgres -c "BEGIN;INSERT INTO data(data) VALUES('5');PREPARE TRANSACTION 'test';"
+$ fg
+BEGIN 694
+table public.data: INSERT: id[integer]:5 data[text]:'5'
+PREPARE TRANSACTION 'test', txid 694
+<keycombo action="simul"><keycap>Control</keycap><keycap>Z</keycap></keycombo>
+$ psql -d postgres -c "COMMIT PREPARED 'test';"
+$ fg
+COMMIT PREPARED 'test', txid 694
+<keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo>
+$ pg_recvlogical -d postgres --slot=test --drop-slot
 </programlisting>
 
   <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 01e87617f4..a3562f3d08 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1914,7 +1914,7 @@ The commands accepted in replication mode are:
   </varlistentry>
 
   <varlistentry id="protocol-replication-create-slot" xreflabel="CREATE_REPLICATION_SLOT">
-   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] }
+   <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> | <literal>TWO_PHASE</literal> ] }
      <indexterm><primary>CREATE_REPLICATION_SLOT</primary></indexterm>
     </term>
     <listitem>
@@ -1955,6 +1955,20 @@ The commands accepted in replication mode are:
        </listitem>
       </varlistentry>
 
+      <varlistentry>
+       <term><literal>TWO_PHASE</literal></term>
+       <listitem>
+        <para>
+         Specify that this logical replication slot supports decoding of two-phase
+         transactions. With this option, two-phase commands like
+         <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal>
+         and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted.
+         The transaction will be decoded and transmitted at
+         <literal>PREPARE TRANSACTION</literal> time.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry>
        <term><literal>RESERVE_WAL</literal></term>
        <listitem>
diff --git a/doc/src/sgml/ref/pg_recvlogical.sgml b/doc/src/sgml/ref/pg_recvlogical.sgml
index 6b1d98d06e..d0972a1e43 100644
--- a/doc/src/sgml/ref/pg_recvlogical.sgml
+++ b/doc/src/sgml/ref/pg_recvlogical.sgml
@@ -65,6 +65,11 @@ PostgreSQL documentation
         <option>--plugin</option>, for the database specified
         by <option>--dbname</option>.
        </para>
+
+       <para>
+        The <option>--two-phase</option> can be specified with
+        <option>--create-slot</option> to enable two-phase decoding.
+       </para>
       </listitem>
      </varlistentry>
 
@@ -256,6 +261,17 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+       <term><option>-t</option></term>
+       <term><option>--two-phase</option></term>
+       <listitem>
+       <para>
+        Enables two-phase decoding. This option should only be used with
+        <option>--create-slot</option>
+       </para>
+       </listitem>
+     </varlistentry>
+
      <varlistentry>
        <term><option>-v</option></term>
        <term><option>--verbose</option></term>
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index eb283a8632..e1e8ec29cc 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -84,6 +84,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_SLOT
 %token K_RESERVE_WAL
 %token K_TEMPORARY
+%token K_TWO_PHASE
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
@@ -283,6 +284,11 @@ create_slot_opt:
 				  $$ = makeDefElem("reserve_wal",
 								   (Node *)makeInteger(true), -1);
 				}
+			| K_TWO_PHASE
+				{
+				  $$ = makeDefElem("two_phase",
+								   (Node *)makeInteger(true), -1);
+				}
 			;
 
 /* DROP_REPLICATION_SLOT slot */
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index dcc3c3fc51..c038a636c3 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -103,6 +103,7 @@ RESERVE_WAL			{ return K_RESERVE_WAL; }
 LOGICAL				{ return K_LOGICAL; }
 SLOT				{ return K_SLOT; }
 TEMPORARY			{ return K_TEMPORARY; }
+TWO_PHASE			{ return K_TWO_PHASE; }
 EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3224536356..92c755f346 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -863,11 +863,13 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 static void
 parseCreateReplSlotOptions(CreateReplicationSlotCmd *cmd,
 						   bool *reserve_wal,
-						   CRSSnapshotAction *snapshot_action)
+						   CRSSnapshotAction *snapshot_action,
+						   bool *two_phase)
 {
 	ListCell   *lc;
 	bool		snapshot_action_given = false;
 	bool		reserve_wal_given = false;
+	bool		two_phase_given = false;
 
 	/* Parse options */
 	foreach(lc, cmd->options)
@@ -905,6 +907,15 @@ parseCreateReplSlotOptions(CreateReplicationSlotCmd *cmd,
 			reserve_wal_given = true;
 			*reserve_wal = true;
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_given || cmd->kind != REPLICATION_KIND_LOGICAL)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_given = true;
+			*two_phase = true;
+		}
 		else
 			elog(ERROR, "unrecognized option: %s", defel->defname);
 	}
@@ -920,6 +931,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 	char		xloc[MAXFNAMELEN];
 	char	   *slot_name;
 	bool		reserve_wal = false;
+	bool		two_phase = false;
 	CRSSnapshotAction snapshot_action = CRS_EXPORT_SNAPSHOT;
 	DestReceiver *dest;
 	TupOutputState *tstate;
@@ -929,7 +941,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 	Assert(!MyReplicationSlot);
 
-	parseCreateReplSlotOptions(cmd, &reserve_wal, &snapshot_action);
+	parseCreateReplSlotOptions(cmd, &reserve_wal, &snapshot_action, &two_phase);
 
 	/* setup state for WalSndSegmentOpen */
 	sendTimeLineIsHistoric = false;
@@ -954,7 +966,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		 */
 		ReplicationSlotCreate(cmd->slotname, true,
 							  cmd->temporary ? RS_TEMPORARY : RS_EPHEMERAL,
-							  false);
+							  two_phase);
 	}
 
 	if (cmd->kind == REPLICATION_KIND_LOGICAL)
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 16d8929b23..8bb0acf498 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -646,7 +646,7 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 	if (temp_replication_slot || create_slot)
 	{
 		if (!CreateReplicationSlot(param->bgconn, replication_slot, NULL,
-								   temp_replication_slot, true, true, false))
+								   temp_replication_slot, true, true, false, false))
 			exit(1);
 
 		if (verbose)
diff --git a/src/bin/pg_basebackup/pg_receivewal.c b/src/bin/pg_basebackup/pg_receivewal.c
index 0d15012c29..c1334fad35 100644
--- a/src/bin/pg_basebackup/pg_receivewal.c
+++ b/src/bin/pg_basebackup/pg_receivewal.c
@@ -741,7 +741,7 @@ main(int argc, char **argv)
 			pg_log_info("creating replication slot \"%s\"", replication_slot);
 
 		if (!CreateReplicationSlot(conn, replication_slot, NULL, false, true, false,
-								   slot_exists_ok))
+								   slot_exists_ok, false))
 			exit(1);
 		exit(0);
 	}
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 5efec160e8..76bd153fac 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -35,6 +35,7 @@
 /* Global Options */
 static char *outfile = NULL;
 static int	verbose = 0;
+static bool two_phase = false;
 static int	noloop = 0;
 static int	standby_message_timeout = 10 * 1000;	/* 10 sec = default */
 static int	fsync_interval = 10 * 1000; /* 10 sec = default */
@@ -93,6 +94,7 @@ usage(void)
 	printf(_("  -s, --status-interval=SECS\n"
 			 "                         time between status packets sent to server (default: %d)\n"), (standby_message_timeout / 1000));
 	printf(_("  -S, --slot=SLOTNAME    name of the logical replication slot\n"));
+	printf(_("  -t, --two-phase        enable two-phase decoding when creating a slot\n"));
 	printf(_("  -v, --verbose          output verbose messages\n"));
 	printf(_("  -V, --version          output version information, then exit\n"));
 	printf(_("  -?, --help             show this help, then exit\n"));
@@ -678,6 +680,7 @@ main(int argc, char **argv)
 		{"fsync-interval", required_argument, NULL, 'F'},
 		{"no-loop", no_argument, NULL, 'n'},
 		{"verbose", no_argument, NULL, 'v'},
+		{"two-phase", no_argument, NULL, 't'},
 		{"version", no_argument, NULL, 'V'},
 		{"help", no_argument, NULL, '?'},
 /* connection options */
@@ -726,7 +729,7 @@ main(int argc, char **argv)
 		}
 	}
 
-	while ((c = getopt_long(argc, argv, "E:f:F:nvd:h:p:U:wWI:o:P:s:S:",
+	while ((c = getopt_long(argc, argv, "E:f:F:nvtd:h:p:U:wWI:o:P:s:S:",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -749,6 +752,9 @@ main(int argc, char **argv)
 			case 'v':
 				verbose++;
 				break;
+			case 't':
+				two_phase = true;
+				break;
 /* connection options */
 			case 'd':
 				dbname = pg_strdup(optarg);
@@ -920,6 +926,15 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
+	if (two_phase && !do_create_slot)
+	{
+		pg_log_error("--two-phase may only be specified with --create-slot");
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}
+
+
 #ifndef WIN32
 	pqsignal(SIGINT, sigint_handler);
 	pqsignal(SIGHUP, sighup_handler);
@@ -976,7 +991,7 @@ main(int argc, char **argv)
 			pg_log_info("creating replication slot \"%s\"", replication_slot);
 
 		if (!CreateReplicationSlot(conn, replication_slot, plugin, false,
-								   false, false, slot_exists_ok))
+								   false, false, slot_exists_ok, two_phase))
 			exit(1);
 		startpos = InvalidXLogRecPtr;
 	}
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 99daf0e972..f5b3b476e5 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -486,7 +486,7 @@ RunIdentifySystem(PGconn *conn, char **sysid, TimeLineID *starttli,
 bool
 CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 					  bool is_temporary, bool is_physical, bool reserve_wal,
-					  bool slot_exists_ok)
+					  bool slot_exists_ok, bool two_phase)
 {
 	PQExpBuffer query;
 	PGresult   *res;
@@ -495,6 +495,7 @@ CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 
 	Assert((is_physical && plugin == NULL) ||
 		   (!is_physical && plugin != NULL));
+	Assert(!(two_phase && is_physical));
 	Assert(slot_name != NULL);
 
 	/* Build query */
@@ -510,6 +511,9 @@ CreateReplicationSlot(PGconn *conn, const char *slot_name, const char *plugin,
 	else
 	{
 		appendPQExpBuffer(query, " LOGICAL \"%s\"", plugin);
+		if (two_phase && PQserverVersion(conn) >= 150000)
+			appendPQExpBufferStr(query, " TWO_PHASE");
+
 		if (PQserverVersion(conn) >= 100000)
 			/* pg_recvlogical doesn't use an exported snapshot, so suppress */
 			appendPQExpBufferStr(query, " NOEXPORT_SNAPSHOT");
diff --git a/src/bin/pg_basebackup/streamutil.h b/src/bin/pg_basebackup/streamutil.h
index 10f87ad0c1..504803b976 100644
--- a/src/bin/pg_basebackup/streamutil.h
+++ b/src/bin/pg_basebackup/streamutil.h
@@ -34,7 +34,7 @@ extern PGconn *GetConnection(void);
 extern bool CreateReplicationSlot(PGconn *conn, const char *slot_name,
 								  const char *plugin, bool is_temporary,
 								  bool is_physical, bool reserve_wal,
-								  bool slot_exists_ok);
+								  bool slot_exists_ok, bool two_phase);
 extern bool DropReplicationSlot(PGconn *conn, const char *slot_name);
 extern bool RunIdentifySystem(PGconn *conn, char **sysid,
 							  TimeLineID *starttli,
diff --git a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
index 53f41814b0..bbbf9e21db 100644
--- a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
+++ b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use TestLib;
 use PostgresNode;
-use Test::More tests => 15;
+use Test::More tests => 20;
 
 program_help_ok('pg_recvlogical');
 program_version_ok('pg_recvlogical');
@@ -22,6 +22,7 @@ max_replication_slots = 4
 max_wal_senders = 4
 log_min_messages = 'debug1'
 log_error_verbosity = verbose
+max_prepared_transactions = 10
 });
 $node->dump_info;
 $node->start;
@@ -63,3 +64,45 @@ $node->command_ok(
 		'--start', '--endpos', "$nextlsn", '--no-loop', '-f', '-'
 	],
 	'replayed a transaction');
+
+$node->command_ok(
+	[
+		'pg_recvlogical',           '-S',
+		'test',                     '-d',
+		$node->connstr('postgres'), '--drop-slot'
+	],
+	'slot dropped');
+
+#test with two-phase option enabled
+$node->command_ok(
+	[
+		'pg_recvlogical',           '-S',
+		'test',                     '-d',
+		$node->connstr('postgres'), '--create-slot', '--two-phase'
+	],
+	'slot with two-phase created');
+
+$slot = $node->slot('test');
+isnt($slot->{'restart_lsn'}, '', 'restart lsn is defined for new slot');
+
+$node->safe_psql('postgres',
+	"BEGIN; INSERT INTO test_table values (11); PREPARE TRANSACTION 'test'");
+$node->safe_psql('postgres',
+	"COMMIT PREPARED 'test'");
+$nextlsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn()');
+chomp($nextlsn);
+
+$node->command_fails(
+	[
+		'pg_recvlogical', '-S', 'test', '-d', $node->connstr('postgres'),
+		'--start', '--endpos', "$nextlsn", '--two-phase', '--no-loop', '-f', '-'
+	],
+	'incorrect usage');
+
+$node->command_ok(
+	[
+		'pg_recvlogical', '-S', 'test', '-d', $node->connstr('postgres'),
+		'--start', '--endpos', "$nextlsn", '--no-loop', '-f', '-'
+	],
+	'replayed a two-phase transaction');
-- 
2.28.0.windows.1

#364Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#363)

On Tue, Jun 29, 2021 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 23, 2021 at 4:10 PM Ajin Cherian <itsajin@gmail.com> wrote:

The first two patches look mostly good to me. I have combined them
into one and made some minor changes. (a) Removed opt_two_phase and
related code from repl_gram.y as that is not required for this version
of the patch. (b) made some changes in docs. Kindly check the attached
and let me know if you have any comments? I am planning to push this
first patch in the series tomorrow unless you or others have any
comments.

The patch applies cleanly, tests pass. I reviewed the patch and have
no comments, it looks good.

regards,
Ajin Cherian
Fujitsu Australia

#365vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#363)

On Tue, Jun 29, 2021 at 12:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 23, 2021 at 4:10 PM Ajin Cherian <itsajin@gmail.com> wrote:

The first two patches look mostly good to me. I have combined them
into one and made some minor changes. (a) Removed opt_two_phase and
related code from repl_gram.y as that is not required for this version
of the patch. (b) made some changes in docs. Kindly check the attached
and let me know if you have any comments? I am planning to push this
first patch in the series tomorrow unless you or others have any
comments.

Thanks for the updated patch, patch applies neatly and tests passed.
If you are ok, One of the documentation changes could be slightly
changed while committing:
+       <para>
+        Enables two-phase decoding. This option should only be used with
+        <option>--create-slot</option>
+       </para>
to:
+       <para>
+        Enables two-phase decoding. This option should only be specified with
+        <option>--create-slot</option> option.
+       </para>

Regards,
Vignesh

#366Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#365)

On Tue, Jun 29, 2021 at 5:31 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, Jun 29, 2021 at 12:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 23, 2021 at 4:10 PM Ajin Cherian <itsajin@gmail.com> wrote:

The first two patches look mostly good to me. I have combined them
into one and made some minor changes. (a) Removed opt_two_phase and
related code from repl_gram.y as that is not required for this version
of the patch. (b) made some changes in docs. Kindly check the attached
and let me know if you have any comments? I am planning to push this
first patch in the series tomorrow unless you or others have any
comments.

Thanks for the updated patch, patch applies neatly and tests passed.
If you are ok, One of the documentation changes could be slightly
changed while committing:

Pushed the patch after taking care of your suggestion. Now, the next
step is to rebase the remaining patches and adapt some of the checks
to PG-15.

--
With Regards,
Amit Kapila.

#367Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#366)
3 attachment(s)

On Wed, Jun 30, 2021 at 6:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Pushed the patch after taking care of your suggestion. Now, the next
step is to rebase the remaining patches and adapt some of the checks
to PG-15.

Please find attached the latest patch set v91*

Differences from v90* are:

* This is the first patch set for PG15

* Rebased to HEAD @ today.

* Now the patch set has only 3 patches again because v90-0001,
v90-0002 are already pushed [1]https://github.com/postgres/postgres/commit/cda03cfed6b8bd5f64567bccbc9578fba035691e

* Bumped all relevant server version checks to 150000

----
[1]: https://github.com/postgres/postgres/commit/cda03cfed6b8bd5f64567bccbc9578fba035691e

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v91-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v91-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 4fbc02dfdacf7631acdd2d62cc7a5de62e3fbf55 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 30 Jun 2021 18:51:38 +1000
Subject: [PATCH v91] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the following things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase
transactions. We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Add new subscription TAP tests, and new subscription.sql regression tests.

* Update PG documentation.

We don't support the following operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* CREATE/ALTER SUBSCRIPTION which tries to set options two_phase=true and streaming=true at the same time.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         | 291 ++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 148 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  11 +-
 src/backend/replication/logical/logical.c          |  31 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 196 +++++++++--
 src/backend/replication/logical/worker.c           | 351 +++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |   2 +-
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  14 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/nodes/replnodes.h                      |   1 +
 src/include/replication/logical.h                  |  10 +
 src/include/replication/logicalproto.h             |  73 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 359 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 235 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   3 +
 43 files changed, 2395 insertions(+), 197 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f517a7d..0235639 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7643,6 +7643,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index a3562f3..e8cb78f 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2811,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2871,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains a Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7391,6 +7398,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepared transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit prepared.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index b3d1731..a6f9944 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1433905 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used with the
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on the publisher is decoded as a normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used with the
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..6d3efb4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		exists.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984..55f6e37 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1255,5 +1255,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subslotname, subsynccommit, subpublications)
+              substream, subtwophasestate, subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index b862e59..4cfd763 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -506,10 +557,34 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false then
+				 * it is safe to enable two_phase up-front because those tables
+				 * are already initially in READY state. When the subscription
+				 * has no tables, we leave the twophase state as PENDING,
+				 * to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -816,7 +891,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -850,6 +926,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -873,7 +955,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -918,7 +1001,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -934,6 +1018,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1058,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				publist = merge_publications(sub->publications, stmt->publication, isadd, stmt->subname);
 
@@ -982,6 +1078,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1011,7 +1118,32 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 6eaa84a..19ea159 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -436,6 +437,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 150000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -851,7 +856,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -868,6 +873,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
+		if (two_phase)
+			appendStringInfoString(&cmd, " TWO_PHASE");
+
 		switch (snapshot_action)
 		{
 			case CRS_EXPORT_SNAPSHOT:
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 453efc5..74df75e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing preparing
+				 * transactions that have locked [user] catalog tables
+				 * exclusively but as of now we ask users not to do such an
+				 * operation.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +734,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d536a5f..d61ef4c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,12 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= slot->data.two_phase;
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +540,22 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (slot->data.two_phase || ctx->twophase_opt_given);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +616,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index cb42fcb..2c191de 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b8c5e2a..9f80794 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2576,7 +2576,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2667,7 +2667,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2714,7 +2714,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2734,7 +2734,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2753,19 +2753,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2783,12 +2784,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..a14a3d6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions, that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot, need
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 682c107..aa3fce0 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready != NIL && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (table_states_not_ready == NIL && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1071,7 +1067,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1158,3 +1155,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ *
+ * Note: If this function started the transaction (indicated by the parameter)
+ * then it is the caller's responsibility to commit it.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static bool has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && list_length(table_states_not_ready) == 0;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase state */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index bbb659d..e48b821 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * is still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -804,6 +880,191 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a BEGIN PREPARE message")));
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (prepare_data.prepare_lsn != remote_final_lsn)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("incorrect prepare LSN %X/%X in prepare message (expected %X/%X)",
+								 LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
+								 LSN_FORMAT_ARGS(remote_final_lsn))));
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	begin_replication_step();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* There is no transaction when COMMIT PREPARED is called */
+	begin_replication_step();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
+		begin_replication_step();
+		FinishPreparedTransaction(gid, false);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2082,6 +2343,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2561,6 +2838,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -3062,6 +3342,24 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+
+	if (!TransactionIdIsValid(xid))
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("invalid two-phase transaction ID")));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3072,6 +3370,7 @@ ApplyWorkerMain(Datum main_arg)
 	XLogRecPtr	origin_startpos;
 	char	   *myslotname;
 	WalRcvStreamOptions options;
+	int			server_version;
 
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
@@ -3230,15 +3529,59 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+
+	server_version = walrcv_server_version(LogRepWorkerWalRcvConn);
 	options.proto.logical.proto_version =
-		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		server_version >= 150000 ? LOGICALREP_PROTO_TWOPHASE_VERSION_NUM :
+		server_version >= 140000 ? LOGICALREP_PROTO_STREAM_VERSION_NUM :
+		LOGICALREP_PROTO_VERSION_NUM;
+
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
+
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index abd5217..f63e17e 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -145,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -156,6 +174,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -167,10 +187,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -246,8 +268,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -319,6 +362,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -331,8 +395,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -347,29 +415,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -389,6 +436,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -839,18 +948,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1270,3 +1369,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index e1e8ec2..0910546 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -242,7 +242,7 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
 			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8c18b4e..33b85d8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -283,6 +283,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2be9ad9..9a2bc37 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -370,7 +370,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 3211521..912144c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4320,6 +4321,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4363,9 +4365,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 150000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4386,6 +4395,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4411,6 +4421,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4438,6 +4450,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4479,6 +4492,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index ba9bc6d..efb8c30 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2abf255..28cf352 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,7 +6415,9 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary and streaming are only supported in v14 and higher.
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
@@ -6423,6 +6425,14 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Binary"),
 							  gettext_noop("Streaming"));
 
+		/*
+		 * Two_phase is only supported in v15 and higher.
+		 */
+		if (pset.sversion >= 150000)
+			appendPQExpBuffer(&buf,
+							  ", subtwophasestate AS \"%s\"\n",
+							  gettext_noop("Two phase commit"));
+
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
 						  ",  subconninfo AS \"%s\"\n",
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 0ebd5aa..d6bf725 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2764,7 +2764,7 @@ psql_completion(const char *text, int start, int end)
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
 					  "enabled", "slot_name", "streaming",
-					  "synchronous_commit");
+					  "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 750d469..6ffa0f8 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -57,6 +65,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -92,6 +102,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 4d20563..632381b 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -87,6 +87,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index faa3a25..ebc43a0 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		temporary;
+	bool		two_phase;
 	List	   *options;
 } CreateReplicationSlotCmd;
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..0b071a6 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -90,6 +90,16 @@ typedef struct LogicalDecodingContext
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 * This flag indicates that the plugin passed in the two-phase option as
+	 * part of the START_STREAMING command. We can't rely solely on the twophase
+	 * flag which only tells whether the plugin provided all the necessary
+	 * two-phase callbacks.
+	 *
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..e20f2da 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -122,6 +131,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +180,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d8..d7c785b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -297,7 +297,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -636,7 +640,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 2eb7e3a..34d95ea 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -84,11 +84,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 57f7dd9..ad6b4e4 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..4c372a6
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,359 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# copy_data=false and two_phase
+###############################
+
+#create some test tables for copy tests
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_copy SELECT generate_series(1,5);");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "INSERT INTO tab_copy VALUES (88);");
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Setup logical replication
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_copy FOR TABLE tab_copy;");
+
+my $appname_copy = 'appname_copy';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_copy
+	CONNECTION '$publisher_connstr application_name=$appname_copy'
+	PUBLICATION tap_pub_copy
+	WITH (two_phase=on, copy_data=false);");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Also wait for initial table sync to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+# Check that the initial table data was NOT replicated (because we said copy_data=false)
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Now do a prepare on publisher and check that it IS replicated
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_copy VALUES (99);
+    PREPARE TRANSACTION 'mygid';");
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Check that the transaction has been prepared on the subscriber, there will be 2
+# prepared transactions for the 2 subscriptions.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
+is($result, qq(2), 'transaction is prepared on subscriber');
+
+# Now commit the insert and verify that it IS replicated
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(6), 'publisher inserted data');
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(2), 'replicated data in subscriber table');
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..e61d28a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 64c06cf..ee3a114 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1390,12 +1390,15 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v91-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v91-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From eec0842bf663a41afaf90991c3ed2075881eb556 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 30 Jun 2021 19:40:20 +1000
Subject: [PATCH v91] Skip empty transactions for logical replication.

The current logical replication behaviour is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  16 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  38 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 158 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  46 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 286 insertions(+), 39 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 985db5c..c2468d2 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -884,11 +884,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index c88ec1e..7415cf2 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7550,6 +7550,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit prepared.
 </para></listitem>
 </varlistentry>
@@ -7564,6 +7571,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d61ef4c..67c762a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -936,7 +937,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -972,7 +974,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						"commit_prepared_cb")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..4653d6d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 9f80794..2724756 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2794,7 +2794,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 1f6b432..15cbdbe 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -990,27 +990,39 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* There is no transaction when COMMIT PREPARED is called */
-	begin_replication_step();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	end_replication_step();
-	CommitTransactionCommand();
+		/* There is no transaction when COMMIT PREPARED is called */
+		begin_replication_step();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 286119c..7ebdb4e 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -132,6 +134,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -401,10 +408,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -419,8 +448,22 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
+	txn->output_plugin_private = NULL;
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -432,10 +475,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -450,8 +511,18 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty prepared transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -462,12 +533,33 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of COMMIT PREPARED of an empty transaction");
+			return;
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -480,8 +572,26 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of ROLLBACK of an empty transaction");
+			return;
+		}
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -630,11 +740,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -668,6 +783,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -770,6 +894,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -777,6 +902,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -813,6 +942,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -833,6 +971,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -845,6 +984,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 7a4804f..2fa60b5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d7c785b..ffc0b56 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -442,7 +442,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 0e218e0..3d246be 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -87,9 +87,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 4c372a6..8a33641 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 24;
+use Test::More tests => 25;
 
 ###############################
 # Setup
@@ -318,10 +318,9 @@ $node_publisher->safe_psql('postgres', "
 
 $node_publisher->wait_for_catchup($appname_copy);
 
-# Check that the transaction has been prepared on the subscriber, there will be 2
-# prepared transactions for the 2 subscriptions.
+# Check that the transaction has been prepared on the subscriber
 $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
-is($result, qq(2), 'transaction is prepared on subscriber');
+is($result, qq(1), 'transaction is prepared on subscriber');
 
 # Now commit the insert and verify that it IS replicated
 $node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
@@ -337,6 +336,45 @@ is($result, qq(2), 'replicated data in subscriber table');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
 $node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+   "CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+   "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot
+$node_publisher->safe_psql('postgres', "
+   BEGIN;
+   INSERT INTO tab_nopub SELECT generate_series(1,10);
+   PREPARE TRANSACTION 'empty_transaction';
+   COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+   'postgres', qq(
+       SELECT get_byte(data, 0)
+       FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+           'proto_version', '1',
+           'publication_names', 'tap_pub')
+));
+
+# the empty transaction should be skipped
+is($result, qq(),
+   'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ee3a114..3ef5152 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1599,6 +1599,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

v91-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v91-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From ea69094e24315ad9c2c1e4b2ddb2f0938cdd8d42 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 30 Jun 2021 19:21:33 +1000
Subject: [PATCH v91] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/protocol.sgml                         |  68 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  10 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 138 ++++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |  10 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 453 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 271 ++++++++++++
 11 files changed, 1021 insertions(+), 79 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index e8cb78f..c88ec1e 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains a Stream Commit or Stream Abort message.
+   contains a Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7398,7 +7398,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7661,6 +7661,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1433905..702934e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 4cfd763..b50f5d6 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -926,12 +911,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e48b821..1f6b432 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1065,6 +1065,90 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a STREAM PREPARE message")));
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1280,30 +1364,20 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	if (in_streamed_transaction)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROTOCOL_VIOLATION),
-				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
 	/* Make sure we have an open transaction */
 	begin_replication_step();
 
@@ -1314,7 +1388,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -1335,7 +1409,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1414,6 +1488,32 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2359,6 +2459,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f63e17e..286119c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1030,6 +1021,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index e20f2da..7a4804f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -124,6 +125,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -243,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index ad6b4e4..34ebca4 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c90e3f6
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,453 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3a0be82
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#368Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#367)
4 attachment(s)

On Wed, Jun 30, 2021 at 7:47 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Wed, Jun 30, 2021 at 6:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Pushed the patch after taking care of your suggestion. Now, the next
step is to rebase the remaining patches and adapt some of the checks
to PG-15.

Please find attached the latest patch set v91*

Differences from v90* are:

* This is the first patch set for PG15

* Rebased to HEAD @ today.

* Now the patch set has only 3 patches again because v90-0001,
v90-0002 are already pushed [1]

* Bumped all relevant server version checks to 150000

Adding a new patch (0004) to this patch-set that handles skipping of
empty streamed transactions. patch-0003 did not
handle empty streamed transactions. To support this, added a new flag
"sent_stream_start" to PGOutputTxnData.
Also transactions which do not have any data will not be stream
committed or stream prepared or stream aborted.
Do review and let me know if you have any comments.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v92-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v92-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From bce3f7d78404675be22b06e453235da3e15451c6 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 30 Jun 2021 22:58:10 -0400
Subject: [PATCH v92] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the following things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase
transactions. We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Add new subscription TAP tests, and new subscription.sql regression tests.

* Update PG documentation.

We don't support the following operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* CREATE/ALTER SUBSCRIPTION which tries to set options two_phase=true and streaming=true at the same time.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         | 291 ++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 148 ++++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  11 +-
 src/backend/replication/logical/logical.c          |  31 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 196 +++++++++--
 src/backend/replication/logical/worker.c           | 351 +++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |   2 +-
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  14 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/nodes/replnodes.h                      |   1 +
 src/include/replication/logical.h                  |  10 +
 src/include/replication/logicalproto.h             |  73 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 359 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 235 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   3 +
 43 files changed, 2395 insertions(+), 197 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f517a7d..0235639 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7643,6 +7643,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index a3562f3..e8cb78f 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2811,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2871,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains a Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7391,6 +7398,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepared transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit prepared.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index b3d1731..a6f9944 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1433905 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used with the
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on the publisher is decoded as a normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used with the
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..6d3efb4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		exists.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984..55f6e37 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1255,5 +1255,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subslotname, subsynccommit, subpublications)
+              substream, subtwophasestate, subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index b862e59..4cfd763 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -69,7 +69,8 @@ parse_subscription_options(List *options,
 						   char **synchronous_commit,
 						   bool *refresh,
 						   bool *binary_given, bool *binary,
-						   bool *streaming_given, bool *streaming)
+						   bool *streaming_given, bool *streaming,
+						   bool *twophase_given, bool *twophase)
 {
 	ListCell   *lc;
 	bool		connect_given = false;
@@ -110,6 +111,11 @@ parse_subscription_options(List *options,
 		*streaming_given = false;
 		*streaming = false;
 	}
+	if (twophase)
+	{
+		*twophase_given = false;
+		*twophase = false;
+	}
 
 	/* Parse options */
 	foreach(lc, options)
@@ -215,6 +221,29 @@ parse_subscription_options(List *options,
 			*streaming_given = true;
 			*streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: twophase == NULL indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!twophase)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (*twophase_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			*twophase_given = true;
+			*twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -285,6 +314,21 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (twophase && *twophase_given && *twophase)
+	{
+		if (streaming && *streaming_given && *streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -337,6 +381,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	bool		copy_data;
 	bool		streaming;
 	bool		streaming_given;
+	bool		twophase;
+	bool		twophase_given;
 	char	   *synchronous_commit;
 	char	   *conninfo;
 	char	   *slotname;
@@ -361,7 +407,8 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 							   &synchronous_commit,
 							   NULL,	/* no "refresh" */
 							   &binary_given, &binary,
-							   &streaming_given, &streaming);
+							   &streaming_given, &streaming,
+							   &twophase_given, &twophase);
 
 	/*
 	 * Since creating a replication slot is not transactional, rolling back
@@ -429,6 +476,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -506,10 +557,34 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(slotname);
 
-				walrcv_create_slot(wrconn, slotname, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false then
+				 * it is safe to enable two_phase up-front because those tables
+				 * are already initially in READY state. When the subscription
+				 * has no tables, we leave the twophase state as PENDING,
+				 * to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (twophase && !copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, slotname, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								slotname)));
@@ -816,7 +891,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   &synchronous_commit,
 										   NULL,	/* no "refresh" */
 										   &binary_given, &binary,
-										   &streaming_given, &streaming);
+										   &streaming_given, &streaming,
+										   NULL, NULL /* no "two_phase" */ );
 
 				if (slotname_given)
 				{
@@ -850,6 +926,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -873,7 +955,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no streaming */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				Assert(enabled_given);
 
 				if (!sub->slotname && enabled)
@@ -918,7 +1001,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 				values[Anum_pg_subscription_subpublications - 1] =
 					publicationListToArray(stmt->publication);
 				replaces[Anum_pg_subscription_subpublications - 1] = true;
@@ -934,6 +1018,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -963,7 +1058,8 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   &refresh,
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
 
 				publist = merge_publications(sub->publications, stmt->publication, isadd, stmt->subname);
 
@@ -982,6 +1078,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -1011,7 +1118,32 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 										   NULL,	/* no "synchronous_commit" */
 										   NULL,	/* no "refresh" */
 										   NULL, NULL,	/* no "binary" */
-										   NULL, NULL); /* no "streaming" */
+										   NULL, NULL,	/* no "streaming" */
+										   NULL, NULL); /* no "two_phase" */
+
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
 
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 6eaa84a..19ea159 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -436,6 +437,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 150000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -851,7 +856,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -868,6 +873,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
+		if (two_phase)
+			appendStringInfoString(&cmd, " TWO_PHASE");
+
 		switch (snapshot_action)
 		{
 			case CRS_EXPORT_SNAPSHOT:
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 453efc5..74df75e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing preparing
+				 * transactions that have locked [user] catalog tables
+				 * exclusively but as of now we ask users not to do such an
+				 * operation.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +734,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d536a5f..d61ef4c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,12 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= slot->data.two_phase;
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +540,22 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (slot->data.two_phase || ctx->twophase_opt_given);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +616,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index cb42fcb..2c191de 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b8c5e2a..9f80794 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2576,7 +2576,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2667,7 +2667,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2714,7 +2714,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2734,7 +2734,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2753,19 +2753,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2783,12 +2784,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..a14a3d6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions, that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot, need
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 682c107..aa3fce0 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready != NIL && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (table_states_not_ready == NIL && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1071,7 +1067,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1158,3 +1155,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ *
+ * Note: If this function started the transaction (indicated by the parameter)
+ * then it is the caller's responsibility to commit it.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static bool has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && list_length(table_states_not_ready) == 0;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase state */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index bbb659d..e48b821 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * is still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +329,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -804,6 +880,191 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a BEGIN PREPARE message")));
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (prepare_data.prepare_lsn != remote_final_lsn)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("incorrect prepare LSN %X/%X in prepare message (expected %X/%X)",
+								 LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
+								 LSN_FORMAT_ARGS(remote_final_lsn))));
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	begin_replication_step();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* There is no transaction when COMMIT PREPARED is called */
+	begin_replication_step();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
+		begin_replication_step();
+		FinishPreparedTransaction(gid, false);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2082,6 +2343,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2561,6 +2838,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -3062,6 +3342,24 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+
+	if (!TransactionIdIsValid(xid))
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("invalid two-phase transaction ID")));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3072,6 +3370,7 @@ ApplyWorkerMain(Datum main_arg)
 	XLogRecPtr	origin_startpos;
 	char	   *myslotname;
 	WalRcvStreamOptions options;
+	int			server_version;
 
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
@@ -3230,15 +3529,59 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+
+	server_version = walrcv_server_version(LogRepWorkerWalRcvConn);
 	options.proto.logical.proto_version =
-		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		server_version >= 150000 ? LOGICALREP_PROTO_TWOPHASE_VERSION_NUM :
+		server_version >= 140000 ? LOGICALREP_PROTO_STREAM_VERSION_NUM :
+		LOGICALREP_PROTO_VERSION_NUM;
+
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
+
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index abd5217..f63e17e 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -145,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -156,6 +174,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -167,10 +187,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -246,8 +268,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -319,6 +362,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -331,8 +395,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -347,29 +415,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -389,6 +436,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -839,18 +948,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1270,3 +1369,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index e1e8ec2..0910546 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -242,7 +242,7 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
 			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8c18b4e..33b85d8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -283,6 +283,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2be9ad9..9a2bc37 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -370,7 +370,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 3211521..912144c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4320,6 +4321,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4363,9 +4365,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 150000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4386,6 +4395,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4411,6 +4421,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4438,6 +4450,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4479,6 +4492,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index ba9bc6d..efb8c30 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2abf255..28cf352 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,7 +6415,9 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary and streaming are only supported in v14 and higher.
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
@@ -6423,6 +6425,14 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Binary"),
 							  gettext_noop("Streaming"));
 
+		/*
+		 * Two_phase is only supported in v15 and higher.
+		 */
+		if (pset.sversion >= 150000)
+			appendPQExpBuffer(&buf,
+							  ", subtwophasestate AS \"%s\"\n",
+							  gettext_noop("Two phase commit"));
+
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
 						  ",  subconninfo AS \"%s\"\n",
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 0ebd5aa..d6bf725 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2764,7 +2764,7 @@ psql_completion(const char *text, int start, int end)
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
 					  "enabled", "slot_name", "streaming",
-					  "synchronous_commit");
+					  "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 750d469..6ffa0f8 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -57,6 +65,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -92,6 +102,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 4d20563..632381b 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -87,6 +87,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index faa3a25..ebc43a0 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		temporary;
+	bool		two_phase;
 	List	   *options;
 } CreateReplicationSlotCmd;
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..0b071a6 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -90,6 +90,16 @@ typedef struct LogicalDecodingContext
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 * This flag indicates that the plugin passed in the two-phase option as
+	 * part of the START_STREAMING command. We can't rely solely on the twophase
+	 * flag which only tells whether the plugin provided all the necessary
+	 * two-phase callbacks.
+	 *
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..e20f2da 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -122,6 +131,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +180,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d8..d7c785b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -297,7 +297,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -636,7 +640,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 2eb7e3a..34d95ea 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -84,11 +84,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 57f7dd9..ad6b4e4 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..4c372a6
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,359 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# copy_data=false and two_phase
+###############################
+
+#create some test tables for copy tests
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_copy SELECT generate_series(1,5);");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "INSERT INTO tab_copy VALUES (88);");
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Setup logical replication
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_copy FOR TABLE tab_copy;");
+
+my $appname_copy = 'appname_copy';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_copy
+	CONNECTION '$publisher_connstr application_name=$appname_copy'
+	PUBLICATION tap_pub_copy
+	WITH (two_phase=on, copy_data=false);");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Also wait for initial table sync to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+# Check that the initial table data was NOT replicated (because we said copy_data=false)
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Now do a prepare on publisher and check that it IS replicated
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_copy VALUES (99);
+    PREPARE TRANSACTION 'mygid';");
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Check that the transaction has been prepared on the subscriber, there will be 2
+# prepared transactions for the 2 subscriptions.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
+is($result, qq(2), 'transaction is prepared on subscriber');
+
+# Now commit the insert and verify that it IS replicated
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(6), 'publisher inserted data');
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(2), 'replicated data in subscriber table');
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..e61d28a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 64c06cf..ee3a114 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1390,12 +1390,15 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v92-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v92-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From fee7445a4149779d32c6e3422edc118ebd5e4562 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 30 Jun 2021 23:16:14 -0400
Subject: [PATCH v92] Skip empty transactions for logical replication.

The current logical replication behaviour is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  16 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  38 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 158 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  46 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 286 insertions(+), 39 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 985db5c..c2468d2 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -884,11 +884,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index c88ec1e..7415cf2 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7550,6 +7550,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit prepared.
 </para></listitem>
 </varlistentry>
@@ -7564,6 +7571,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d61ef4c..67c762a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -936,7 +937,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -972,7 +974,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						"commit_prepared_cb")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..4653d6d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 9f80794..2724756 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2794,7 +2794,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 1f6b432..15cbdbe 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -990,27 +990,39 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* There is no transaction when COMMIT PREPARED is called */
-	begin_replication_step();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	end_replication_step();
-	CommitTransactionCommand();
+		/* There is no transaction when COMMIT PREPARED is called */
+		begin_replication_step();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 286119c..7ebdb4e 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -132,6 +134,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -401,10 +408,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -419,8 +448,22 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
+	txn->output_plugin_private = NULL;
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -432,10 +475,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -450,8 +511,18 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty prepared transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -462,12 +533,33 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of COMMIT PREPARED of an empty transaction");
+			return;
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -480,8 +572,26 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of ROLLBACK of an empty transaction");
+			return;
+		}
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -630,11 +740,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -668,6 +783,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -770,6 +894,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -777,6 +902,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -813,6 +942,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -833,6 +971,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -845,6 +984,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 7a4804f..2fa60b5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d7c785b..ffc0b56 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -442,7 +442,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 0e218e0..3d246be 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -87,9 +87,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 4c372a6..8a33641 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 24;
+use Test::More tests => 25;
 
 ###############################
 # Setup
@@ -318,10 +318,9 @@ $node_publisher->safe_psql('postgres', "
 
 $node_publisher->wait_for_catchup($appname_copy);
 
-# Check that the transaction has been prepared on the subscriber, there will be 2
-# prepared transactions for the 2 subscriptions.
+# Check that the transaction has been prepared on the subscriber
 $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
-is($result, qq(2), 'transaction is prepared on subscriber');
+is($result, qq(1), 'transaction is prepared on subscriber');
 
 # Now commit the insert and verify that it IS replicated
 $node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
@@ -337,6 +336,45 @@ is($result, qq(2), 'replicated data in subscriber table');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
 $node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+   "CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+   "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot
+$node_publisher->safe_psql('postgres', "
+   BEGIN;
+   INSERT INTO tab_nopub SELECT generate_series(1,10);
+   PREPARE TRANSACTION 'empty_transaction';
+   COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+   'postgres', qq(
+       SELECT get_byte(data, 0)
+       FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+           'proto_version', '1',
+           'publication_names', 'tap_pub')
+));
+
+# the empty transaction should be skipped
+is($result, qq(),
+   'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ee3a114..3ef5152 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1599,6 +1599,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

v92-0004-Skip-empty-streaming-in-progress-transaction-for.patchapplication/octet-stream; name=v92-0004-Skip-empty-streaming-in-progress-transaction-for.patchDownload
From 57d4a12d8db6ea79a30cdfc535282c27bb948bbc Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 30 Jun 2021 23:27:43 -0400
Subject: [PATCH v92] Skip empty streaming in-progress transaction for logical
 replication.

This improves the behaviour of skipping empty transaction to also
include empty streamed in-progress transactions.
---
 src/backend/replication/pgoutput/pgoutput.c | 142 ++++++++++++++++++++++++----
 1 file changed, 124 insertions(+), 18 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7ebdb4e..02ed5a6 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -65,6 +65,8 @@ static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
+static void pgoutput_send_stream_start(struct LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn);
 static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
@@ -134,9 +136,21 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+/*
+ * Maintain the per-transaction level variables to track whether the
+ * transaction and or streams have written any changes.
+ * BEGIN / BEGIN PREPARE is held back until the first
+ * change needs to be sent. In streaming mode the transaction can
+ * be decoded in streams, so along with maintaining whether the
+ * transaction has written any changes, we also need to track whether the
+ * current stream has written any changes. START STREAM is held back until
+ * the first change is streamed. This is done so that empty transactions and
+ * streams which do not have any changes can be dropped.
+ */
 typedef struct PGOutputTxnData
 {
 	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+	bool sent_stream_start; /* flag indicating if stream start has been sent */
 } PGOutputTxnData;
 
 /* Map used to remember which relation schemas we sent. */
@@ -746,9 +760,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
-	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
-	if (!in_streaming)
-		Assert(txndata);
+	/* should have set up txndata as part of BEGIN/BEGIN PREPARE/START STREAM */
+	Assert(txndata);
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -783,8 +796,11 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* If streaming, send STREAM START if we haven't yet */
+	if (in_streaming && !txndata->sent_stream_start)
+		pgoutput_send_stream_start(ctx, txn);
 	/* output BEGIN if we haven't yet */
-	if (!in_streaming && !txndata->sent_begin_txn)
+	else if (!txndata->sent_begin_txn)
 	{
 		if (rbtxn_prepared(txn))
 			pgoutput_begin_prepare(ctx, txn);
@@ -902,9 +918,8 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
-	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
-	if (!in_streaming)
-		Assert(txndata);
+	/* Should have setup txndata as part of BEGIN/BEGIN PREPARE/START STREAM */
+	Assert(txndata);
 
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
@@ -942,8 +957,11 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* If streaming, send STREAM START if we haven't yet */
+		if (in_streaming && !txndata->sent_stream_start)
+			pgoutput_send_stream_start(ctx, txn);
 		/* output BEGIN if we haven't yet */
-		if (!in_streaming && !txndata->sent_begin_txn)
+		else if (!txndata->sent_begin_txn)
 		{
 			if (rbtxn_prepared(txn))
 				pgoutput_begin_prepare(ctx, txn);
@@ -984,16 +1002,24 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
-	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
-	if (!in_streaming && transactional)
+	/* Set up txndata for streaming and transactional messages */
+	if (in_streaming || transactional)
 	{
 		txndata = (PGOutputTxnData *) txn->output_plugin_private;
-		if (!txndata->sent_begin_txn)
+
+		/* If streaming, send STREAM START if we haven't yet */
+		if (in_streaming && !txndata->sent_stream_start)
+			pgoutput_send_stream_start(ctx, txn);
+		/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+		else if (transactional)
 		{
-			if (rbtxn_prepared(txn))
-				pgoutput_begin_prepare(ctx, txn);
-			else
-				pgoutput_begin(ctx, txn);
+			if (!txndata->sent_begin_txn)
+			{
+				if (rbtxn_prepared(txn))
+					pgoutput_begin_prepare(ctx, txn);
+				else
+					pgoutput_begin(ctx, txn);
+			}
 		}
 	}
 
@@ -1076,12 +1102,37 @@ static void
 pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 					  ReorderBufferTXN *txn)
 {
-	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData *txndata = txn->output_plugin_private;
 
 	/* we can't nest streaming of transactions */
 	Assert(!in_streaming);
 
 	/*
+	 * Don't actually send stream start here, instead set a flag that indicates
+	 * that stream start hasn't been sent and wait for the first actual change
+	 * for this stream to be sent and then send stream start. This is done
+	 * to avoid sending empty streams without any changes.
+	 */
+	if (txndata == NULL)
+	{
+		txndata =
+			MemoryContextAllocZero(ctx->context, sizeof(PGOutputTxnData));
+		txndata->sent_begin_txn = false;
+		txn->output_plugin_private = txndata;
+	}
+
+	txndata->sent_stream_start = false;
+	in_streaming = true;
+}
+
+static void
+pgoutput_send_stream_start(struct LogicalDecodingContext *ctx,
+						   ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*txndata = (PGOutputTxnData *) txn->output_plugin_private;
+
+	/*
 	 * If we already sent the first stream for this transaction then don't
 	 * send the origin id in the subsequent streams.
 	 */
@@ -1096,8 +1147,12 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 
 	OutputPluginWrite(ctx, true);
 
-	/* we're streaming a chunk of transaction now */
-	in_streaming = true;
+	/*
+	 * Set the flags that indicate that changes were sent as part of
+	 * the transaction and the stream.
+	 */
+	txndata->sent_begin_txn = txndata->sent_stream_start = true;
+
 }
 
 /*
@@ -1107,9 +1162,18 @@ static void
 pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
 					 ReorderBufferTXN *txn)
 {
+	PGOutputTxnData *data = txn->output_plugin_private;
+
 	/* we should be streaming a trasanction */
 	Assert(in_streaming);
 
+	if (!data->sent_stream_start)
+	{
+		in_streaming = false;
+		elog(DEBUG1, "Skipping replication of an empty transaction in stream stop");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_stream_stop(ctx->out);
 	OutputPluginWrite(ctx, true);
@@ -1128,6 +1192,8 @@ pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 					  XLogRecPtr abort_lsn)
 {
 	ReorderBufferTXN *toptxn;
+	PGOutputTxnData  *txndata;
+	bool sent_begin_txn;
 
 	/*
 	 * The abort should happen outside streaming block, even for streamed
@@ -1137,6 +1203,21 @@ pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 
 	/* determine the toplevel transaction */
 	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+	txndata = toptxn->output_plugin_private;
+	sent_begin_txn = txndata->sent_begin_txn;
+
+	if (txn->toptxn == NULL)
+	{
+		pfree(txndata);
+		txn->output_plugin_private = NULL;
+	}
+
+	if (!sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction in stream abort");
+		return;
+	}
+
 
 	Assert(rbtxn_is_streamed(toptxn));
 
@@ -1156,6 +1237,9 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 					   ReorderBufferTXN *txn,
 					   XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData *txndata = txn->output_plugin_private;
+	bool			sent_begin_txn = txndata->sent_begin_txn;
+
 	/*
 	 * The commit should happen outside streaming block, even for streamed
 	 * transactions. The transaction has to be marked as streamed, though.
@@ -1163,6 +1247,16 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 	Assert(!in_streaming);
 	Assert(rbtxn_is_streamed(txn));
 
+	pfree(txndata);
+	txn->output_plugin_private = NULL;
+
+	/* If no changes were part of this transaction then drop the commit */
+	if (!sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction in stream commit");
+		return;
+	}
+
 	OutputPluginUpdateProgress(ctx);
 
 	OutputPluginPrepareWrite(ctx, true);
@@ -1182,8 +1276,20 @@ pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
 							ReorderBufferTXN *txn,
 							XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData *txndata = txn->output_plugin_private;
+	bool			sent_begin_txn = txndata->sent_begin_txn;
+
 	Assert(rbtxn_is_streamed(txn));
 
+	pfree(txndata);
+	txn->output_plugin_private = NULL;
+
+	if (!sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction in stream prepare");
+		return;
+	}
+
 	OutputPluginUpdateProgress(ctx);
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
-- 
1.8.3.1

v92-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v92-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From c9eae99ed4f4c6941a8ea59efae26068132ea86b Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Wed, 30 Jun 2021 23:11:30 -0400
Subject: [PATCH v92] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/protocol.sgml                         |  68 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  10 -
 src/backend/commands/subscriptioncmds.c            |  21 -
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 138 ++++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |  10 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 453 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 271 ++++++++++++
 11 files changed, 1021 insertions(+), 79 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index e8cb78f..c88ec1e 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains a Stream Commit or Stream Abort message.
+   contains a Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7398,7 +7398,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7661,6 +7661,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1433905..702934e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 4cfd763..b50f5d6 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -314,21 +314,6 @@ parse_subscription_options(List *options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (twophase && *twophase_given && *twophase)
-	{
-		if (streaming && *streaming_given && *streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -926,12 +911,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (streaming_given)
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e48b821..1f6b432 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -331,7 +331,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1065,6 +1065,90 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a STREAM PREPARE message")));
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1280,30 +1364,20 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	if (in_streamed_transaction)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROTOCOL_VIOLATION),
-				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
 	/* Make sure we have an open transaction */
 	begin_replication_step();
 
@@ -1314,7 +1388,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -1335,7 +1409,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1414,6 +1488,32 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2359,6 +2459,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f63e17e..286119c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1030,6 +1021,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index e20f2da..7a4804f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -124,6 +125,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -243,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index ad6b4e4..34ebca4 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c90e3f6
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,453 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3a0be82
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#369tanghy.fnst@fujitsu.com
tanghy.fnst@fujitsu.com
In reply to: Ajin Cherian (#368)
RE: [HACKERS] logical decoding of two-phase transactions

On Thursday, July 1, 2021 11:48 AM Ajin Cherian <itsajin@gmail.com>

Adding a new patch (0004) to this patch-set that handles skipping of
empty streamed transactions. patch-0003 did not
handle empty streamed transactions. To support this, added a new flag
"sent_stream_start" to PGOutputTxnData.
Also transactions which do not have any data will not be stream
committed or stream prepared or stream aborted.
Do review and let me know if you have any comments.

Thanks for your patch. I met an issue while using it. When a transaction contains TRUNCATE, the subscriber reported an error: " ERROR: no data left in message" and the data couldn't be replicated.

Steps to reproduce the issue:

(set logical_decoding_work_mem to 64kB at publisher so that streaming could work. )

------publisher------
create table test (a int primary key, b varchar);
create publication pub for table test;

------subscriber------
create table test (a int primary key, b varchar);
create subscription sub connection 'dbname=postgres' publication pub with(two_phase=on, streaming=on);

------publisher------
BEGIN;
TRUNCATE test;
INSERT INTO test SELECT i, md5(i::text) FROM generate_series(1001, 6000) s(i);
UPDATE test SET b = md5(b) WHERE mod(a,2) = 0;
DELETE FROM test WHERE mod(a,3) = 0;
COMMIT;

The above case worked ok when remove 0004 patch, so I think it’s a problem of 0004 patch. Please have a look.

Regards
Tang

#370Peter Smith
smithpb2250@gmail.com
In reply to: tanghy.fnst@fujitsu.com (#369)
4 attachment(s)

Please find attached the latest patch set v93*

Differences from v92* are:

* Rebased to HEAD @ today.

This rebase was made necessary by recent changes [1]https://github.com/postgres/postgres/commit/8aafb02616753f5c6c90bbc567636b73c0cbb9d4 to the
parse_subscription_options function.

----
[1]: https://github.com/postgres/postgres/commit/8aafb02616753f5c6c90bbc567636b73c0cbb9d4

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v93-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v93-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 5d1ae9e0f1a25ead55a92e886c1355a583ed2ba5 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 6 Jul 2021 14:00:04 +1000
Subject: [PATCH v93] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/protocol.sgml                         |  68 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  10 -
 src/backend/commands/subscriptioncmds.c            |  25 --
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 138 ++++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |  10 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 453 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 271 ++++++++++++
 11 files changed, 1021 insertions(+), 83 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index e8cb78f..c88ec1e 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains a Stream Commit or Stream Abort message.
+   contains a Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7398,7 +7398,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7661,6 +7661,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1433905..702934e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index b360892..ab3314c 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -352,25 +352,6 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (opts->twophase &&
-		IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT) &&
-		IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
-	{
-		if (opts->streaming &&
-			IsSet(supported_opts, SUBOPT_STREAMING) &&
-			IsSet(opts->specified_opts, SUBOPT_STREAMING))
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -939,12 +920,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && opts.streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index d112f8a..bf4bfeb 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -330,7 +330,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1045,6 +1045,90 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a STREAM PREPARE message")));
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1260,30 +1344,20 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	if (in_streamed_transaction)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROTOCOL_VIOLATION),
-				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
 	/* Make sure we have an open transaction */
 	begin_replication_step();
 
@@ -1294,7 +1368,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -1315,7 +1389,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1394,6 +1468,32 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2337,6 +2437,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f63e17e..286119c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1030,6 +1021,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index e20f2da..7a4804f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -124,6 +125,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -243,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index ad6b4e4..34ebca4 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c90e3f6
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,453 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3a0be82
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v93-0004-Skip-empty-streaming-in-progress-transaction-for.patchapplication/octet-stream; name=v93-0004-Skip-empty-streaming-in-progress-transaction-for.patchDownload
From 2bebd82bc25af8d1d5b3420fea183a090d207163 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 6 Jul 2021 14:22:01 +1000
Subject: [PATCH v93] Skip empty streaming in-progress transaction for logical
 replication.

This improves the behaviour of skipping empty transaction to also
include empty streamed in-progress transactions.
---
 src/backend/replication/pgoutput/pgoutput.c | 142 ++++++++++++++++++++++++----
 1 file changed, 124 insertions(+), 18 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7ebdb4e..02ed5a6 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -65,6 +65,8 @@ static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
+static void pgoutput_send_stream_start(struct LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn);
 static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
@@ -134,9 +136,21 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+/*
+ * Maintain the per-transaction level variables to track whether the
+ * transaction and or streams have written any changes.
+ * BEGIN / BEGIN PREPARE is held back until the first
+ * change needs to be sent. In streaming mode the transaction can
+ * be decoded in streams, so along with maintaining whether the
+ * transaction has written any changes, we also need to track whether the
+ * current stream has written any changes. START STREAM is held back until
+ * the first change is streamed. This is done so that empty transactions and
+ * streams which do not have any changes can be dropped.
+ */
 typedef struct PGOutputTxnData
 {
 	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+	bool sent_stream_start; /* flag indicating if stream start has been sent */
 } PGOutputTxnData;
 
 /* Map used to remember which relation schemas we sent. */
@@ -746,9 +760,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
-	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
-	if (!in_streaming)
-		Assert(txndata);
+	/* should have set up txndata as part of BEGIN/BEGIN PREPARE/START STREAM */
+	Assert(txndata);
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -783,8 +796,11 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* If streaming, send STREAM START if we haven't yet */
+	if (in_streaming && !txndata->sent_stream_start)
+		pgoutput_send_stream_start(ctx, txn);
 	/* output BEGIN if we haven't yet */
-	if (!in_streaming && !txndata->sent_begin_txn)
+	else if (!txndata->sent_begin_txn)
 	{
 		if (rbtxn_prepared(txn))
 			pgoutput_begin_prepare(ctx, txn);
@@ -902,9 +918,8 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
-	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
-	if (!in_streaming)
-		Assert(txndata);
+	/* Should have setup txndata as part of BEGIN/BEGIN PREPARE/START STREAM */
+	Assert(txndata);
 
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
@@ -942,8 +957,11 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* If streaming, send STREAM START if we haven't yet */
+		if (in_streaming && !txndata->sent_stream_start)
+			pgoutput_send_stream_start(ctx, txn);
 		/* output BEGIN if we haven't yet */
-		if (!in_streaming && !txndata->sent_begin_txn)
+		else if (!txndata->sent_begin_txn)
 		{
 			if (rbtxn_prepared(txn))
 				pgoutput_begin_prepare(ctx, txn);
@@ -984,16 +1002,24 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
-	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
-	if (!in_streaming && transactional)
+	/* Set up txndata for streaming and transactional messages */
+	if (in_streaming || transactional)
 	{
 		txndata = (PGOutputTxnData *) txn->output_plugin_private;
-		if (!txndata->sent_begin_txn)
+
+		/* If streaming, send STREAM START if we haven't yet */
+		if (in_streaming && !txndata->sent_stream_start)
+			pgoutput_send_stream_start(ctx, txn);
+		/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+		else if (transactional)
 		{
-			if (rbtxn_prepared(txn))
-				pgoutput_begin_prepare(ctx, txn);
-			else
-				pgoutput_begin(ctx, txn);
+			if (!txndata->sent_begin_txn)
+			{
+				if (rbtxn_prepared(txn))
+					pgoutput_begin_prepare(ctx, txn);
+				else
+					pgoutput_begin(ctx, txn);
+			}
 		}
 	}
 
@@ -1076,12 +1102,37 @@ static void
 pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 					  ReorderBufferTXN *txn)
 {
-	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData *txndata = txn->output_plugin_private;
 
 	/* we can't nest streaming of transactions */
 	Assert(!in_streaming);
 
 	/*
+	 * Don't actually send stream start here, instead set a flag that indicates
+	 * that stream start hasn't been sent and wait for the first actual change
+	 * for this stream to be sent and then send stream start. This is done
+	 * to avoid sending empty streams without any changes.
+	 */
+	if (txndata == NULL)
+	{
+		txndata =
+			MemoryContextAllocZero(ctx->context, sizeof(PGOutputTxnData));
+		txndata->sent_begin_txn = false;
+		txn->output_plugin_private = txndata;
+	}
+
+	txndata->sent_stream_start = false;
+	in_streaming = true;
+}
+
+static void
+pgoutput_send_stream_start(struct LogicalDecodingContext *ctx,
+						   ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*txndata = (PGOutputTxnData *) txn->output_plugin_private;
+
+	/*
 	 * If we already sent the first stream for this transaction then don't
 	 * send the origin id in the subsequent streams.
 	 */
@@ -1096,8 +1147,12 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 
 	OutputPluginWrite(ctx, true);
 
-	/* we're streaming a chunk of transaction now */
-	in_streaming = true;
+	/*
+	 * Set the flags that indicate that changes were sent as part of
+	 * the transaction and the stream.
+	 */
+	txndata->sent_begin_txn = txndata->sent_stream_start = true;
+
 }
 
 /*
@@ -1107,9 +1162,18 @@ static void
 pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
 					 ReorderBufferTXN *txn)
 {
+	PGOutputTxnData *data = txn->output_plugin_private;
+
 	/* we should be streaming a trasanction */
 	Assert(in_streaming);
 
+	if (!data->sent_stream_start)
+	{
+		in_streaming = false;
+		elog(DEBUG1, "Skipping replication of an empty transaction in stream stop");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_stream_stop(ctx->out);
 	OutputPluginWrite(ctx, true);
@@ -1128,6 +1192,8 @@ pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 					  XLogRecPtr abort_lsn)
 {
 	ReorderBufferTXN *toptxn;
+	PGOutputTxnData  *txndata;
+	bool sent_begin_txn;
 
 	/*
 	 * The abort should happen outside streaming block, even for streamed
@@ -1137,6 +1203,21 @@ pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 
 	/* determine the toplevel transaction */
 	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+	txndata = toptxn->output_plugin_private;
+	sent_begin_txn = txndata->sent_begin_txn;
+
+	if (txn->toptxn == NULL)
+	{
+		pfree(txndata);
+		txn->output_plugin_private = NULL;
+	}
+
+	if (!sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction in stream abort");
+		return;
+	}
+
 
 	Assert(rbtxn_is_streamed(toptxn));
 
@@ -1156,6 +1237,9 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 					   ReorderBufferTXN *txn,
 					   XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData *txndata = txn->output_plugin_private;
+	bool			sent_begin_txn = txndata->sent_begin_txn;
+
 	/*
 	 * The commit should happen outside streaming block, even for streamed
 	 * transactions. The transaction has to be marked as streamed, though.
@@ -1163,6 +1247,16 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 	Assert(!in_streaming);
 	Assert(rbtxn_is_streamed(txn));
 
+	pfree(txndata);
+	txn->output_plugin_private = NULL;
+
+	/* If no changes were part of this transaction then drop the commit */
+	if (!sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction in stream commit");
+		return;
+	}
+
 	OutputPluginUpdateProgress(ctx);
 
 	OutputPluginPrepareWrite(ctx, true);
@@ -1182,8 +1276,20 @@ pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
 							ReorderBufferTXN *txn,
 							XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData *txndata = txn->output_plugin_private;
+	bool			sent_begin_txn = txndata->sent_begin_txn;
+
 	Assert(rbtxn_is_streamed(txn));
 
+	pfree(txndata);
+	txn->output_plugin_private = NULL;
+
+	if (!sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction in stream prepare");
+		return;
+	}
+
 	OutputPluginUpdateProgress(ctx);
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
-- 
1.8.3.1

v93-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v93-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 2424d5c5332a92a22687e6b295beac58439c27dc Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 6 Jul 2021 13:45:55 +1000
Subject: [PATCH v93] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the following things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase
transactions. We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Add new subscription TAP tests, and new subscription.sql regression tests.

* Update PG documentation.

We don't support the following operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* CREATE/ALTER SUBSCRIPTION which tries to set options two_phase=true and streaming=true at the same time.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         | 291 ++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 130 +++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  11 +-
 src/backend/replication/logical/logical.c          |  31 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 196 +++++++++--
 src/backend/replication/logical/worker.c           | 351 +++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |   2 +-
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  14 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/nodes/replnodes.h                      |   1 +
 src/include/replication/logical.h                  |  10 +
 src/include/replication/logicalproto.h             |  73 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 359 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 235 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   3 +
 43 files changed, 2383 insertions(+), 191 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f517a7d..0235639 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7643,6 +7643,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index a3562f3..e8cb78f 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2811,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2871,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains a Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7391,6 +7398,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepared transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit prepared.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index b3d1731..a6f9944 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1433905 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used with the
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on the publisher is decoded as a normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used with the
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..6d3efb4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		exists.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984..55f6e37 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1255,5 +1255,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subslotname, subsynccommit, subpublications)
+              substream, subtwophasestate, subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index eb88d87..b360892 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,6 +59,7 @@
 #define SUBOPT_REFRESH				0x00000040
 #define SUBOPT_BINARY				0x00000080
 #define SUBOPT_STREAMING			0x00000100
+#define SUBOPT_TWOPHASE_COMMIT		0x00000200
 
 /* check if the 'val' has 'bits' set */
 #define IsSet(val, bits)  (((val) & (bits)) == (bits))
@@ -79,6 +80,7 @@ typedef struct SubOpts
 	bool		refresh;
 	bool		binary;
 	bool		streaming;
+	bool		twophase;
 } SubOpts;
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
@@ -123,6 +125,8 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 		opts->binary = false;
 	if (IsSet(supported_opts, SUBOPT_STREAMING))
 		opts->streaming = false;
+	if (IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT))
+		opts->twophase = false;
 
 	/* Parse options */
 	foreach(lc, stmt_options)
@@ -237,6 +241,29 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 			opts->specified_opts |= SUBOPT_STREAMING;
 			opts->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: Unsupported twophase indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT))
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			opts->specified_opts |= SUBOPT_TWOPHASE_COMMIT;
+			opts->twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -325,6 +352,25 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (opts->twophase &&
+		IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT) &&
+		IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
+	{
+		if (opts->streaming &&
+			IsSet(supported_opts, SUBOPT_STREAMING) &&
+			IsSet(opts->specified_opts, SUBOPT_STREAMING))
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -385,7 +431,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	supported_opts = (SUBOPT_CONNECT | SUBOPT_ENABLED | SUBOPT_CREATE_SLOT |
 					  SUBOPT_SLOT_NAME | SUBOPT_COPY_DATA |
 					  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
-					  SUBOPT_STREAMING);
+					  SUBOPT_STREAMING | SUBOPT_TWOPHASE_COMMIT);
 	parse_subscription_options(stmt->options, supported_opts, &opts);
 
 	/*
@@ -455,6 +501,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(opts.enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(opts.binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(opts.streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(opts.twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (opts.slot_name)
@@ -532,10 +582,34 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (opts.create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(opts.slot_name);
 
-				walrcv_create_slot(wrconn, opts.slot_name, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false then
+				 * it is safe to enable two_phase up-front because those tables
+				 * are already initially in READY state. When the subscription
+				 * has no tables, we leave the twophase state as PENDING,
+				 * to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (opts.twophase && !opts.copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, opts.slot_name, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								opts.slot_name)));
@@ -865,6 +939,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && opts.streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -927,6 +1007,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && opts.copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -966,6 +1057,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && opts.copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -986,6 +1088,30 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				parse_subscription_options(stmt->options, SUBOPT_COPY_DATA, &opts);
 
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && opts.copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
 				AlterSubscription_refresh(sub, opts.copy_data);
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 6eaa84a..19ea159 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -436,6 +437,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 150000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -851,7 +856,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -868,6 +873,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
+		if (two_phase)
+			appendStringInfoString(&cmd, " TWO_PHASE");
+
 		switch (snapshot_action)
 		{
 			case CRS_EXPORT_SNAPSHOT:
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 453efc5..74df75e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing preparing
+				 * transactions that have locked [user] catalog tables
+				 * exclusively but as of now we ask users not to do such an
+				 * operation.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +734,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d536a5f..d61ef4c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,12 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= slot->data.two_phase;
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +540,22 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (slot->data.two_phase || ctx->twophase_opt_given);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +616,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index cb42fcb..2c191de 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b8c5e2a..9f80794 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2576,7 +2576,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2667,7 +2667,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2714,7 +2714,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2734,7 +2734,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2753,19 +2753,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2783,12 +2784,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..a14a3d6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions, that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot, need
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 682c107..aa3fce0 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready != NIL && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (table_states_not_ready == NIL && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1071,7 +1067,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1158,3 +1155,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ *
+ * Note: If this function started the transaction (indicated by the parameter)
+ * then it is the caller's responsibility to commit it.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static bool has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && list_length(table_states_not_ready) == 0;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase state */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 5fc620c..d112f8a 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * is still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -255,6 +328,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -784,6 +860,191 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a BEGIN PREPARE message")));
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (prepare_data.prepare_lsn != remote_final_lsn)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("incorrect prepare LSN %X/%X in prepare message (expected %X/%X)",
+								 LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
+								 LSN_FORMAT_ARGS(remote_final_lsn))));
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	begin_replication_step();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* There is no transaction when COMMIT PREPARED is called */
+	begin_replication_step();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
+		begin_replication_step();
+		FinishPreparedTransaction(gid, false);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2060,6 +2321,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2539,6 +2816,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -3040,6 +3320,24 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+
+	if (!TransactionIdIsValid(xid))
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("invalid two-phase transaction ID")));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3050,6 +3348,7 @@ ApplyWorkerMain(Datum main_arg)
 	XLogRecPtr	origin_startpos;
 	char	   *myslotname;
 	WalRcvStreamOptions options;
+	int			server_version;
 
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
@@ -3208,15 +3507,59 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+
+	server_version = walrcv_server_version(LogRepWorkerWalRcvConn);
 	options.proto.logical.proto_version =
-		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		server_version >= 150000 ? LOGICALREP_PROTO_TWOPHASE_VERSION_NUM :
+		server_version >= 140000 ? LOGICALREP_PROTO_STREAM_VERSION_NUM :
+		LOGICALREP_PROTO_VERSION_NUM;
+
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
+
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index abd5217..f63e17e 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -145,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -156,6 +174,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -167,10 +187,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -246,8 +268,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -319,6 +362,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -331,8 +395,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -347,29 +415,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -389,6 +436,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -839,18 +948,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1270,3 +1369,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index e1e8ec2..0910546 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -242,7 +242,7 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
 			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8c18b4e..33b85d8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -283,6 +283,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2be9ad9..9a2bc37 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -370,7 +370,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 3211521..912144c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4320,6 +4321,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4363,9 +4365,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 150000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4386,6 +4395,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4411,6 +4421,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4438,6 +4450,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4479,6 +4492,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index ba9bc6d..efb8c30 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2abf255..28cf352 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,7 +6415,9 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary and streaming are only supported in v14 and higher.
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
@@ -6423,6 +6425,14 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Binary"),
 							  gettext_noop("Streaming"));
 
+		/*
+		 * Two_phase is only supported in v15 and higher.
+		 */
+		if (pset.sversion >= 150000)
+			appendPQExpBuffer(&buf,
+							  ", subtwophasestate AS \"%s\"\n",
+							  gettext_noop("Two phase commit"));
+
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
 						  ",  subconninfo AS \"%s\"\n",
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 0ebd5aa..d6bf725 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2764,7 +2764,7 @@ psql_completion(const char *text, int start, int end)
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
 					  "enabled", "slot_name", "streaming",
-					  "synchronous_commit");
+					  "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 750d469..6ffa0f8 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -57,6 +65,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -92,6 +102,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 4d20563..632381b 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -87,6 +87,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index faa3a25..ebc43a0 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		temporary;
+	bool		two_phase;
 	List	   *options;
 } CreateReplicationSlotCmd;
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..0b071a6 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -90,6 +90,16 @@ typedef struct LogicalDecodingContext
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 * This flag indicates that the plugin passed in the two-phase option as
+	 * part of the START_STREAMING command. We can't rely solely on the twophase
+	 * flag which only tells whether the plugin provided all the necessary
+	 * two-phase callbacks.
+	 *
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..e20f2da 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -122,6 +131,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +180,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d8..d7c785b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -297,7 +297,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -636,7 +640,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 2eb7e3a..34d95ea 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -84,11 +84,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 57f7dd9..ad6b4e4 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..4c372a6
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,359 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# copy_data=false and two_phase
+###############################
+
+#create some test tables for copy tests
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_copy SELECT generate_series(1,5);");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "INSERT INTO tab_copy VALUES (88);");
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Setup logical replication
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_copy FOR TABLE tab_copy;");
+
+my $appname_copy = 'appname_copy';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_copy
+	CONNECTION '$publisher_connstr application_name=$appname_copy'
+	PUBLICATION tap_pub_copy
+	WITH (two_phase=on, copy_data=false);");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Also wait for initial table sync to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+# Check that the initial table data was NOT replicated (because we said copy_data=false)
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Now do a prepare on publisher and check that it IS replicated
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_copy VALUES (99);
+    PREPARE TRANSACTION 'mygid';");
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Check that the transaction has been prepared on the subscriber, there will be 2
+# prepared transactions for the 2 subscriptions.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
+is($result, qq(2), 'transaction is prepared on subscriber');
+
+# Now commit the insert and verify that it IS replicated
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(6), 'publisher inserted data');
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(2), 'replicated data in subscriber table');
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..e61d28a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a72d53a..235ca54 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1390,12 +1390,15 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v93-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v93-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From fd1337ec38e9fc4a91afda10c2f04ae7c9cf9b86 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 6 Jul 2021 14:13:48 +1000
Subject: [PATCH v93] Skip empty transactions for logical replication.

The current logical replication behaviour is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  16 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  38 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 158 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  46 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 286 insertions(+), 39 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 002efc8..123d2f1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -884,11 +884,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index c88ec1e..7415cf2 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7550,6 +7550,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit prepared.
 </para></listitem>
 </varlistentry>
@@ -7564,6 +7571,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d61ef4c..67c762a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -936,7 +937,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -972,7 +974,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						"commit_prepared_cb")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..4653d6d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 9f80794..2724756 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2794,7 +2794,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index bf4bfeb..a6cbfb1 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -970,27 +970,39 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* There is no transaction when COMMIT PREPARED is called */
-	begin_replication_step();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	end_replication_step();
-	CommitTransactionCommand();
+		/* There is no transaction when COMMIT PREPARED is called */
+		begin_replication_step();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 286119c..7ebdb4e 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -132,6 +134,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -401,10 +408,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -419,8 +448,22 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
+	txn->output_plugin_private = NULL;
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -432,10 +475,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -450,8 +511,18 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty prepared transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -462,12 +533,33 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of COMMIT PREPARED of an empty transaction");
+			return;
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -480,8 +572,26 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of ROLLBACK of an empty transaction");
+			return;
+		}
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -630,11 +740,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -668,6 +783,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -770,6 +894,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -777,6 +902,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -813,6 +942,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -833,6 +971,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -845,6 +984,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 7a4804f..2fa60b5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d7c785b..ffc0b56 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -442,7 +442,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 0e218e0..3d246be 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -87,9 +87,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 4c372a6..8a33641 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 24;
+use Test::More tests => 25;
 
 ###############################
 # Setup
@@ -318,10 +318,9 @@ $node_publisher->safe_psql('postgres', "
 
 $node_publisher->wait_for_catchup($appname_copy);
 
-# Check that the transaction has been prepared on the subscriber, there will be 2
-# prepared transactions for the 2 subscriptions.
+# Check that the transaction has been prepared on the subscriber
 $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
-is($result, qq(2), 'transaction is prepared on subscriber');
+is($result, qq(1), 'transaction is prepared on subscriber');
 
 # Now commit the insert and verify that it IS replicated
 $node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
@@ -337,6 +336,45 @@ is($result, qq(2), 'replicated data in subscriber table');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
 $node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+   "CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+   "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot
+$node_publisher->safe_psql('postgres', "
+   BEGIN;
+   INSERT INTO tab_nopub SELECT generate_series(1,10);
+   PREPARE TRANSACTION 'empty_transaction';
+   COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+   'postgres', qq(
+       SELECT get_byte(data, 0)
+       FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+           'proto_version', '1',
+           'publication_names', 'tap_pub')
+));
+
+# the empty transaction should be skipped
+is($result, qq(),
+   'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 235ca54..1fa6a92 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1599,6 +1599,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

#371Ajin Cherian
itsajin@gmail.com
In reply to: tanghy.fnst@fujitsu.com (#369)
4 attachment(s)

On Fri, Jul 2, 2021 at 8:18 PM tanghy.fnst@fujitsu.com
<tanghy.fnst@fujitsu.com> wrote:

Thanks for your patch. I met an issue while using it. When a transaction contains TRUNCATE, the subscriber reported an error: " ERROR: no data left in message" and the data couldn't be replicated.

Steps to reproduce the issue:

(set logical_decoding_work_mem to 64kB at publisher so that streaming could work. )

------publisher------
create table test (a int primary key, b varchar);
create publication pub for table test;

------subscriber------
create table test (a int primary key, b varchar);
create subscription sub connection 'dbname=postgres' publication pub with(two_phase=on, streaming=on);

------publisher------
BEGIN;
TRUNCATE test;
INSERT INTO test SELECT i, md5(i::text) FROM generate_series(1001, 6000) s(i);
UPDATE test SET b = md5(b) WHERE mod(a,2) = 0;
DELETE FROM test WHERE mod(a,3) = 0;
COMMIT;

The above case worked ok when remove 0004 patch, so I think it’s a problem of 0004 patch. Please have a look.

thanks for the test!
I hadn't updated the case where sending schema across was the first
change of the transaction as part of the decoding of the
truncate command. In this test case, the schema was sent across
without the stream start, hence the error on the apply worker.
I have updated with a fix. Please do a test and confirm.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v94-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v94-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 676827135cd5b172a48a26fece81c8ca64ca8fed Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 6 Jul 2021 06:04:41 -0400
Subject: [PATCH v94] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the following things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase
transactions. We enable the two_phase once the initial data sync is over.

* Add a new option to enable two_phase while creating a slot. We don't use
this option in the patch but this will allow the outside replication
solutions using streaming replication protocol to use it.

* Add new subscription TAP tests, and new subscription.sql regression tests.

* Update PG documentation.

We don't support the following operations:

* ALTER SUBSCRIPTION REFRESH PUBLICATION when two_phase enabled.

* ALTER SUBSCRIPTION {SET|ADD|DROP} PUBLICATION WITH (refresh = true) when two_phase enabled.

* CREATE/ALTER SUBSCRIPTION which tries to set options two_phase=true and streaming=true at the same time.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         | 291 ++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 130 +++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  11 +-
 src/backend/replication/logical/logical.c          |  31 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 196 +++++++++--
 src/backend/replication/logical/worker.c           | 351 +++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/repl_gram.y                |   2 +-
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |  14 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/nodes/replnodes.h                      |   1 +
 src/include/replication/logical.h                  |  10 +
 src/include/replication/logicalproto.h             |  73 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   6 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 359 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 235 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   3 +
 43 files changed, 2383 insertions(+), 191 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f517a7d..0235639 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7643,6 +7643,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled;
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement;
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index a3562f3..e8cb78f 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2811,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2871,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains a Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7391,6 +7398,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepared transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit prepared.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index b3d1731..a6f9944 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1433905 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used with the
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on the publisher is decoded as a normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used with the
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..6d3efb4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		exists.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984..55f6e37 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1255,5 +1255,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subslotname, subsynccommit, subpublications)
+              substream, subtwophasestate, subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index eb88d87..b360892 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,6 +59,7 @@
 #define SUBOPT_REFRESH				0x00000040
 #define SUBOPT_BINARY				0x00000080
 #define SUBOPT_STREAMING			0x00000100
+#define SUBOPT_TWOPHASE_COMMIT		0x00000200
 
 /* check if the 'val' has 'bits' set */
 #define IsSet(val, bits)  (((val) & (bits)) == (bits))
@@ -79,6 +80,7 @@ typedef struct SubOpts
 	bool		refresh;
 	bool		binary;
 	bool		streaming;
+	bool		twophase;
 } SubOpts;
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
@@ -123,6 +125,8 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 		opts->binary = false;
 	if (IsSet(supported_opts, SUBOPT_STREAMING))
 		opts->streaming = false;
+	if (IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT))
+		opts->twophase = false;
 
 	/* Parse options */
 	foreach(lc, stmt_options)
@@ -237,6 +241,29 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 			opts->specified_opts |= SUBOPT_STREAMING;
 			opts->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: Unsupported twophase indicates that this call originated from
+			 * AlterSubscription.
+			 */
+			if (!IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT))
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			opts->specified_opts |= SUBOPT_TWOPHASE_COMMIT;
+			opts->twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -325,6 +352,25 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (opts->twophase &&
+		IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT) &&
+		IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
+	{
+		if (opts->streaming &&
+			IsSet(supported_opts, SUBOPT_STREAMING) &&
+			IsSet(opts->specified_opts, SUBOPT_STREAMING))
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
+
 }
 
 /*
@@ -385,7 +431,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	supported_opts = (SUBOPT_CONNECT | SUBOPT_ENABLED | SUBOPT_CREATE_SLOT |
 					  SUBOPT_SLOT_NAME | SUBOPT_COPY_DATA |
 					  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
-					  SUBOPT_STREAMING);
+					  SUBOPT_STREAMING | SUBOPT_TWOPHASE_COMMIT);
 	parse_subscription_options(stmt->options, supported_opts, &opts);
 
 	/*
@@ -455,6 +501,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(opts.enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(opts.binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(opts.streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(opts.twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (opts.slot_name)
@@ -532,10 +582,34 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (opts.create_slot)
 			{
+				bool twophase_enabled = false;
+
 				Assert(opts.slot_name);
 
-				walrcv_create_slot(wrconn, opts.slot_name, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false then
+				 * it is safe to enable two_phase up-front because those tables
+				 * are already initially in READY state. When the subscription
+				 * has no tables, we leave the twophase state as PENDING,
+				 * to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+				 */
+				if (opts.twophase && !opts.copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, opts.slot_name, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								opts.slot_name)));
@@ -865,6 +939,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && opts.streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -927,6 +1007,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && opts.copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -966,6 +1057,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && opts.copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -986,6 +1088,30 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				parse_subscription_options(stmt->options, SUBOPT_COPY_DATA, &opts);
 
+				/*
+				 * The subscription option "two_phase" requires that replication
+				 * has passed the initial table synchronization phase before the
+				 * two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && opts.copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
 				AlterSubscription_refresh(sub, opts.copy_data);
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 6eaa84a..19ea159 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -436,6 +437,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 150000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -851,7 +856,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -868,6 +873,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
+		if (two_phase)
+			appendStringInfoString(&cmd, " TWO_PHASE");
+
 		switch (snapshot_action)
 		{
 			case CRS_EXPORT_SNAPSHOT:
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 453efc5..74df75e 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing preparing
+				 * transactions that have locked [user] catalog tables
+				 * exclusively but as of now we ask users not to do such an
+				 * operation.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +734,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d536a5f..d61ef4c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,12 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= slot->data.two_phase;
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +540,22 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (slot->data.two_phase || ctx->twophase_opt_given);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +616,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index cb42fcb..2c191de 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b8c5e2a..9f80794 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2576,7 +2576,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2667,7 +2667,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2714,7 +2714,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2734,7 +2734,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2753,19 +2753,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2783,12 +2784,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..a14a3d6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions, that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot, need
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 682c107..aa3fce0 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready != NIL && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,37 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (table_states_not_ready == NIL && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become properly
+	 * 'enabled' at that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as PENDING,
+	 * which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1071,7 +1067,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1158,3 +1155,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ *
+ * Note: If this function started the transaction (indicated by the parameter)
+ * then it is the caller's responsibility to commit it.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static bool has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But if
+		 * table_state_not_ready was empty we still need to check again to see
+		 * if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * When there are no tables, then return false.
+	 * When no tablesyncs are busy, then all are READY
+	 */
+	return has_subrels && list_length(table_states_not_ready) == 0;
+}
+
+/*
+ * Update the pg_subscription two_phase state of the specified subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase state */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 5fc620c..d112f8a 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,78 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * is still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it calls wal_startstreaming to enable the
+ * publisher for two-phase commit and updates the tri-state value
+ * PENDING -> ENABLED. Now, it is possible that during the time we have not
+ * enabled two_phase, the publisher (replication server) would have skipped some
+ * prepares but we ensure that such prepares are sent along with commit
+ * prepare, see ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to the inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +131,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -255,6 +328,9 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   TupleTableSlot *remoteslot,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
 
 /*
  * Should this worker apply changes for given relation.
@@ -784,6 +860,191 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+	char		gid[GIDSIZE];
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a BEGIN PREPARE message")));
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	/* The gid must not already be prepared. */
+	TwoPhaseTransactionGid(MySubscription->oid, begin_data.xid,
+						   gid, sizeof(gid));
+	Assert(!LookupGXact(gid, begin_data.end_lsn, begin_data.prepare_time));
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (prepare_data.prepare_lsn != remote_final_lsn)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("incorrect prepare LSN %X/%X in prepare message (expected %X/%X)",
+								 LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
+								 LSN_FORMAT_ARGS(remote_final_lsn))));
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	begin_replication_step();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* There is no transaction when COMMIT PREPARED is called */
+	begin_replication_step();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
+		begin_replication_step();
+		FinishPreparedTransaction(gid, false);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2060,6 +2321,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2539,6 +2816,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -3040,6 +3320,24 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+
+	if (!TransactionIdIsValid(xid))
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("invalid two-phase transaction ID")));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3050,6 +3348,7 @@ ApplyWorkerMain(Datum main_arg)
 	XLogRecPtr	origin_startpos;
 	char	   *myslotname;
 	WalRcvStreamOptions options;
+	int			server_version;
 
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
@@ -3208,15 +3507,59 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+
+	server_version = walrcv_server_version(LogRepWorkerWalRcvConn);
 	options.proto.logical.proto_version =
-		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		server_version >= 150000 ? LOGICALREP_PROTO_TWOPHASE_VERSION_NUM :
+		server_version >= 140000 ? LOGICALREP_PROTO_STREAM_VERSION_NUM :
+		LOGICALREP_PROTO_VERSION_NUM;
+
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
+
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become properly ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index abd5217..f63e17e 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -145,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -156,6 +174,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -167,10 +187,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -246,8 +268,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase and
+		 * streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -319,6 +362,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -331,8 +395,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -347,29 +415,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -389,6 +436,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -839,18 +948,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1270,3 +1369,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index e1e8ec2..0910546 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -242,7 +242,7 @@ create_replication_slot:
 					cmd->options = $5;
 					$$ = (Node *) cmd;
 				}
-			/* CREATE_REPLICATION_SLOT slot TEMPORARY LOGICAL plugin */
+			/* CREATE_REPLICATION_SLOT slot TEMPORARY TWO_PHASE LOGICAL plugin */
 			| K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_opt_list
 				{
 					CreateReplicationSlotCmd *cmd;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8c18b4e..33b85d8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -283,6 +283,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2be9ad9..9a2bc37 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -370,7 +370,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 3211521..912144c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4320,6 +4321,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4363,9 +4365,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 150000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4386,6 +4395,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4411,6 +4421,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4438,6 +4450,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4479,6 +4492,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index ba9bc6d..efb8c30 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2abf255..28cf352 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6415,7 +6415,9 @@ describeSubscriptions(const char *pattern, bool verbose)
 
 	if (verbose)
 	{
-		/* Binary mode and streaming are only supported in v14 and higher */
+		/*
+		 * Binary and streaming are only supported in v14 and higher.
+		 */
 		if (pset.sversion >= 140000)
 			appendPQExpBuffer(&buf,
 							  ", subbinary AS \"%s\"\n"
@@ -6423,6 +6425,14 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Binary"),
 							  gettext_noop("Streaming"));
 
+		/*
+		 * Two_phase is only supported in v15 and higher.
+		 */
+		if (pset.sversion >= 150000)
+			appendPQExpBuffer(&buf,
+							  ", subtwophasestate AS \"%s\"\n",
+							  gettext_noop("Two phase commit"));
+
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
 						  ",  subconninfo AS \"%s\"\n",
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 0ebd5aa..d6bf725 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2764,7 +2764,7 @@ psql_completion(const char *text, int start, int end)
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
 					  "enabled", "slot_name", "streaming",
-					  "synchronous_commit");
+					  "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 750d469..6ffa0f8 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -57,6 +65,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Decode 2PC PREPARE? */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -92,6 +102,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Decode 2PC PREPARE? */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 4d20563..632381b 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -87,6 +87,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index faa3a25..ebc43a0 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -56,6 +56,7 @@ typedef struct CreateReplicationSlotCmd
 	ReplicationKind kind;
 	char	   *plugin;
 	bool		temporary;
+	bool		two_phase;
 	List	   *options;
 } CreateReplicationSlotCmd;
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..0b071a6 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -90,6 +90,16 @@ typedef struct LogicalDecodingContext
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 * This flag indicates that the plugin passed in the two-phase option as
+	 * part of the START_STREAMING command. We can't rely solely on the twophase
+	 * flag which only tells whether the plugin provided all the necessary
+	 * two-phase callbacks.
+	 *
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..e20f2da 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit PREPARE decoding. Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -122,6 +131,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +180,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d8..d7c785b 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -297,7 +297,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	} xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -636,7 +640,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 2eb7e3a..34d95ea 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -84,11 +84,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..9edf907 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,7 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Enable 2PC decoding of PREPARE */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +348,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +422,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 57f7dd9..ad6b4e4 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..4c372a6
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,359 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# copy_data=false and two_phase
+###############################
+
+#create some test tables for copy tests
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_copy SELECT generate_series(1,5);");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "INSERT INTO tab_copy VALUES (88);");
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Setup logical replication
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_copy FOR TABLE tab_copy;");
+
+my $appname_copy = 'appname_copy';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_copy
+	CONNECTION '$publisher_connstr application_name=$appname_copy'
+	PUBLICATION tap_pub_copy
+	WITH (two_phase=on, copy_data=false);");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Also wait for initial table sync to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+# Check that the initial table data was NOT replicated (because we said copy_data=false)
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Now do a prepare on publisher and check that it IS replicated
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_copy VALUES (99);
+    PREPARE TRANSACTION 'mygid';");
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Check that the transaction has been prepared on the subscriber, there will be 2
+# prepared transactions for the 2 subscriptions.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
+is($result, qq(2), 'transaction is prepared on subscriber');
+
+# Now commit the insert and verify that it IS replicated
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(6), 'publisher inserted data');
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(2), 'replicated data in subscriber table');
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..e61d28a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a72d53a..235ca54 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1390,12 +1390,15 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

v94-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v94-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 0dd635bf050a4d6262422399c606c559e50ee0d3 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 6 Jul 2021 06:16:43 -0400
Subject: [PATCH v94] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/protocol.sgml                         |  68 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  10 -
 src/backend/commands/subscriptioncmds.c            |  25 --
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 138 ++++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |  10 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 453 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 271 ++++++++++++
 11 files changed, 1021 insertions(+), 83 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index e8cb78f..c88ec1e 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains a Stream Commit or Stream Abort message.
+   contains a Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7398,7 +7398,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7661,6 +7661,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1433905..702934e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index b360892..ab3314c 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -352,25 +352,6 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (opts->twophase &&
-		IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT) &&
-		IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
-	{
-		if (opts->streaming &&
-			IsSet(supported_opts, SUBOPT_STREAMING) &&
-			IsSet(opts->specified_opts, SUBOPT_STREAMING))
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
-
 }
 
 /*
@@ -939,12 +920,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && opts.streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index d112f8a..bf4bfeb 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -330,7 +330,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   CmdType operation);
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1045,6 +1045,90 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a STREAM PREPARE message")));
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1260,30 +1344,20 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	if (in_streamed_transaction)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROTOCOL_VIOLATION),
-				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
 	/* Make sure we have an open transaction */
 	begin_replication_step();
 
@@ -1294,7 +1368,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -1315,7 +1389,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1394,6 +1468,32 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2337,6 +2437,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f63e17e..286119c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase and
-		 * streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1030,6 +1021,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index e20f2da..7a4804f 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -124,6 +125,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -243,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index ad6b4e4..34ebca4 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c90e3f6
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,453 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3a0be82
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

v94-0004-Skip-empty-streaming-in-progress-transaction-for.patchapplication/octet-stream; name=v94-0004-Skip-empty-streaming-in-progress-transaction-for.patchDownload
From 5cc0ef33ee902706909116aef6c82c8821661fac Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 6 Jul 2021 07:02:26 -0400
Subject: [PATCH v94] Skip empty streaming in-progress transaction for logical
 replication.

This improves the behaviour of skipping empty transaction to also
include empty streamed in-progress transactions.
---
 src/backend/replication/pgoutput/pgoutput.c | 167 +++++++++++++++++++++++++---
 1 file changed, 149 insertions(+), 18 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7ebdb4e..86d0c0a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -65,6 +65,8 @@ static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
+static void pgoutput_send_stream_start(struct LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn);
 static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
@@ -134,9 +136,21 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+/*
+ * Maintain the per-transaction level variables to track whether the
+ * transaction and or streams have written any changes.
+ * BEGIN / BEGIN PREPARE is held back until the first
+ * change needs to be sent. In streaming mode the transaction can
+ * be decoded in streams, so along with maintaining whether the
+ * transaction has written any changes, we also need to track whether the
+ * current stream has written any changes. START STREAM is held back until
+ * the first change is streamed. This is done so that empty transactions and
+ * streams which do not have any changes can be dropped.
+ */
 typedef struct PGOutputTxnData
 {
 	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+	bool sent_stream_start; /* flag indicating if stream start has been sent */
 } PGOutputTxnData;
 
 /* Map used to remember which relation schemas we sent. */
@@ -610,6 +624,8 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 	bool		schema_sent;
 	TransactionId xid = InvalidTransactionId;
 	TransactionId topxid = InvalidTransactionId;
+	PGOutputTxnData *txndata;
+	ReorderBufferTXN *toptxn;
 
 	/*
 	 * Remember XID of the (sub)transaction for the change. We don't care if
@@ -623,9 +639,16 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 		xid = change->txn->xid;
 
 	if (change->txn->toptxn)
+	{
 		topxid = change->txn->toptxn->xid;
+		toptxn = change->txn->toptxn;
+	}
 	else
+	{
 		topxid = xid;
+		toptxn = txn;
+	}
+
 
 	/*
 	 * Do we need to send the schema? We do track streamed transactions
@@ -648,6 +671,23 @@ maybe_send_schema(LogicalDecodingContext *ctx,
 	if (schema_sent)
 		return;
 
+	/* set up txndata */
+	txndata = toptxn->output_plugin_private;
+
+	/*
+	 * Before we send schema, make sure that STREAM START/BEGIN/BEGIN PREPARE
+	 * is sent. If not, send now.
+	 */
+	if (in_streaming && !txndata->sent_stream_start)
+		pgoutput_send_stream_start(ctx, toptxn);
+	else if (!txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(toptxn))
+			pgoutput_begin_prepare(ctx, toptxn);
+		else
+			pgoutput_begin(ctx, toptxn);
+	}
+
 	/*
 	 * Nope, so send the schema.  If the changes will be published using an
 	 * ancestor's schema, not the relation's own, send that ancestor's schema
@@ -746,9 +786,8 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
-	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
-	if (!in_streaming)
-		Assert(txndata);
+	/* should have set up txndata as part of BEGIN/BEGIN PREPARE/START STREAM */
+	Assert(txndata);
 
 	if (!is_publishable_relation(relation))
 		return;
@@ -783,8 +822,11 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* If streaming, send STREAM START if we haven't yet */
+	if (in_streaming && !txndata->sent_stream_start)
+		pgoutput_send_stream_start(ctx, txn);
 	/* output BEGIN if we haven't yet */
-	if (!in_streaming && !txndata->sent_begin_txn)
+	else if (!txndata->sent_begin_txn)
 	{
 		if (rbtxn_prepared(txn))
 			pgoutput_begin_prepare(ctx, txn);
@@ -902,9 +944,8 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
-	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
-	if (!in_streaming)
-		Assert(txndata);
+	/* Should have setup txndata as part of BEGIN/BEGIN PREPARE/START STREAM */
+	Assert(txndata);
 
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
@@ -942,8 +983,11 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* If streaming, send STREAM START if we haven't yet */
+		if (in_streaming && !txndata->sent_stream_start)
+			pgoutput_send_stream_start(ctx, txn);
 		/* output BEGIN if we haven't yet */
-		if (!in_streaming && !txndata->sent_begin_txn)
+		else if (!txndata->sent_begin_txn)
 		{
 			if (rbtxn_prepared(txn))
 				pgoutput_begin_prepare(ctx, txn);
@@ -984,16 +1028,24 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
-	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
-	if (!in_streaming && transactional)
+	/* Set up txndata for streaming and transactional messages */
+	if (in_streaming || transactional)
 	{
 		txndata = (PGOutputTxnData *) txn->output_plugin_private;
-		if (!txndata->sent_begin_txn)
+
+		/* If streaming, send STREAM START if we haven't yet */
+		if (in_streaming && !txndata->sent_stream_start)
+			pgoutput_send_stream_start(ctx, txn);
+		/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+		else if (transactional)
 		{
-			if (rbtxn_prepared(txn))
-				pgoutput_begin_prepare(ctx, txn);
-			else
-				pgoutput_begin(ctx, txn);
+			if (!txndata->sent_begin_txn)
+			{
+				if (rbtxn_prepared(txn))
+					pgoutput_begin_prepare(ctx, txn);
+				else
+					pgoutput_begin(ctx, txn);
+			}
 		}
 	}
 
@@ -1076,12 +1128,37 @@ static void
 pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 					  ReorderBufferTXN *txn)
 {
-	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData *txndata = txn->output_plugin_private;
 
 	/* we can't nest streaming of transactions */
 	Assert(!in_streaming);
 
 	/*
+	 * Don't actually send stream start here, instead set a flag that indicates
+	 * that stream start hasn't been sent and wait for the first actual change
+	 * for this stream to be sent and then send stream start. This is done
+	 * to avoid sending empty streams without any changes.
+	 */
+	if (txndata == NULL)
+	{
+		txndata =
+			MemoryContextAllocZero(ctx->context, sizeof(PGOutputTxnData));
+		txndata->sent_begin_txn = false;
+		txn->output_plugin_private = txndata;
+	}
+
+	txndata->sent_stream_start = false;
+	in_streaming = true;
+}
+
+static void
+pgoutput_send_stream_start(struct LogicalDecodingContext *ctx,
+						   ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*txndata = (PGOutputTxnData *) txn->output_plugin_private;
+
+	/*
 	 * If we already sent the first stream for this transaction then don't
 	 * send the origin id in the subsequent streams.
 	 */
@@ -1096,8 +1173,11 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 
 	OutputPluginWrite(ctx, true);
 
-	/* we're streaming a chunk of transaction now */
-	in_streaming = true;
+	/*
+	 * Set the flags that indicate that changes were sent as part of
+	 * the transaction and the stream.
+	 */
+	txndata->sent_begin_txn = txndata->sent_stream_start = true;
 }
 
 /*
@@ -1107,9 +1187,18 @@ static void
 pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
 					 ReorderBufferTXN *txn)
 {
+	PGOutputTxnData *data = txn->output_plugin_private;
+
 	/* we should be streaming a trasanction */
 	Assert(in_streaming);
 
+	if (!data->sent_stream_start)
+	{
+		in_streaming = false;
+		elog(DEBUG1, "Skipping replication of an empty transaction in stream stop");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_stream_stop(ctx->out);
 	OutputPluginWrite(ctx, true);
@@ -1128,6 +1217,8 @@ pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 					  XLogRecPtr abort_lsn)
 {
 	ReorderBufferTXN *toptxn;
+	PGOutputTxnData  *txndata;
+	bool sent_begin_txn;
 
 	/*
 	 * The abort should happen outside streaming block, even for streamed
@@ -1137,6 +1228,21 @@ pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 
 	/* determine the toplevel transaction */
 	toptxn = (txn->toptxn) ? txn->toptxn : txn;
+	txndata = toptxn->output_plugin_private;
+	sent_begin_txn = txndata->sent_begin_txn;
+
+	if (txn->toptxn == NULL)
+	{
+		pfree(txndata);
+		txn->output_plugin_private = NULL;
+	}
+
+	if (!sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction in stream abort");
+		return;
+	}
+
 
 	Assert(rbtxn_is_streamed(toptxn));
 
@@ -1156,6 +1262,9 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 					   ReorderBufferTXN *txn,
 					   XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData *txndata = txn->output_plugin_private;
+	bool			sent_begin_txn = txndata->sent_begin_txn;
+
 	/*
 	 * The commit should happen outside streaming block, even for streamed
 	 * transactions. The transaction has to be marked as streamed, though.
@@ -1163,6 +1272,16 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 	Assert(!in_streaming);
 	Assert(rbtxn_is_streamed(txn));
 
+	pfree(txndata);
+	txn->output_plugin_private = NULL;
+
+	/* If no changes were part of this transaction then drop the commit */
+	if (!sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction in stream commit");
+		return;
+	}
+
 	OutputPluginUpdateProgress(ctx);
 
 	OutputPluginPrepareWrite(ctx, true);
@@ -1182,8 +1301,20 @@ pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
 							ReorderBufferTXN *txn,
 							XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData *txndata = txn->output_plugin_private;
+	bool			sent_begin_txn = txndata->sent_begin_txn;
+
 	Assert(rbtxn_is_streamed(txn));
 
+	pfree(txndata);
+	txn->output_plugin_private = NULL;
+
+	if (!sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction in stream prepare");
+		return;
+	}
+
 	OutputPluginUpdateProgress(ctx);
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
-- 
1.8.3.1

v94-0003-Skip-empty-transactions-for-logical-replication.patchapplication/octet-stream; name=v94-0003-Skip-empty-transactions-for-logical-replication.patchDownload
From 69f11f5c107182bb16f9373028d1ccf07a2cd024 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Tue, 6 Jul 2021 06:49:08 -0400
Subject: [PATCH v94] Skip empty transactions for logical replication.

The current logical replication behaviour is to send every transaction to
subscriber even though the transaction is empty (because it does not
contain changes from the selected publications). It is a waste of CPU
cycles and network bandwidth to build/transmit these empty transactions.

This patch addresses the above problem by postponing the BEGIN / BEGIN PREPARE message
until the first change. While processing a COMMIT message or a PREPARE message,
if there is no other change for that transaction,
do not send COMMIT message or PREPARE message. It means that pgoutput will
skip BEGIN / COMMIT or BEGIN PREPARE / PREPARE  messages for transactions that are empty.

Discussion:
https://postgr.es/m/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c           |   7 +-
 doc/src/sgml/logicaldecoding.sgml               |  12 +-
 doc/src/sgml/protocol.sgml                      |  15 +++
 src/backend/replication/logical/logical.c       |   9 +-
 src/backend/replication/logical/proto.c         |  16 ++-
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/replication/logical/worker.c        |  38 ++++--
 src/backend/replication/pgoutput/pgoutput.c     | 158 +++++++++++++++++++++++-
 src/include/replication/logicalproto.h          |   8 +-
 src/include/replication/output_plugin.h         |   4 +-
 src/include/replication/reorderbuffer.h         |   4 +-
 src/test/subscription/t/020_messages.pl         |   5 +-
 src/test/subscription/t/021_twophase.pl         |  46 ++++++-
 src/tools/pgindent/typedefs.list                |   1 +
 14 files changed, 286 insertions(+), 39 deletions(-)

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index e5cd84e..408dbfc 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -86,7 +86,9 @@ static void pg_decode_prepare_txn(LogicalDecodingContext *ctx,
 								  XLogRecPtr prepare_lsn);
 static void pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx,
 										  ReorderBufferTXN *txn,
-										  XLogRecPtr commit_lsn);
+										  XLogRecPtr commit_lsn,
+										  XLogRecPtr prepare_end_lsn,
+										  TimestampTz prepare_time);
 static void pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 											ReorderBufferTXN *txn,
 											XLogRecPtr prepare_end_lsn,
@@ -390,7 +392,8 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* COMMIT PREPARED callback */
 static void
 pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							  XLogRecPtr commit_lsn)
+							  XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							  TimestampTz prepare_time)
 {
 	TestDecodingData *data = ctx->output_plugin_private;
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 002efc8..123d2f1 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -884,11 +884,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
       The required <function>commit_prepared_cb</function> callback is called
       whenever a transaction <command>COMMIT PREPARED</command> has been decoded.
       The <parameter>gid</parameter> field, which is part of the
-      <parameter>txn</parameter> parameter, can be used in this callback.
+      <parameter>txn</parameter> parameter, can be used in this callback. The
+      parameters <parameter>prepare_end_lsn</parameter> and
+      <parameter>prepare_time</parameter> can be used to check if the plugin
+      has received this <command>PREPARE TRANSACTION</command> in which case
+      it can commit the transaction, otherwise, it can skip the commit. The
+      <parameter>gid</parameter> alone is not sufficient because the downstream
+      node can have a prepared transaction with the same identifier.
 <programlisting>
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
                                                ReorderBufferTXN *txn,
-                                               XLogRecPtr commit_lsn);
+                                               XLogRecPtr commit_lsn,
+                                               XLogRecPtr prepare_end_lsn,
+                                               TimestampTz prepare_time);
 </programlisting>
      </para>
     </sect3>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index c88ec1e..7415cf2 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7550,6 +7550,13 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                The end LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 The LSN of the commit prepared.
 </para></listitem>
 </varlistentry>
@@ -7564,6 +7571,14 @@ are available since protocol version 3.
 <varlistentry>
 <term>Int64</term>
 <listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
                 Commit timestamp of the transaction. The value is in number
                 of microseconds since PostgreSQL epoch (2000-01-01).
 </para></listitem>
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d61ef4c..67c762a 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -63,7 +63,8 @@ static void begin_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn
 static void prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 							   XLogRecPtr prepare_lsn);
 static void commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-									   XLogRecPtr commit_lsn);
+									   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+									   TimestampTz prepare_time);
 static void rollback_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 										 XLogRecPtr prepare_end_lsn, TimestampTz prepare_time);
 static void change_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
@@ -936,7 +937,8 @@ prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 
 static void
 commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
-						   XLogRecPtr commit_lsn)
+						   XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+						   TimestampTz prepare_time)
 {
 	LogicalDecodingContext *ctx = cache->private_data;
 	LogicalErrorCallbackState state;
@@ -972,7 +974,8 @@ commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
 						"commit_prepared_cb")));
 
 	/* do the actual work: call callback */
-	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn);
+	ctx->callbacks.commit_prepared_cb(ctx, txn, commit_lsn, prepare_end_lsn,
+									  prepare_time);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 8e03006..4653d6d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -206,7 +206,9 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
  */
 void
 logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-								 XLogRecPtr commit_lsn)
+								 XLogRecPtr commit_lsn,
+								 XLogRecPtr prepare_end_lsn,
+								 TimestampTz prepare_time)
 {
 	uint8		flags = 0;
 
@@ -222,8 +224,10 @@ logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
 	pq_sendbyte(out, flags);
 
 	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
 	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 
@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
 
 	/* read fields */
+	prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR,"prepare_end_lsn is not set in commit prepared message");
 	prepare_data->commit_lsn = pq_getmsgint64(in);
 	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "commit_lsn is not set in commit prepared message");
-	prepare_data->end_lsn = pq_getmsgint64(in);
-	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_end_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_end_lsn is not set in commit prepared message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->commit_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 9f80794..2724756 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2794,7 +2794,7 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	txn->origin_lsn = origin_lsn;
 
 	if (is_commit)
-		rb->commit_prepared(rb, txn, commit_lsn);
+		rb->commit_prepared(rb, txn, commit_lsn, prepare_end_lsn, prepare_time);
 	else
 		rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index bf4bfeb..a6cbfb1 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -970,27 +970,39 @@ apply_handle_commit_prepared(StringInfo s)
 	/* Compute GID for two_phase transactions. */
 	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
 						   gid, sizeof(gid));
-
-	/* There is no transaction when COMMIT PREPARED is called */
-	begin_replication_step();
-
 	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
+	 * It is possible that we haven't received the prepare because
+	 * the transaction did not have any changes relevant to this
+	 * subscription and was essentially an empty prepare. In which case,
+	 * the walsender is optimized to drop the empty transaction and the
+	 * accompanying prepare. Silently ignore if we don't find the prepared
+	 * transaction.
 	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.commit_time;
+	if (LookupGXact(gid, prepare_data.prepare_end_lsn,
+					prepare_data.prepare_time))
+	{
 
-	FinishPreparedTransaction(gid, true);
-	end_replication_step();
-	CommitTransactionCommand();
+		/* There is no transaction when COMMIT PREPARED is called */
+		begin_replication_step();
+
+		/*
+		 * Update origin state so we can restart streaming from correct position
+		 * in case of crash.
+		 */
+		replorigin_session_origin_lsn = prepare_data.commit_end_lsn;
+		replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+		FinishPreparedTransaction(gid, true);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
 	pgstat_report_stat(false);
 
-	store_flush_position(prepare_data.end_lsn);
+	store_flush_position(prepare_data.commit_end_lsn);
 	in_remote_transaction = false;
 
 	/* Process any tables that are being synchronized in parallel. */
-	process_syncing_tables(prepare_data.end_lsn);
+	process_syncing_tables(prepare_data.commit_end_lsn);
 
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 286119c..7ebdb4e 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -56,7 +56,9 @@ static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
 static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
 								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
-										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn,
+										 XLogRecPtr prepare_end_lsn,
+										 TimestampTz prepare_time);
 static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 										   ReorderBufferTXN *txn,
 										   XLogRecPtr prepare_end_lsn,
@@ -132,6 +134,11 @@ typedef struct RelationSyncEntry
 	TupleConversionMap *map;
 } RelationSyncEntry;
 
+typedef struct PGOutputTxnData
+{
+	bool sent_begin_txn;    /* flag indicating whether begin has been sent */
+} PGOutputTxnData;
+
 /* Map used to remember which relation schemas we sent. */
 static HTAB *RelationSyncCache = NULL;
 
@@ -401,10 +408,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	PGOutputTxnData    *data = MemoryContextAllocZero(ctx->context,
+														sizeof(PGOutputTxnData));
+
+	/*
+	 * Don't send BEGIN message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN and COMMIT messages to subscribers,
+	 * using bandwidth on something with little/no use for logical replication.
+	 */
+	data->sent_begin_txn = false;
+	txn->output_plugin_private = data;
+}
+
+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -419,8 +448,22 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					XLogRecPtr commit_lsn)
 {
+	PGOutputTxnData	*data = (PGOutputTxnData *) txn->output_plugin_private;
+	bool            skip;
+
+	Assert(data);
+	skip = !data->sent_begin_txn;
+	pfree(data);
+	txn->output_plugin_private = NULL;
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip COMMIT message if nothing was sent */
+	if (skip)
+	{
+		elog(DEBUG1, "Skipping replication of an empty transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_commit(ctx->out, txn, commit_lsn);
 	OutputPluginWrite(ctx, true);
@@ -432,10 +475,28 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 static void
 pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+	/*
+	 * Don't send BEGIN PREPARE message here. Instead, postpone it until the first
+	 * change. In logical replication, a common scenario is to replicate a set
+	 * of tables (instead of all tables) and transactions whose changes were on
+	 * table(s) that are not published will produce empty transactions. These
+	 * empty transactions will send BEGIN PREPARE and COMMIT PREPARED messages
+	 * to subscribers, using bandwidth on something with little/no use
+	 * for logical replication.
+	 */
+	pgoutput_begin_txn(ctx, txn);
+}
+
+static void
+pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
 	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
 
+	Assert(data);
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin_prepare(ctx->out, txn);
+	data->sent_begin_txn = true;
 
 	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
 					 send_replication_origin);
@@ -450,8 +511,18 @@ static void
 pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					 XLogRecPtr prepare_lsn)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
+	Assert(data);
 	OutputPluginUpdateProgress(ctx);
 
+	/* skip PREPARE message if nothing was sent */
+	if (!data->sent_begin_txn)
+	{
+		elog(DEBUG1, "Skipping replication of an empty prepared transaction");
+		return;
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
 	OutputPluginWrite(ctx, true);
@@ -462,12 +533,33 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
  */
 static void
 pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
-							 XLogRecPtr commit_lsn)
+							 XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn,
+							 TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending COMMIT PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of COMMIT PREPARED of an empty transaction");
+			return;
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
-	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn, prepare_end_lsn,
+									 prepare_time);
 	OutputPluginWrite(ctx, true);
 }
 
@@ -480,8 +572,26 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
 							   XLogRecPtr prepare_end_lsn,
 							   TimestampTz prepare_time)
 {
+	PGOutputTxnData    *data = (PGOutputTxnData *) txn->output_plugin_private;
+
 	OutputPluginUpdateProgress(ctx);
 
+	/*
+	 * skip sending ROLLBACK PREPARED message if prepared transaction
+	 * has not been sent.
+	 */
+	if (data)
+	{
+		bool skip = !data->sent_begin_txn;
+		pfree(data);
+		txn->output_plugin_private = NULL;
+		if (skip)
+		{
+			elog(DEBUG1,
+				 "Skipping replication of ROLLBACK of an empty transaction");
+			return;
+		}
+	}
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
 									   prepare_time);
@@ -630,11 +740,16 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				Relation relation, ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	TransactionId xid = InvalidTransactionId;
 	Relation	ancestor = NULL;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	if (!is_publishable_relation(relation))
 		return;
 
@@ -668,6 +783,15 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			Assert(false);
 	}
 
+	/* output BEGIN if we haven't yet */
+	if (!in_streaming && !txndata->sent_begin_txn)
+	{
+		if (rbtxn_prepared(txn))
+			pgoutput_begin_prepare(ctx, txn);
+		else
+			pgoutput_begin(ctx, txn);
+	}
+
 	/* Avoid leaking memory by using and resetting our own context */
 	old = MemoryContextSwitchTo(data->context);
 
@@ -770,6 +894,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				  int nrelations, Relation relations[], ReorderBufferChange *change)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private;
 	MemoryContext old;
 	RelationSyncEntry *relentry;
 	int			i;
@@ -777,6 +902,10 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	Oid		   *relids;
 	TransactionId xid = InvalidTransactionId;
 
+	/* If not streaming, should have setup txndata as part of BEGIN/BEGIN PREPARE */
+	if (!in_streaming)
+		Assert(txndata);
+
 	/* Remember the xid for the change in streaming mode. See pgoutput_change. */
 	if (in_streaming)
 		xid = change->txn->xid;
@@ -813,6 +942,15 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (nrelids > 0)
 	{
+		/* output BEGIN if we haven't yet */
+		if (!in_streaming && !txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+
 		OutputPluginPrepareWrite(ctx, true);
 		logicalrep_write_truncate(ctx->out,
 								  xid,
@@ -833,6 +971,7 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				 const char *message)
 {
 	PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
+	PGOutputTxnData *txndata;
 	TransactionId xid = InvalidTransactionId;
 
 	if (!data->messages)
@@ -845,6 +984,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (in_streaming)
 		xid = txn->xid;
 
+	/* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */
+	if (!in_streaming && transactional)
+	{
+		txndata = (PGOutputTxnData *) txn->output_plugin_private;
+		if (!txndata->sent_begin_txn)
+		{
+			if (rbtxn_prepared(txn))
+				pgoutput_begin_prepare(ctx, txn);
+			else
+				pgoutput_begin(ctx, txn);
+		}
+	}
+
 	OutputPluginPrepareWrite(ctx, true);
 	logicalrep_write_message(ctx->out,
 							 xid,
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 7a4804f..2fa60b5 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -150,8 +150,10 @@ typedef struct LogicalRepPreparedTxnData
  */
 typedef struct LogicalRepCommitPreparedTxnData
 {
+	XLogRecPtr	prepare_end_lsn;
 	XLogRecPtr	commit_lsn;
-	XLogRecPtr	end_lsn;
+	XLogRecPtr	commit_end_lsn;
+	TimestampTz prepare_time;
 	TimestampTz commit_time;
 	TransactionId xid;
 	char		gid[GIDSIZE];
@@ -190,7 +192,9 @@ extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 extern void logicalrep_read_prepare(StringInfo in,
 									LogicalRepPreparedTxnData *prepare_data);
 extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
-											 XLogRecPtr commit_lsn);
+											 XLogRecPtr commit_lsn,
+											 XLogRecPtr prepare_end_lsn,
+											 TimestampTz prepare_time);
 extern void logicalrep_read_commit_prepared(StringInfo in,
 											LogicalRepCommitPreparedTxnData *prepare_data);
 extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
index 810495e..0d28306 100644
--- a/src/include/replication/output_plugin.h
+++ b/src/include/replication/output_plugin.h
@@ -128,7 +128,9 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
  */
 typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /*
  * Called for ROLLBACK PREPARED.
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index d7c785b..ffc0b56 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -442,7 +442,9 @@ typedef void (*ReorderBufferPrepareCB) (ReorderBuffer *rb,
 /* commit prepared callback signature */
 typedef void (*ReorderBufferCommitPreparedCB) (ReorderBuffer *rb,
 											   ReorderBufferTXN *txn,
-											   XLogRecPtr commit_lsn);
+											   XLogRecPtr commit_lsn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
 
 /* rollback  prepared callback signature */
 typedef void (*ReorderBufferRollbackPreparedCB) (ReorderBuffer *rb,
diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 0e218e0..3d246be 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -87,9 +87,8 @@ $result = $node_publisher->safe_psql(
 			'publication_names', 'tap_pub')
 ));
 
-# 66 67 == B C == BEGIN COMMIT
-is( $result, qq(66
-67),
+# no message and no BEGIN and COMMIT because of empty transaction optimization
+is($result, qq(),
 	'option messages defaults to false so message (M) is not available on slot'
 );
 
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 4c372a6..8a33641 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -6,7 +6,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 24;
+use Test::More tests => 25;
 
 ###############################
 # Setup
@@ -318,10 +318,9 @@ $node_publisher->safe_psql('postgres', "
 
 $node_publisher->wait_for_catchup($appname_copy);
 
-# Check that the transaction has been prepared on the subscriber, there will be 2
-# prepared transactions for the 2 subscriptions.
+# Check that the transaction has been prepared on the subscriber
 $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
-is($result, qq(2), 'transaction is prepared on subscriber');
+is($result, qq(1), 'transaction is prepared on subscriber');
 
 # Now commit the insert and verify that it IS replicated
 $node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
@@ -337,6 +336,45 @@ is($result, qq(2), 'replicated data in subscriber table');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
 $node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
 
+##############################
+# Test empty prepares
+##############################
+
+# create a table that is not part of the publication
+$node_publisher->safe_psql('postgres',
+   "CREATE TABLE tab_nopub (a int PRIMARY KEY)");
+
+# disable the subscription so that we can peek at the slot
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
+
+# wait for the replication slot to become inactive in the publisher
+$node_publisher->poll_query_until('postgres',
+   "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
+
+# create a transaction with no changes relevant to the slot
+$node_publisher->safe_psql('postgres', "
+   BEGIN;
+   INSERT INTO tab_nopub SELECT generate_series(1,10);
+   PREPARE TRANSACTION 'empty_transaction';
+   COMMIT PREPARED 'empty_transaction';");
+
+# peek at the contents of the slot
+$result = $node_publisher->safe_psql(
+   'postgres', qq(
+       SELECT get_byte(data, 0)
+       FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
+           'proto_version', '1',
+           'publication_names', 'tap_pub')
+));
+
+# the empty transaction should be skipped
+is($result, qq(),
+   'empty transaction dropped on slot'
+);
+
+# enable the subscription to test cleanup
+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub ENABLE");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 235ca54..1fa6a92 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1599,6 +1599,7 @@ PGMessageField
 PGModuleMagicFunction
 PGNoticeHooks
 PGOutputData
+PGOutputTxnData
 PGPROC
 PGP_CFB
 PGP_Context
-- 
1.8.3.1

#372tanghy.fnst@fujitsu.com
tanghy.fnst@fujitsu.com
In reply to: Ajin Cherian (#371)
RE: [HACKERS] logical decoding of two-phase transactions

On Tuesday, July 6, 2021 7:18 PM Ajin Cherian <itsajin@gmail.com>

thanks for the test!
I hadn't updated the case where sending schema across was the first
change of the transaction as part of the decoding of the
truncate command. In this test case, the schema was sent across
without the stream start, hence the error on the apply worker.
I have updated with a fix. Please do a test and confirm.

Thanks for your patch.
I have tested and confirmed that the issue was fixed.

Regards
Tang

#373Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#370)
1 attachment(s)

On Tue, Jul 6, 2021 at 9:58 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v93*

Thanks, I have gone through the 0001 patch and made a number of
changes. (a) Removed some of the code which was leftover from previous
versions, (b) Removed the Assert in apply_handle_begin_prepare() as I
don't think that makes sense, (c) added/changed comments and made a
few other cosmetic changes, (d) ran pgindent.

Let me know what you think of the attached?

--
With Regards,
Amit Kapila.

Attachments:

v95-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v95-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From 5f4c86d34fd971799387738a01ed96725be7aa39 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 6 Jul 2021 13:45:55 +1000
Subject: [PATCH v95] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the following things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable
two-phase transactions. We enable the two_phase once the initial data sync
is over.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

The streaming option is not allowed with this new two_phase option. This
can be done as a separate patch.

We don't allow to toggle two_phase option of a subscription because it can
lead to an inconsistent replica. For the same reason, we don't allow to
refresh the publication once the two_phase is enabled for a subscription
unless copy_data option is false.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi, Greg Nancarrow
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         | 291 ++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 131 +++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  11 +-
 src/backend/replication/logical/logical.c          |  31 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 197 +++++++++--
 src/backend/replication/logical/worker.c           | 347 +++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |   8 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |  10 +
 src/include/replication/logicalproto.h             |  73 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   7 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 359 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 235 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   3 +
 41 files changed, 2375 insertions(+), 189 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f517a7d..79c26c1 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7643,6 +7643,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State code:
+       <literal>d</literal> = two_phase mode was not requested, so is disabled,
+       <literal>p</literal> = two_phase mode was requested, but is pending enablement,
+       <literal>e</literal> = two_phase mode was requested, and is enabled.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index a3562f3..e8cb78f 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2811,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2871,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains a Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7391,6 +7398,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepared transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit prepared.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index b3d1731..a6f9944 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1433905 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used with the
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on the publisher is decoded as a normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used with the
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..6d3efb4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		exists.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984..55f6e37 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1255,5 +1255,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subslotname, subsynccommit, subpublications)
+              substream, subtwophasestate, subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index eb88d87..971eb88 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,6 +59,7 @@
 #define SUBOPT_REFRESH				0x00000040
 #define SUBOPT_BINARY				0x00000080
 #define SUBOPT_STREAMING			0x00000100
+#define SUBOPT_TWOPHASE_COMMIT		0x00000200
 
 /* check if the 'val' has 'bits' set */
 #define IsSet(val, bits)  (((val) & (bits)) == (bits))
@@ -79,6 +80,7 @@ typedef struct SubOpts
 	bool		refresh;
 	bool		binary;
 	bool		streaming;
+	bool		twophase;
 } SubOpts;
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
@@ -123,6 +125,8 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 		opts->binary = false;
 	if (IsSet(supported_opts, SUBOPT_STREAMING))
 		opts->streaming = false;
+	if (IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT))
+		opts->twophase = false;
 
 	/* Parse options */
 	foreach(lc, stmt_options)
@@ -237,6 +241,29 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 			opts->specified_opts |= SUBOPT_STREAMING;
 			opts->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: Unsupported twophase indicates that this call originated
+			 * from AlterSubscription.
+			 */
+			if (!IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT))
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			opts->specified_opts |= SUBOPT_TWOPHASE_COMMIT;
+			opts->twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -325,6 +352,25 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (opts->twophase &&
+		IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT) &&
+		IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
+	{
+		if (opts->streaming &&
+			IsSet(supported_opts, SUBOPT_STREAMING) &&
+			IsSet(opts->specified_opts, SUBOPT_STREAMING))
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+			/*- translator: both %s are strings of the form "option = value" */
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
 }
 
 /*
@@ -385,7 +431,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	supported_opts = (SUBOPT_CONNECT | SUBOPT_ENABLED | SUBOPT_CREATE_SLOT |
 					  SUBOPT_SLOT_NAME | SUBOPT_COPY_DATA |
 					  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
-					  SUBOPT_STREAMING);
+					  SUBOPT_STREAMING | SUBOPT_TWOPHASE_COMMIT);
 	parse_subscription_options(stmt->options, supported_opts, &opts);
 
 	/*
@@ -455,6 +501,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(opts.enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(opts.binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(opts.streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(opts.twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (opts.slot_name)
@@ -532,10 +582,35 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (opts.create_slot)
 			{
+				bool		twophase_enabled = false;
+
 				Assert(opts.slot_name);
 
-				walrcv_create_slot(wrconn, opts.slot_name, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false
+				 * then it is safe to enable two_phase up-front because those
+				 * tables are already initially in READY state. When the
+				 * subscription has no tables, we leave the twophase state as
+				 * PENDING, to allow ALTER SUBSCRIPTION ... REFRESH
+				 * PUBLICATION to work.
+				 */
+				if (opts.twophase && !opts.copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, opts.slot_name, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								opts.slot_name)));
@@ -865,6 +940,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && opts.streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -927,6 +1008,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && opts.copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -966,6 +1058,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && opts.copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -986,6 +1089,30 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				parse_subscription_options(stmt->options, SUBOPT_COPY_DATA, &opts);
 
+				/*
+				 * The subscription option "two_phase" requires that
+				 * replication has passed the initial table synchronization
+				 * phase before the two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && opts.copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
 				AlterSubscription_refresh(sub, opts.copy_data);
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 6eaa84a..19ea159 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -436,6 +437,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 150000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -851,7 +856,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -868,6 +873,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
+		if (two_phase)
+			appendStringInfoString(&cmd, " TWO_PHASE");
+
 		switch (snapshot_action)
 		{
 			case CRS_EXPORT_SNAPSHOT:
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 453efc5..2874dc0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing
+				 * preparing transactions that have locked [user] catalog
+				 * tables exclusively but as of now, we ask users not to do
+				 * such an operation.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +734,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d536a5f..d61ef4c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,12 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= slot->data.two_phase;
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +540,22 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (slot->data.two_phase || ctx->twophase_opt_given);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +616,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index cb42fcb..2c191de 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b8c5e2a..9f80794 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2576,7 +2576,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2667,7 +2667,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2714,7 +2714,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2734,7 +2734,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2753,19 +2753,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2783,12 +2784,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..a14a3d6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions, that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot, need
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 682c107..edd877d 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,38 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become 'enabled' at
+	 * that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as
+	 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+	 * work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1071,7 +1068,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1158,3 +1156,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ *
+ * Note: If this function started the transaction (indicated by the parameter)
+ * then it is the caller's responsibility to commit it.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static bool has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But
+		 * if table_state_not_ready was empty we still need to check again to
+		 * see if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * Return false, when there are no tables in subscription or not all
+	 * tables are in ready state, true, otherwise.
+	 */
+	return has_subrels && list_length(table_states_not_ready) == 0;
+}
+
+/*
+ * Update the two_phase state of the specified subscription in pg_subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase state */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 5fc620c..144ad9d 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,79 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * is still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription's two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it starts streaming with the two_phase option
+ * which inturn enables the decoding of two-phase commits at publisher. Then,
+ * it updates the tri-state value from PENDING to ENABLED. Now, it is possible
+ * that during the time we have not enabled two_phase, the publisher
+ * (replication server) would have skipped some prepares but we ensure that
+ * such prepares are sent along with commit prepare, see
+ * ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to an inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +132,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +330,10 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -784,6 +862,185 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a BEGIN PREPARE message")));
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (prepare_data.prepare_lsn != remote_final_lsn)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("incorrect prepare LSN %X/%X in prepare message (expected %X/%X)",
+								 LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
+								 LSN_FORMAT_ARGS(remote_final_lsn))));
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	begin_replication_step();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* There is no transaction when COMMIT PREPARED is called */
+	begin_replication_step();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
+		begin_replication_step();
+		FinishPreparedTransaction(gid, false);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2060,6 +2317,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2539,6 +2812,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -3040,6 +3316,24 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+
+	if (!TransactionIdIsValid(xid))
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("invalid two-phase transaction ID")));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3050,6 +3344,7 @@ ApplyWorkerMain(Datum main_arg)
 	XLogRecPtr	origin_startpos;
 	char	   *myslotname;
 	WalRcvStreamOptions options;
+	int			server_version;
 
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
@@ -3208,15 +3503,59 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+
+	server_version = walrcv_server_version(LogRepWorkerWalRcvConn);
 	options.proto.logical.proto_version =
-		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		server_version >= 150000 ? LOGICALREP_PROTO_TWOPHASE_VERSION_NUM :
+		server_version >= 140000 ? LOGICALREP_PROTO_STREAM_VERSION_NUM :
+		LOGICALREP_PROTO_VERSION_NUM;
+
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
+
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index abd5217..e4314af 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -145,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -156,6 +174,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -167,10 +187,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -246,8 +268,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase
+		 * and streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -319,6 +362,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -331,8 +395,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -347,29 +415,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -389,6 +436,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -839,18 +948,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1270,3 +1369,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8c18b4e..33b85d8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -283,6 +283,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2be9ad9..9a2bc37 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -370,7 +370,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 3211521..912144c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4320,6 +4321,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4363,9 +4365,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 150000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4386,6 +4395,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4411,6 +4421,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4438,6 +4450,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4479,6 +4492,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index ba9bc6d..efb8c30 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2abf255..ba658f7 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6423,6 +6423,12 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Binary"),
 							  gettext_noop("Streaming"));
 
+		/* Two_phase is only supported in v15 and higher */
+		if (pset.sversion >= 150000)
+			appendPQExpBuffer(&buf,
+							  ", subtwophasestate AS \"%s\"\n",
+							  gettext_noop("Two phase commit"));
+
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
 						  ",  subconninfo AS \"%s\"\n",
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 0ebd5aa..d6bf725 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2764,7 +2764,7 @@ psql_completion(const char *text, int start, int end)
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
 					  "enabled", "slot_name", "streaming",
-					  "synchronous_commit");
+					  "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 750d469..2106149 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -57,6 +65,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Stream two-phase transactions */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -92,6 +102,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Allow streaming two-phase transactions */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 4d20563..632381b 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -87,6 +87,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..e0f513b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -90,6 +90,16 @@ typedef struct LogicalDecodingContext
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 *
+	 * This flag indicates that the plugin passed in the two-phase option as
+	 * part of the START_STREAMING command. We can't rely solely on the
+	 * twophase flag which only tells whether the plugin provided all the
+	 * necessary two-phase callbacks.
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..63de90d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit decoding (at prepare time). Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -122,6 +131,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +180,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d8..5b40ff7 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -297,7 +297,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	}			xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -636,7 +640,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 2eb7e3a..34d95ea 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -84,11 +84,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..0b607ed 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,8 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Streaming of two-phase transactions at
+									 * prepare time */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +349,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +423,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 57f7dd9..ad6b4e4 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..4c372a6
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,359 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# copy_data=false and two_phase
+###############################
+
+#create some test tables for copy tests
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_copy SELECT generate_series(1,5);");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "INSERT INTO tab_copy VALUES (88);");
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Setup logical replication
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_copy FOR TABLE tab_copy;");
+
+my $appname_copy = 'appname_copy';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_copy
+	CONNECTION '$publisher_connstr application_name=$appname_copy'
+	PUBLICATION tap_pub_copy
+	WITH (two_phase=on, copy_data=false);");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Also wait for initial table sync to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+# Check that the initial table data was NOT replicated (because we said copy_data=false)
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Now do a prepare on publisher and check that it IS replicated
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_copy VALUES (99);
+    PREPARE TRANSACTION 'mygid';");
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Check that the transaction has been prepared on the subscriber, there will be 2
+# prepared transactions for the 2 subscriptions.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
+is($result, qq(2), 'transaction is prepared on subscriber');
+
+# Now commit the insert and verify that it IS replicated
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(6), 'publisher inserted data');
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(2), 'replicated data in subscriber table');
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..e61d28a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9a0936e..227f92c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1390,12 +1390,15 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

#374vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#373)

On Thu, Jul 8, 2021 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 6, 2021 at 9:58 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v93*

Thanks, I have gone through the 0001 patch and made a number of
changes. (a) Removed some of the code which was leftover from previous
versions, (b) Removed the Assert in apply_handle_begin_prepare() as I
don't think that makes sense, (c) added/changed comments and made a
few other cosmetic changes, (d) ran pgindent.

Let me know what you think of the attached?

The patch looks good to me, I don't have any comments.

Regards,
Vignesh

#375Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#374)

On Thu, Jul 8, 2021 at 10:08 PM vignesh C <vignesh21@gmail.com> wrote:

On Thu, Jul 8, 2021 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 6, 2021 at 9:58 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v93*

Thanks, I have gone through the 0001 patch and made a number of
changes. (a) Removed some of the code which was leftover from previous
versions, (b) Removed the Assert in apply_handle_begin_prepare() as I
don't think that makes sense, (c) added/changed comments and made a
few other cosmetic changes, (d) ran pgindent.

Let me know what you think of the attached?

The patch looks good to me, I don't have any comments.

I tried the v95-0001 patch.

- The patch applied cleanly and all build / testing was OK.
- The documentation also builds OK.
- I checked all v95-0001 / v93-0001 differences and found no problems.
- Furthermore, I noted that v95-0001 patch is passing the cfbot [1]http://cfbot.cputube.org/patch_33_2914.log.

So this patch LGTM.

------
[1]: http://cfbot.cputube.org/patch_33_2914.log

Kind Regards,
Peter Smith.
Fujitsu Australia

#376Ajin Cherian
itsajin@gmail.com
In reply to: Peter Smith (#375)

On Fri, Jul 9, 2021 at 9:13 AM Peter Smith <smithpb2250@gmail.com> wrote:

I tried the v95-0001 patch.

- The patch applied cleanly and all build / testing was OK.
- The documentation also builds OK.
- I checked all v95-0001 / v93-0001 differences and found no problems.
- Furthermore, I noted that v95-0001 patch is passing the cfbot [1].

So this patch LGTM.

Applied, reviewed and tested the patch.
Also ran a 5 level cascaded standby setup running a modified pgbench
that does two phase commits and it ran fine.
Did some testing using empty transactions and no issues found
The patch looks good to me.

regards,
Ajin Cherian

#377tanghy.fnst@fujitsu.com
tanghy.fnst@fujitsu.com
In reply to: Ajin Cherian (#376)
RE: [HACKERS] logical decoding of two-phase transactions

On Friday, July 9, 2021 2:56 PM Ajin Cherian <itsajin@gmail.com>wrote:

On Fri, Jul 9, 2021 at 9:13 AM Peter Smith <smithpb2250@gmail.com> wrote:

I tried the v95-0001 patch.

- The patch applied cleanly and all build / testing was OK.
- The documentation also builds OK.
- I checked all v95-0001 / v93-0001 differences and found no problems.
- Furthermore, I noted that v95-0001 patch is passing the cfbot [1].

So this patch LGTM.

Applied, reviewed and tested the patch.
Also ran a 5 level cascaded standby setup running a modified pgbench
that does two phase commits and it ran fine.
Did some testing using empty transactions and no issues found
The patch looks good to me.

I did some cross version tests on patch v95 (publisher is PG14 and subscriber is PG15, or publisher is PG15 and subscriber is PG14; set two_phase option to on or off/default). It worked as expected, data could be replicated correctly.

Besides, I tested some scenarios using synchronized replication, it worked fine in my cases.

So this patch LGTM.

Regards
Tang

#378Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#375)
1 attachment(s)

On Fri, Jul 9, 2021 at 4:43 AM Peter Smith <smithpb2250@gmail.com> wrote:

The patch looks good to me, I don't have any comments.

I tried the v95-0001 patch.

- The patch applied cleanly and all build / testing was OK.
- The documentation also builds OK.
- I checked all v95-0001 / v93-0001 differences and found no problems.
- Furthermore, I noted that v95-0001 patch is passing the cfbot [1].

So this patch LGTM.

Thanks, I took another pass over it and made a few changes in docs and
comments. I am planning to push this next week sometime (by 14th July)
unless there are more comments from you or someone else. Just to
summarize, this patch will add support for prepared transactions to
built-in logical replication. To add support for streaming
transactions at prepare time into the
built-in logical replication, we need to do the following things: (a)
Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol. (b) Modify
the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare. (c) Add a new SUBSCRIPTION
option "two_phase" to allow users to enable
two-phase transactions. We enable the two_phase once the initial data
sync is over. Refer to comments atop worker.c in the patch and commit
message to see further details about this patch. After this patch,
there is a follow-up patch to allow streaming and two-phase options
together which I feel needs some more review and can be committed
separately.

--
With Regards,
Amit Kapila.

Attachments:

v96-0001-Add-support-for-prepared-transactions-to-built-i.patchapplication/octet-stream; name=v96-0001-Add-support-for-prepared-transactions-to-built-i.patchDownload
From aa2cf92396126eee7256d0656caa10c3590fd5ae Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 6 Jul 2021 13:45:55 +1000
Subject: [PATCH v96] Add support for prepared transactions to built-in logical
 replication.

To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the following things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable
two-phase transactions. We enable the two_phase once the initial data sync
is over.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

The streaming option is not allowed with this new two_phase option. This
can be done as a separate patch.

We don't allow to toggle two_phase option of a subscription because it can
lead to an inconsistent replica. For the same reason, we don't allow to
refresh the publication once the two_phase is enabled for a subscription
unless copy_data option is false.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi, Greg Nancarrow
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
---
 contrib/test_decoding/test_decoding.c              |  12 +-
 doc/src/sgml/catalogs.sgml                         |  12 +
 doc/src/sgml/protocol.sgml                         | 291 ++++++++++++++++-
 doc/src/sgml/ref/alter_subscription.sgml           |   5 +
 doc/src/sgml/ref/create_subscription.sgml          |  37 +++
 doc/src/sgml/ref/pg_dump.sgml                      |   7 +-
 src/backend/access/transam/twophase.c              |  68 ++++
 src/backend/catalog/pg_subscription.c              |  34 ++
 src/backend/catalog/system_views.sql               |   2 +-
 src/backend/commands/subscriptioncmds.c            | 131 +++++++-
 .../libpqwalreceiver/libpqwalreceiver.c            |  10 +-
 src/backend/replication/logical/decode.c           |  11 +-
 src/backend/replication/logical/logical.c          |  31 +-
 src/backend/replication/logical/origin.c           |   7 +-
 src/backend/replication/logical/proto.c            | 217 ++++++++++++-
 src/backend/replication/logical/reorderbuffer.c    |  25 +-
 src/backend/replication/logical/snapbuild.c        |  33 +-
 src/backend/replication/logical/tablesync.c        | 197 +++++++++--
 src/backend/replication/logical/worker.c           | 347 +++++++++++++++++++-
 src/backend/replication/pgoutput/pgoutput.c        | 201 +++++++++---
 src/backend/replication/slot.c                     |   1 +
 src/backend/replication/walreceiver.c              |   2 +-
 src/bin/pg_dump/pg_dump.c                          |  20 +-
 src/bin/pg_dump/pg_dump.h                          |   1 +
 src/bin/psql/describe.c                            |   8 +-
 src/bin/psql/tab-complete.c                        |   2 +-
 src/include/access/twophase.h                      |   2 +
 src/include/catalog/pg_subscription.h              |  11 +
 src/include/catalog/pg_subscription_rel.h          |   1 +
 src/include/replication/logical.h                  |  10 +
 src/include/replication/logicalproto.h             |  73 ++++-
 src/include/replication/pgoutput.h                 |   1 +
 src/include/replication/reorderbuffer.h            |   8 +-
 src/include/replication/slot.h                     |   7 +-
 src/include/replication/snapbuild.h                |   5 +-
 src/include/replication/walreceiver.h              |   7 +-
 src/include/replication/worker_internal.h          |   3 +
 src/test/regress/expected/subscription.out         | 109 ++++---
 src/test/regress/sql/subscription.sql              |  25 ++
 src/test/subscription/t/021_twophase.pl            | 359 +++++++++++++++++++++
 src/test/subscription/t/022_twophase_cascade.pl    | 235 ++++++++++++++
 src/tools/pgindent/typedefs.list                   |   3 +
 42 files changed, 2381 insertions(+), 190 deletions(-)
 create mode 100644 src/test/subscription/t/021_twophase.pl
 create mode 100644 src/test/subscription/t/022_twophase_cascade.pl

diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index de1b692..e5cd84e 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -339,7 +339,7 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -382,7 +382,7 @@ pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -404,7 +404,7 @@ pg_decode_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -428,7 +428,7 @@ pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -853,7 +853,7 @@ pg_decode_stream_prepare(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.prepare_time));
 
 	OutputPluginWrite(ctx, true);
 }
@@ -882,7 +882,7 @@ pg_decode_stream_commit(LogicalDecodingContext *ctx,
 
 	if (data->include_timestamp)
 		appendStringInfo(ctx->out, " (at %s)",
-						 timestamptz_to_str(txn->commit_time));
+						 timestamptz_to_str(txn->xact_time.commit_time));
 
 	OutputPluginWrite(ctx, true);
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f517a7d..0f5d25b 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7643,6 +7643,18 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subtwophasestate</structfield> <type>char</type>
+      </para>
+      <para>
+       State codes for two-phase mode:
+       <literal>d</literal> = disabled,
+       <literal>p</literal> = pending enablement,
+       <literal>e</literal> = enabled
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index a3562f3..e8cb78f 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2811,11 +2811,17 @@ The commands accepted in replication mode are:
      </term>
      <listitem>
       <para>
-       Protocol version. Currently versions <literal>1</literal> and
-       <literal>2</literal> are supported. The version <literal>2</literal>
-       is supported only for server version 14 and above, and it allows
-       streaming of large in-progress transactions.
-     </para>
+       Protocol version. Currently versions <literal>1</literal>, <literal>2</literal>,
+       and <literal>3</literal> are supported.
+      </para>
+      <para>
+       Version <literal>2</literal> is supported only for server version 14
+       and above, and it allows streaming of large in-progress transactions.
+      </para>
+      <para>
+       Version <literal>3</literal> is supported only for server version 15
+       and above, and it allows streaming of two-phase transactions.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -2871,10 +2877,11 @@ The commands accepted in replication mode are:
   <para>
    The logical replication protocol sends individual transactions one by one.
    This means that all messages between a pair of Begin and Commit messages
-   belong to the same transaction. It also sends changes of large in-progress
-   transactions between a pair of Stream Start and Stream Stop messages. The
-   last stream of such a transaction contains Stream Commit or Stream Abort
-   message.
+   belong to the same transaction. Similarly, all messages between a pair of
+   Begin Prepare and Prepare messages belong to the same transaction.
+   It also sends changes of large in-progress transactions between a pair of
+   Stream Start and Stream Stop messages. The last stream of such a transaction 
+   contains a Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7391,6 +7398,272 @@ Stream Abort
 </variablelist>
 
 <para>
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+are available since protocol version 3.
+</para>
+
+<variablelist>
+
+<varlistentry>
+
+<term>Begin Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('b')</term>
+<listitem><para>
+                Identifies the message as the beginning of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('P')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepared transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Commit Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('K')</term>
+<listitem><para>
+                Identifies the message as the commit of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the commit prepared.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the commit prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Commit timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+<varlistentry>
+
+<term>Rollback Prepared</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('r')</term>
+<listitem><para>
+                Identifies the message as the rollback of a two-phase transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the rollback prepared transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Rollback timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+</variablelist>
+
+<para>
 
 The following message parts are shared by the above messages.
 
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index b3d1731..a6f9944 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -67,6 +67,11 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO <
    Commands <command>ALTER SUBSCRIPTION ... REFRESH PUBLICATION</command> and
    <command>ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...</command> with refresh
    option as true cannot be executed inside a transaction block.
+
+   These commands also cannot be executed when the subscription has
+   <literal>two_phase</literal> commit enabled, unless <literal>copy_data = false</literal>.
+   See column <literal>subtwophasestate</literal> of
+   <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state.
   </para>
  </refsect1>
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index e812bee..1433905 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -237,6 +237,43 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           are fully decoded on the publisher, and only then sent to the
           subscriber as a whole.
          </para>
+
+         <para>
+          The <literal>streaming</literal> option cannot be used with the
+          <literal>two_phase</literal> option.
+         </para>
+
+        </listitem>
+       </varlistentry>
+       <varlistentry>
+        <term><literal>two_phase</literal> (<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether two-phase commit is enabled for this subscription.
+          The default is <literal>false</literal>.
+         </para>
+
+         <para>
+          When two-phase commit is enabled then the decoded transactions are sent
+          to the subscriber on the PREPARE TRANSACTION. By default, the transaction
+          prepared on the publisher is decoded as a normal transaction at commit.
+         </para>
+
+         <para>
+          The two-phase commit implementation requires that the replication has
+          successfully passed the initial table synchronization phase. This means
+          even when two_phase is enabled for the subscription, the internal
+          two-phase state remains temporarily "pending" until the initialization
+          phase is completed. See column
+          <literal>subtwophasestate</literal> of <xref linkend="catalog-pg-subscription"/>
+          to know the actual two-phase state.
+         </para>
+
+         <para>
+          The <literal>two_phase</literal> option cannot be used with the
+          <literal>streaming</literal> option.
+         </para>
+
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index a6c0788..2360986 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1405,7 +1405,12 @@ CREATE DATABASE foo WITH TEMPLATE template0;
    servers.  It is then up to the user to reactivate the subscriptions in a
    suitable way.  If the involved hosts have changed, the connection
    information might have to be changed.  It might also be appropriate to
-   truncate the target tables before initiating a new full table copy.
+   truncate the target tables before initiating a new full table copy.  Users
+   must create slot with <literal>two_phase = false</literal> unless they later
+   don't want to copy the initial data during refresh with
+   <literal>copy_data = false</literal>.  After the initial sync, the
+   <literal>two_phase</literal> option will be automatically enabled by the
+   subscriber. 
   </para>
  </refsect1>
 
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f67d813..6d3efb4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2458,3 +2458,71 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
 }
+
+/*
+ * LookupGXact
+ *		Check if the prepared transaction with the given GID, lsn and timestamp
+ *		exists.
+ *
+ * Note that we always compare with the LSN where prepare ends because that is
+ * what is stored as origin_lsn in the 2PC file.
+ *
+ * This function is primarily used to check if the prepared transaction
+ * received from the upstream (remote node) already exists. Checking only GID
+ * is not sufficient because a different prepared xact with the same GID can
+ * exist on the same node. So, we are ensuring to match origin_lsn and
+ * origin_timestamp of prepared xact to avoid the possibility of a match of
+ * prepared xact from two different nodes.
+ */
+bool
+LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn,
+			TimestampTz origin_prepare_timestamp)
+{
+	int			i;
+	bool		found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+		/* Ignore not-yet-valid GIDs. */
+		if (gxact->valid && strcmp(gxact->gid, gid) == 0)
+		{
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+
+			/*
+			 * We are not expecting collisions of GXACTs (same gid) between
+			 * publisher and subscribers, so we perform all I/O while holding
+			 * TwoPhaseStateLock for simplicity.
+			 *
+			 * To move the I/O out of the lock, we need to ensure that no
+			 * other backend commits the prepared xact in the meantime. We can
+			 * do this optimization if we encounter many collisions in GID
+			 * between publisher and subscriber.
+			 */
+			if (gxact->ondisk)
+				buf = ReadTwoPhaseFile(gxact->xid, false);
+			else
+			{
+				Assert(gxact->prepare_start_lsn);
+				XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+			}
+
+			hdr = (TwoPhaseFileHeader *) buf;
+
+			if (hdr->origin_lsn == prepare_end_lsn &&
+				hdr->origin_timestamp == origin_prepare_timestamp)
+			{
+				found = true;
+				pfree(buf);
+				break;
+			}
+
+			pfree(buf);
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+	return found;
+}
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..25021e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -68,6 +68,7 @@ GetSubscription(Oid subid, bool missing_ok)
 	sub->enabled = subform->subenabled;
 	sub->binary = subform->subbinary;
 	sub->stream = subform->substream;
+	sub->twophasestate = subform->subtwophasestate;
 
 	/* Get conninfo */
 	datum = SysCacheGetAttr(SUBSCRIPTIONOID,
@@ -450,6 +451,39 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	table_close(rel, RowExclusiveLock);
 }
 
+/*
+ * Does the subscription have any relations?
+ *
+ * Use this function only to know true/false, and when you have no need for the
+ * List returned by GetSubscriptionRelations.
+ */
+bool
+HasSubscriptionRelations(Oid subid)
+{
+	Relation	rel;
+	ScanKeyData skey[1];
+	SysScanDesc scan;
+	bool		has_subrels;
+
+	rel = table_open(SubscriptionRelRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_subscription_rel_srsubid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(subid));
+
+	scan = systable_beginscan(rel, InvalidOid, false,
+							  NULL, 1, skey);
+
+	/* If even a single tuple exists then the subscription has tables. */
+	has_subrels = HeapTupleIsValid(systable_getnext(scan));
+
+	/* Cleanup */
+	systable_endscan(scan);
+	table_close(rel, AccessShareLock);
+
+	return has_subrels;
+}
 
 /*
  * Get all relations for subscription.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984..55f6e37 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1255,5 +1255,5 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subslotname, subsynccommit, subpublications)
+              substream, subtwophasestate, subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index eb88d87..971eb88 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -59,6 +59,7 @@
 #define SUBOPT_REFRESH				0x00000040
 #define SUBOPT_BINARY				0x00000080
 #define SUBOPT_STREAMING			0x00000100
+#define SUBOPT_TWOPHASE_COMMIT		0x00000200
 
 /* check if the 'val' has 'bits' set */
 #define IsSet(val, bits)  (((val) & (bits)) == (bits))
@@ -79,6 +80,7 @@ typedef struct SubOpts
 	bool		refresh;
 	bool		binary;
 	bool		streaming;
+	bool		twophase;
 } SubOpts;
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
@@ -123,6 +125,8 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 		opts->binary = false;
 	if (IsSet(supported_opts, SUBOPT_STREAMING))
 		opts->streaming = false;
+	if (IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT))
+		opts->twophase = false;
 
 	/* Parse options */
 	foreach(lc, stmt_options)
@@ -237,6 +241,29 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 			opts->specified_opts |= SUBOPT_STREAMING;
 			opts->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			/*
+			 * Do not allow toggling of two_phase option. Doing so could cause
+			 * missing of transactions and lead to an inconsistent replica.
+			 * See comments atop worker.c
+			 *
+			 * Note: Unsupported twophase indicates that this call originated
+			 * from AlterSubscription.
+			 */
+			if (!IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT))
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized subscription parameter: \"%s\"", defel->defname)));
+
+			if (IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+
+			opts->specified_opts |= SUBOPT_TWOPHASE_COMMIT;
+			opts->twophase = defGetBoolean(defel);
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -325,6 +352,25 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
+
+	/*
+	 * Do additional checking for the disallowed combination of two_phase and
+	 * streaming. While streaming and two_phase can theoretically be
+	 * supported, it needs more analysis to allow them together.
+	 */
+	if (opts->twophase &&
+		IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT) &&
+		IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
+	{
+		if (opts->streaming &&
+			IsSet(supported_opts, SUBOPT_STREAMING) &&
+			IsSet(opts->specified_opts, SUBOPT_STREAMING))
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+			/*- translator: both %s are strings of the form "option = value" */
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase = true", "streaming = true")));
+	}
 }
 
 /*
@@ -385,7 +431,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	supported_opts = (SUBOPT_CONNECT | SUBOPT_ENABLED | SUBOPT_CREATE_SLOT |
 					  SUBOPT_SLOT_NAME | SUBOPT_COPY_DATA |
 					  SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
-					  SUBOPT_STREAMING);
+					  SUBOPT_STREAMING | SUBOPT_TWOPHASE_COMMIT);
 	parse_subscription_options(stmt->options, supported_opts, &opts);
 
 	/*
@@ -455,6 +501,10 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(opts.enabled);
 	values[Anum_pg_subscription_subbinary - 1] = BoolGetDatum(opts.binary);
 	values[Anum_pg_subscription_substream - 1] = BoolGetDatum(opts.streaming);
+	values[Anum_pg_subscription_subtwophasestate - 1] =
+		CharGetDatum(opts.twophase ?
+					 LOGICALREP_TWOPHASE_STATE_PENDING :
+					 LOGICALREP_TWOPHASE_STATE_DISABLED);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (opts.slot_name)
@@ -532,10 +582,35 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 			 */
 			if (opts.create_slot)
 			{
+				bool		twophase_enabled = false;
+
 				Assert(opts.slot_name);
 
-				walrcv_create_slot(wrconn, opts.slot_name, false,
+				/*
+				 * Even if two_phase is set, don't create the slot with
+				 * two-phase enabled. Will enable it once all the tables are
+				 * synced and ready. This avoids race-conditions like prepared
+				 * transactions being skipped due to changes not being applied
+				 * due to checks in should_apply_changes_for_rel() when
+				 * tablesync for the corresponding tables are in progress. See
+				 * comments atop worker.c.
+				 *
+				 * Note that if tables were specified but copy_data is false
+				 * then it is safe to enable two_phase up-front because those
+				 * tables are already initially in READY state. When the
+				 * subscription has no tables, we leave the twophase state as
+				 * PENDING, to allow ALTER SUBSCRIPTION ... REFRESH
+				 * PUBLICATION to work.
+				 */
+				if (opts.twophase && !opts.copy_data && tables != NIL)
+					twophase_enabled = true;
+
+				walrcv_create_slot(wrconn, opts.slot_name, false, twophase_enabled,
 								   CRS_NOEXPORT_SNAPSHOT, NULL);
+
+				if (twophase_enabled)
+					UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+
 				ereport(NOTICE,
 						(errmsg("created replication slot \"%s\" on publisher",
 								opts.slot_name)));
@@ -865,6 +940,12 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
+					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && opts.streaming)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("cannot set %s for two-phase enabled subscription",
+										"streaming = true")));
+
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
@@ -927,6 +1008,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && opts.copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Make sure refresh sees the new list of publications. */
@@ -966,6 +1058,17 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 								 errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
 								 errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
 
+					/*
+					 * See ALTER_SUBSCRIPTION_REFRESH for details why this is
+					 * not allowed.
+					 */
+					if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && opts.copy_data)
+						ereport(ERROR,
+								(errcode(ERRCODE_SYNTAX_ERROR),
+								 errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"),
+								 errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false"
+										 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 					PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION with refresh");
 
 					/* Only refresh the added/dropped list of publications. */
@@ -986,6 +1089,30 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				parse_subscription_options(stmt->options, SUBOPT_COPY_DATA, &opts);
 
+				/*
+				 * The subscription option "two_phase" requires that
+				 * replication has passed the initial table synchronization
+				 * phase before the two_phase becomes properly enabled.
+				 *
+				 * But, having reached this two-phase commit "enabled" state
+				 * we must not allow any subsequent table initialization to
+				 * occur. So the ALTER SUBSCRIPTION ... REFRESH is disallowed
+				 * when the the user had requested two_phase = on mode.
+				 *
+				 * The exception to this restriction is when copy_data =
+				 * false, because when copy_data is false the tablesync will
+				 * start already in READY state and will exit directly without
+				 * doing anything.
+				 *
+				 * For more details see comments atop worker.c.
+				 */
+				if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && opts.copy_data)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"),
+							 errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false"
+									 ", or use DROP/CREATE SUBSCRIPTION.")));
+
 				PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... REFRESH");
 
 				AlterSubscription_refresh(sub, opts.copy_data);
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 6eaa84a..19ea159 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -73,6 +73,7 @@ static void libpqrcv_send(WalReceiverConn *conn, const char *buffer,
 static char *libpqrcv_create_slot(WalReceiverConn *conn,
 								  const char *slotname,
 								  bool temporary,
+								  bool two_phase,
 								  CRSSnapshotAction snapshot_action,
 								  XLogRecPtr *lsn);
 static pid_t libpqrcv_get_backend_pid(WalReceiverConn *conn);
@@ -436,6 +437,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
 			PQserverVersion(conn->streamConn) >= 140000)
 			appendStringInfoString(&cmd, ", streaming 'on'");
 
+		if (options->proto.logical.twophase &&
+			PQserverVersion(conn->streamConn) >= 150000)
+			appendStringInfoString(&cmd, ", two_phase 'on'");
+
 		pubnames = options->proto.logical.publication_names;
 		pubnames_str = stringlist_to_identifierstr(conn->streamConn, pubnames);
 		if (!pubnames_str)
@@ -851,7 +856,7 @@ libpqrcv_send(WalReceiverConn *conn, const char *buffer, int nbytes)
  */
 static char *
 libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
-					 bool temporary, CRSSnapshotAction snapshot_action,
+					 bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
 					 XLogRecPtr *lsn)
 {
 	PGresult   *res;
@@ -868,6 +873,9 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
 	if (conn->logical)
 	{
 		appendStringInfoString(&cmd, " LOGICAL pgoutput");
+		if (two_phase)
+			appendStringInfoString(&cmd, " TWO_PHASE");
+
 		switch (snapshot_action)
 		{
 			case CRS_EXPORT_SNAPSHOT:
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 453efc5..2874dc0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -374,11 +374,10 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				 *
 				 * XXX Now, this can even lead to a deadlock if the prepare
 				 * transaction is waiting to get it logically replicated for
-				 * distributed 2PC. Currently, we don't have an in-core
-				 * implementation of prepares for distributed 2PC but some
-				 * out-of-core logical replication solution can have such an
-				 * implementation. They need to inform users to not have locks
-				 * on catalog tables in such transactions.
+				 * distributed 2PC. This can be avoided by disallowing
+				 * preparing transactions that have locked [user] catalog
+				 * tables exclusively but as of now, we ask users not to do
+				 * such an operation.
 				 */
 				DecodePrepare(ctx, buf, &parsed);
 				break;
@@ -735,7 +734,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	if (two_phase)
 	{
 		ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
-									SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
+									SnapBuildGetTwoPhaseAt(ctx->snapshot_builder),
 									commit_time, origin_id, origin_lsn,
 									parsed->twophase_gid, true);
 	}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index d536a5f..d61ef4c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
 		AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
-								need_full_snapshot, slot->data.initial_consistent_point);
+								need_full_snapshot, slot->data.two_phase_at);
 
 	ctx->reorder->private_data = ctx;
 
@@ -432,10 +432,12 @@ CreateInitDecodingContext(const char *plugin,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= slot->data.two_phase;
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -538,10 +540,22 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 	MemoryContextSwitchTo(old_context);
 
 	/*
-	 * We allow decoding of prepared transactions iff the two_phase option is
-	 * enabled at the time of slot creation.
+	 * We allow decoding of prepared transactions when the two_phase is
+	 * enabled at the time of slot creation, or when the two_phase option is
+	 * given at the streaming start, provided the plugin supports all the
+	 * callbacks for two-phase.
 	 */
-	ctx->twophase &= MyReplicationSlot->data.two_phase;
+	ctx->twophase &= (slot->data.two_phase || ctx->twophase_opt_given);
+
+	/* Mark slot to allow two_phase decoding if not already marked */
+	if (ctx->twophase && !slot->data.two_phase)
+	{
+		slot->data.two_phase = true;
+		slot->data.two_phase_at = start_lsn;
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+	}
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
@@ -602,7 +616,8 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 
 	SpinLockAcquire(&slot->mutex);
 	slot->data.confirmed_flush = ctx->reader->EndRecPtr;
-	slot->data.initial_consistent_point = ctx->reader->EndRecPtr;
+	if (slot->data.two_phase)
+		slot->data.two_phase_at = ctx->reader->EndRecPtr;
 	SpinLockRelease(&slot->mutex);
 }
 
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index cb42fcb..2c191de 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -973,8 +973,11 @@ replorigin_advance(RepOriginId node,
 
 	/*
 	 * Due to - harmless - race conditions during a checkpoint we could see
-	 * values here that are older than the ones we already have in memory.
-	 * Don't overwrite those.
+	 * values here that are older than the ones we already have in memory. We
+	 * could also see older values for prepared transactions when the prepare
+	 * is sent at a later point of time along with commit prepared and there
+	 * are other transactions commits between prepare and commit prepared. See
+	 * ReorderBufferFinishPrepared. Don't overwrite those.
 	 */
 	if (go_backward || replication_state->remote_lsn < remote_commit)
 		replication_state->remote_lsn = remote_commit;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1cf59e0..13c8c3b 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 	pq_sendint32(out, txn->xid);
 }
 
@@ -85,7 +85,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
@@ -107,6 +107,217 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 }
 
 /*
+ * Write BEGIN PREPARE to the output stream.
+ */
+void
+logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn)
+{
+	pq_sendbyte(out, LOGICAL_REP_MSG_BEGIN_PREPARE);
+
+	/* fixed fields */
+	pq_sendint64(out, txn->final_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction BEGIN PREPARE from the stream.
+ */
+void
+logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data)
+{
+	/* read fields */
+	begin_data->prepare_lsn = pq_getmsgint64(in);
+	if (begin_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn not set in begin prepare message");
+	begin_data->end_lsn = pq_getmsgint64(in);
+	if (begin_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn not set in begin prepare message");
+	begin_data->prepare_time = pq_getmsgint64(in);
+	begin_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(begin_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write PREPARE to the output stream.
+ */
+void
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+	Assert(rbtxn_prepared(txn));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_lsn is not set in prepare message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in prepare message");
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write COMMIT PREPARED to the output stream.
+ */
+void
+logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+								 XLogRecPtr commit_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, commit_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction COMMIT PREPARED from the stream.
+ */
+void
+logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *prepare_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in commit prepared message", flags);
+
+	/* read fields */
+	prepare_data->commit_lsn = pq_getmsgint64(in);
+	if (prepare_data->commit_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "commit_lsn is not set in commit prepared message");
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	if (prepare_data->end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "end_lsn is not set in commit prepared message");
+	prepare_data->commit_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+}
+
+/*
+ * Write ROLLBACK PREPARED to the output stream.
+ */
+void
+logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+								   XLogRecPtr prepare_end_lsn,
+								   TimestampTz prepare_time)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_ROLLBACK_PREPARED);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_end_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, prepare_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read transaction ROLLBACK PREPARED from the stream.
+ */
+void
+logicalrep_read_rollback_prepared(StringInfo in,
+								  LogicalRepRollbackPreparedTxnData *rollback_data)
+{
+	/* read flags */
+	uint8		flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in rollback prepared message", flags);
+
+	/* read fields */
+	rollback_data->prepare_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "prepare_end_lsn is not set in rollback prepared message");
+	rollback_data->rollback_end_lsn = pq_getmsgint64(in);
+	if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr)
+		elog(ERROR, "rollback_end_lsn is not set in rollback prepared message");
+	rollback_data->prepare_time = pq_getmsgint64(in);
+	rollback_data->rollback_time = pq_getmsgint64(in);
+	rollback_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(rollback_data->gid, pq_getmsgstring(in));
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
@@ -841,7 +1052,7 @@ logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, txn->xact_time.commit_time);
 }
 
 /*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1b4f4a5..7378beb 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2576,7 +2576,7 @@ ReorderBufferReplay(ReorderBufferTXN *txn,
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2667,7 +2667,7 @@ ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
 	 */
 	txn->final_lsn = prepare_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = prepare_time;
+	txn->xact_time.prepare_time = prepare_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
@@ -2714,7 +2714,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 	Assert(txn->final_lsn != InvalidXLogRecPtr);
 
 	ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-						txn->commit_time, txn->origin_id, txn->origin_lsn);
+						txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 
 	/*
 	 * We send the prepare for the concurrently aborted xacts so that later
@@ -2734,7 +2734,7 @@ ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
 void
 ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 							XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-							XLogRecPtr initial_consistent_point,
+							XLogRecPtr two_phase_at,
 							TimestampTz commit_time, RepOriginId origin_id,
 							XLogRecPtr origin_lsn, char *gid, bool is_commit)
 {
@@ -2753,19 +2753,20 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 	 * be later used for rollback.
 	 */
 	prepare_end_lsn = txn->end_lsn;
-	prepare_time = txn->commit_time;
+	prepare_time = txn->xact_time.prepare_time;
 
 	/* add the gid in the txn */
 	txn->gid = pstrdup(gid);
 
 	/*
 	 * It is possible that this transaction is not decoded at prepare time
-	 * either because by that time we didn't have a consistent snapshot or it
-	 * was decoded earlier but we have restarted. We only need to send the
-	 * prepare if it was not decoded earlier. We don't need to decode the xact
-	 * for aborts if it is not done already.
+	 * either because by that time we didn't have a consistent snapshot, or
+	 * two_phase was not enabled, or it was decoded earlier but we have
+	 * restarted. We only need to send the prepare if it was not decoded
+	 * earlier. We don't need to decode the xact for aborts if it is not done
+	 * already.
 	 */
-	if ((txn->final_lsn < initial_consistent_point) && is_commit)
+	if ((txn->final_lsn < two_phase_at) && is_commit)
 	{
 		txn->txn_flags |= RBTXN_PREPARE;
 
@@ -2783,12 +2784,12 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 		 * prepared after the restart.
 		 */
 		ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
-							txn->commit_time, txn->origin_id, txn->origin_lsn);
+							txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
 	}
 
 	txn->final_lsn = commit_lsn;
 	txn->end_lsn = end_lsn;
-	txn->commit_time = commit_time;
+	txn->xact_time.commit_time = commit_time;
 	txn->origin_id = origin_id;
 	txn->origin_lsn = origin_lsn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 04f3355..a14a3d6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -165,15 +165,15 @@ struct SnapBuild
 	XLogRecPtr	start_decoding_at;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which two-phase decoding was enabled or LSN at which we found a
+	 * consistent point at the time of slot creation.
 	 *
-	 * The prepared transactions that are not covered by initial snapshot
-	 * needs to be sent later along with commit prepared and they must be
-	 * before this point.
+	 * The prepared transactions, that were skipped because previously
+	 * two-phase was not enabled or are not covered by initial snapshot, need
+	 * to be sent later along with commit prepared and they must be before
+	 * this point.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Don't start decoding WAL until the "xl_running_xacts" information
@@ -281,7 +281,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 						TransactionId xmin_horizon,
 						XLogRecPtr start_lsn,
 						bool need_full_snapshot,
-						XLogRecPtr initial_consistent_point)
+						XLogRecPtr two_phase_at)
 {
 	MemoryContext context;
 	MemoryContext oldcontext;
@@ -309,7 +309,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
 	builder->initial_xmin_horizon = xmin_horizon;
 	builder->start_decoding_at = start_lsn;
 	builder->building_full_snapshot = need_full_snapshot;
-	builder->initial_consistent_point = initial_consistent_point;
+	builder->two_phase_at = two_phase_at;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -370,12 +370,21 @@ SnapBuildCurrentState(SnapBuild *builder)
 }
 
 /*
- * Return the LSN at which the snapshot was exported
+ * Return the LSN at which the two-phase decoding was first enabled.
  */
 XLogRecPtr
-SnapBuildInitialConsistentPoint(SnapBuild *builder)
+SnapBuildGetTwoPhaseAt(SnapBuild *builder)
 {
-	return builder->initial_consistent_point;
+	return builder->two_phase_at;
+}
+
+/*
+ * Set the LSN at which two-phase decoding is enabled.
+ */
+void
+SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr)
+{
+	builder->two_phase_at = ptr;
 }
 
 /*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 682c107..f07983a 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -96,6 +96,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/indexing.h"
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 #include "commands/copy.h"
@@ -114,8 +115,11 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 
 static bool table_states_valid = false;
+static List *table_states_not_ready = NIL;
+static bool FetchTableStates(bool *started_tx);
 
 StringInfo	copybuf = NULL;
 
@@ -362,7 +366,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 		Oid			relid;
 		TimestampTz last_start_time;
 	};
-	static List *table_states = NIL;
 	static HTAB *last_start_times = NULL;
 	ListCell   *lc;
 	bool		started_tx = false;
@@ -370,42 +373,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	Assert(!IsTransactionState());
 
 	/* We need up-to-date sync state info for subscription tables here. */
-	if (!table_states_valid)
-	{
-		MemoryContext oldctx;
-		List	   *rstates;
-		ListCell   *lc;
-		SubscriptionRelState *rstate;
-
-		/* Clean the old list. */
-		list_free_deep(table_states);
-		table_states = NIL;
-
-		StartTransactionCommand();
-		started_tx = true;
-
-		/* Fetch all non-ready tables. */
-		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
-
-		/* Allocate the tracking info in a permanent memory context. */
-		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
-		foreach(lc, rstates)
-		{
-			rstate = palloc(sizeof(SubscriptionRelState));
-			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
-			table_states = lappend(table_states, rstate);
-		}
-		MemoryContextSwitchTo(oldctx);
-
-		table_states_valid = true;
-	}
+	FetchTableStates(&started_tx);
 
 	/*
 	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
-	if (table_states && !last_start_times)
+	if (table_states_not_ready && !last_start_times)
 	{
 		HASHCTL		ctl;
 
@@ -419,16 +394,38 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	 * Clean up the hash table when we're done with all tables (just to
 	 * release the bit of memory).
 	 */
-	else if (!table_states && last_start_times)
+	else if (!table_states_not_ready && last_start_times)
 	{
 		hash_destroy(last_start_times);
 		last_start_times = NULL;
 	}
 
 	/*
+	 * Even when the two_phase mode is requested by the user, it remains as
+	 * 'pending' until all tablesyncs have reached READY state.
+	 *
+	 * When this happens, we restart the apply worker and (if the conditions
+	 * are still ok) then the two_phase tri-state will become 'enabled' at
+	 * that time.
+	 *
+	 * Note: If the subscription has no tables then leave the state as
+	 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+	 * work.
+	 */
+	if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+		AllTablesyncsReady())
+	{
+		ereport(LOG,
+				(errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled",
+						MySubscription->name)));
+
+		proc_exit(0);
+	}
+
+	/*
 	 * Process all tables that are being synchronized.
 	 */
-	foreach(lc, table_states)
+	foreach(lc, table_states_not_ready)
 	{
 		SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
 
@@ -1071,7 +1068,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
 	 * slot leading to a dangling slot on the server.
 	 */
 	HOLD_INTERRUPTS();
-	walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ ,
+	walrcv_create_slot(LogRepWorkerWalRcvConn,
+					   slotname, false /* permanent */ , false /* two_phase */ ,
 					   CRS_USE_SNAPSHOT, origin_startpos);
 	RESUME_INTERRUPTS();
 
@@ -1158,3 +1156,134 @@ copy_table_done:
 	wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
 	return slotname;
 }
+
+/*
+ * Common code to fetch the up-to-date sync state info into the static lists.
+ *
+ * Returns true if subscription has 1 or more tables, else false.
+ *
+ * Note: If this function started the transaction (indicated by the parameter)
+ * then it is the caller's responsibility to commit it.
+ */
+static bool
+FetchTableStates(bool *started_tx)
+{
+	static bool has_subrels = false;
+
+	*started_tx = false;
+
+	if (!table_states_valid)
+	{
+		MemoryContext oldctx;
+		List	   *rstates;
+		ListCell   *lc;
+		SubscriptionRelState *rstate;
+
+		/* Clean the old lists. */
+		list_free_deep(table_states_not_ready);
+		table_states_not_ready = NIL;
+
+		if (!IsTransactionState())
+		{
+			StartTransactionCommand();
+			*started_tx = true;
+		}
+
+		/* Fetch all non-ready tables. */
+		rstates = GetSubscriptionNotReadyRelations(MySubscription->oid);
+
+		/* Allocate the tracking info in a permanent memory context. */
+		oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+		foreach(lc, rstates)
+		{
+			rstate = palloc(sizeof(SubscriptionRelState));
+			memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState));
+			table_states_not_ready = lappend(table_states_not_ready, rstate);
+		}
+		MemoryContextSwitchTo(oldctx);
+
+		/*
+		 * Does the subscription have tables?
+		 *
+		 * If there were not-READY relations found then we know it does. But
+		 * if table_state_not_ready was empty we still need to check again to
+		 * see if there are 0 tables.
+		 */
+		has_subrels = (list_length(table_states_not_ready) > 0) ||
+			HasSubscriptionRelations(MySubscription->oid);
+
+		table_states_valid = true;
+	}
+
+	return has_subrels;
+}
+
+/*
+ * If the subscription has no tables then return false.
+ *
+ * Otherwise, are all tablesyncs READY?
+ *
+ * Note: This function is not suitable to be called from outside of apply or
+ * tablesync workers because MySubscription needs to be already initialized.
+ */
+bool
+AllTablesyncsReady(void)
+{
+	bool		started_tx = false;
+	bool		has_subrels = false;
+
+	/* We need up-to-date sync state info for subscription tables here. */
+	has_subrels = FetchTableStates(&started_tx);
+
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+	}
+
+	/*
+	 * Return false when there are no tables in subscription or not all tables
+	 * are in ready state; true otherwise.
+	 */
+	return has_subrels && list_length(table_states_not_ready) == 0;
+}
+
+/*
+ * Update the two_phase state of the specified subscription in pg_subscription.
+ */
+void
+UpdateTwoPhaseState(Oid suboid, char new_state)
+{
+	Relation	rel;
+	HeapTuple	tup;
+	bool		nulls[Natts_pg_subscription];
+	bool		replaces[Natts_pg_subscription];
+	Datum		values[Natts_pg_subscription];
+
+	Assert(new_state == LOGICALREP_TWOPHASE_STATE_DISABLED ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_PENDING ||
+		   new_state == LOGICALREP_TWOPHASE_STATE_ENABLED);
+
+	rel = table_open(SubscriptionRelationId, RowExclusiveLock);
+	tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(suboid));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR,
+			 "cache lookup failed for subscription oid %u",
+			 suboid);
+
+	/* Form a new tuple. */
+	memset(values, 0, sizeof(values));
+	memset(nulls, false, sizeof(nulls));
+	memset(replaces, false, sizeof(replaces));
+
+	/* And update/set two_phase state */
+	values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state);
+	replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+
+	tup = heap_modify_tuple(tup, RelationGetDescr(rel),
+							values, nulls, replaces);
+	CatalogTupleUpdate(rel, &tup->t_self, tup);
+
+	heap_freetuple(tup);
+	table_close(rel, RowExclusiveLock);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 5fc620c..161cd92 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -49,6 +49,79 @@
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
+ *
+ * TWO_PHASE TRANSACTIONS
+ * ----------------------
+ * Two phase transactions are replayed at prepare and then committed or
+ * rolled back at commit prepared and rollback prepared respectively. It is
+ * possible to have a prepared transaction that arrives at the apply worker
+ * when the tablesync is busy doing the initial copy. In this case, the apply
+ * worker skips all the prepared operations [e.g. inserts] while the tablesync
+ * is still busy (see the condition of should_apply_changes_for_rel). The
+ * tablesync worker might not get such a prepared transaction because say it
+ * was prior to the initial consistent point but might have got some later
+ * commits. Now, the tablesync worker will exit without doing anything for the
+ * prepared transaction skipped by the apply worker as the sync location for it
+ * will be already ahead of the apply worker's current location. This would lead
+ * to an "empty prepare", because later when the apply worker does the commit
+ * prepare, there is nothing in it (the inserts were skipped earlier).
+ *
+ * To avoid this, and similar prepare confusions the subscription's two_phase
+ * commit is enabled only after the initial sync is over. The two_phase option
+ * has been implemented as a tri-state with values DISABLED, PENDING, and
+ * ENABLED.
+ *
+ * Even if the user specifies they want a subscription with two_phase = on,
+ * internally it will start with a tri-state of PENDING which only becomes
+ * ENABLED after all tablesync initializations are completed - i.e. when all
+ * tablesync workers have reached their READY state. In other words, the value
+ * PENDING is only a temporary state for subscription start-up.
+ *
+ * Until the two_phase is properly available (ENABLED) the subscription will
+ * behave as if two_phase = off. When the apply worker detects that all
+ * tablesyncs have become READY (while the tri-state was PENDING) it will
+ * restart the apply worker process. This happens in
+ * process_sync_tables_for_apply.
+ *
+ * When the (re-started) apply worker finds that all tablesyncs are READY for a
+ * two_phase tri-state of PENDING it start streaming messages with the
+ * two_phase option which in turn enables the decoding of two-phase commits at
+ * the publisher. Then, it updates the tri-state value from PENDING to ENABLED.
+ * Now, it is possible that during the time we have not enabled two_phase, the
+ * publisher (replication server) would have skipped some prepares but we
+ * ensure that such prepares are sent along with commit prepare, see
+ * ReorderBufferFinishPrepared.
+ *
+ * If the subscription has no tables then a two_phase tri-state PENDING is
+ * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * PUBLICATION which might otherwise be disallowed (see below).
+ *
+ * If ever a user needs to be aware of the tri-state value, they can fetch it
+ * from the pg_subscription catalog (see column subtwophasestate).
+ *
+ * We don't allow to toggle two_phase option of a subscription because it can
+ * lead to an inconsistent replica. Consider, initially, it was on and we have
+ * received some prepare then we turn it off, now at commit time the server
+ * will send the entire transaction data along with the commit. With some more
+ * analysis, we can allow changing this option from off to on but not sure if
+ * that alone would be useful.
+ *
+ * Finally, to avoid problems mentioned in previous paragraphs from any
+ * subsequent (not READY) tablesyncs (need to toggle two_phase option from 'on'
+ * to 'off' and then again back to 'on') there is a restriction for
+ * ALTER SUBSCRIPTION REFRESH PUBLICATION. This command is not permitted when
+ * the two_phase tri-state is ENABLED, except when copy_data = false.
+ *
+ * We can get prepare of the same GID more than once for the genuine cases
+ * where we have defined multiple subscriptions for publications on the same
+ * server and prepared transaction has operations on tables subscribed to those
+ * subscriptions. For such cases, if we use the GID sent by publisher one of
+ * the prepares will be successful and others will fail, in which case the
+ * server will send them again. Now, this can lead to a deadlock if user has
+ * set synchronous_standby_names for all the subscriptions on subscriber. To
+ * avoid such deadlocks, we generate a unique GID (consisting of the
+ * subscription oid and the xid of the prepared transaction) for each prepare
+ * transaction on the subscriber.
  *-------------------------------------------------------------------------
  */
 
@@ -59,6 +132,7 @@
 
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
@@ -256,6 +330,10 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 									   LogicalRepTupleData *newtup,
 									   CmdType operation);
 
+/* Compute GID for two_phase transactions */
+static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
+
+
 /*
  * Should this worker apply changes for given relation.
  *
@@ -784,6 +862,185 @@ apply_handle_commit(StringInfo s)
 }
 
 /*
+ * Handle BEGIN PREPARE message.
+ */
+static void
+apply_handle_begin_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData begin_data;
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a BEGIN PREPARE message")));
+
+	logicalrep_read_begin_prepare(s, &begin_data);
+
+	remote_final_lsn = begin_data.prepare_lsn;
+
+	in_remote_transaction = true;
+
+	pgstat_report_activity(STATE_RUNNING, NULL);
+}
+
+/*
+ * Handle PREPARE message.
+ */
+static void
+apply_handle_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_prepare(s, &prepare_data);
+
+	if (prepare_data.prepare_lsn != remote_final_lsn)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("incorrect prepare LSN %X/%X in prepare message (expected %X/%X)",
+								 LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
+								 LSN_FORMAT_ARGS(remote_final_lsn))));
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * Unlike commit, here, we always prepare the transaction even though no
+	 * change has happened in this transaction. It is done this way because at
+	 * commit prepared time, we won't know whether we have skipped preparing a
+	 * transaction because of no change.
+	 *
+	 * XXX, We can optimize such that at commit prepared time, we first check
+	 * whether we have prepared the transaction or not but that doesn't seem
+	 * worthwhile because such cases shouldn't be common.
+	 */
+	begin_replication_step();
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a COMMIT PREPARED of a previously PREPARED transaction.
+ */
+static void
+apply_handle_commit_prepared(StringInfo s)
+{
+	LogicalRepCommitPreparedTxnData prepare_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_commit_prepared(s, &prepare_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/* There is no transaction when COMMIT PREPARED is called */
+	begin_replication_step();
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.commit_time;
+
+	FinishPreparedTransaction(gid, true);
+	end_replication_step();
+	CommitTransactionCommand();
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
+ * Handle a ROLLBACK PREPARED of a previously PREPARED TRANSACTION.
+ */
+static void
+apply_handle_rollback_prepared(StringInfo s)
+{
+	LogicalRepRollbackPreparedTxnData rollback_data;
+	char		gid[GIDSIZE];
+
+	logicalrep_read_rollback_prepared(s, &rollback_data);
+
+	/* Compute GID for two_phase transactions. */
+	TwoPhaseTransactionGid(MySubscription->oid, rollback_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * It is possible that we haven't received prepare because it occurred
+	 * before walsender reached a consistent point or the two_phase was still
+	 * not enabled by that time, so in such cases, we need to skip rollback
+	 * prepared.
+	 */
+	if (LookupGXact(gid, rollback_data.prepare_end_lsn,
+					rollback_data.prepare_time))
+	{
+		/*
+		 * Update origin state so we can restart streaming from correct
+		 * position in case of crash.
+		 */
+		replorigin_session_origin_lsn = rollback_data.rollback_end_lsn;
+		replorigin_session_origin_timestamp = rollback_data.rollback_time;
+
+		/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
+		begin_replication_step();
+		FinishPreparedTransaction(gid, false);
+		end_replication_step();
+		CommitTransactionCommand();
+	}
+
+	pgstat_report_stat(false);
+
+	store_flush_position(rollback_data.rollback_end_lsn);
+	in_remote_transaction = false;
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(rollback_data.rollback_end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2060,6 +2317,22 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_STREAM_COMMIT:
 			apply_handle_stream_commit(s);
 			return;
+
+		case LOGICAL_REP_MSG_BEGIN_PREPARE:
+			apply_handle_begin_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_PREPARE:
+			apply_handle_prepare(s);
+			return;
+
+		case LOGICAL_REP_MSG_COMMIT_PREPARED:
+			apply_handle_commit_prepared(s);
+			return;
+
+		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
+			apply_handle_rollback_prepared(s);
+			return;
 	}
 
 	ereport(ERROR,
@@ -2539,6 +2812,9 @@ maybe_reread_subscription(void)
 	/* !slotname should never happen when enabled is true. */
 	Assert(newsub->slotname);
 
+	/* two-phase should not be altered */
+	Assert(newsub->twophasestate == MySubscription->twophasestate);
+
 	/*
 	 * Exit if any parameter that affects the remote connection was changed.
 	 * The launcher will start a new worker.
@@ -3040,6 +3316,24 @@ cleanup_subxact_info()
 	subxact_data.nsubxacts_max = 0;
 }
 
+/*
+ * Form the prepared transaction GID for two_phase transactions.
+ *
+ * Return the GID in the supplied buffer.
+ */
+static void
+TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid)
+{
+	Assert(subid != InvalidRepOriginId);
+
+	if (!TransactionIdIsValid(xid))
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("invalid two-phase transaction ID")));
+
+	snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);
+}
+
 /* Logical Replication Apply worker entry point */
 void
 ApplyWorkerMain(Datum main_arg)
@@ -3050,6 +3344,7 @@ ApplyWorkerMain(Datum main_arg)
 	XLogRecPtr	origin_startpos;
 	char	   *myslotname;
 	WalRcvStreamOptions options;
+	int			server_version;
 
 	/* Attach to slot */
 	logicalrep_worker_attach(worker_slot);
@@ -3208,15 +3503,59 @@ ApplyWorkerMain(Datum main_arg)
 	options.logical = true;
 	options.startpoint = origin_startpos;
 	options.slotname = myslotname;
+
+	server_version = walrcv_server_version(LogRepWorkerWalRcvConn);
 	options.proto.logical.proto_version =
-		walrcv_server_version(LogRepWorkerWalRcvConn) >= 140000 ?
-		LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
+		server_version >= 150000 ? LOGICALREP_PROTO_TWOPHASE_VERSION_NUM :
+		server_version >= 140000 ? LOGICALREP_PROTO_STREAM_VERSION_NUM :
+		LOGICALREP_PROTO_VERSION_NUM;
+
 	options.proto.logical.publication_names = MySubscription->publications;
 	options.proto.logical.binary = MySubscription->binary;
 	options.proto.logical.streaming = MySubscription->stream;
+	options.proto.logical.twophase = false;
+
+	if (!am_tablesync_worker())
+	{
+		/*
+		 * Even when the two_phase mode is requested by the user, it remains
+		 * as the tri-state PENDING until all tablesyncs have reached READY
+		 * state. Only then, can it become ENABLED.
+		 *
+		 * Note: If the subscription has no tables then leave the state as
+		 * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+		 * work.
+		 */
+		if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+			AllTablesyncsReady())
+		{
+			/* Start streaming with two_phase enabled */
+			options.proto.logical.twophase = true;
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
 
-	/* Start normal logical streaming replication. */
-	walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+			StartTransactionCommand();
+			UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED);
+			MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED;
+			CommitTransactionCommand();
+		}
+		else
+		{
+			walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+		}
+
+		ereport(DEBUG1,
+				(errmsg("logical replication apply worker for subscription \"%s\" two_phase is %s.",
+						MySubscription->name,
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
+						MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
+						"?")));
+	}
+	else
+	{
+		/* Start normal logical streaming replication. */
+		walrcv_startstreaming(LogRepWorkerWalRcvConn, &options);
+	}
 
 	/* Run the main loop. */
 	LogicalRepApplyLoop(origin_startpos);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index abd5217..e4314af 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -51,6 +51,16 @@ static void pgoutput_message(LogicalDecodingContext *ctx,
 							 Size sz, const char *message);
 static bool pgoutput_origin_filter(LogicalDecodingContext *ctx,
 								   RepOriginId origin_id);
+static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx,
+									   ReorderBufferTXN *txn);
+static void pgoutput_prepare_txn(LogicalDecodingContext *ctx,
+								 ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
+static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx,
+										 ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+										   ReorderBufferTXN *txn,
+										   XLogRecPtr prepare_end_lsn,
+										   TimestampTz prepare_time);
 static void pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 								  ReorderBufferTXN *txn);
 static void pgoutput_stream_stop(struct LogicalDecodingContext *ctx,
@@ -70,6 +80,9 @@ static void publication_invalidation_cb(Datum arg, int cacheid,
 										uint32 hashvalue);
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
 									LogicalDecodingContext *ctx);
+static void send_repl_origin(LogicalDecodingContext *ctx,
+							 RepOriginId origin_id, XLogRecPtr origin_lsn,
+							 bool send_origin);
 
 /*
  * Entry in the map used to remember which relation schemas we sent.
@@ -145,6 +158,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->truncate_cb = pgoutput_truncate;
 	cb->message_cb = pgoutput_message;
 	cb->commit_cb = pgoutput_commit_txn;
+
+	cb->begin_prepare_cb = pgoutput_begin_prepare_txn;
+	cb->prepare_cb = pgoutput_prepare_txn;
+	cb->commit_prepared_cb = pgoutput_commit_prepared_txn;
+	cb->rollback_prepared_cb = pgoutput_rollback_prepared_txn;
 	cb->filter_by_origin_cb = pgoutput_origin_filter;
 	cb->shutdown_cb = pgoutput_shutdown;
 
@@ -156,6 +174,8 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_change_cb = pgoutput_change;
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
+	/* transaction streaming - two-phase commit */
+	cb->stream_prepare_cb = NULL;
 }
 
 static void
@@ -167,10 +187,12 @@ parse_output_parameters(List *options, PGOutputData *data)
 	bool		binary_option_given = false;
 	bool		messages_option_given = false;
 	bool		streaming_given = false;
+	bool		two_phase_option_given = false;
 
 	data->binary = false;
 	data->streaming = false;
 	data->messages = false;
+	data->two_phase = false;
 
 	foreach(lc, options)
 	{
@@ -246,8 +268,29 @@ parse_output_parameters(List *options, PGOutputData *data)
 
 			data->streaming = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "two_phase") == 0)
+		{
+			if (two_phase_option_given)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			two_phase_option_given = true;
+
+			data->two_phase = defGetBoolean(defel);
+		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
+
+		/*
+		 * Do additional checking for the disallowed combination of two_phase
+		 * and streaming. While streaming and two_phase can theoretically be
+		 * supported, it needs more analysis to allow them together.
+		 */
+		if (data->two_phase && data->streaming)
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("%s and %s are mutually exclusive options",
+							"two_phase", "streaming")));
 	}
 }
 
@@ -319,6 +362,27 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 		/* Also remember we're currently not streaming any transaction. */
 		in_streaming = false;
 
+		/*
+		 * Here, we just check whether the two-phase option is passed by
+		 * plugin and decide whether to enable it at later point of time. It
+		 * remains enabled if the previous start-up has done so. But we only
+		 * allow the option to be passed in with sufficient version of the
+		 * protocol, and when the output plugin supports it.
+		 */
+		if (!data->two_phase)
+			ctx->twophase_opt_given = false;
+		else if (data->protocol_version < LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("requested proto_version=%d does not support two-phase commit, need %d or higher",
+							data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
+		else if (!ctx->twophase)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("two-phase commit requested, but not supported by output plugin")));
+		else
+			ctx->twophase_opt_given = true;
+
 		/* Init publication state. */
 		data->publications = NIL;
 		publications_valid = false;
@@ -331,8 +395,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 	}
 	else
 	{
-		/* Disable the streaming during the slot initialization mode. */
+		/*
+		 * Disable the streaming and prepared transactions during the slot
+		 * initialization mode.
+		 */
 		ctx->streaming = false;
+		ctx->twophase = false;
 	}
 }
 
@@ -347,29 +415,8 @@ pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_begin(ctx->out, txn);
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		/*----------
-		 * XXX: which behaviour do we want here?
-		 *
-		 * Alternatives:
-		 *	- don't send origin message if origin name not found
-		 *	  (that's what we do now)
-		 *	- throw error - that will break replication, not good
-		 *	- send some special "unknown" origin
-		 *----------
-		 */
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, txn->origin_lsn);
-		}
-
-	}
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 }
@@ -389,6 +436,68 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 }
 
 /*
+ * BEGIN PREPARE callback
+ */
+static void
+pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+	bool		send_replication_origin = txn->origin_id != InvalidRepOriginId;
+
+	OutputPluginPrepareWrite(ctx, !send_replication_origin);
+	logicalrep_write_begin_prepare(ctx->out, txn);
+
+	send_repl_origin(ctx, txn->origin_id, txn->origin_lsn,
+					 send_replication_origin);
+
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * PREPARE callback
+ */
+static void
+pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+					 XLogRecPtr prepare_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * COMMIT PREPARED callback
+ */
+static void
+pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+							 XLogRecPtr commit_lsn)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_commit_prepared(ctx->out, txn, commit_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
+ * ROLLBACK PREPARED callback
+ */
+static void
+pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx,
+							   ReorderBufferTXN *txn,
+							   XLogRecPtr prepare_end_lsn,
+							   TimestampTz prepare_time)
+{
+	OutputPluginUpdateProgress(ctx);
+
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_rollback_prepared(ctx->out, txn, prepare_end_lsn,
+									   prepare_time);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Write the current schema of the relation and its ancestor (if any) if not
  * done yet.
  */
@@ -839,18 +948,8 @@ pgoutput_stream_start(struct LogicalDecodingContext *ctx,
 	OutputPluginPrepareWrite(ctx, !send_replication_origin);
 	logicalrep_write_stream_start(ctx->out, txn->xid, !rbtxn_is_streamed(txn));
 
-	if (send_replication_origin)
-	{
-		char	   *origin;
-
-		if (replorigin_by_oid(txn->origin_id, true, &origin))
-		{
-			/* Message boundary */
-			OutputPluginWrite(ctx, false);
-			OutputPluginPrepareWrite(ctx, true);
-			logicalrep_write_origin(ctx->out, origin, InvalidXLogRecPtr);
-		}
-	}
+	send_repl_origin(ctx, txn->origin_id, InvalidXLogRecPtr,
+					 send_replication_origin);
 
 	OutputPluginWrite(ctx, true);
 
@@ -1270,3 +1369,33 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 		entry->pubactions.pubtruncate = false;
 	}
 }
+
+/* Send Replication origin */
+static void
+send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
+				 XLogRecPtr origin_lsn, bool send_origin)
+{
+	if (send_origin)
+	{
+		char	   *origin;
+
+		/*----------
+		 * XXX: which behaviour do we want here?
+		 *
+		 * Alternatives:
+		 *  - don't send origin message if origin name not found
+		 *    (that's what we do now)
+		 *  - throw error - that will break replication, not good
+		 *  - send some special "unknown" origin
+		 *----------
+		 */
+		if (replorigin_by_oid(origin_id, true, &origin))
+		{
+			/* Message boundary */
+			OutputPluginWrite(ctx, false);
+			OutputPluginPrepareWrite(ctx, true);
+
+			logicalrep_write_origin(ctx->out, origin, origin_lsn);
+		}
+	}
+}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 8c18b4e..33b85d8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -283,6 +283,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
 	slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
 	slot->data.persistency = persistency;
 	slot->data.two_phase = two_phase;
+	slot->data.two_phase_at = InvalidXLogRecPtr;
 
 	/* and then data only present in shared memory */
 	slot->just_dirtied = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2be9ad9..9a2bc37 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -370,7 +370,7 @@ WalReceiverMain(void)
 					 "pg_walreceiver_%lld",
 					 (long long int) walrcv_get_backend_pid(wrconn));
 
-			walrcv_create_slot(wrconn, slotname, true, 0, NULL);
+			walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
 
 			SpinLockAcquire(&walrcv->mutex);
 			strlcpy(walrcv->slotname, slotname, NAMEDATALEN);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 3211521..912144c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -51,6 +51,7 @@
 #include "catalog/pg_largeobject_d.h"
 #include "catalog/pg_largeobject_metadata_d.h"
 #include "catalog/pg_proc_d.h"
+#include "catalog/pg_subscription.h"
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
@@ -4320,6 +4321,7 @@ getSubscriptions(Archive *fout)
 	int			i_subname;
 	int			i_rolname;
 	int			i_substream;
+	int			i_subtwophasestate;
 	int			i_subconninfo;
 	int			i_subslotname;
 	int			i_subsynccommit;
@@ -4363,9 +4365,16 @@ getSubscriptions(Archive *fout)
 		appendPQExpBufferStr(query, " false AS subbinary,\n");
 
 	if (fout->remoteVersion >= 140000)
-		appendPQExpBufferStr(query, " s.substream\n");
+		appendPQExpBufferStr(query, " s.substream,\n");
 	else
-		appendPQExpBufferStr(query, " false AS substream\n");
+		appendPQExpBufferStr(query, " false AS substream,\n");
+
+	if (fout->remoteVersion >= 150000)
+		appendPQExpBufferStr(query, " s.subtwophasestate\n");
+	else
+		appendPQExpBuffer(query,
+						  " '%c' AS subtwophasestate\n",
+						  LOGICALREP_TWOPHASE_STATE_DISABLED);
 
 	appendPQExpBufferStr(query,
 						 "FROM pg_subscription s\n"
@@ -4386,6 +4395,7 @@ getSubscriptions(Archive *fout)
 	i_subpublications = PQfnumber(res, "subpublications");
 	i_subbinary = PQfnumber(res, "subbinary");
 	i_substream = PQfnumber(res, "substream");
+	i_subtwophasestate = PQfnumber(res, "subtwophasestate");
 
 	subinfo = pg_malloc(ntups * sizeof(SubscriptionInfo));
 
@@ -4411,6 +4421,8 @@ getSubscriptions(Archive *fout)
 			pg_strdup(PQgetvalue(res, i, i_subbinary));
 		subinfo[i].substream =
 			pg_strdup(PQgetvalue(res, i, i_substream));
+		subinfo[i].subtwophasestate =
+			pg_strdup(PQgetvalue(res, i, i_subtwophasestate));
 
 		if (strlen(subinfo[i].rolname) == 0)
 			pg_log_warning("owner of subscription \"%s\" appears to be invalid",
@@ -4438,6 +4450,7 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	char	  **pubnames = NULL;
 	int			npubnames = 0;
 	int			i;
+	char		two_phase_disabled[] = {LOGICALREP_TWOPHASE_STATE_DISABLED, '\0'};
 
 	if (!(subinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 		return;
@@ -4479,6 +4492,9 @@ dumpSubscription(Archive *fout, const SubscriptionInfo *subinfo)
 	if (strcmp(subinfo->substream, "f") != 0)
 		appendPQExpBufferStr(query, ", streaming = on");
 
+	if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
+		appendPQExpBufferStr(query, ", two_phase = on");
+
 	if (strcmp(subinfo->subsynccommit, "off") != 0)
 		appendPQExpBuffer(query, ", synchronous_commit = %s", fmtId(subinfo->subsynccommit));
 
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index ba9bc6d..efb8c30 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -639,6 +639,7 @@ typedef struct _SubscriptionInfo
 	char	   *subslotname;
 	char	   *subbinary;
 	char	   *substream;
+	char	   *subtwophasestate;
 	char	   *subsynccommit;
 	char	   *subpublications;
 } SubscriptionInfo;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2abf255..ba658f7 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6389,7 +6389,7 @@ describeSubscriptions(const char *pattern, bool verbose)
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
 	static const bool translate_columns[] = {false, false, false, false,
-	false, false, false, false};
+	false, false, false, false, false};
 
 	if (pset.sversion < 100000)
 	{
@@ -6423,6 +6423,12 @@ describeSubscriptions(const char *pattern, bool verbose)
 							  gettext_noop("Binary"),
 							  gettext_noop("Streaming"));
 
+		/* Two_phase is only supported in v15 and higher */
+		if (pset.sversion >= 150000)
+			appendPQExpBuffer(&buf,
+							  ", subtwophasestate AS \"%s\"\n",
+							  gettext_noop("Two phase commit"));
+
 		appendPQExpBuffer(&buf,
 						  ",  subsynccommit AS \"%s\"\n"
 						  ",  subconninfo AS \"%s\"\n",
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 0ebd5aa..d6bf725 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2764,7 +2764,7 @@ psql_completion(const char *text, int start, int end)
 	else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
 		COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
 					  "enabled", "slot_name", "streaming",
-					  "synchronous_commit");
+					  "synchronous_commit", "two_phase");
 
 /* CREATE TRIGGER --- is allowed inside CREATE SCHEMA, so use TailMatches */
 
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 91786da..e27e1a8 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -58,4 +58,6 @@ extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
 extern void PrepareRedoRemove(TransactionId xid, bool giveWarning);
 extern void restoreTwoPhaseData(void);
+extern bool LookupGXact(const char *gid, XLogRecPtr prepare_at_lsn,
+						TimestampTz origin_prepare_timestamp);
 #endif							/* TWOPHASE_H */
diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h
index 750d469..2106149 100644
--- a/src/include/catalog/pg_subscription.h
+++ b/src/include/catalog/pg_subscription.h
@@ -22,6 +22,14 @@
 
 #include "nodes/pg_list.h"
 
+/*
+ * two_phase tri-state values. See comments atop worker.c to know more about
+ * these states.
+ */
+#define LOGICALREP_TWOPHASE_STATE_DISABLED 'd'
+#define LOGICALREP_TWOPHASE_STATE_PENDING 'p'
+#define LOGICALREP_TWOPHASE_STATE_ENABLED 'e'
+
 /* ----------------
  *		pg_subscription definition. cpp turns this into
  *		typedef struct FormData_pg_subscription
@@ -57,6 +65,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW
 
 	bool		substream;		/* Stream in-progress transactions. */
 
+	char		subtwophasestate;	/* Stream two-phase transactions */
+
 #ifdef CATALOG_VARLEN			/* variable-length fields start here */
 	/* Connection string to the publisher */
 	text		subconninfo BKI_FORCE_NOT_NULL;
@@ -92,6 +102,7 @@ typedef struct Subscription
 	bool		binary;			/* Indicates if the subscription wants data in
 								 * binary format */
 	bool		stream;			/* Allow streaming in-progress transactions. */
+	char		twophasestate;	/* Allow streaming two-phase transactions */
 	char	   *conninfo;		/* Connection string to the publisher */
 	char	   *slotname;		/* Name of the replication slot */
 	char	   *synccommit;		/* Synchronous commit setting for worker */
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 4d20563..632381b 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -87,6 +87,7 @@ extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
 extern void RemoveSubscriptionRel(Oid subid, Oid relid);
 
+extern bool HasSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionRelations(Oid subid);
 extern List *GetSubscriptionNotReadyRelations(Oid subid);
 
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6..e0f513b 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -90,6 +90,16 @@ typedef struct LogicalDecodingContext
 	bool		twophase;
 
 	/*
+	 * Is two-phase option given by output plugin?
+	 *
+	 * This flag indicates that the plugin passed in the two-phase option as
+	 * part of the START_STREAMING command. We can't rely solely on the
+	 * twophase flag which only tells whether the plugin provided all the
+	 * necessary two-phase callbacks.
+	 */
+	bool		twophase_opt_given;
+
+	/*
 	 * State for writing output.
 	 */
 	bool		accept_writes;
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 55b90c0..63de90d 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -13,6 +13,7 @@
 #ifndef LOGICAL_PROTO_H
 #define LOGICAL_PROTO_H
 
+#include "access/xact.h"
 #include "replication/reorderbuffer.h"
 #include "utils/rel.h"
 
@@ -26,12 +27,16 @@
  * connect time.
  *
  * LOGICALREP_PROTO_STREAM_VERSION_NUM is the minimum protocol version with
- * support for streaming large transactions.
+ * support for streaming large transactions. Introduced in PG14.
+ *
+ * LOGICALREP_PROTO_TWOPHASE_VERSION_NUM is the minimum protocol version with
+ * support for two-phase commit decoding (at prepare time). Introduced in PG15.
  */
 #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
 #define LOGICALREP_PROTO_VERSION_NUM 1
 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
-#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
+#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3
+#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
 
 /*
  * Logical message types
@@ -55,6 +60,10 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_RELATION = 'R',
 	LOGICAL_REP_MSG_TYPE = 'Y',
 	LOGICAL_REP_MSG_MESSAGE = 'M',
+	LOGICAL_REP_MSG_BEGIN_PREPARE = 'b',
+	LOGICAL_REP_MSG_PREPARE = 'P',
+	LOGICAL_REP_MSG_COMMIT_PREPARED = 'K',
+	LOGICAL_REP_MSG_ROLLBACK_PREPARED = 'r',
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
@@ -122,6 +131,48 @@ typedef struct LogicalRepCommitData
 	TimestampTz committime;
 } LogicalRepCommitData;
 
+/*
+ * Prepared transaction protocol information for begin_prepare, and prepare.
+ */
+typedef struct LogicalRepPreparedTxnData
+{
+	XLogRecPtr	prepare_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz prepare_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepPreparedTxnData;
+
+/*
+ * Prepared transaction protocol information for commit prepared.
+ */
+typedef struct LogicalRepCommitPreparedTxnData
+{
+	XLogRecPtr	commit_lsn;
+	XLogRecPtr	end_lsn;
+	TimestampTz commit_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepCommitPreparedTxnData;
+
+/*
+ * Rollback Prepared transaction protocol information. The prepare information
+ * prepare_end_lsn and prepare_time are used to check if the downstream has
+ * received this prepared transaction in which case it can apply the rollback,
+ * otherwise, it can skip the rollback operation. The gid alone is not
+ * sufficient because the downstream node can have a prepared transaction with
+ * same identifier.
+ */
+typedef struct LogicalRepRollbackPreparedTxnData
+{
+	XLogRecPtr	prepare_end_lsn;
+	XLogRecPtr	rollback_end_lsn;
+	TimestampTz prepare_time;
+	TimestampTz rollback_time;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+} LogicalRepRollbackPreparedTxnData;
+
 extern void logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn);
 extern void logicalrep_read_begin(StringInfo in,
 								  LogicalRepBeginData *begin_data);
@@ -129,6 +180,24 @@ extern void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 									XLogRecPtr commit_lsn);
 extern void logicalrep_read_commit(StringInfo in,
 								   LogicalRepCommitData *commit_data);
+extern void logicalrep_write_begin_prepare(StringInfo out, ReorderBufferTXN *txn);
+extern void logicalrep_read_begin_prepare(StringInfo in,
+										  LogicalRepPreparedTxnData *begin_data);
+extern void logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+									 XLogRecPtr prepare_lsn);
+extern void logicalrep_read_prepare(StringInfo in,
+									LogicalRepPreparedTxnData *prepare_data);
+extern void logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn,
+											 XLogRecPtr commit_lsn);
+extern void logicalrep_read_commit_prepared(StringInfo in,
+											LogicalRepCommitPreparedTxnData *prepare_data);
+extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN *txn,
+											   XLogRecPtr prepare_end_lsn,
+											   TimestampTz prepare_time);
+extern void logicalrep_read_rollback_prepared(StringInfo in,
+											  LogicalRepRollbackPreparedTxnData *rollback_data);
+
+
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
 extern char *logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn);
diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h
index 51e7c03..0dc460f 100644
--- a/src/include/replication/pgoutput.h
+++ b/src/include/replication/pgoutput.h
@@ -27,6 +27,7 @@ typedef struct PGOutputData
 	bool		binary;
 	bool		streaming;
 	bool		messages;
+	bool		two_phase;
 } PGOutputData;
 
 #endif							/* PGOUTPUT_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index ba257d8..5b40ff7 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -297,7 +297,11 @@ typedef struct ReorderBufferTXN
 	 * Commit or Prepare time, only known when we read the actual commit or
 	 * prepare record.
 	 */
-	TimestampTz commit_time;
+	union
+	{
+		TimestampTz commit_time;
+		TimestampTz prepare_time;
+	}			xact_time;
 
 	/*
 	 * The base snapshot is used to decode all changes until either this
@@ -636,7 +640,7 @@ void		ReorderBufferCommit(ReorderBuffer *, TransactionId,
 								TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
 void		ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
 										XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
-										XLogRecPtr initial_consistent_point,
+										XLogRecPtr two_phase_at,
 										TimestampTz commit_time,
 										RepOriginId origin_id, XLogRecPtr origin_lsn,
 										char *gid, bool is_commit);
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 2eb7e3a..34d95ea 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -84,11 +84,10 @@ typedef struct ReplicationSlotPersistentData
 	XLogRecPtr	confirmed_flush;
 
 	/*
-	 * LSN at which we found a consistent point at the time of slot creation.
-	 * This is also the point where we have exported a snapshot for the
-	 * initial copy.
+	 * LSN at which we enabled two_phase commit for this slot or LSN at which
+	 * we found a consistent point at the time of slot creation.
 	 */
-	XLogRecPtr	initial_consistent_point;
+	XLogRecPtr	two_phase_at;
 
 	/*
 	 * Allow decoding of prepared transactions?
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index fbabce6..de72124 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -62,7 +62,7 @@ extern void CheckPointSnapBuild(void);
 extern SnapBuild *AllocateSnapshotBuilder(struct ReorderBuffer *cache,
 										  TransactionId xmin_horizon, XLogRecPtr start_lsn,
 										  bool need_full_snapshot,
-										  XLogRecPtr initial_consistent_point);
+										  XLogRecPtr two_phase_at);
 extern void FreeSnapshotBuilder(SnapBuild *cache);
 
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
@@ -76,7 +76,8 @@ extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder,
 											TransactionId xid);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *snapstate, XLogRecPtr ptr);
-extern XLogRecPtr SnapBuildInitialConsistentPoint(SnapBuild *builder);
+extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
+extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
 extern void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn,
 							   TransactionId xid, int nsubxacts,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25..0b607ed 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -181,6 +181,8 @@ typedef struct
 			List	   *publication_names;	/* String list of publications */
 			bool		binary; /* Ask publisher to use binary */
 			bool		streaming;	/* Streaming of large transactions */
+			bool		twophase;	/* Streaming of two-phase transactions at
+									 * prepare time */
 		}			logical;
 	}			proto;
 } WalRcvStreamOptions;
@@ -347,6 +349,7 @@ typedef void (*walrcv_send_fn) (WalReceiverConn *conn,
 typedef char *(*walrcv_create_slot_fn) (WalReceiverConn *conn,
 										const char *slotname,
 										bool temporary,
+										bool two_phase,
 										CRSSnapshotAction snapshot_action,
 										XLogRecPtr *lsn);
 
@@ -420,8 +423,8 @@ extern PGDLLIMPORT WalReceiverFunctionsType *WalReceiverFunctions;
 	WalReceiverFunctions->walrcv_receive(conn, buffer, wait_fd)
 #define walrcv_send(conn, buffer, nbytes) \
 	WalReceiverFunctions->walrcv_send(conn, buffer, nbytes)
-#define walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn) \
-	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, snapshot_action, lsn)
+#define walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn) \
+	WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, snapshot_action, lsn)
 #define walrcv_get_backend_pid(conn) \
 	WalReceiverFunctions->walrcv_get_backend_pid(conn)
 #define walrcv_exec(conn, exec, nRetTypes, retTypes) \
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 179eb43..41c7487 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -86,6 +86,9 @@ extern void ReplicationOriginNameForTablesync(Oid suboid, Oid relid,
 											  char *originname, int szorgname);
 extern char *LogicalRepSyncTableStart(XLogRecPtr *origin_startpos);
 
+extern bool AllTablesyncsReady(void);
+extern void UpdateTwoPhaseState(Oid suboid, char new_state);
+
 void		process_syncing_tables(XLogRecPtr current_lsn);
 void		invalidate_syncing_table_states(Datum arg, int cacheid,
 											uint32 hashvalue);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 57f7dd9..ad6b4e4 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -76,10 +76,10 @@ ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
 ERROR:  invalid connection string syntax: missing "=" after "foobar" in connection info string
 
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET PUBLICATION testpub2, testpub3 WITH (refresh = false);
@@ -91,10 +91,10 @@ ERROR:  subscription "regress_doesnotexist" does not exist
 ALTER SUBSCRIPTION regress_testsub SET (create_slot = false);
 ERROR:  unrecognized subscription parameter: "create_slot"
 \dRs+
-                                                                List of subscriptions
-      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | off                | dbname=regress_doesnotexist2
+                                                                          List of subscriptions
+      Name       |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+-----------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | off                | dbname=regress_doesnotexist2
 (1 row)
 
 BEGIN;
@@ -126,10 +126,10 @@ ALTER SUBSCRIPTION regress_testsub_foo SET (synchronous_commit = foobar);
 ERROR:  invalid value for parameter "synchronous_commit": "foobar"
 HINT:  Available values: local, remote_write, remote_apply, on, off.
 \dRs+
-                                                                  List of subscriptions
-        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Synchronous commit |           Conninfo           
----------------------+---------------------------+---------+---------------------+--------+-----------+--------------------+------------------------------
- regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | local              | dbname=regress_doesnotexist2
+                                                                            List of subscriptions
+        Name         |           Owner           | Enabled |     Publication     | Binary | Streaming | Two phase commit | Synchronous commit |           Conninfo           
+---------------------+---------------------------+---------+---------------------+--------+-----------+------------------+--------------------+------------------------------
+ regress_testsub_foo | regress_subscription_user | f       | {testpub2,testpub3} | f      | f         | d                | local              | dbname=regress_doesnotexist2
 (1 row)
 
 -- rename back to keep the rest simple
@@ -162,19 +162,19 @@ ERROR:  binary requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, binary = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | t      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (binary = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -185,19 +185,19 @@ ERROR:  streaming requires a Boolean value
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true);
 WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 ALTER SUBSCRIPTION regress_testsub SET (streaming = false);
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication already exists
@@ -212,10 +212,10 @@ ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refr
 ALTER SUBSCRIPTION regress_testsub ADD PUBLICATION testpub1, testpub2 WITH (refresh = false);
 ERROR:  publication "testpub1" is already in subscription "regress_testsub"
 \dRs+
-                                                                    List of subscriptions
-      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-----------------------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | off                | dbname=regress_doesnotexist
+                                                                             List of subscriptions
+      Name       |           Owner           | Enabled |         Publication         | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-----------------------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub,testpub1,testpub2} | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 -- fail - publication used more then once
@@ -233,10 +233,10 @@ ERROR:  unrecognized subscription parameter: "copy_data"
 -- ok - delete publications
 ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1, testpub2 WITH (refresh = false);
 \dRs+
-                                                            List of subscriptions
-      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Synchronous commit |          Conninfo           
------------------+---------------------------+---------+-------------+--------+-----------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | off                | dbname=regress_doesnotexist
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | d                | off                | dbname=regress_doesnotexist
 (1 row)
 
 DROP SUBSCRIPTION regress_testsub;
@@ -263,6 +263,43 @@ ALTER SUBSCRIPTION regress_testsub DISABLE;
 ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+ERROR:  two_phase requires a Boolean value
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+ERROR:  unrecognized subscription parameter: "two_phase"
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+ERROR:  cannot set streaming = true for two-phase enabled subscription
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+\dRs+
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+(1 row)
+
+DROP SUBSCRIPTION regress_testsub;
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+ERROR:  two_phase = true and streaming = true are mutually exclusive options
+\dRs+
+                                            List of subscriptions
+ Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
+------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
+(0 rows)
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 308c098..b732871 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -202,6 +202,31 @@ ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 DROP FUNCTION func;
 
+-- fail - two_phase must be boolean
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = foo);
+
+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
+
+\dRs+
+--fail - alter of two_phase option not supported.
+ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
+
+--fail - cannot set streaming when two_phase enabled
+ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+
+\dRs+
+
+DROP SUBSCRIPTION regress_testsub;
+
+-- fail - two_phase and streaming are mutually exclusive.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+
+\dRs+
+
+
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
new file mode 100644
index 0000000..4c372a6
--- /dev/null
+++ b/src/test/subscription/t/021_twophase.pl
@@ -0,0 +1,359 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full SELECT generate_series(1,10);
+	PREPARE TRANSACTION 'some_initial_data';
+	COMMIT PREPARED 'some_initial_data';");
+
+# Setup structure on subscriber
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub FOR TABLE tab_full");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then COMMIT PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets committed on subscriber
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check that 2PC gets replicated to subscriber
+# then ROLLBACK PREPARED
+###############################
+
+$node_publisher->safe_psql('postgres',"
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that ROLLBACK PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(0), 'Rows rolled back are not on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher and subscriber crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (12);
+    INSERT INTO tab_full VALUES (13);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (12,13);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (subscriber only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (14);
+    INSERT INTO tab_full VALUES (15);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (14,15);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Check that COMMIT PREPARED is decoded properly on crash restart
+# (publisher only crash)
+###############################
+
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_full VALUES (16);
+    INSERT INTO tab_full VALUES (17);
+    PREPARE TRANSACTION 'test_prepared_tab';");
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (16,17);");
+is($result, qq(2), 'Rows inserted via 2PC are visible on the subscriber');
+
+###############################
+# Test nested transaction with 2PC
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# COMMIT
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check the transaction state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber');
+
+# check inserts are visible. 22 should be rolled back. 21 should be committed.
+$result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are on the subscriber');
+
+###############################
+# Test using empty GID
+###############################
+
+# check that 2PC gets replicated to subscriber
+$node_publisher->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (51);
+	PREPARE TRANSACTION '';");
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# ROLLBACK
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';");
+
+# check that 2PC gets aborted on subscriber
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# copy_data=false and two_phase
+###############################
+
+#create some test tables for copy tests
+$node_publisher->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_publisher->safe_psql('postgres', "INSERT INTO tab_copy SELECT generate_series(1,5);");
+$node_subscriber->safe_psql('postgres', "CREATE TABLE tab_copy (a int PRIMARY KEY)");
+$node_subscriber->safe_psql('postgres', "INSERT INTO tab_copy VALUES (88);");
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Setup logical replication
+$node_publisher->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_copy FOR TABLE tab_copy;");
+
+my $appname_copy = 'appname_copy';
+$node_subscriber->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_copy
+	CONNECTION '$publisher_connstr application_name=$appname_copy'
+	PUBLICATION tap_pub_copy
+	WITH (two_phase=on, copy_data=false);");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Also wait for initial table sync to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+# Check that the initial table data was NOT replicated (because we said copy_data=false)
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(1), 'initial data in subscriber table');
+
+# Now do a prepare on publisher and check that it IS replicated
+$node_publisher->safe_psql('postgres', "
+    BEGIN;
+    INSERT INTO tab_copy VALUES (99);
+    PREPARE TRANSACTION 'mygid';");
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+# Check that the transaction has been prepared on the subscriber, there will be 2
+# prepared transactions for the 2 subscriptions.
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;;");
+is($result, qq(2), 'transaction is prepared on subscriber');
+
+# Now commit the insert and verify that it IS replicated
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'mygid';");
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(6), 'publisher inserted data');
+
+$node_publisher->wait_for_catchup($appname_copy);
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;");
+is($result, qq(2), 'replicated data in subscriber table');
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_copy;");
+$node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_copy;");
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
new file mode 100644
index 0000000..e61d28a
--- /dev/null
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_C->start;
+
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_A->safe_psql('postgres', "
+	INSERT INTO tab_full SELECT generate_series(1,10);");
+
+# Create the same tables on node_B amd node_C
+$node_B->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE tab_full (a int PRIMARY KEY)");
+
+# Setup logical replication
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then COMMIT PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (11);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 11;");
+is($result, qq(1), 'Row inserted via 2PC has committed on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# check that 2PC gets replicated to subscriber(s)
+# then ROLLBACK PREPARED
+###############################
+
+# 2PC PREPARE
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (12);
+	PREPARE TRANSACTION 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab_full';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a = 12;");
+is($result, qq(0), 'Row inserted via 2PC is not present on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test nested transactions with 2PC
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO tab_full VALUES (21);
+	SAVEPOINT sp_inner;
+	INSERT INTO tab_full VALUES (22);
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# 22 should be rolled back.
+# 21 should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
+is($result, qq(21), 'Rows committed are present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9a0936e..227f92c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1390,12 +1390,15 @@ LogicalOutputPluginWriterUpdateProgress
 LogicalOutputPluginWriterWrite
 LogicalRepBeginData
 LogicalRepCommitData
+LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
 LogicalRepMsgType
 LogicalRepPartMapEntry
+LogicalRepPreparedTxnData
 LogicalRepRelId
 LogicalRepRelMapEntry
 LogicalRepRelation
+LogicalRepRollbackPreparedTxnData
 LogicalRepTupleData
 LogicalRepTyp
 LogicalRepWorker
-- 
1.8.3.1

#379Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#378)

On Sun, Jul 11, 2021 at 8:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 9, 2021 at 4:43 AM Peter Smith <smithpb2250@gmail.com> wrote:

The patch looks good to me, I don't have any comments.

I tried the v95-0001 patch.

- The patch applied cleanly and all build / testing was OK.
- The documentation also builds OK.
- I checked all v95-0001 / v93-0001 differences and found no problems.
- Furthermore, I noted that v95-0001 patch is passing the cfbot [1].

So this patch LGTM.

Thanks, I took another pass over it and made a few changes in docs and
comments. I am planning to push this next week sometime (by 14th July)
unless there are more comments from you or someone else. Just to
summarize, this patch will add support for prepared transactions to
built-in logical replication. To add support for streaming
transactions at prepare time into the
built-in logical replication, we need to do the following things: (a)
Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol. (b) Modify
the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare. (c) Add a new SUBSCRIPTION
option "two_phase" to allow users to enable
two-phase transactions. We enable the two_phase once the initial data
sync is over. Refer to comments atop worker.c in the patch and commit
message to see further details about this patch. After this patch,
there is a follow-up patch to allow streaming and two-phase options
together which I feel needs some more review and can be committed
separately.

FYI - I repeated the same verification of the v96-0001 patch as I did
previously for v95-0001

- The v96 patch applied cleanly and all build / testing was OK.
- The documentation also builds OK.
- I checked the v95-0001 / v96-0001 differences and found no problems.
- Furthermore, I noted that v96-0001 patch is passing the cfbot.

LGTM.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#380Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#379)

On Mon, Jul 12, 2021 at 9:14 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Sun, Jul 11, 2021 at 8:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 9, 2021 at 4:43 AM Peter Smith <smithpb2250@gmail.com> wrote:

The patch looks good to me, I don't have any comments.

I tried the v95-0001 patch.

- The patch applied cleanly and all build / testing was OK.
- The documentation also builds OK.
- I checked all v95-0001 / v93-0001 differences and found no problems.
- Furthermore, I noted that v95-0001 patch is passing the cfbot [1].

So this patch LGTM.

Thanks, I took another pass over it and made a few changes in docs and
comments. I am planning to push this next week sometime (by 14th July)
unless there are more comments from you or someone else. Just to
summarize, this patch will add support for prepared transactions to
built-in logical replication. To add support for streaming
transactions at prepare time into the
built-in logical replication, we need to do the following things: (a)
Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol. (b) Modify
the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare. (c) Add a new SUBSCRIPTION
option "two_phase" to allow users to enable
two-phase transactions. We enable the two_phase once the initial data
sync is over. Refer to comments atop worker.c in the patch and commit
message to see further details about this patch. After this patch,
there is a follow-up patch to allow streaming and two-phase options
together which I feel needs some more review and can be committed
separately.

FYI - I repeated the same verification of the v96-0001 patch as I did
previously for v95-0001

- The v96 patch applied cleanly and all build / testing was OK.
- The documentation also builds OK.
- I checked the v95-0001 / v96-0001 differences and found no problems.
- Furthermore, I noted that v96-0001 patch is passing the cfbot.

LGTM.

Pushed.

Feel free to submit the remaining patches after rebase. Is it possible
to post patches related to skipping empty transactions in the other
thread [1]/messages/by-id/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com where that topic is being discussed?

[1]: /messages/by-id/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com

--
With Regards,
Amit Kapila.

#381Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#380)
1 attachment(s)

On Wed, Jul 14, 2021 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 12, 2021 at 9:14 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Sun, Jul 11, 2021 at 8:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 9, 2021 at 4:43 AM Peter Smith <smithpb2250@gmail.com> wrote:

The patch looks good to me, I don't have any comments.

I tried the v95-0001 patch.

- The patch applied cleanly and all build / testing was OK.
- The documentation also builds OK.
- I checked all v95-0001 / v93-0001 differences and found no problems.
- Furthermore, I noted that v95-0001 patch is passing the cfbot [1].

So this patch LGTM.

Thanks, I took another pass over it and made a few changes in docs and
comments. I am planning to push this next week sometime (by 14th July)
unless there are more comments from you or someone else. Just to
summarize, this patch will add support for prepared transactions to
built-in logical replication. To add support for streaming
transactions at prepare time into the
built-in logical replication, we need to do the following things: (a)
Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol. (b) Modify
the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare. (c) Add a new SUBSCRIPTION
option "two_phase" to allow users to enable
two-phase transactions. We enable the two_phase once the initial data
sync is over. Refer to comments atop worker.c in the patch and commit
message to see further details about this patch. After this patch,
there is a follow-up patch to allow streaming and two-phase options
together which I feel needs some more review and can be committed
separately.

FYI - I repeated the same verification of the v96-0001 patch as I did
previously for v95-0001

- The v96 patch applied cleanly and all build / testing was OK.
- The documentation also builds OK.
- I checked the v95-0001 / v96-0001 differences and found no problems.
- Furthermore, I noted that v96-0001 patch is passing the cfbot.

LGTM.

Pushed.

Feel free to submit the remaining patches after rebase. Is it possible
to post patches related to skipping empty transactions in the other
thread [1] where that topic is being discussed?

[1] - /messages/by-id/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com

Please find attached the latest patch set v97*

* Rebased v94* to HEAD @ today.

This rebase was made necessary by the recent push of the first patch
from this set.

v94-0001 ==> already pushed [1]https://github.com/postgres/postgres/commit/a8fd13cab0ba815e9925dc9676e6309f699b5f72
v94-0002 ==> v97-0001
v94-0003 ==> will be relocated to other thread [2]/messages/by-id/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com
v94-0004 ==> this is omitted for now

----
[1]: https://github.com/postgres/postgres/commit/a8fd13cab0ba815e9925dc9676e6309f699b5f72
[2]: /messages/by-id/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v97-0001-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v97-0001-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 221078bcba35628a1af8a602dcee1374b1554bd8 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Wed, 14 Jul 2021 17:00:59 +1000
Subject: [PATCH v97] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/protocol.sgml                         |  68 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  10 -
 src/backend/commands/subscriptioncmds.c            |  25 --
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 138 ++++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |  10 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 453 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 271 ++++++++++++
 11 files changed, 1021 insertions(+), 83 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index e8cb78f..c88ec1e 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains a Stream Commit or Stream Abort message.
+   contains a Stream Prepare or Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7398,7 +7398,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7661,6 +7661,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1433905..702934e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 5f834a9..cdddaac 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -352,25 +352,6 @@ parse_subscription_options(List *stmt_options, bits32 supported_opts, SubOpts *o
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (opts->twophase &&
-		IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT) &&
-		IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
-	{
-		if (opts->streaming &&
-			IsSet(supported_opts, SUBOPT_STREAMING) &&
-			IsSet(opts->specified_opts, SUBOPT_STREAMING))
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-			/*- translator: both %s are strings of the form "option = value" */
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
 }
 
 /*
@@ -940,12 +921,6 @@ AlterSubscription(AlterSubscriptionStmt *stmt, bool isTopLevel)
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && opts.streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..8e03006 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strcpy(prepare_data->gid, pq_getmsgstring(in));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b9a7a7f..d014b3e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -332,7 +332,7 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1041,6 +1041,90 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a STREAM PREPARE message")));
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1256,30 +1340,20 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	if (in_streamed_transaction)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROTOCOL_VIOLATION),
-				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
 	/* Make sure we have an open transaction */
 	begin_replication_step();
 
@@ -1290,7 +1364,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -1311,7 +1385,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1390,6 +1464,32 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2333,6 +2433,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index e4314af..286119c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase
-		 * and streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1030,6 +1021,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 63de90d..d193b41 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -124,6 +125,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -243,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index ad6b4e4..34ebca4 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c90e3f6
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,453 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3a0be82
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#382vignesh C
vignesh21@gmail.com
In reply to: Peter Smith (#381)

On Wed, Jul 14, 2021 at 2:03 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Wed, Jul 14, 2021 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 12, 2021 at 9:14 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Sun, Jul 11, 2021 at 8:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 9, 2021 at 4:43 AM Peter Smith <smithpb2250@gmail.com> wrote:

The patch looks good to me, I don't have any comments.

I tried the v95-0001 patch.

- The patch applied cleanly and all build / testing was OK.
- The documentation also builds OK.
- I checked all v95-0001 / v93-0001 differences and found no problems.
- Furthermore, I noted that v95-0001 patch is passing the cfbot [1].

So this patch LGTM.

Thanks, I took another pass over it and made a few changes in docs and
comments. I am planning to push this next week sometime (by 14th July)
unless there are more comments from you or someone else. Just to
summarize, this patch will add support for prepared transactions to
built-in logical replication. To add support for streaming
transactions at prepare time into the
built-in logical replication, we need to do the following things: (a)
Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol. (b) Modify
the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare. (c) Add a new SUBSCRIPTION
option "two_phase" to allow users to enable
two-phase transactions. We enable the two_phase once the initial data
sync is over. Refer to comments atop worker.c in the patch and commit
message to see further details about this patch. After this patch,
there is a follow-up patch to allow streaming and two-phase options
together which I feel needs some more review and can be committed
separately.

FYI - I repeated the same verification of the v96-0001 patch as I did
previously for v95-0001

- The v96 patch applied cleanly and all build / testing was OK.
- The documentation also builds OK.
- I checked the v95-0001 / v96-0001 differences and found no problems.
- Furthermore, I noted that v96-0001 patch is passing the cfbot.

LGTM.

Pushed.

Feel free to submit the remaining patches after rebase. Is it possible
to post patches related to skipping empty transactions in the other
thread [1] where that topic is being discussed?

[1] - /messages/by-id/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com

Please find attached the latest patch set v97*

* Rebased v94* to HEAD @ today.

Thanks for the updated patch, the patch applies cleanly and test passes:
I had couple of comments:
1) Should we include "stream_prepare_cb" here in
logicaldecoding-streaming section of logicaldecoding.sgml
documentation:
To reduce the apply lag caused by large transactions, an output plugin
may provide additional callback to support incremental streaming of
in-progress transactions. There are multiple required streaming
callbacks (stream_start_cb, stream_stop_cb, stream_abort_cb,
stream_commit_cb and stream_change_cb) and two optional callbacks
(stream_message_cb and stream_truncate_cb).

2) Should we add an example for stream_prepare_cb here in
logicaldecoding-streaming section of logicaldecoding.sgml
documentation:
One example sequence of streaming callback calls for one transaction
may look like this:

stream_start_cb(...); <-- start of first block of changes
stream_change_cb(...);
stream_change_cb(...);
stream_message_cb(...);
stream_change_cb(...);
...
stream_change_cb(...);
stream_stop_cb(...); <-- end of first block of changes

stream_start_cb(...); <-- start of second block of changes
stream_change_cb(...);
stream_change_cb(...);
stream_change_cb(...);
...
stream_message_cb(...);
stream_change_cb(...);
stream_stop_cb(...); <-- end of second block of changes

stream_commit_cb(...); <-- commit of the streamed transaction

Regards,
Vignesh

#383Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#380)

Amit Kapila <amit.kapila16@gmail.com> writes:

Pushed.

Coverity thinks this has security issues, and I agree.

/srv/coverity/git/pgsql-git/postgresql/src/backend/replication/logical/proto.c: 144 in logicalrep_read_begin_prepare()
143 /* read gid (copy it into a pre-allocated buffer) */

CID 1487517: Security best practices violations (STRING_OVERFLOW)
You might overrun the 200-character fixed-size string "begin_data->gid" by copying the return value of "pq_getmsgstring" without checking the length.

144 strcpy(begin_data->gid, pq_getmsgstring(in));

200 /* read gid (copy it into a pre-allocated buffer) */

CID 1487515: Security best practices violations (STRING_OVERFLOW)
You might overrun the 200-character fixed-size string "prepare_data->gid" by copying the return value of "pq_getmsgstring" without checking the length.

201 strcpy(prepare_data->gid, pq_getmsgstring(in));

256 /* read gid (copy it into a pre-allocated buffer) */

CID 1487516: Security best practices violations (STRING_OVERFLOW)
You might overrun the 200-character fixed-size string "prepare_data->gid" by copying the return value of "pq_getmsgstring" without checking the length.

257 strcpy(prepare_data->gid, pq_getmsgstring(in));

316 /* read gid (copy it into a pre-allocated buffer) */

CID 1487519: Security best practices violations (STRING_OVERFLOW)
You might overrun the 200-character fixed-size string "rollback_data->gid" by copying the return value of "pq_getmsgstring" without checking the length.

317 strcpy(rollback_data->gid, pq_getmsgstring(in));

I think you'd be way better off making the gid fields be "char *"
and pstrdup'ing the result of pq_getmsgstring. Another possibility
perhaps is to use strlcpy, but I'd only go that way if it's important
to constrain the received strings to 200 bytes.

regards, tom lane

#384Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#383)

On Mon, Jul 19, 2021 at 1:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Pushed.

I think you'd be way better off making the gid fields be "char *"
and pstrdup'ing the result of pq_getmsgstring. Another possibility
perhaps is to use strlcpy, but I'd only go that way if it's important
to constrain the received strings to 200 bytes.

I think it is important to constrain length to 200 bytes for this case
as here we receive a prepared transaction identifier which according
to docs [1]https://www.postgresql.org/docs/devel/sql-prepare-transaction.html has a max length of 200 bytes. Also, in
ParseCommitRecord() and ParseAbortRecord(), we are using strlcpy with
200 as max length to copy prepare transaction identifier. So, I think
it is better to use strlcpy here unless you or Peter feels otherwise.

[1]: https://www.postgresql.org/docs/devel/sql-prepare-transaction.html

--
With Regards,
Amit Kapila.

#385Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#384)
1 attachment(s)

On Mon, Jul 19, 2021 at 12:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 19, 2021 at 1:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Pushed.

I think you'd be way better off making the gid fields be "char *"
and pstrdup'ing the result of pq_getmsgstring. Another possibility
perhaps is to use strlcpy, but I'd only go that way if it's important
to constrain the received strings to 200 bytes.

I think it is important to constrain length to 200 bytes for this case
as here we receive a prepared transaction identifier which according
to docs [1] has a max length of 200 bytes. Also, in
ParseCommitRecord() and ParseAbortRecord(), we are using strlcpy with
200 as max length to copy prepare transaction identifier. So, I think
it is better to use strlcpy here unless you or Peter feels otherwise.

OK. I have implemented this reported [1]/messages/by-id/161029.1626639923@sss.pgh.pa.us potential buffer overrun
using the constraining strlcpy, because the GID limitation of 200
bytes is already mentioned in the documentation [2]https://www.postgresql.org/docs/devel/sql-prepare-transaction.html.

PSA.

------
[1]: /messages/by-id/161029.1626639923@sss.pgh.pa.us
[2]: https://www.postgresql.org/docs/devel/sql-prepare-transaction.html

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v1-0001-Fix-potential-buffer-overruns.patchapplication/octet-stream; name=v1-0001-Fix-potential-buffer-overruns.patchDownload
From f167c8bb435e7d7a69d95930917d90243f07117b Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 19 Jul 2021 13:34:03 +1000
Subject: [PATCH v1] Fix potential buffer overruns.

Prevent potential buffer overruns when using strcpy to gid buffer.
Exposed by Coverity. Reported by Tom Lane [1].

[1] https://www.postgresql.org/message-id/161029.1626639923%40sss.pgh.pa.us
---
 src/backend/replication/logical/proto.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..0a48c4d 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -141,7 +141,7 @@ logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_da
 	begin_data->xid = pq_getmsgint(in, 4);
 
 	/* read gid (copy it into a pre-allocated buffer) */
-	strcpy(begin_data->gid, pq_getmsgstring(in));
+	strlcpy(begin_data->gid, pq_getmsgstring(in), GIDSIZE);
 }
 
 /*
@@ -198,7 +198,7 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
 	prepare_data->xid = pq_getmsgint(in, 4);
 
 	/* read gid (copy it into a pre-allocated buffer) */
-	strcpy(prepare_data->gid, pq_getmsgstring(in));
+	strlcpy(prepare_data->gid, pq_getmsgstring(in), GIDSIZE);
 }
 
 /*
@@ -254,7 +254,7 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 	prepare_data->xid = pq_getmsgint(in, 4);
 
 	/* read gid (copy it into a pre-allocated buffer) */
-	strcpy(prepare_data->gid, pq_getmsgstring(in));
+	strlcpy(prepare_data->gid, pq_getmsgstring(in), GIDSIZE);
 }
 
 /*
@@ -314,7 +314,7 @@ logicalrep_read_rollback_prepared(StringInfo in,
 	rollback_data->xid = pq_getmsgint(in, 4);
 
 	/* read gid (copy it into a pre-allocated buffer) */
-	strcpy(rollback_data->gid, pq_getmsgstring(in));
+	strlcpy(rollback_data->gid, pq_getmsgstring(in), GIDSIZE);
 }
 
 /*
-- 
1.8.3.1

#386Greg Nancarrow
gregn4422@gmail.com
In reply to: Peter Smith (#381)

On Wed, Jul 14, 2021 at 6:33 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v97*

I couldn't spot spot any significant issues in the v97-0001 patch, but
do have the following trivial feedback comments:

(1) doc/src/sgml/protocol.sgml
Suggestion:

BEFORE:
+   contains a Stream Prepare or Stream Commit or Stream Abort message.
AFTER:
+   contains a Stream Prepare, Stream Commit or Stream Abort message.

(2) src/backend/replication/logical/worker.c
It seems a bit weird to add a forward declaration here, without a
comment, like for the one immediately above it

/* Compute GID for two_phase transactions */
static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char
*gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);

(3) src/backend/replication/logical/worker.c
Other DEBUG1 messages don't end with "."

+ elog(DEBUG1, "apply_handle_stream_prepare: replayed %d
(all) changes.", nchanges);

Regards,
Greg Nancarrow
Fujitsu Australia

#387Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#385)

On Mon, Jul 19, 2021 at 9:19 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, Jul 19, 2021 at 12:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 19, 2021 at 1:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Pushed.

I think you'd be way better off making the gid fields be "char *"
and pstrdup'ing the result of pq_getmsgstring. Another possibility
perhaps is to use strlcpy, but I'd only go that way if it's important
to constrain the received strings to 200 bytes.

I think it is important to constrain length to 200 bytes for this case
as here we receive a prepared transaction identifier which according
to docs [1] has a max length of 200 bytes. Also, in
ParseCommitRecord() and ParseAbortRecord(), we are using strlcpy with
200 as max length to copy prepare transaction identifier. So, I think
it is better to use strlcpy here unless you or Peter feels otherwise.

OK. I have implemented this reported [1] potential buffer overrun
using the constraining strlcpy, because the GID limitation of 200
bytes is already mentioned in the documentation [2].

This will work but I think it is better to use sizeof gid buffer as we
are using in ParseCommitRecord() and ParseAbortRecord(). Tomorrow, if
due to some unforeseen reason if we change the size of gid buffer to
be different than the GIDSIZE then it will work seamlessly.

--
With Regards,
Amit Kapila.

#388Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#387)
1 attachment(s)

On Mon, Jul 19, 2021 at 4:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 19, 2021 at 9:19 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, Jul 19, 2021 at 12:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 19, 2021 at 1:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Pushed.

I think you'd be way better off making the gid fields be "char *"
and pstrdup'ing the result of pq_getmsgstring. Another possibility
perhaps is to use strlcpy, but I'd only go that way if it's important
to constrain the received strings to 200 bytes.

I think it is important to constrain length to 200 bytes for this case
as here we receive a prepared transaction identifier which according
to docs [1] has a max length of 200 bytes. Also, in
ParseCommitRecord() and ParseAbortRecord(), we are using strlcpy with
200 as max length to copy prepare transaction identifier. So, I think
it is better to use strlcpy here unless you or Peter feels otherwise.

OK. I have implemented this reported [1] potential buffer overrun
using the constraining strlcpy, because the GID limitation of 200
bytes is already mentioned in the documentation [2].

This will work but I think it is better to use sizeof gid buffer as we
are using in ParseCommitRecord() and ParseAbortRecord(). Tomorrow, if
due to some unforeseen reason if we change the size of gid buffer to
be different than the GIDSIZE then it will work seamlessly.

Modified as requested. PSA patch v2.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v2-0001-Fix-potential-buffer-overruns.patchapplication/octet-stream; name=v2-0001-Fix-potential-buffer-overruns.patchDownload
From 90252434ed58b3607ba73eb79eb46125afcd5099 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 19 Jul 2021 17:15:10 +1000
Subject: [PATCH v2] Fix potential buffer overruns.

Prevent potential buffer overruns when using strcpy to gid buffer.
Exposed by Coverity. Reported by Tom Lane [1].

[1] https://www.postgresql.org/message-id/161029.1626639923%40sss.pgh.pa.us
---
 src/backend/replication/logical/proto.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 13c8c3b..a245252 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -141,7 +141,7 @@ logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_da
 	begin_data->xid = pq_getmsgint(in, 4);
 
 	/* read gid (copy it into a pre-allocated buffer) */
-	strcpy(begin_data->gid, pq_getmsgstring(in));
+	strlcpy(begin_data->gid, pq_getmsgstring(in), sizeof(begin_data->gid));
 }
 
 /*
@@ -198,7 +198,7 @@ logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
 	prepare_data->xid = pq_getmsgint(in, 4);
 
 	/* read gid (copy it into a pre-allocated buffer) */
-	strcpy(prepare_data->gid, pq_getmsgstring(in));
+	strlcpy(prepare_data->gid, pq_getmsgstring(in), sizeof(prepare_data->gid));
 }
 
 /*
@@ -254,7 +254,7 @@ logicalrep_read_commit_prepared(StringInfo in, LogicalRepCommitPreparedTxnData *
 	prepare_data->xid = pq_getmsgint(in, 4);
 
 	/* read gid (copy it into a pre-allocated buffer) */
-	strcpy(prepare_data->gid, pq_getmsgstring(in));
+	strlcpy(prepare_data->gid, pq_getmsgstring(in), sizeof(prepare_data->gid));
 }
 
 /*
@@ -314,7 +314,7 @@ logicalrep_read_rollback_prepared(StringInfo in,
 	rollback_data->xid = pq_getmsgint(in, 4);
 
 	/* read gid (copy it into a pre-allocated buffer) */
-	strcpy(rollback_data->gid, pq_getmsgstring(in));
+	strlcpy(rollback_data->gid, pq_getmsgstring(in), sizeof(rollback_data->gid));
 }
 
 /*
-- 
1.8.3.1

#389Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#388)

On Mon, Jul 19, 2021 at 1:00 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Mon, Jul 19, 2021 at 4:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

OK. I have implemented this reported [1] potential buffer overrun
using the constraining strlcpy, because the GID limitation of 200
bytes is already mentioned in the documentation [2].

This will work but I think it is better to use sizeof gid buffer as we
are using in ParseCommitRecord() and ParseAbortRecord(). Tomorrow, if
due to some unforeseen reason if we change the size of gid buffer to
be different than the GIDSIZE then it will work seamlessly.

Modified as requested. PSA patch v2.

LGTM. I'll push this tomorrow unless Tom or someone else has any comments.

--
With Regards,
Amit Kapila.

#390Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#389)
1 attachment(s)

Please find attached the latest patch set v98*

Patches:

v97-0001 --> v98-0001

Differences:

* Rebased to HEAD @ yesterday.

* Code/Docs changes:

1. Fixed the same strcpy problem as reported by Tom Lane [1]/messages/by-id/161029.1626639923@sss.pgh.pa.us for the
previous 2PC patch.

2. Addressed all feedback suggestions given by Greg [2]/messages/by-id/CAJcOf-ckGONzyAj0Y70ju_tfLWF819JYb=dv9p5AnoZxm50j0g@mail.gmail.com.

3. Added some more documentation as suggested by Vignesh [3]/messages/by-id/CALDaNm0LVY5A98xrgaodynnj6c=WQ5=ZMpauC44aRio7-jWBYQ@mail.gmail.com.

----
[1]: /messages/by-id/161029.1626639923@sss.pgh.pa.us
[2]: /messages/by-id/CAJcOf-ckGONzyAj0Y70ju_tfLWF819JYb=dv9p5AnoZxm50j0g@mail.gmail.com
[3]: /messages/by-id/CALDaNm0LVY5A98xrgaodynnj6c=WQ5=ZMpauC44aRio7-jWBYQ@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v98-0001-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v98-0001-Add-prepare-API-support-for-streaming-transactio.patchDownload
From 9a18a13eb20ba383f09208789a4ca2802284c773 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 20 Jul 2021 12:36:09 +1000
Subject: [PATCH v98] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/logicaldecoding.sgml                  |  11 +-
 doc/src/sgml/protocol.sgml                         |  68 +++-
 doc/src/sgml/ref/create_subscription.sgml          |  10 -
 src/backend/commands/subscriptioncmds.c            |  25 --
 src/backend/replication/logical/proto.c            |  60 +++
 src/backend/replication/logical/worker.c           | 138 ++++++-
 src/backend/replication/pgoutput/pgoutput.c        |  33 +-
 src/include/replication/logicalproto.h             |  10 +-
 src/test/regress/expected/subscription.out         |  24 +-
 src/test/regress/sql/subscription.sql              |  12 +-
 src/test/subscription/t/023_twophase_stream.pl     | 453 +++++++++++++++++++++
 .../subscription/t/024_twophase_cascade_stream.pl  | 271 ++++++++++++
 12 files changed, 1032 insertions(+), 83 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl
 create mode 100644 src/test/subscription/t/024_twophase_cascade_stream.pl

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 002efc8..f6832c2 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1199,6 +1199,9 @@ OutputPluginWrite(ctx, true);
     <function>stream_abort_cb</function>, <function>stream_commit_cb</function>
     and <function>stream_change_cb</function>) and two optional callbacks
     (<function>stream_message_cb</function> and <function>stream_truncate_cb</function>).
+    Also, if streaming of two-phase commands is to be supported, then additional
+    callbacks must be provided. (See <xref linkend="logicaldecoding-two-phase-commits"/>
+    for details).
    </para>
 
    <para>
@@ -1237,7 +1240,13 @@ stream_start_cb(...);   &lt;-- start of second block of changes
   stream_change_cb(...);
 stream_stop_cb(...);    &lt;-- end of second block of changes
 
-stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+
+[a. when using normal commit]
+stream_commit_cb(...);    &lt;-- commit of the streamed transaction
+
+[b. when using two-phase commit]
+stream_prepare_cb(...);   &lt;-- prepare the streamed transaction
+commit_prepared_cb(...);  &lt;-- commit of the prepared transaction
 </programlisting>
    </para>
 
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index e8cb78f..221589a 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction 
-   contains a Stream Commit or Stream Abort message.
+   contains a Stream Prepare, Stream Commit or Stream Abort message.
   </para>
 
   <para>
@@ -7398,7 +7398,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7661,6 +7661,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1433905..702934e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 239d263..5cf64b6 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -333,25 +333,6 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (opts->twophase &&
-		IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT) &&
-		IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
-	{
-		if (opts->streaming &&
-			IsSet(supported_opts, SUBOPT_STREAMING) &&
-			IsSet(opts->specified_opts, SUBOPT_STREAMING))
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-			/*- translator: both %s are strings of the form "option = value" */
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
 }
 
 /*
@@ -924,12 +905,6 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && opts.streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index a245252..00990ca 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -318,6 +318,66 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	uint8		flags = 0;
+
+	pq_sendbyte(out, LOGICAL_REP_MSG_STREAM_PREPARE);
+
+	/*
+	 * This should only ever happen for two-phase commit transactions, in
+	 * which case we expect to have a valid GID.
+	 */
+	Assert(txn->gid != NULL);
+
+	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
+
+	/* send the flags field */
+	pq_sendbyte(out, flags);
+
+	/* send fields */
+	pq_sendint64(out, prepare_lsn);
+	pq_sendint64(out, txn->end_lsn);
+	pq_sendint64(out, txn->xact_time.prepare_time);
+	pq_sendint32(out, txn->xid);
+
+	/* send gid */
+	pq_sendstring(out, txn->gid);
+}
+
+/*
+ * Read STREAM PREPARE from the output stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	uint8		flags;
+
+	/* read flags */
+	flags = pq_getmsgbyte(in);
+
+	if (flags != 0)
+		elog(ERROR, "unrecognized flags %u in stream prepare message", flags);
+
+	/* read fields */
+	prepare_data->prepare_lsn = pq_getmsgint64(in);
+	prepare_data->end_lsn = pq_getmsgint64(in);
+	prepare_data->prepare_time = pq_getmsgint64(in);
+	prepare_data->xid = pq_getmsgint(in, 4);
+
+	/* read gid (copy it into a pre-allocated buffer) */
+	strlcpy(prepare_data->gid, pq_getmsgstring(in), sizeof(prepare_data->gid));
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b9a7a7f..ac505e1 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -333,6 +333,8 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
 
+/* Common streaming function to apply all the spooled messages */
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -1041,6 +1043,90 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	int			nchanges = 0;
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a STREAM PREPARE message")));
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+
+	nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data.end_lsn;
+	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+	PrepareTransactionBlock(gid);
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes", nchanges);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1256,30 +1342,20 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	if (in_streamed_transaction)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROTOCOL_VIOLATION),
-				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
 	/* Make sure we have an open transaction */
 	begin_replication_step();
 
@@ -1290,7 +1366,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -1311,7 +1387,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1390,6 +1466,32 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return nchanges;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+	int			nchanges = 0;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	nchanges = apply_spooled_messages(xid, commit_data.commit_lsn);
+
+	elog(DEBUG1, "apply_handle_stream_commit: replayed %d (all) changes.", nchanges);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
@@ -2333,6 +2435,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index e4314af..286119c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase
-		 * and streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1030,6 +1021,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 63de90d..d193b41 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -124,6 +125,7 @@ typedef struct LogicalRepBeginData
 	TransactionId xid;
 } LogicalRepBeginData;
 
+/* Commit (and abort) information */
 typedef struct LogicalRepCommitData
 {
 	XLogRecPtr	commit_lsn;
@@ -243,4 +245,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index ad6b4e4..34ebca4 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -279,27 +279,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index b732871..e304852 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -209,23 +209,25 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c90e3f6
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,453 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 27;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC ROLLBACK PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is rolled back.
+# 3. After servers are restarted the pending transaction is rolled back.
+#
+# Expect all inserted data is gone.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# rollback post the restart
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are rolled back
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only subscriber crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then 1 server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: only publisher crashes)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# insert, update, delete enough data to cause streaming
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->stop('immediate');
+$node_publisher->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Do DELETE after PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row DELETE is done for one of the records that was inserted by the 2PC transaction
+# 4. Then there is a COMMIT PREPARED.
+#
+# Expect all the 2PC data rows on the subscriber (since in fact delete at step 3 would do nothing
+# because that record was not yet committed at the time of the delete).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# DELETE one of the prepared 2PC records before they get committed (we are outside of the 2PC transaction)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a = 5");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber. Nothing was deleted');
+
+# confirm the "deleted" row was in fact not deleted
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab WHERE a = 5");
+is($result, qq(1), 'The row we deleted before the commit till exists on subscriber.');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Try 2PC transaction works using an empty GID literal
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION '';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED '';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
diff --git a/src/test/subscription/t/024_twophase_cascade_stream.pl b/src/test/subscription/t/024_twophase_cascade_stream.pl
new file mode 100644
index 0000000..3a0be82
--- /dev/null
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 31;
+
+###############################
+# Setup a cascade of pub/sub nodes.
+# node_A -> node_B -> node_C
+###############################
+
+# Initialize nodes
+# node_A
+my $node_A = get_new_node('node_A');
+$node_A->init(allows_streaming => 'logical');
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_A->start;
+# node_B
+my $node_B = get_new_node('node_B');
+$node_B->init(allows_streaming => 'logical');
+$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_B->start;
+# node_C
+my $node_C = get_new_node('node_C');
+$node_C->init(allows_streaming => 'logical');
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_C->start;
+
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+
+# node_A (pub) -> node_B (sub)
+my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
+$node_A->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_A FOR TABLE test_tab");
+my $appname_B = 'tap_sub_B';
+$node_B->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_B
+	CONNECTION '$node_A_connstr application_name=$appname_B'
+	PUBLICATION tap_pub_A
+	WITH (streaming = on, two_phase = on)");
+
+# node_B (pub) -> node_C (sub)
+my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
+$node_B->safe_psql('postgres',
+	"CREATE PUBLICATION tap_pub_B FOR TABLE test_tab");
+my $appname_C = 'tap_sub_C';
+$node_C->safe_psql('postgres',	"
+	CREATE SUBSCRIPTION tap_sub_C
+	CONNECTION '$node_B_connstr application_name=$appname_C'
+	PUBLICATION tap_pub_B
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# Also wait for two-phase to be enabled
+my $twophase_query = "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_B->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+$node_C->poll_query_until('postgres', $twophase_query)
+	or die "Timed out while waiting for subscriber to enable twophase";
+
+is(1,1, "Cascade setup is complete");
+
+my $result;
+
+###############################
+# Check initial data was copied to subscriber(s)
+###############################
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber C');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# The 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC ROLLBACK
+$node_A->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction is aborted on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Row inserted by 2PC is not present. Only initial data remains on subscriber C');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. There are 2 rows only in the table (from previous test)
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows rolled back are not present on subscriber C');
+
+###############################
+# check all the cleanup
+###############################
+
+# cleanup the node_B => node_C pub/sub
+$node_C->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_C");
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node C');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node B');
+
+# cleanup the node_A => node_B pub/sub
+$node_B->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub_B");
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber node B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber node B');
+$result = $node_A->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher node A');
+
+# shutdown
+$node_C->stop('fast');
+$node_B->stop('fast');
+$node_A->stop('fast');
-- 
1.8.3.1

#391Peter Smith
smithpb2250@gmail.com
In reply to: Greg Nancarrow (#386)

On Mon, Jul 19, 2021 at 3:28 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Wed, Jul 14, 2021 at 6:33 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v97*

I couldn't spot spot any significant issues in the v97-0001 patch, but
do have the following trivial feedback comments:

(1) doc/src/sgml/protocol.sgml
Suggestion:

BEFORE:
+   contains a Stream Prepare or Stream Commit or Stream Abort message.
AFTER:
+   contains a Stream Prepare, Stream Commit or Stream Abort message.

(2) src/backend/replication/logical/worker.c
It seems a bit weird to add a forward declaration here, without a
comment, like for the one immediately above it

/* Compute GID for two_phase transactions */
static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char
*gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);

(3) src/backend/replication/logical/worker.c
Other DEBUG1 messages don't end with "."

+ elog(DEBUG1, "apply_handle_stream_prepare: replayed %d
(all) changes.", nchanges);

Thanks for the feedback. All these are fixed as suggested in v98.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#392Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#382)

On Fri, Jul 16, 2021 at 4:08 PM vignesh C <vignesh21@gmail.com> wrote:

[...]

Thanks for the updated patch, the patch applies cleanly and test passes:
I had couple of comments:
1) Should we include "stream_prepare_cb" here in
logicaldecoding-streaming section of logicaldecoding.sgml
documentation:
To reduce the apply lag caused by large transactions, an output plugin
may provide additional callback to support incremental streaming of
in-progress transactions. There are multiple required streaming
callbacks (stream_start_cb, stream_stop_cb, stream_abort_cb,
stream_commit_cb and stream_change_cb) and two optional callbacks
(stream_message_cb and stream_truncate_cb).

Modified in v98. The information about 'stream_prepare_cb' and friends
is given in detail in section 49.10 so I added a link to that page.

2) Should we add an example for stream_prepare_cb here in
logicaldecoding-streaming section of logicaldecoding.sgml
documentation:
One example sequence of streaming callback calls for one transaction
may look like this:

stream_start_cb(...); <-- start of first block of changes
stream_change_cb(...);
stream_change_cb(...);
stream_message_cb(...);
stream_change_cb(...);
...
stream_change_cb(...);
stream_stop_cb(...); <-- end of first block of changes

stream_start_cb(...); <-- start of second block of changes
stream_change_cb(...);
stream_change_cb(...);
stream_change_cb(...);
...
stream_message_cb(...);
stream_change_cb(...);
stream_stop_cb(...); <-- end of second block of changes

stream_commit_cb(...); <-- commit of the streamed transaction

Modified in v98. I felt it would be too verbose to add another full
example since it would be 90% the same as the current example. So I
have combined the information.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#393Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#390)

On Tue, Jul 20, 2021 at 9:24 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v98*

Review comments:
================
1.
/*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
  */
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)

Let's extract this common functionality (common to current code and
the patch) as a separate patch? I think we can commit this as a
separate patch.

2.
apply_spooled_messages()
{
..
elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
nchanges, path);
..
}

You have this DEBUG1 message in apply_spooled_messages and its
callers. You can remove it from callers as the patch already has
another debug message to indicate whether it is stream prepare or
stream commit. Also, if this is the only reason to return nchanges
from apply_spooled_messages() then we can get rid of that as well.

3.
+ /*
+ * 2. Mark the transaction as prepared. - Similar code as for
+ * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+ */
+
+ /*
+ * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+ * called within the PrepareTransactionBlock below.
+ */
+ BeginTransactionBlock();
+ CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+ /*
+ * Update origin state so we can restart streaming from correct position
+ * in case of crash.
+ */
+ replorigin_session_origin_lsn = prepare_data.end_lsn;
+ replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+ PrepareTransactionBlock(gid);

I think you can move this part into a common function
apply_handle_prepare_internal. If that is possible then you might want
to move this part into a common functionality patch as mentioned in
point-1.

4.
+ xid = logicalrep_read_stream_prepare(s, &prepare_data);
+ elog(DEBUG1, "received prepare for streamed transaction %u", xid);

It is better to have an empty line between the above code lines for
the sake of clarity.

5.
+/* Commit (and abort) information */
typedef struct LogicalRepCommitData

How is this structure related to abort? Even if it is, why this
comment belongs to this patch?

6. Most of the code in logicalrep_write_stream_prepare() and
logicalrep_write_prepare() is same except for message. I think if we
want we can handle both of them with a single message by setting some
flag for stream case but probably there will be some additional
checking required on the worker-side. What do you think? I think if we
want to keep them separate then at least we should keep the common
functionality in logicalrep_write_*/logicalrep_read_* in separate
functions. This way we will avoid minor inconsistencies in-stream and
non-stream functions.

7.
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
    Begin Prepare and Prepare messages belong to the same transaction.
    It also sends changes of large in-progress transactions between a pair of
    Stream Start and Stream Stop messages. The last stream of such a transaction
-   contains a Stream Commit or Stream Abort message.
+   contains a Stream Prepare, Stream Commit or Stream Abort message.

I am not sure if it is correct to mention Stream Prepare here because
after that we will send commit prepared as well for such a
transaction. So, I think we should remove this change.

8.
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
\dRs+

+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);

Is there a reason for this change in the tests?

9.
I think this contains a lot of streaming tests in 023_twophase_stream.
Let's keep just one test for crash-restart scenario (+# Check that 2PC
COMMIT PREPARED is decoded properly on crash restart.) where both
publisher and subscriber get restarted. I think others are covered in
one or another way by other existing tests. Apart from that, I also
don't see the need for the below tests:
# Do DELETE after PREPARE but before COMMIT PREPARED.
This is mostly the same as the previous test where the patch is testing Insert
# Try 2PC transaction works using an empty GID literal
This is covered in 021_twophase.

10.
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.

In the above comment, you might want to say something about streaming.
In general, I am not sure if it is really adding value to have these
many streaming tests for cascaded setup and doing the whole setup
again after we have done in tests 022_twophase_cascade. I think it is
sufficient to do just one or two streaming tests by enhancing
022_twophase_cascade, you can alter subscription to enable streaming
after doing non-streaming tests.

11. Have you verified that all these tests went through the streaming
code path? If not, you can once enable DEBUG message in
apply_handle_stream_prepare() and see if all tests hit that.

--
With Regards,
Amit Kapila.

#394Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#393)

On Fri, Jul 23, 2021 at 8:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 20, 2021 at 9:24 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v98*

Review comments:
================

[...]

With Regards,
Amit Kapila.

Thanks for your review comments.

I having been working through them today and hope to post the v99*
patches tomorrow.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#395Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#393)

On Fri, Jul 23, 2021 at 8:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 20, 2021 at 9:24 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v98*

Review comments:
================

All the following review comments are addressed in v99* patch set.

1.
/*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
+ * Returns how many changes were applied.
*/
-static void
-apply_handle_stream_commit(StringInfo s)
+static int
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)

Let's extract this common functionality (common to current code and
the patch) as a separate patch? I think we can commit this as a
separate patch.

Done. Split patches as requested.

2.
apply_spooled_messages()
{
..
elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
nchanges, path);
..
}

You have this DEBUG1 message in apply_spooled_messages and its
callers. You can remove it from callers as the patch already has
another debug message to indicate whether it is stream prepare or
stream commit. Also, if this is the only reason to return nchanges
from apply_spooled_messages() then we can get rid of that as well.

Done.

3.
+ /*
+ * 2. Mark the transaction as prepared. - Similar code as for
+ * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+ */
+
+ /*
+ * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+ * called within the PrepareTransactionBlock below.
+ */
+ BeginTransactionBlock();
+ CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+ /*
+ * Update origin state so we can restart streaming from correct position
+ * in case of crash.
+ */
+ replorigin_session_origin_lsn = prepare_data.end_lsn;
+ replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+ PrepareTransactionBlock(gid);

I think you can move this part into a common function
apply_handle_prepare_internal. If that is possible then you might want
to move this part into a common functionality patch as mentioned in
point-1.

Done. (The common function is included in patch 0001)

4.
+ xid = logicalrep_read_stream_prepare(s, &prepare_data);
+ elog(DEBUG1, "received prepare for streamed transaction %u", xid);

It is better to have an empty line between the above code lines for
the sake of clarity.

Done.

5.
+/* Commit (and abort) information */
typedef struct LogicalRepCommitData

How is this structure related to abort? Even if it is, why this
comment belongs to this patch?

OK. Removed this from the patch.

6. Most of the code in logicalrep_write_stream_prepare() and
logicalrep_write_prepare() is same except for message. I think if we
want we can handle both of them with a single message by setting some
flag for stream case but probably there will be some additional
checking required on the worker-side. What do you think? I think if we
want to keep them separate then at least we should keep the common
functionality in logicalrep_write_*/logicalrep_read_* in separate
functions. This way we will avoid minor inconsistencies in-stream and
non-stream functions.

Done. (The common functions are included in patch 0001).

7.
+++ b/doc/src/sgml/protocol.sgml
@@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
Begin Prepare and Prepare messages belong to the same transaction.
It also sends changes of large in-progress transactions between a pair of
Stream Start and Stream Stop messages. The last stream of such a transaction
-   contains a Stream Commit or Stream Abort message.
+   contains a Stream Prepare, Stream Commit or Stream Abort message.

I am not sure if it is correct to mention Stream Prepare here because
after that we will send commit prepared as well for such a
transaction. So, I think we should remove this change.

Done.

8.
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
\dRs+

+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);

Is there a reason for this change in the tests?

Yes, the setting of slot_name = NONE really belongs with the DROP
SUBSCRIPTION. Similarly, the \dRs+ is done to test the effect of the
setting of the streaming option (not the slot_name = NONE). Since I
needed to add a new DROP SUBSCRIPTION (because now the streaming
option works) so I also refactored this exiting test to make all the
test formats consistent.

9.
I think this contains a lot of streaming tests in 023_twophase_stream.
Let's keep just one test for crash-restart scenario (+# Check that 2PC
COMMIT PREPARED is decoded properly on crash restart.) where both
publisher and subscriber get restarted. I think others are covered in
one or another way by other existing tests. Apart from that, I also
don't see the need for the below tests:
# Do DELETE after PREPARE but before COMMIT PREPARED.
This is mostly the same as the previous test where the patch is testing Insert
# Try 2PC transaction works using an empty GID literal
This is covered in 021_twophase.

Done. Removed all the excessive tests as you suggested.

10.
+++ b/src/test/subscription/t/024_twophase_cascade_stream.pl
@@ -0,0 +1,271 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test cascading logical replication of 2PC.

In the above comment, you might want to say something about streaming.
In general, I am not sure if it is really adding value to have these
many streaming tests for cascaded setup and doing the whole setup
again after we have done in tests 022_twophase_cascade. I think it is
sufficient to do just one or two streaming tests by enhancing
022_twophase_cascade, you can alter subscription to enable streaming
after doing non-streaming tests.

Done. Remove the 024 TAP tests, and instead merged the streaming
cascade tests into the 022_twophase_casecase.pl as you suggested.

11. Have you verified that all these tests went through the streaming
code path? If not, you can once enable DEBUG message in
apply_handle_stream_prepare() and see if all tests hit that.

Yeah, it was done a very long time ago when the tests were first
written; Anyway, just to be certain I temporarily modified the code as
suggested and confirmed by the logfiles that the tests is running
through apply_handle_stream_prepare.

------
Kind Regards,
Peter Smith.
Fujitsu Australia.

#396Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#395)
2 attachment(s)

Please find attached the latest patch set v99*

v98-0001 --> split into v99-0001 + v99-0002

Differences:

* Rebased to HEAD @ yesterday.

* Addresses review comments from Amit [1]/messages/by-id/CAA4eK1+izpAybqpEFp8+Rx=C1Z1H_XLcRod_WYjBRv2Rn+DO2w@mail.gmail.com and split the v98 patch as requested.

----
[1]: /messages/by-id/CAA4eK1+izpAybqpEFp8+Rx=C1Z1H_XLcRod_WYjBRv2Rn+DO2w@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v99-0001-Refactor-to-make-common-functions.patchapplication/octet-stream; name=v99-0001-Refactor-to-make-common-functions.patchDownload
From 65d08446d76874ef4a5120290abca8aad6a5ff50 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 27 Jul 2021 14:14:21 +1000
Subject: [PATCH v99] Refactor to make common functions.

This is a non-functional change only to refactor code to extract
some replication logic into static functions.

This is just done as preparation for the 2PC streaming patch which
also shares this common logic.
---
 src/backend/replication/logical/proto.c  | 44 +++++++++++++----
 src/backend/replication/logical/worker.c | 82 +++++++++++++++++++++-----------
 2 files changed, 87 insertions(+), 39 deletions(-)

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index a245252..4b392cb 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -145,15 +145,15 @@ logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_da
 }
 
 /*
- * Write PREPARE to the output stream.
+ * Common code for logicalrep_write_prepare and logicalrep_write_stream_prepare.
  */
-void
-logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
-						 XLogRecPtr prepare_lsn)
+static void
+logicalrep_write_prepare_common(StringInfo out, LogicalRepMsgType type,
+								ReorderBufferTXN *txn, XLogRecPtr prepare_lsn)
 {
 	uint8		flags = 0;
 
-	pq_sendbyte(out, LOGICAL_REP_MSG_PREPARE);
+	pq_sendbyte(out, type);
 
 	/*
 	 * This should only ever happen for two-phase commit transactions, in
@@ -161,6 +161,7 @@ logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 	 */
 	Assert(txn->gid != NULL);
 	Assert(rbtxn_prepared(txn));
+	Assert(TransactionIdIsValid(txn->xid));
 
 	/* send the flags field */
 	pq_sendbyte(out, flags);
@@ -176,29 +177,52 @@ logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 }
 
 /*
- * Read transaction PREPARE from the stream.
+ * Write PREPARE to the output stream.
  */
 void
-logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
+						 XLogRecPtr prepare_lsn)
+{
+	logicalrep_write_prepare_common(out, LOGICAL_REP_MSG_PREPARE,
+									txn, prepare_lsn);
+}
+
+/*
+ * Common code for logicalrep_read_prepare and logicalrep_read_stream_prepare.
+ */
+static TransactionId
+logicalrep_read_prepare_common(StringInfo in, char *msgtype,
+							   LogicalRepPreparedTxnData *prepare_data)
 {
 	/* read flags */
 	uint8		flags = pq_getmsgbyte(in);
 
 	if (flags != 0)
-		elog(ERROR, "unrecognized flags %u in prepare message", flags);
+		elog(ERROR, "unrecognized flags %u in %s message", flags, msgtype);
 
 	/* read fields */
 	prepare_data->prepare_lsn = pq_getmsgint64(in);
 	if (prepare_data->prepare_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "prepare_lsn is not set in prepare message");
+		elog(ERROR, "prepare_lsn is not set in %s message", msgtype);
 	prepare_data->end_lsn = pq_getmsgint64(in);
 	if (prepare_data->end_lsn == InvalidXLogRecPtr)
-		elog(ERROR, "end_lsn is not set in prepare message");
+		elog(ERROR, "end_lsn is not set in %s message", msgtype);
 	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
 
 	/* read gid (copy it into a pre-allocated buffer) */
 	strlcpy(prepare_data->gid, pq_getmsgstring(in), sizeof(prepare_data->gid));
+
+	return prepare_data->xid;
+}
+
+/*
+ * Read transaction PREPARE from the stream.
+ */
+void
+logicalrep_read_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	logicalrep_read_prepare_common(in, "prepare", prepare_data);
 }
 
 /*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b9a7a7f..f16ba68 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -333,6 +333,8 @@ static void apply_handle_tuple_routing(ApplyExecutionData *edata,
 /* Compute GID for two_phase transactions */
 static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid);
 
+/* Common streaming function to apply all the spooled messages */
+static void apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);
 
 /*
  * Should this worker apply changes for given relation.
@@ -885,6 +887,29 @@ apply_handle_begin_prepare(StringInfo s)
 }
 
 /*
+ * Common function to prepare the GID.
+ */
+static void
+apply_handle_prepare_internal(LogicalRepPreparedTxnData *prepare_data, char *gid)
+{
+	/*
+	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
+	 * called within the PrepareTransactionBlock below.
+	 */
+	BeginTransactionBlock();
+	CommitTransactionCommand(); /* Completes the preceding Begin command. */
+
+	/*
+	 * Update origin state so we can restart streaming from correct position
+	 * in case of crash.
+	 */
+	replorigin_session_origin_lsn = prepare_data->end_lsn;
+	replorigin_session_origin_timestamp = prepare_data->prepare_time;
+
+	PrepareTransactionBlock(gid);
+}
+
+/*
  * Handle PREPARE message.
  */
 static void
@@ -923,21 +948,8 @@ apply_handle_prepare(StringInfo s)
 	 */
 	begin_replication_step();
 
-	/*
-	 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
-	 * called within the PrepareTransactionBlock below.
-	 */
-	BeginTransactionBlock();
-	CommitTransactionCommand(); /* Completes the preceding Begin command. */
-
-	/*
-	 * Update origin state so we can restart streaming from correct position
-	 * in case of crash.
-	 */
-	replorigin_session_origin_lsn = prepare_data.end_lsn;
-	replorigin_session_origin_timestamp = prepare_data.prepare_time;
+	apply_handle_prepare_internal(&prepare_data, gid);
 
-	PrepareTransactionBlock(gid);
 	end_replication_step();
 	CommitTransactionCommand();
 	pgstat_report_stat(false);
@@ -1256,30 +1268,19 @@ apply_handle_stream_abort(StringInfo s)
 }
 
 /*
- * Handle STREAM COMMIT message.
+ * Common spoolfile processing.
  */
 static void
-apply_handle_stream_commit(StringInfo s)
+apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 {
-	TransactionId xid;
 	StringInfoData s2;
 	int			nchanges;
 	char		path[MAXPGPATH];
 	char	   *buffer = NULL;
-	LogicalRepCommitData commit_data;
 	StreamXidHash *ent;
 	MemoryContext oldcxt;
 	BufFile    *fd;
 
-	if (in_streamed_transaction)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROTOCOL_VIOLATION),
-				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
-
-	xid = logicalrep_read_stream_commit(s, &commit_data);
-
-	elog(DEBUG1, "received commit for streamed transaction %u", xid);
-
 	/* Make sure we have an open transaction */
 	begin_replication_step();
 
@@ -1290,7 +1291,7 @@ apply_handle_stream_commit(StringInfo s)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -1311,7 +1312,7 @@ apply_handle_stream_commit(StringInfo s)
 
 	MemoryContextSwitchTo(oldcxt);
 
-	remote_final_lsn = commit_data.commit_lsn;
+	remote_final_lsn = lsn;
 
 	/*
 	 * Make sure the handle apply_dispatch methods are aware we're in a remote
@@ -1390,6 +1391,29 @@ apply_handle_stream_commit(StringInfo s)
 	elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
 		 nchanges, path);
 
+	return;
+}
+
+/*
+ * Handle STREAM COMMIT message.
+ */
+static void
+apply_handle_stream_commit(StringInfo s)
+{
+	TransactionId xid;
+	LogicalRepCommitData commit_data;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM COMMIT message without STREAM STOP")));
+
+	xid = logicalrep_read_stream_commit(s, &commit_data);
+
+	elog(DEBUG1, "received commit for streamed transaction %u", xid);
+
+	apply_spooled_messages(xid, commit_data.commit_lsn);
+
 	apply_handle_commit_internal(s, &commit_data);
 
 	/* unlink the files with serialized changes and subxact info */
-- 
1.8.3.1

v99-0002-Add-prepare-API-support-for-streaming-transactio.patchapplication/octet-stream; name=v99-0002-Add-prepare-API-support-for-streaming-transactio.patchDownload
From a4f94e428380d5ac8506456bc6b5287186c1ae7d Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 27 Jul 2021 15:34:40 +1000
Subject: [PATCH v99] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/logicaldecoding.sgml               |  11 +-
 doc/src/sgml/protocol.sgml                      |  66 +++++-
 doc/src/sgml/ref/create_subscription.sgml       |  10 -
 src/backend/commands/subscriptioncmds.c         |  25 ---
 src/backend/replication/logical/proto.c         |  21 ++
 src/backend/replication/logical/worker.c        |  71 ++++++
 src/backend/replication/pgoutput/pgoutput.c     |  33 ++-
 src/include/replication/logicalproto.h          |   9 +-
 src/test/regress/expected/subscription.out      |  24 +-
 src/test/regress/sql/subscription.sql           |  13 +-
 src/test/subscription/t/022_twophase_cascade.pl | 153 ++++++++++++-
 src/test/subscription/t/023_twophase_stream.pl  | 281 ++++++++++++++++++++++++
 12 files changed, 642 insertions(+), 75 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 89b8090..0d0de29 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1199,6 +1199,9 @@ OutputPluginWrite(ctx, true);
     <function>stream_abort_cb</function>, <function>stream_commit_cb</function>
     and <function>stream_change_cb</function>) and two optional callbacks
     (<function>stream_message_cb</function> and <function>stream_truncate_cb</function>).
+    Also, if streaming of two-phase commands is to be supported, then additional
+    callbacks must be provided. (See <xref linkend="logicaldecoding-two-phase-commits"/>
+    for details).
    </para>
 
    <para>
@@ -1237,7 +1240,13 @@ stream_start_cb(...);   &lt;-- start of second block of changes
   stream_change_cb(...);
 stream_stop_cb(...);    &lt;-- end of second block of changes
 
-stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+
+[a. when using normal commit]
+stream_commit_cb(...);    &lt;-- commit of the streamed transaction
+
+[b. when using two-phase commit]
+stream_prepare_cb(...);   &lt;-- prepare the streamed transaction
+commit_prepared_cb(...);  &lt;-- commit of the prepared transaction
 </programlisting>
    </para>
 
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index e8cb78f..8ec5542 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7398,7 +7398,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7661,6 +7661,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1433905..702934e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 22ae982..5157f44 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -335,25 +335,6 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (opts->twophase &&
-		IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT) &&
-		IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
-	{
-		if (opts->streaming &&
-			IsSet(supported_opts, SUBOPT_STREAMING) &&
-			IsSet(opts->specified_opts, SUBOPT_STREAMING))
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-			/*- translator: both %s are strings of the form "option = value" */
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
 }
 
 /*
@@ -933,12 +914,6 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && opts.streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 4b392cb..7f327c4 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -342,6 +342,27 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	logicalrep_write_prepare_common(out, LOGICAL_REP_MSG_STREAM_PREPARE,
+									txn, prepare_lsn);
+}
+
+/*
+ * Read STREAM PREPARE from the stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	return logicalrep_read_prepare_common(in, "stream prepare", prepare_data);
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f16ba68..f17f3f9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1053,6 +1053,73 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+	char		gid[GIDSIZE];
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a STREAM PREPARE message")));
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * Compute unique GID for two_phase transactions. We don't use GID of
+	 * prepared transaction sent by server as that can lead to deadlock when
+	 * we have multiple subscriptions from same node point to publications on
+	 * the same node. See comments atop worker.c
+	 */
+	TwoPhaseTransactionGid(MySubscription->oid, prepare_data.xid,
+						   gid, sizeof(gid));
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+	apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+	apply_handle_prepare_internal(&prepare_data, gid);
+
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -2357,6 +2424,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index e4314af..286119c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase
-		 * and streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1030,6 +1021,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 63de90d..c99a931 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -243,4 +244,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 67f92b3..77b4437 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -282,27 +282,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 88743ab..fcb73a4 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -212,23 +212,26 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index e61d28a..edc62ee 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -2,11 +2,14 @@
 # Copyright (c) 2021, PostgreSQL Global Development Group
 
 # Test cascading logical replication of 2PC.
+#
+# Includes tests for options 2PC (not-streaming) and also for 2PC (streaming).
+#
 use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 27;
+use Test::More tests => 38;
 
 ###############################
 # Setup a cascade of pub/sub nodes.
@@ -17,20 +20,20 @@ use Test::More tests => 27;
 # node_A
 my $node_A = get_new_node('node_A');
 $node_A->init(allows_streaming => 'logical');
-$node_A->append_conf('postgresql.conf',
-	qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
 $node_A->start;
 # node_B
 my $node_B = get_new_node('node_B');
 $node_B->init(allows_streaming => 'logical');
-$node_B->append_conf('postgresql.conf',
-	qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf',	qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
 $node_B->start;
 # node_C
 my $node_C = get_new_node('node_C');
 $node_C->init(allows_streaming => 'logical');
-$node_C->append_conf('postgresql.conf',
-	qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
 $node_C->start;
 
 # Create some pre-existing content on node_A
@@ -45,12 +48,29 @@ $node_B->safe_psql('postgres',
 $node_C->safe_psql('postgres',
 	"CREATE TABLE tab_full (a int PRIMARY KEY)");
 
+# Create some pre-existing content on node_A (uses same DDL as 015_stream.pl)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
 # Setup logical replication
 
+# -----------------------
+# 2PC NON-STREAMING TESTS
+# -----------------------
+
 # node_A (pub) -> node_B (sub)
 my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
 $node_A->safe_psql('postgres',
-	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full, test_tab");
 my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
@@ -61,7 +81,7 @@ $node_B->safe_psql('postgres',	"
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
 $node_B->safe_psql('postgres',
-	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full, test_tab");
 my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
@@ -203,6 +223,121 @@ is($result, qq(21), 'Rows committed are present on subscriber B');
 $result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
 is($result, qq(21), 'Rows committed are present on subscriber C');
 
+# ---------------------
+# 2PC + STREAMING TESTS
+# ---------------------
+
+# Setup logical replication (streaming = on)
+
+$node_B->safe_psql('postgres',	"
+	ALTER SUBSCRIPTION tap_sub_B
+	SET (streaming = on);");
+
+$node_C->safe_psql('postgres',	"
+	ALTER SUBSCRIPTION tap_sub_C
+	SET (streaming = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. Cleanup from previous test leaving only 2 rows
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..903e3ca
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,281 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 18;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher (uses same DDL as 015_stream test)
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#397Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#396)

On Tue, Jul 27, 2021 at 11:41 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v99*

v98-0001 --> split into v99-0001 + v99-0002

Pushed the first refactoring patch after making few modifications as below.
1.
- /* open the spool file for the committed transaction */
+ /* Open the spool file for the committed/prepared transaction */
  changes_filename(path, MyLogicalRepWorker->subid, xid);

In the above comment, we don't need to say prepared. It can be done as
part of the second patch.

2.
+apply_handle_prepare_internal(LogicalRepPreparedTxnData
*prepare_data, char *gid)

I don't think there is any need for this function to take gid as
input. It can compute by itself instead of callers doing it.

3.
+static TransactionId+logicalrep_read_prepare_common(StringInfo in,
char *msgtype,
+                               LogicalRepPreparedTxnData *prepare_data)

I don't think the above function needs to return xid because it is
already present as part of prepare_data. Even, if it is required due
to some reason for the second patch then let's do it as part of if but
I don't think it is required for the second patch.

4.
 /*
- * Write PREPARE to the output stream.
+ * Common code for logicalrep_write_prepare and
logicalrep_write_stream_prepare.
  */

Here and at a similar another place, we don't need to refer to
logicalrep_write_stream_prepare as that is part of the second patch.

Few comments on 0002 patch:
==========================
1.
+# ---------------------
+# 2PC + STREAMING TESTS
+# ---------------------
+
+# Setup logical replication (streaming = on)
+
+$node_B->safe_psql('postgres', "
+ ALTER SUBSCRIPTION tap_sub_B
+ SET (streaming = on);");
+
+$node_C->safe_psql('postgres', "
+ ALTER SUBSCRIPTION tap_sub_C
+ SET (streaming = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);

This is not the right way to determine if the new streaming option is
enabled on the publisher. Even if there is no restart of apply workers
(and walsender) after you have enabled the option, the above wait will
succeed. You need to do something like below as we are doing in
001_rep_changes.pl:

$oldpid = $node_publisher->safe_psql('postgres',
"SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH
(copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name
= 'tap_sub';"
) or die "Timed out while waiting for apply to restart";

2.
+# Create some pre-existing content on publisher (uses same DDL as
015_stream test)

Here, in the comments, I don't see the need to same uses same DDL ...

--
With Regards,
Amit Kapila.

#398Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#397)
1 attachment(s)

Attachments:

v100-0001-Add-prepare-API-support-for-streaming-transacti.patchapplication/octet-stream; name=v100-0001-Add-prepare-API-support-for-streaming-transacti.patchDownload
From 6c4be193f268ebad4344eb904894c737258ebc1a Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Fri, 30 Jul 2021 13:35:10 +1000
Subject: [PATCH v100] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/logicaldecoding.sgml               |  11 +-
 doc/src/sgml/protocol.sgml                      |  66 +++++-
 doc/src/sgml/ref/create_subscription.sgml       |  10 -
 src/backend/commands/subscriptioncmds.c         |  25 ---
 src/backend/replication/logical/proto.c         |  31 ++-
 src/backend/replication/logical/worker.c        |  63 +++++-
 src/backend/replication/pgoutput/pgoutput.c     |  33 ++-
 src/include/replication/logicalproto.h          |   9 +-
 src/test/regress/expected/subscription.out      |  24 +-
 src/test/regress/sql/subscription.sql           |  13 +-
 src/test/subscription/t/022_twophase_cascade.pl | 166 +++++++++++++-
 src/test/subscription/t/023_twophase_stream.pl  | 281 ++++++++++++++++++++++++
 12 files changed, 654 insertions(+), 78 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 89b8090..0d0de29 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1199,6 +1199,9 @@ OutputPluginWrite(ctx, true);
     <function>stream_abort_cb</function>, <function>stream_commit_cb</function>
     and <function>stream_change_cb</function>) and two optional callbacks
     (<function>stream_message_cb</function> and <function>stream_truncate_cb</function>).
+    Also, if streaming of two-phase commands is to be supported, then additional
+    callbacks must be provided. (See <xref linkend="logicaldecoding-two-phase-commits"/>
+    for details).
    </para>
 
    <para>
@@ -1237,7 +1240,13 @@ stream_start_cb(...);   &lt;-- start of second block of changes
   stream_change_cb(...);
 stream_stop_cb(...);    &lt;-- end of second block of changes
 
-stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+
+[a. when using normal commit]
+stream_commit_cb(...);    &lt;-- commit of the streamed transaction
+
+[b. when using two-phase commit]
+stream_prepare_cb(...);   &lt;-- prepare the streamed transaction
+commit_prepared_cb(...);  &lt;-- commit of the prepared transaction
 </programlisting>
    </para>
 
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index e8cb78f..8ec5542 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7398,7 +7398,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7661,6 +7661,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a large in-progress transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1433905..702934e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 22ae982..5157f44 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -335,25 +335,6 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (opts->twophase &&
-		IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT) &&
-		IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
-	{
-		if (opts->streaming &&
-			IsSet(supported_opts, SUBOPT_STREAMING) &&
-			IsSet(opts->specified_opts, SUBOPT_STREAMING))
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-			/*- translator: both %s are strings of the form "option = value" */
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
 }
 
 /*
@@ -933,12 +914,6 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && opts.streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 2d77456..5f14da2 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -145,7 +145,8 @@ logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_da
 }
 
 /*
- * The core functionality for logicalrep_write_prepare.
+ * The core functionality for logicalrep_write_prepare and
+ * logicalrep_write_stream_prepare.
  */
 static void
 logicalrep_write_prepare_common(StringInfo out, LogicalRepMsgType type,
@@ -188,7 +189,8 @@ logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 }
 
 /*
- * The core functionality for logicalrep_read_prepare.
+ * The core functionality for logicalrep_read_prepare and
+ * logicalrep_read_stream_prepare.
  */
 static void
 logicalrep_read_prepare_common(StringInfo in, char *msgtype,
@@ -209,6 +211,8 @@ logicalrep_read_prepare_common(StringInfo in, char *msgtype,
 		elog(ERROR, "end_lsn is not set in %s message", msgtype);
 	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
+	if (prepare_data->xid == InvalidTransactionId)
+		elog(ERROR, "invalid two-phase transaction ID in %s message", msgtype);
 
 	/* read gid (copy it into a pre-allocated buffer) */
 	strlcpy(prepare_data->gid, pq_getmsgstring(in), sizeof(prepare_data->gid));
@@ -340,6 +344,29 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	logicalrep_write_prepare_common(out, LOGICAL_REP_MSG_STREAM_PREPARE,
+									txn, prepare_lsn);
+}
+
+/*
+ * Read STREAM PREPARE from the stream.
+ */
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	logicalrep_read_prepare_common(in, "stream prepare", prepare_data);
+
+	return prepare_data->xid;
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3f499b1..5f2a38b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1054,6 +1054,63 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+	TransactionId xid;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a STREAM PREPARE message")));
+
+	xid = logicalrep_read_stream_prepare(s, &prepare_data);
+
+	elog(DEBUG1, "received prepare for streamed transaction %u", xid);
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit)
+	 */
+	apply_spooled_messages(xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared. - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare)
+	 */
+	apply_handle_prepare_internal(&prepare_data);
+
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info */
+	stream_cleanup_files(MyLogicalRepWorker->subid, xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1292,7 +1349,7 @@ apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -2358,6 +2415,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index e4314af..286119c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase
-		 * and streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1030,6 +1021,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 63de90d..c99a931 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -243,4 +244,10 @@ extern void logicalrep_write_stream_abort(StringInfo out, TransactionId xid,
 extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
 										 TransactionId *subxid);
 
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+													LogicalRepPreparedTxnData *prepare_data);
+
+
 #endif							/* LOGICAL_PROTO_H */
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 67f92b3..77b4437 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -282,27 +282,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 88743ab..fcb73a4 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -212,23 +212,26 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true);
 
 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index d7cc999..650a77a 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -2,11 +2,14 @@
 # Copyright (c) 2021, PostgreSQL Global Development Group
 
 # Test cascading logical replication of 2PC.
+#
+# Includes tests for options 2PC (not-streaming) and also for 2PC (streaming).
+#
 use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 27;
+use Test::More tests => 38;
 
 ###############################
 # Setup a cascade of pub/sub nodes.
@@ -17,20 +20,20 @@ use Test::More tests => 27;
 # node_A
 my $node_A = PostgresNode->new('node_A');
 $node_A->init(allows_streaming => 'logical');
-$node_A->append_conf('postgresql.conf',
-	qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
 $node_A->start;
 # node_B
 my $node_B = PostgresNode->new('node_B');
 $node_B->init(allows_streaming => 'logical');
-$node_B->append_conf('postgresql.conf',
-	qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf',	qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
 $node_B->start;
 # node_C
 my $node_C = PostgresNode->new('node_C');
 $node_C->init(allows_streaming => 'logical');
-$node_C->append_conf('postgresql.conf',
-	qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
 $node_C->start;
 
 # Create some pre-existing content on node_A
@@ -45,12 +48,29 @@ $node_B->safe_psql('postgres',
 $node_C->safe_psql('postgres',
 	"CREATE TABLE tab_full (a int PRIMARY KEY)");
 
+# Create some pre-existing content on node_A
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B amd node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
 # Setup logical replication
 
+# -----------------------
+# 2PC NON-STREAMING TESTS
+# -----------------------
+
 # node_A (pub) -> node_B (sub)
 my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
 $node_A->safe_psql('postgres',
-	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full, test_tab");
 my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
@@ -61,7 +81,7 @@ $node_B->safe_psql('postgres',	"
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
 $node_B->safe_psql('postgres',
-	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full, test_tab");
 my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
@@ -203,6 +223,134 @@ is($result, qq(21), 'Rows committed are present on subscriber B');
 $result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
 is($result, qq(21), 'Rows committed are present on subscriber C');
 
+# ---------------------
+# 2PC + STREAMING TESTS
+# ---------------------
+
+my $oldpid_B = $node_A->safe_psql('postgres', "
+    SELECT pid FROM pg_stat_replication
+    WHERE application_name = '$appname_B';");
+my $oldpid_C = $node_B->safe_psql('postgres', "
+    SELECT pid FROM pg_stat_replication
+    WHERE application_name = '$appname_C';");
+
+# Setup logical replication (streaming = on)
+
+$node_B->safe_psql('postgres',	"
+	ALTER SUBSCRIPTION tap_sub_B
+	SET (streaming = on);");
+$node_C->safe_psql('postgres',	"
+	ALTER SUBSCRIPTION tap_sub_C
+	SET (streaming = on)");
+
+# Wait for subscribers to finish initialization
+
+$node_A->poll_query_until('postgres', "
+    SELECT pid != $oldpid_B FROM pg_stat_replication
+    WHERE application_name = '$appname_B';"
+) or die "Timed out while waiting for apply to restart";
+$node_B->poll_query_until('postgres', "
+    SELECT pid != $oldpid_C FROM pg_stat_replication
+    WHERE application_name = '$appname_C';"
+) or die "Timed out while waiting for apply to restart";
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. Cleanup from previous test leaving only 2 rows
+# 1. Insert one more row
+# 2. Record a SAVEPOINT
+# 3. Data is streamed using 2PC
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts
+# 5. Then COMMIT PREPARED
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1)
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..4ddfbf79
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,281 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# logical replication of 2PC test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 18;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgresNode->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_publisher->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = PostgresNode->new('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf',
+	qq(max_prepared_transactions = 10));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# (Note: both publisher and subscriber crash/restart)
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber.
+# (the 3334 + inserted 1 = 3335)
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#399Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#397)

On Thu, Jul 29, 2021 at 9:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jul 27, 2021 at 11:41 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v99*

v98-0001 --> split into v99-0001 + v99-0002

Pushed the first refactoring patch after making few modifications as below.
1.
- /* open the spool file for the committed transaction */
+ /* Open the spool file for the committed/prepared transaction */
changes_filename(path, MyLogicalRepWorker->subid, xid);

In the above comment, we don't need to say prepared. It can be done as
part of the second patch.

Updated comment in v100.

2.
+apply_handle_prepare_internal(LogicalRepPreparedTxnData
*prepare_data, char *gid)

I don't think there is any need for this function to take gid as
input. It can compute by itself instead of callers doing it.

OK.

3.
+static TransactionId+logicalrep_read_prepare_common(StringInfo in,
char *msgtype,
+                               LogicalRepPreparedTxnData *prepare_data)

I don't think the above function needs to return xid because it is
already present as part of prepare_data. Even, if it is required due
to some reason for the second patch then let's do it as part of if but
I don't think it is required for the second patch.

OK.

4.
/*
- * Write PREPARE to the output stream.
+ * Common code for logicalrep_write_prepare and
logicalrep_write_stream_prepare.
*/

Here and at a similar another place, we don't need to refer to
logicalrep_write_stream_prepare as that is part of the second patch.

Updated comment in v100

Few comments on 0002 patch:
==========================
1.
+# ---------------------
+# 2PC + STREAMING TESTS
+# ---------------------
+
+# Setup logical replication (streaming = on)
+
+$node_B->safe_psql('postgres', "
+ ALTER SUBSCRIPTION tap_sub_B
+ SET (streaming = on);");
+
+$node_C->safe_psql('postgres', "
+ ALTER SUBSCRIPTION tap_sub_C
+ SET (streaming = on)");
+
+# Wait for subscribers to finish initialization
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);

This is not the right way to determine if the new streaming option is
enabled on the publisher. Even if there is no restart of apply workers
(and walsender) after you have enabled the option, the above wait will
succeed. You need to do something like below as we are doing in
001_rep_changes.pl:

$oldpid = $node_publisher->safe_psql('postgres',
"SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
);
$node_subscriber->safe_psql('postgres',
"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH
(copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name
= 'tap_sub';"
) or die "Timed out while waiting for apply to restart";

Fixed in v100 as suggested.

2.
+# Create some pre-existing content on publisher (uses same DDL as
015_stream test)

Here, in the comments, I don't see the need to same uses same DDL ...

Fixed in v100. Comment removed.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#400vignesh C
vignesh21@gmail.com
In reply to: Peter Smith (#398)

On Fri, Jul 30, 2021 at 9:32 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v100*

v99-0002 --> v100-0001

Differences:

* Rebased to HEAD @ today (needed because some recent commits [1][2] broke v99)

The patch applies neatly, tests passes and documentation looks good.
A Few minor comments.
1) This blank line is not required:
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION
'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
false, streaming = true, two_phase = true);
+
2) Few points have punctuation mark and few don't have, we can make it
consistent:
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
3) similarly here too:
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################

Regards,
Vignesh

#401Greg Nancarrow
gregn4422@gmail.com
In reply to: Peter Smith (#398)

On Fri, Jul 30, 2021 at 2:02 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v100*

v99-0002 --> v100-0001

A few minor comments:

(1) doc/src/sgml/protocol.sgml

In the following description, is the word "large" really needed? Also
"the message ... for a ... message" sounds a bit odd, as does
"two-phase prepare".

What about the following:

BEFORE:
+                Identifies the message as a two-phase prepare for a
large in-progress transaction message.
AFTER:
+                Identifies the message as a prepare for an
in-progress two-phase transaction.

(2) src/backend/replication/logical/worker.c

Similar format comment, but one uses a full-stop and the other
doesn't, looks a bit odd, since the lines are near each other.

* 1. Replay all the spooled operations - Similar code as for

* 2. Mark the transaction as prepared. - Similar code as for

(3) src/test/subscription/t/023_twophase_stream.pl

Shouldn't the following comment mention, for example, "with streaming"
or something to that effect?

# logical replication of 2PC test

Regards,
Greg Nancarrow
Fujitsu Australia

#402tanghy.fnst@fujitsu.com
tanghy.fnst@fujitsu.com
In reply to: Peter Smith (#398)
RE: [HACKERS] logical decoding of two-phase transactions

On Friday, July 30, 2021 12:02 PM Peter Smith <smithpb2250@gmail.com>wrote:

Please find attached the latest patch set v100*

v99-0002 --> v100-0001

Thanks for your patch. A few comments on the test file:

1. src/test/subscription/t/022_twophase_cascade.pl

1.1
I saw your test cases for "PREPARE / COMMIT PREPARED" and "PREPARE with a nested ROLLBACK TO SAVEPOINT", but didn't see cases for "PREPARE / ROLLBACK PREPARED". Is it needless or just missing?

1.2
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.

I think it should be (9999, 'foobar') here.

1.3
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+

It seems the test is not finished yet. We didn't check the value of 'result'. Besides, maybe we should also check node_C, right?

1.4
+$node_B->append_conf('postgresql.conf',	qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));

You see, the first line uses a TAB but the second line uses a space.
Also, we could use only one statement to append these two settings to run tests a bit faster. Thoughts?
Something like:

$node_B->append_conf(
'postgresql.conf', qq(
max_prepared_transactions = 10
logical_decoding_work_mem = 64kB
));

Regards
Tang

#403Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#380)

On Wed, Jul 14, 2021 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 12, 2021 at 9:14 AM Peter Smith <smithpb2250@gmail.com> wrote:

Pushed.

As reported by Michael [1]/messages/by-id/YQP02+5yLCIgmdJY@paquier.xyz, there is one test failure related to this
commit. The failure is as below:

# Failed test 'transaction is prepared on subscriber'
# at t/021_twophase.pl line 324.
# got: '1'
# expected: '2'
# Looks like you failed 1 test of 24.
[12:14:02] t/021_twophase.pl ..................
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/24 subtests
[12:14:12] t/022_twophase_cascade.pl .......... ok 10542 ms ( 0.00
usr 0.00 sys + 2.03 cusr 0.61 csys = 2.64 CPU)
[12:14:31] t/100_bugs.pl ...................... ok 18550 ms ( 0.00
usr 0.00 sys + 3.85 cusr 1.36 csys = 5.21 CPU)
[12:14:31]

I think I know what's going wrong here. The corresponding test is:

# Now do a prepare on publisher and check that it IS replicated
$node_publisher->safe_psql('postgres', "
BEGIN;
INSERT INTO tab_copy VALUES (99);
PREPARE TRANSACTION 'mygid';");

$node_publisher->wait_for_catchup($appname_copy);

# Check that the transaction has been prepared on the subscriber,
there will be 2
# prepared transactions for the 2 subscriptions.
$result = $node_subscriber->safe_psql('postgres', "SELECT count(*)
FROM pg_prepared_xacts;");
is($result, qq(2), 'transaction is prepared on subscriber');

Here, the test is expecting 2 prepared transactions corresponding to
two subscriptions but it waits for just one subscription via
appname_copy. It should wait for the second subscription using
$appname as well.

What do you think?

[1]: /messages/by-id/YQP02+5yLCIgmdJY@paquier.xyz

--
With Regards,
Amit Kapila.

#404Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#403)
1 attachment(s)

On Sat, Jul 31, 2021 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Here, the test is expecting 2 prepared transactions corresponding to
two subscriptions but it waits for just one subscription via
appname_copy. It should wait for the second subscription using
$appname as well.

What do you think?

I agree with this analysis. The test needs to wait for both
subscriptions to catch up.
Attached is a patch that addresses this issue.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v1-0001-Fix-possible-failure-in-021_twophase-tap-test.patchapplication/octet-stream; name=v1-0001-Fix-possible-failure-in-021_twophase-tap-test.patchDownload
From c18e9f0b2d84140e3d57e4a893d0826d7dc7064c Mon Sep 17 00:00:00 2001
From: Ajin Cherian <ajinc@fast.au.fujitsu.com>
Date: Sat, 31 Jul 2021 01:35:01 -0400
Subject: [PATCH v1] Fix possible failure in 021_twophase tap test

Change the test so that it waits for both subscriptions to catchup
before checking the prepared transaction count.
---
 src/test/subscription/t/021_twophase.pl | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl
index 903a771..19f0962 100644
--- a/src/test/subscription/t/021_twophase.pl
+++ b/src/test/subscription/t/021_twophase.pl
@@ -316,7 +316,9 @@ $node_publisher->safe_psql('postgres', "
     INSERT INTO tab_copy VALUES (99);
     PREPARE TRANSACTION 'mygid';");
 
+# Wait for both subscribers to catchup
 $node_publisher->wait_for_catchup($appname_copy);
+$node_publisher->wait_for_catchup($appname);
 
 # Check that the transaction has been prepared on the subscriber, there will be 2
 # prepared transactions for the 2 subscriptions.
-- 
1.8.3.1

#405Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#404)

On Sat, Jul 31, 2021 at 11:12 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Sat, Jul 31, 2021 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Here, the test is expecting 2 prepared transactions corresponding to
two subscriptions but it waits for just one subscription via
appname_copy. It should wait for the second subscription using
$appname as well.

What do you think?

I agree with this analysis. The test needs to wait for both
subscriptions to catch up.
Attached is a patch that addresses this issue.

LGTM, unless Peter Smith has any comments or thinks otherwise, I'll
push this on Monday.

--
With Regards,
Amit Kapila.

#406Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#398)

On Fri, Jul 30, 2021 at 9:32 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v100*

Few minor comments:
1.
CREATE SUBSCRIPTION regress_testsub CONNECTION
'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
false, two_phase = true);

 \dRs+
+
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);

Spurious line addition.

2.
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in,
LogicalRepPreparedTxnData *prepare_data)
+{
+ logicalrep_read_prepare_common(in, "stream prepare", prepare_data);
+
+ return prepare_data->xid;
+}

There is no need to return TransactionId separately. The caller can
use from prepare_data, if required.

3.
extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
TransactionId *subxid);

+extern void logicalrep_write_stream_prepare(StringInfo out,
ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+ LogicalRepPreparedTxnData *prepare_data);
+
+

Keep the order of declarations the same as its definitions in proto.c
which means move these after logicalrep_read_rollback_prepared() and
be careful about extra blank lines.

--
With Regards,
Amit Kapila.

#407vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#404)

On Sat, Jul 31, 2021 at 11:12 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Sat, Jul 31, 2021 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Here, the test is expecting 2 prepared transactions corresponding to
two subscriptions but it waits for just one subscription via
appname_copy. It should wait for the second subscription using
$appname as well.

What do you think?

I agree with this analysis. The test needs to wait for both
subscriptions to catch up.
Attached is a patch that addresses this issue.

The changes look good to me.

Regards,
Vignesh

#408Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#407)

On Sun, Aug 1, 2021 at 3:05 AM vignesh C <vignesh21@gmail.com> wrote:

On Sat, Jul 31, 2021 at 11:12 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Sat, Jul 31, 2021 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Here, the test is expecting 2 prepared transactions corresponding to
two subscriptions but it waits for just one subscription via
appname_copy. It should wait for the second subscription using
$appname as well.

What do you think?

I agree with this analysis. The test needs to wait for both
subscriptions to catch up.
Attached is a patch that addresses this issue.

The changes look good to me.

The patch to the test code posted by Ajin LGTM also.

I applied the patch and re-ran the TAP subscription tests. All OK.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#409Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#399)
1 attachment(s)

Attachments:

v101-0001-Add-prepare-API-support-for-streaming-transacti.patchapplication/octet-stream; name=v101-0001-Add-prepare-API-support-for-streaming-transacti.patchDownload
From e4bb53136bbb537ecd1dd7f10ac0d9c7b752da65 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Mon, 2 Aug 2021 16:43:17 +1000
Subject: [PATCH v101 1/1] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/logicaldecoding.sgml               |  11 +-
 doc/src/sgml/protocol.sgml                      |  66 +++++-
 doc/src/sgml/ref/create_subscription.sgml       |  10 -
 src/backend/commands/subscriptioncmds.c         |  25 ---
 src/backend/replication/logical/proto.c         |  29 ++-
 src/backend/replication/logical/worker.c        |  62 +++++-
 src/backend/replication/pgoutput/pgoutput.c     |  33 ++-
 src/include/replication/logicalproto.h          |   8 +-
 src/test/regress/expected/subscription.out      |  24 +-
 src/test/regress/sql/subscription.sql           |  11 +-
 src/test/subscription/t/022_twophase_cascade.pl | 179 ++++++++++++++-
 src/test/subscription/t/023_twophase_stream.pl  | 284 ++++++++++++++++++++++++
 12 files changed, 663 insertions(+), 79 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 89b8090..0d0de29 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1199,6 +1199,9 @@ OutputPluginWrite(ctx, true);
     <function>stream_abort_cb</function>, <function>stream_commit_cb</function>
     and <function>stream_change_cb</function>) and two optional callbacks
     (<function>stream_message_cb</function> and <function>stream_truncate_cb</function>).
+    Also, if streaming of two-phase commands is to be supported, then additional
+    callbacks must be provided. (See <xref linkend="logicaldecoding-two-phase-commits"/>
+    for details).
    </para>
 
    <para>
@@ -1237,7 +1240,13 @@ stream_start_cb(...);   &lt;-- start of second block of changes
   stream_change_cb(...);
 stream_stop_cb(...);    &lt;-- end of second block of changes
 
-stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+
+[a. when using normal commit]
+stream_commit_cb(...);    &lt;-- commit of the streamed transaction
+
+[b. when using two-phase commit]
+stream_prepare_cb(...);   &lt;-- prepare the streamed transaction
+commit_prepared_cb(...);  &lt;-- commit of the prepared transaction
 </programlisting>
    </para>
 
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 9843953..1b27273 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7410,7 +7410,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7673,6 +7673,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a streamed transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8</term>
+<listitem><para>
+                Flags; currently unused (must be 0).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1433905..702934e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 22ae982..5157f44 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -335,25 +335,6 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (opts->twophase &&
-		IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT) &&
-		IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
-	{
-		if (opts->streaming &&
-			IsSet(supported_opts, SUBOPT_STREAMING) &&
-			IsSet(opts->specified_opts, SUBOPT_STREAMING))
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-			/*- translator: both %s are strings of the form "option = value" */
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
 }
 
 /*
@@ -933,12 +914,6 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && opts.streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 2d77456..52b65e9 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -145,7 +145,8 @@ logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_da
 }
 
 /*
- * The core functionality for logicalrep_write_prepare.
+ * The core functionality for logicalrep_write_prepare and
+ * logicalrep_write_stream_prepare.
  */
 static void
 logicalrep_write_prepare_common(StringInfo out, LogicalRepMsgType type,
@@ -188,7 +189,8 @@ logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 }
 
 /*
- * The core functionality for logicalrep_read_prepare.
+ * The core functionality for logicalrep_read_prepare and
+ * logicalrep_read_stream_prepare.
  */
 static void
 logicalrep_read_prepare_common(StringInfo in, char *msgtype,
@@ -209,6 +211,8 @@ logicalrep_read_prepare_common(StringInfo in, char *msgtype,
 		elog(ERROR, "end_lsn is not set in %s message", msgtype);
 	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
+	if (prepare_data->xid == InvalidTransactionId)
+		elog(ERROR, "invalid two-phase transaction ID in %s message", msgtype);
 
 	/* read gid (copy it into a pre-allocated buffer) */
 	strlcpy(prepare_data->gid, pq_getmsgstring(in), sizeof(prepare_data->gid));
@@ -340,6 +344,27 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	logicalrep_write_prepare_common(out, LOGICAL_REP_MSG_STREAM_PREPARE,
+									txn, prepare_lsn);
+}
+
+/*
+ * Read STREAM PREPARE from the stream.
+ */
+void
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	logicalrep_read_prepare_common(in, "stream prepare", prepare_data);
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 249de80..2389e37 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1053,6 +1053,62 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a STREAM PREPARE message")));
+
+	logicalrep_read_stream_prepare(s, &prepare_data);
+
+	elog(DEBUG1, "received prepare for streamed transaction %u", prepare_data.xid);
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit).
+	 */
+	apply_spooled_messages(prepare_data.xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare).
+	 */
+	apply_handle_prepare_internal(&prepare_data);
+
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info. */
+	stream_cleanup_files(MyLogicalRepWorker->subid, prepare_data.xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1291,7 +1347,7 @@ apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -2357,6 +2413,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index e4314af..286119c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase
-		 * and streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1030,6 +1021,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 63de90d..2e29513 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -196,7 +197,10 @@ extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN
 											   TimestampTz prepare_time);
 extern void logicalrep_read_rollback_prepared(StringInfo in,
 											  LogicalRepRollbackPreparedTxnData *rollback_data);
-
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern void logicalrep_read_stream_prepare(StringInfo in,
+										   LogicalRepPreparedTxnData *prepare_data);
 
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 67f92b3..77b4437 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -282,27 +282,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 88743ab..d42104c 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -215,20 +215,21 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index d7cc999..a47c62d 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -2,11 +2,14 @@
 # Copyright (c) 2021, PostgreSQL Global Development Group
 
 # Test cascading logical replication of 2PC.
+#
+# Includes tests for options 2PC (not-streaming) and also for 2PC (streaming).
+#
 use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 27;
+use Test::More tests => 41;
 
 ###############################
 # Setup a cascade of pub/sub nodes.
@@ -17,20 +20,26 @@ use Test::More tests => 27;
 # node_A
 my $node_A = PostgresNode->new('node_A');
 $node_A->init(allows_streaming => 'logical');
-$node_A->append_conf('postgresql.conf',
-	qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+logical_decoding_work_mem = 64kB
+));
 $node_A->start;
 # node_B
 my $node_B = PostgresNode->new('node_B');
 $node_B->init(allows_streaming => 'logical');
-$node_B->append_conf('postgresql.conf',
-	qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+logical_decoding_work_mem = 64kB
+));
 $node_B->start;
 # node_C
 my $node_C = PostgresNode->new('node_C');
 $node_C->init(allows_streaming => 'logical');
-$node_C->append_conf('postgresql.conf',
-	qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+logical_decoding_work_mem = 64kB
+));
 $node_C->start;
 
 # Create some pre-existing content on node_A
@@ -45,12 +54,29 @@ $node_B->safe_psql('postgres',
 $node_C->safe_psql('postgres',
 	"CREATE TABLE tab_full (a int PRIMARY KEY)");
 
+# Create some pre-existing content on node_A (for streaming tests)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B and node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
 # Setup logical replication
 
+# -----------------------
+# 2PC NON-STREAMING TESTS
+# -----------------------
+
 # node_A (pub) -> node_B (sub)
 my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
 $node_A->safe_psql('postgres',
-	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full, test_tab");
 my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
@@ -61,7 +87,7 @@ $node_B->safe_psql('postgres',	"
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
 $node_B->safe_psql('postgres',
-	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full, test_tab");
 my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
@@ -203,6 +229,141 @@ is($result, qq(21), 'Rows committed are present on subscriber B');
 $result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
 is($result, qq(21), 'Rows committed are present on subscriber C');
 
+# ---------------------
+# 2PC + STREAMING TESTS
+# ---------------------
+
+my $oldpid_B = $node_A->safe_psql('postgres', "
+	SELECT pid FROM pg_stat_replication
+	WHERE application_name = '$appname_B';");
+my $oldpid_C = $node_B->safe_psql('postgres', "
+	SELECT pid FROM pg_stat_replication
+	WHERE application_name = '$appname_C';");
+
+# Setup logical replication (streaming = on)
+
+$node_B->safe_psql('postgres', "
+	ALTER SUBSCRIPTION tap_sub_B
+	SET (streaming = on);");
+$node_C->safe_psql('postgres', "
+	ALTER SUBSCRIPTION tap_sub_C
+	SET (streaming = on)");
+
+# Wait for subscribers to finish initialization
+
+$node_A->poll_query_until('postgres', "
+	SELECT pid != $oldpid_B FROM pg_stat_replication
+	WHERE application_name = '$appname_B';"
+) or die "Timed out while waiting for apply to restart";
+$node_B->poll_query_until('postgres', "
+	SELECT pid != $oldpid_C FROM pg_stat_replication
+	WHERE application_name = '$appname_C';"
+) or die "Timed out while waiting for apply to restart";
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED.
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. Cleanup from previous test leaving only 2 rows.
+# 1. Insert one more row.
+# 2. Record a SAVEPOINT.
+# 3. Data is streamed using 2PC.
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts.
+# 5. Then COMMIT PREPARED.
+#
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1).
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (9999, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows committed are present on subscriber C');
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c72c6b5
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,284 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test logical replication of 2PC with streaming.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 18;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgresNode->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+logical_decoding_work_mem = 64kB
+));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = PostgresNode->new('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED.
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# Note: both publisher and subscriber do crash/restart.
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber
+# (the original 2 + inserted 1).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber
+# (the 3334 + inserted 1 = 3335).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#410Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#406)

On Sat, Jul 31, 2021 at 9:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 30, 2021 at 9:32 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v100*

Few minor comments:
1.
CREATE SUBSCRIPTION regress_testsub CONNECTION
'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
false, two_phase = true);

\dRs+
+
--fail - alter of two_phase option not supported.
ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);

Spurious line addition.

OK. Fixed in v101.

2.
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in,
LogicalRepPreparedTxnData *prepare_data)
+{
+ logicalrep_read_prepare_common(in, "stream prepare", prepare_data);
+
+ return prepare_data->xid;
+}

There is no need to return TransactionId separately. The caller can
use from prepare_data, if required.

OK. Modified in v101

3.
extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid,
TransactionId *subxid);

+extern void logicalrep_write_stream_prepare(StringInfo out,
ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn);
+extern TransactionId logicalrep_read_stream_prepare(StringInfo in,
+ LogicalRepPreparedTxnData *prepare_data);
+
+

Keep the order of declarations the same as its definitions in proto.c
which means move these after logicalrep_read_rollback_prepared() and
be careful about extra blank lines.

OK. Reordered in v101.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#411Peter Smith
smithpb2250@gmail.com
In reply to: tanghy.fnst@fujitsu.com (#402)

On Fri, Jul 30, 2021 at 6:25 PM tanghy.fnst@fujitsu.com
<tanghy.fnst@fujitsu.com> wrote:

On Friday, July 30, 2021 12:02 PM Peter Smith <smithpb2250@gmail.com>wrote:

Please find attached the latest patch set v100*

v99-0002 --> v100-0001

Thanks for your patch. A few comments on the test file:

1. src/test/subscription/t/022_twophase_cascade.pl

1.1
I saw your test cases for "PREPARE / COMMIT PREPARED" and "PREPARE with a nested ROLLBACK TO SAVEPOINT", but didn't see cases for "PREPARE / ROLLBACK PREPARED". Is it needless or just missing?

Yes, that test used to exist but it was removed in response to a
previous review (see [1]/messages/by-id/CAHut+Pts_bWx_RrXu+YwbiJva33nTROoQQP5H4pVrF+NcCMkRA@mail.gmail.com comment #10, Amit said there were too many
tests).

1.2
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.

I think it should be (9999, 'foobar') here.

Good catch. Fixed in v101.

1.3
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+

It seems the test is not finished yet. We didn't check the value of 'result'. Besides, maybe we should also check node_C, right?

Oops. Thanks for finding this! Fixed in v101 by adding the missing tests.

1.4
+$node_B->append_conf('postgresql.conf',        qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));

You see, the first line uses a TAB but the second line uses a space.
Also, we could use only one statement to append these two settings to run tests a bit faster. Thoughts?
Something like:

$node_B->append_conf(
'postgresql.conf', qq(
max_prepared_transactions = 10
logical_decoding_work_mem = 64kB
));

OK. In v101 I changed the config as you suggested for both the 022 and
023 TAP tests.

------
[1]: /messages/by-id/CAHut+Pts_bWx_RrXu+YwbiJva33nTROoQQP5H4pVrF+NcCMkRA@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia.

#412Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#400)

On Fri, Jul 30, 2021 at 3:18 PM vignesh C <vignesh21@gmail.com> wrote:

On Fri, Jul 30, 2021 at 9:32 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v100*

v99-0002 --> v100-0001

Differences:

* Rebased to HEAD @ today (needed because some recent commits [1][2] broke v99)

The patch applies neatly, tests passes and documentation looks good.
A Few minor comments.
1) This blank line is not required:
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION
'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
false, streaming = true, two_phase = true);
+

Fixed in v101.

2) Few points have punctuation mark and few don't have, we can make it
consistent:
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################

Fixed in v101.

3) similarly here too:
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber.
+# (the original 2 + inserted 1)
+###############################

Fixed in v101.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#413Peter Smith
smithpb2250@gmail.com
In reply to: Greg Nancarrow (#401)

On Fri, Jul 30, 2021 at 4:33 PM Greg Nancarrow <gregn4422@gmail.com> wrote:

On Fri, Jul 30, 2021 at 2:02 PM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v100*

v99-0002 --> v100-0001

A few minor comments:

(1) doc/src/sgml/protocol.sgml

In the following description, is the word "large" really needed? Also
"the message ... for a ... message" sounds a bit odd, as does
"two-phase prepare".

What about the following:

BEFORE:
+                Identifies the message as a two-phase prepare for a
large in-progress transaction message.
AFTER:
+                Identifies the message as a prepare for an
in-progress two-phase transaction.

Updated in v101.

The other nearby messages are referring refer to a “streamed
transaction” so I’ve changed this to say “Identifies the message as a
two-phase prepare for a streamed transaction message.” (e.g. compare
this text with the existing similar text for ‘P’).

BTW, I agree with you that "the message ... for a ... message" seems
odd; it was written in this way only to be consistent with existing
documentation, which all uses the same odd phrasing.

(2) src/backend/replication/logical/worker.c

Similar format comment, but one uses a full-stop and the other
doesn't, looks a bit odd, since the lines are near each other.

* 1. Replay all the spooled operations - Similar code as for

* 2. Mark the transaction as prepared. - Similar code as for

Updated in v101 to make the comments consistent.

(3) src/test/subscription/t/023_twophase_stream.pl

Shouldn't the following comment mention, for example, "with streaming"
or something to that effect?

# logical replication of 2PC test

Fixed as suggested in v101.

------
Kind Regards,
Peter Smith.
Fujitsu Australia.

#414Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#408)

On Sun, Aug 1, 2021 at 3:51 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Sun, Aug 1, 2021 at 3:05 AM vignesh C <vignesh21@gmail.com> wrote:

On Sat, Jul 31, 2021 at 11:12 AM Ajin Cherian <itsajin@gmail.com> wrote:

On Sat, Jul 31, 2021 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Here, the test is expecting 2 prepared transactions corresponding to
two subscriptions but it waits for just one subscription via
appname_copy. It should wait for the second subscription using
$appname as well.

What do you think?

I agree with this analysis. The test needs to wait for both
subscriptions to catch up.
Attached is a patch that addresses this issue.

The changes look good to me.

The patch to the test code posted by Ajin LGTM also.

Pushed.

--
With Regards,
Amit Kapila.

#415Peter Smith
smithpb2250@gmail.com
In reply to: Peter Smith (#409)
1 attachment(s)

Please find attached the latest patch set v102*

Differences:

* Rebased to HEAD @ today.

* This is a documentation change only. A recent commit [1]https://github.com/postgres/postgres/commit/a5cb4f9829fbfd68655543d2d371a18a8eb43b84 has changed
the documentation style for the message formats slightly to annotate
the data types. For consistency, the same style change needs to be
adopted for the newly added message of this patch. This same change
also finally addresses some old review comments [2]/messages/by-id/CALDaNm3U4fGxTnQfaT1TqUkgX5c0CSDvmW12Bfksis8zB_XinA@mail.gmail.com from Vignesh.

----
[1]: https://github.com/postgres/postgres/commit/a5cb4f9829fbfd68655543d2d371a18a8eb43b84
[2]: /messages/by-id/CALDaNm3U4fGxTnQfaT1TqUkgX5c0CSDvmW12Bfksis8zB_XinA@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v102-0001-Add-prepare-API-support-for-streaming-transacti.patchapplication/octet-stream; name=v102-0001-Add-prepare-API-support-for-streaming-transacti.patchDownload
From 5e874e60ac788aa15a1a87835e81ad1cfcb3b3b7 Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 3 Aug 2021 10:27:05 +1000
Subject: [PATCH v102] Add prepare API support for streaming transactions.

* Permits the combination of "streaming" and "two_phase" subscription options.

* Adds the prepare API for streaming transactions which will apply the changes
accumulated in the spool-file at prepare time.

* Adds new subscription TAP tests, and new subscription.sql regression tests.

* Updates PG documentation.
---
 doc/src/sgml/logicaldecoding.sgml               |  11 +-
 doc/src/sgml/protocol.sgml                      |  66 +++++-
 doc/src/sgml/ref/create_subscription.sgml       |  10 -
 src/backend/commands/subscriptioncmds.c         |  25 ---
 src/backend/replication/logical/proto.c         |  29 ++-
 src/backend/replication/logical/worker.c        |  62 +++++-
 src/backend/replication/pgoutput/pgoutput.c     |  33 ++-
 src/include/replication/logicalproto.h          |   8 +-
 src/test/regress/expected/subscription.out      |  24 +-
 src/test/regress/sql/subscription.sql           |  11 +-
 src/test/subscription/t/022_twophase_cascade.pl | 179 ++++++++++++++-
 src/test/subscription/t/023_twophase_stream.pl  | 284 ++++++++++++++++++++++++
 12 files changed, 663 insertions(+), 79 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 89b8090..0d0de29 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1199,6 +1199,9 @@ OutputPluginWrite(ctx, true);
     <function>stream_abort_cb</function>, <function>stream_commit_cb</function>
     and <function>stream_change_cb</function>) and two optional callbacks
     (<function>stream_message_cb</function> and <function>stream_truncate_cb</function>).
+    Also, if streaming of two-phase commands is to be supported, then additional
+    callbacks must be provided. (See <xref linkend="logicaldecoding-two-phase-commits"/>
+    for details).
    </para>
 
    <para>
@@ -1237,7 +1240,13 @@ stream_start_cb(...);   &lt;-- start of second block of changes
   stream_change_cb(...);
 stream_stop_cb(...);    &lt;-- end of second block of changes
 
-stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+
+[a. when using normal commit]
+stream_commit_cb(...);    &lt;-- commit of the streamed transaction
+
+[b. when using two-phase commit]
+stream_prepare_cb(...);   &lt;-- prepare the streamed transaction
+commit_prepared_cb(...);  &lt;-- commit of the prepared transaction
 </programlisting>
    </para>
 
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 991994d..3001168 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7411,7 +7411,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7714,6 +7714,70 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase prepare for a streamed transaction message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int8(0)</term>
+<listitem><para>
+                Flags; currently unused.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64 (XLogRecPtr)</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64 (XLogRecPtr)</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64 (TimestampTz)</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int32 (TransactionId)</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 1433905..702934e 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 22ae982..5157f44 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -335,25 +335,6 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (opts->twophase &&
-		IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT) &&
-		IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
-	{
-		if (opts->streaming &&
-			IsSet(supported_opts, SUBOPT_STREAMING) &&
-			IsSet(opts->specified_opts, SUBOPT_STREAMING))
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-			/*- translator: both %s are strings of the form "option = value" */
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
 }
 
 /*
@@ -933,12 +914,6 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && opts.streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 2d77456..52b65e9 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -145,7 +145,8 @@ logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_da
 }
 
 /*
- * The core functionality for logicalrep_write_prepare.
+ * The core functionality for logicalrep_write_prepare and
+ * logicalrep_write_stream_prepare.
  */
 static void
 logicalrep_write_prepare_common(StringInfo out, LogicalRepMsgType type,
@@ -188,7 +189,8 @@ logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 }
 
 /*
- * The core functionality for logicalrep_read_prepare.
+ * The core functionality for logicalrep_read_prepare and
+ * logicalrep_read_stream_prepare.
  */
 static void
 logicalrep_read_prepare_common(StringInfo in, char *msgtype,
@@ -209,6 +211,8 @@ logicalrep_read_prepare_common(StringInfo in, char *msgtype,
 		elog(ERROR, "end_lsn is not set in %s message", msgtype);
 	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
+	if (prepare_data->xid == InvalidTransactionId)
+		elog(ERROR, "invalid two-phase transaction ID in %s message", msgtype);
 
 	/* read gid (copy it into a pre-allocated buffer) */
 	strlcpy(prepare_data->gid, pq_getmsgstring(in), sizeof(prepare_data->gid));
@@ -340,6 +344,27 @@ logicalrep_read_rollback_prepared(StringInfo in,
 }
 
 /*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	logicalrep_write_prepare_common(out, LOGICAL_REP_MSG_STREAM_PREPARE,
+									txn, prepare_lsn);
+}
+
+/*
+ * Read STREAM PREPARE from the stream.
+ */
+void
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	logicalrep_read_prepare_common(in, "stream prepare", prepare_data);
+}
+
+/*
  * Write ORIGIN to the output stream.
  */
 void
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 249de80..2389e37 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1053,6 +1053,62 @@ apply_handle_rollback_prepared(StringInfo s)
 }
 
 /*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a STREAM PREPARE message")));
+
+	logicalrep_read_stream_prepare(s, &prepare_data);
+
+	elog(DEBUG1, "received prepare for streamed transaction %u", prepare_data.xid);
+
+	/*
+	 * 1. Replay all the spooled operations - Similar code as for
+	 * apply_handle_stream_commit (i.e. non two-phase stream commit).
+	 */
+	apply_spooled_messages(prepare_data.xid, prepare_data.prepare_lsn);
+
+	/*
+	 * 2. Mark the transaction as prepared - Similar code as for
+	 * apply_handle_prepare (i.e. two-phase non-streamed prepare).
+	 */
+	apply_handle_prepare_internal(&prepare_data);
+
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info. */
+	stream_cleanup_files(MyLogicalRepWorker->subid, prepare_data.xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
+/*
  * Handle ORIGIN message.
  *
  * TODO, support tracking of multiple origins
@@ -1291,7 +1347,7 @@ apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -2357,6 +2413,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index e4314af..286119c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase
-		 * and streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1030,6 +1021,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 }
 
 /*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
+/*
  * Initialize the relation schema sync cache for a decoding session.
  *
  * The hash table is destroyed at the end of a decoding session. While
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 63de90d..2e29513 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -196,7 +197,10 @@ extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN
 											   TimestampTz prepare_time);
 extern void logicalrep_read_rollback_prepared(StringInfo in,
 											  LogicalRepRollbackPreparedTxnData *rollback_data);
-
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern void logicalrep_read_stream_prepare(StringInfo in,
+										   LogicalRepPreparedTxnData *prepare_data);
 
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 67f92b3..77b4437 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -282,27 +282,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 88743ab..d42104c 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -215,20 +215,21 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index d7cc999..a47c62d 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -2,11 +2,14 @@
 # Copyright (c) 2021, PostgreSQL Global Development Group
 
 # Test cascading logical replication of 2PC.
+#
+# Includes tests for options 2PC (not-streaming) and also for 2PC (streaming).
+#
 use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 27;
+use Test::More tests => 41;
 
 ###############################
 # Setup a cascade of pub/sub nodes.
@@ -17,20 +20,26 @@ use Test::More tests => 27;
 # node_A
 my $node_A = PostgresNode->new('node_A');
 $node_A->init(allows_streaming => 'logical');
-$node_A->append_conf('postgresql.conf',
-	qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+logical_decoding_work_mem = 64kB
+));
 $node_A->start;
 # node_B
 my $node_B = PostgresNode->new('node_B');
 $node_B->init(allows_streaming => 'logical');
-$node_B->append_conf('postgresql.conf',
-	qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+logical_decoding_work_mem = 64kB
+));
 $node_B->start;
 # node_C
 my $node_C = PostgresNode->new('node_C');
 $node_C->init(allows_streaming => 'logical');
-$node_C->append_conf('postgresql.conf',
-	qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+logical_decoding_work_mem = 64kB
+));
 $node_C->start;
 
 # Create some pre-existing content on node_A
@@ -45,12 +54,29 @@ $node_B->safe_psql('postgres',
 $node_C->safe_psql('postgres',
 	"CREATE TABLE tab_full (a int PRIMARY KEY)");
 
+# Create some pre-existing content on node_A (for streaming tests)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B and node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
 # Setup logical replication
 
+# -----------------------
+# 2PC NON-STREAMING TESTS
+# -----------------------
+
 # node_A (pub) -> node_B (sub)
 my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
 $node_A->safe_psql('postgres',
-	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full, test_tab");
 my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
@@ -61,7 +87,7 @@ $node_B->safe_psql('postgres',	"
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
 $node_B->safe_psql('postgres',
-	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full, test_tab");
 my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
@@ -203,6 +229,141 @@ is($result, qq(21), 'Rows committed are present on subscriber B');
 $result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
 is($result, qq(21), 'Rows committed are present on subscriber C');
 
+# ---------------------
+# 2PC + STREAMING TESTS
+# ---------------------
+
+my $oldpid_B = $node_A->safe_psql('postgres', "
+	SELECT pid FROM pg_stat_replication
+	WHERE application_name = '$appname_B';");
+my $oldpid_C = $node_B->safe_psql('postgres', "
+	SELECT pid FROM pg_stat_replication
+	WHERE application_name = '$appname_C';");
+
+# Setup logical replication (streaming = on)
+
+$node_B->safe_psql('postgres', "
+	ALTER SUBSCRIPTION tap_sub_B
+	SET (streaming = on);");
+$node_C->safe_psql('postgres', "
+	ALTER SUBSCRIPTION tap_sub_C
+	SET (streaming = on)");
+
+# Wait for subscribers to finish initialization
+
+$node_A->poll_query_until('postgres', "
+	SELECT pid != $oldpid_B FROM pg_stat_replication
+	WHERE application_name = '$appname_B';"
+) or die "Timed out while waiting for apply to restart";
+$node_B->poll_query_until('postgres', "
+	SELECT pid != $oldpid_C FROM pg_stat_replication
+	WHERE application_name = '$appname_C';"
+) or die "Timed out while waiting for apply to restart";
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED.
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. Cleanup from previous test leaving only 2 rows.
+# 1. Insert one more row.
+# 2. Record a SAVEPOINT.
+# 3. Data is streamed using 2PC.
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts.
+# 5. Then COMMIT PREPARED.
+#
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1).
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (9999, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows committed are present on subscriber C');
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000..c72c6b5
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,284 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test logical replication of 2PC with streaming.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 18;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgresNode->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+logical_decoding_work_mem = 64kB
+));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = PostgresNode->new('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED.
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# Note: both publisher and subscriber do crash/restart.
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber
+# (the original 2 + inserted 1).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber
+# (the 3334 + inserted 1 = 3335).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
1.8.3.1

#416Peter Smith
smithpb2250@gmail.com
In reply to: vignesh C (#316)

On Mon, May 10, 2021 at 1:31 PM vignesh C <vignesh21@gmail.com> wrote:

...

2) I felt we can change lsn data type from Int64 to XLogRecPtr
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>Int64</term>
+<listitem><para>
+                The end LSN of the transaction.
+</para></listitem>
+</varlistentry>
3) I felt we can change lsn data type from Int32 to TransactionId
+<varlistentry>
+<term>Int32</term>
+<listitem><para>
+                Xid of the subtransaction (will be same as xid of the
transaction for top-level
+                transactions).
+</para></listitem>
+</varlistentry>

...

Similar problems related to comments 2 and 3 are being discussed at
[1], we can change it accordingly based on the conclusion in the other
thread.
[1] - /messages/by-id/CAHut+Ps2JsSd_OpBR9kXt1Rt4bwyXAjh875gUpFw6T210ttO7Q@mail.gmail.com

Earlier today the other documentation patch mentioned above was
committed by Tom Lane.

The 2PC patch v102 now fixes your review comments 2 and 3 by matching
the same datatype annotation style of that commit.

------
Kind Regards,
Peter Smith
Fujitsu Australia

#417Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Smith (#415)
1 attachment(s)

On Tue, Aug 3, 2021 at 6:17 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v102*

I have made minor modifications in the comments and docs, please see
attached. Can you please check whether the names of contributors in
the commit message are correct or do we need to change it?

--
With Regards,
Amit Kapila.

Attachments:

v103-0001-Add-prepare-API-support-for-streaming-transacti.patchapplication/octet-stream; name=v103-0001-Add-prepare-API-support-for-streaming-transacti.patchDownload
From 710feb1e820c8abd59df38fa0ecc314027d6c96e Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 3 Aug 2021 09:07:51 +0530
Subject: [PATCH v103] Add prepare API support for streaming transactions in
 logical replication.

Commit a8fd13cab0 added support for prepared transactions to built-in
logical replication via a new option two_phase for a subscription. The
two_phase option was not allowed with the existing streaming option.

This commit permits the combination of "streaming" and "two_phase"
subscription options. It extends the pgoutput plugin and the subscriber
side code to add the prepare API for streaming transactions which will
apply the changes accumulated in the spool-file at prepare time.

Author: Peter Smith and Ajin Cherian
Reviewed-by: Vignesh C, Amit Kapila, Greg Nancarrow
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
---
 doc/src/sgml/logicaldecoding.sgml             |  11 +-
 doc/src/sgml/protocol.sgml                    |  76 ++++-
 doc/src/sgml/ref/create_subscription.sgml     |  10 -
 src/backend/commands/subscriptioncmds.c       |  25 --
 src/backend/replication/logical/proto.c       |  29 +-
 src/backend/replication/logical/worker.c      |  56 +++-
 src/backend/replication/pgoutput/pgoutput.c   |  33 +-
 src/include/replication/logicalproto.h        |   8 +-
 src/test/regress/expected/subscription.out    |  24 +-
 src/test/regress/sql/subscription.sql         |  11 +-
 .../subscription/t/022_twophase_cascade.pl    | 179 ++++++++++-
 .../subscription/t/023_twophase_stream.pl     | 284 ++++++++++++++++++
 12 files changed, 667 insertions(+), 79 deletions(-)
 create mode 100644 src/test/subscription/t/023_twophase_stream.pl

diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 89b8090b79..0d0de291f3 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -1199,6 +1199,9 @@ OutputPluginWrite(ctx, true);
     <function>stream_abort_cb</function>, <function>stream_commit_cb</function>
     and <function>stream_change_cb</function>) and two optional callbacks
     (<function>stream_message_cb</function> and <function>stream_truncate_cb</function>).
+    Also, if streaming of two-phase commands is to be supported, then additional
+    callbacks must be provided. (See <xref linkend="logicaldecoding-two-phase-commits"/>
+    for details).
    </para>
 
    <para>
@@ -1237,7 +1240,13 @@ stream_start_cb(...);   &lt;-- start of second block of changes
   stream_change_cb(...);
 stream_stop_cb(...);    &lt;-- end of second block of changes
 
-stream_commit_cb(...);  &lt;-- commit of the streamed transaction
+
+[a. when using normal commit]
+stream_commit_cb(...);    &lt;-- commit of the streamed transaction
+
+[b. when using two-phase commit]
+stream_prepare_cb(...);   &lt;-- prepare the streamed transaction
+commit_prepared_cb(...);  &lt;-- commit of the prepared transaction
 </programlisting>
    </para>
 
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 991994de1d..91ec237c21 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -7411,7 +7411,7 @@ Stream Abort
 </variablelist>
 
 <para>
-The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared)
+The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared, Stream Prepare)
 are available since protocol version 3.
 </para>
 
@@ -7714,6 +7714,80 @@ are available since protocol version 3.
 </listitem>
 </varlistentry>
 
+<varlistentry>
+
+<term>Stream Prepare</term>
+<listitem>
+<para>
+
+<variablelist>
+
+<varlistentry>
+<term>Byte1('p')</term>
+<listitem><para>
+                Identifies the message as a two-phase stream prepare message.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>
+        Int8(0)
+</term>
+<listitem><para>
+                Flags; currently unused.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>
+        Int64 (XLogRecPtr)
+</term>
+<listitem><para>
+                The LSN of the prepare.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>
+        Int64 (XLogRecPtr)
+</term>
+<listitem><para>
+                The end LSN of the prepare transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>
+        Int64 (TimestampTz)
+</term>
+<listitem><para>
+                Prepare timestamp of the transaction. The value is in number
+                of microseconds since PostgreSQL epoch (2000-01-01).
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>
+        Int32 (TransactionId)
+</term>
+<listitem><para>
+                Xid of the transaction.
+</para></listitem>
+</varlistentry>
+
+<varlistentry>
+<term>String</term>
+<listitem><para>
+                The user defined GID of the two-phase transaction.
+</para></listitem>
+</varlistentry>
+
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
 </variablelist>
 
 <para>
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 143390593d..702934eba1 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -238,11 +238,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           subscriber as a whole.
          </para>
 
-         <para>
-          The <literal>streaming</literal> option cannot be used with the
-          <literal>two_phase</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
        <varlistentry>
@@ -269,11 +264,6 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl
           to know the actual two-phase state.
          </para>
 
-         <para>
-          The <literal>two_phase</literal> option cannot be used with the
-          <literal>streaming</literal> option.
-         </para>
-
         </listitem>
        </varlistentry>
       </variablelist></para>
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 22ae982328..5157f44058 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -335,25 +335,6 @@ parse_subscription_options(ParseState *pstate, List *stmt_options,
 					 errmsg("subscription with %s must also set %s",
 							"slot_name = NONE", "create_slot = false")));
 	}
-
-	/*
-	 * Do additional checking for the disallowed combination of two_phase and
-	 * streaming. While streaming and two_phase can theoretically be
-	 * supported, it needs more analysis to allow them together.
-	 */
-	if (opts->twophase &&
-		IsSet(supported_opts, SUBOPT_TWOPHASE_COMMIT) &&
-		IsSet(opts->specified_opts, SUBOPT_TWOPHASE_COMMIT))
-	{
-		if (opts->streaming &&
-			IsSet(supported_opts, SUBOPT_STREAMING) &&
-			IsSet(opts->specified_opts, SUBOPT_STREAMING))
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-			/*- translator: both %s are strings of the form "option = value" */
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase = true", "streaming = true")));
-	}
 }
 
 /*
@@ -933,12 +914,6 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 				if (IsSet(opts.specified_opts, SUBOPT_STREAMING))
 				{
-					if ((sub->twophasestate != LOGICALREP_TWOPHASE_STATE_DISABLED) && opts.streaming)
-						ereport(ERROR,
-								(errcode(ERRCODE_SYNTAX_ERROR),
-								 errmsg("cannot set %s for two-phase enabled subscription",
-										"streaming = true")));
-
 					values[Anum_pg_subscription_substream - 1] =
 						BoolGetDatum(opts.streaming);
 					replaces[Anum_pg_subscription_substream - 1] = true;
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 2d774567e0..52b65e9572 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -145,7 +145,8 @@ logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_da
 }
 
 /*
- * The core functionality for logicalrep_write_prepare.
+ * The core functionality for logicalrep_write_prepare and
+ * logicalrep_write_stream_prepare.
  */
 static void
 logicalrep_write_prepare_common(StringInfo out, LogicalRepMsgType type,
@@ -188,7 +189,8 @@ logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn,
 }
 
 /*
- * The core functionality for logicalrep_read_prepare.
+ * The core functionality for logicalrep_read_prepare and
+ * logicalrep_read_stream_prepare.
  */
 static void
 logicalrep_read_prepare_common(StringInfo in, char *msgtype,
@@ -209,6 +211,8 @@ logicalrep_read_prepare_common(StringInfo in, char *msgtype,
 		elog(ERROR, "end_lsn is not set in %s message", msgtype);
 	prepare_data->prepare_time = pq_getmsgint64(in);
 	prepare_data->xid = pq_getmsgint(in, 4);
+	if (prepare_data->xid == InvalidTransactionId)
+		elog(ERROR, "invalid two-phase transaction ID in %s message", msgtype);
 
 	/* read gid (copy it into a pre-allocated buffer) */
 	strlcpy(prepare_data->gid, pq_getmsgstring(in), sizeof(prepare_data->gid));
@@ -339,6 +343,27 @@ logicalrep_read_rollback_prepared(StringInfo in,
 	strlcpy(rollback_data->gid, pq_getmsgstring(in), sizeof(rollback_data->gid));
 }
 
+/*
+ * Write STREAM PREPARE to the output stream.
+ */
+void
+logicalrep_write_stream_prepare(StringInfo out,
+								ReorderBufferTXN *txn,
+								XLogRecPtr prepare_lsn)
+{
+	logicalrep_write_prepare_common(out, LOGICAL_REP_MSG_STREAM_PREPARE,
+									txn, prepare_lsn);
+}
+
+/*
+ * Read STREAM PREPARE from the stream.
+ */
+void
+logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data)
+{
+	logicalrep_read_prepare_common(in, "stream prepare", prepare_data);
+}
+
 /*
  * Write ORIGIN to the output stream.
  */
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 249de80798..ecaed157f2 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1052,6 +1052,56 @@ apply_handle_rollback_prepared(StringInfo s)
 	pgstat_report_activity(STATE_IDLE, NULL);
 }
 
+/*
+ * Handle STREAM PREPARE.
+ *
+ * Logic is in two parts:
+ * 1. Replay all the spooled operations
+ * 2. Mark the transaction as prepared
+ */
+static void
+apply_handle_stream_prepare(StringInfo s)
+{
+	LogicalRepPreparedTxnData prepare_data;
+
+	if (in_streamed_transaction)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("STREAM PREPARE message without STREAM STOP")));
+
+	/* Tablesync should never receive prepare. */
+	if (am_tablesync_worker())
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("tablesync worker received a STREAM PREPARE message")));
+
+	logicalrep_read_stream_prepare(s, &prepare_data);
+
+	elog(DEBUG1, "received prepare for streamed transaction %u", prepare_data.xid);
+
+	/* Replay all the spooled operations. */
+	apply_spooled_messages(prepare_data.xid, prepare_data.prepare_lsn);
+
+	/* Mark the transaction as prepared. */
+	apply_handle_prepare_internal(&prepare_data);
+
+	CommitTransactionCommand();
+
+	pgstat_report_stat(false);
+
+	store_flush_position(prepare_data.end_lsn);
+
+	in_remote_transaction = false;
+
+	/* unlink the files with serialized changes and subxact info. */
+	stream_cleanup_files(MyLogicalRepWorker->subid, prepare_data.xid);
+
+	/* Process any tables that are being synchronized in parallel. */
+	process_syncing_tables(prepare_data.end_lsn);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
+}
+
 /*
  * Handle ORIGIN message.
  *
@@ -1291,7 +1341,7 @@ apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
 	 */
 	oldcxt = MemoryContextSwitchTo(TopTransactionContext);
 
-	/* open the spool file for the committed transaction */
+	/* Open the spool file for the committed/prepared transaction */
 	changes_filename(path, MyLogicalRepWorker->subid, xid);
 	elog(DEBUG1, "replaying changes from file \"%s\"", path);
 
@@ -2357,6 +2407,10 @@ apply_dispatch(StringInfo s)
 		case LOGICAL_REP_MSG_ROLLBACK_PREPARED:
 			apply_handle_rollback_prepared(s);
 			return;
+
+		case LOGICAL_REP_MSG_STREAM_PREPARE:
+			apply_handle_stream_prepare(s);
+			return;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index e4314af13a..286119c8c8 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -71,6 +71,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx,
 static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 								   ReorderBufferTXN *txn,
 								   XLogRecPtr commit_lsn);
+static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+										ReorderBufferTXN *txn, XLogRecPtr prepare_lsn);
 
 static bool publications_valid;
 static bool in_streaming;
@@ -175,7 +177,7 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb)
 	cb->stream_message_cb = pgoutput_message;
 	cb->stream_truncate_cb = pgoutput_truncate;
 	/* transaction streaming - two-phase commit */
-	cb->stream_prepare_cb = NULL;
+	cb->stream_prepare_cb = pgoutput_stream_prepare_txn;
 }
 
 static void
@@ -280,17 +282,6 @@ parse_output_parameters(List *options, PGOutputData *data)
 		}
 		else
 			elog(ERROR, "unrecognized pgoutput option: %s", defel->defname);
-
-		/*
-		 * Do additional checking for the disallowed combination of two_phase
-		 * and streaming. While streaming and two_phase can theoretically be
-		 * supported, it needs more analysis to allow them together.
-		 */
-		if (data->two_phase && data->streaming)
-			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("%s and %s are mutually exclusive options",
-							"two_phase", "streaming")));
 	}
 }
 
@@ -1029,6 +1020,24 @@ pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
 	cleanup_rel_sync_cache(txn->xid, true);
 }
 
+/*
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn,
+							XLogRecPtr prepare_lsn)
+{
+	Assert(rbtxn_is_streamed(txn));
+
+	OutputPluginUpdateProgress(ctx);
+	OutputPluginPrepareWrite(ctx, true);
+	logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn);
+	OutputPluginWrite(ctx, true);
+}
+
 /*
  * Initialize the relation schema sync cache for a decoding session.
  *
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 63de90d94a..2e29513151 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -67,7 +67,8 @@ typedef enum LogicalRepMsgType
 	LOGICAL_REP_MSG_STREAM_START = 'S',
 	LOGICAL_REP_MSG_STREAM_END = 'E',
 	LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
-	LOGICAL_REP_MSG_STREAM_ABORT = 'A'
+	LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+	LOGICAL_REP_MSG_STREAM_PREPARE = 'p'
 } LogicalRepMsgType;
 
 /*
@@ -196,7 +197,10 @@ extern void logicalrep_write_rollback_prepared(StringInfo out, ReorderBufferTXN
 											   TimestampTz prepare_time);
 extern void logicalrep_read_rollback_prepared(StringInfo in,
 											  LogicalRepRollbackPreparedTxnData *rollback_data);
-
+extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn,
+											XLogRecPtr prepare_lsn);
+extern void logicalrep_read_stream_prepare(StringInfo in,
+										   LogicalRepPreparedTxnData *prepare_data);
 
 extern void logicalrep_write_origin(StringInfo out, const char *origin,
 									XLogRecPtr origin_lsn);
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 67f92b3878..77b4437b69 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -282,27 +282,29 @@ WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ..
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 ERROR:  unrecognized subscription parameter: "two_phase"
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
-ERROR:  cannot set streaming = true for two-phase enabled subscription
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 \dRs+
                                                                      List of subscriptions
       Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
 -----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
- regress_testsub | regress_subscription_user | f       | {testpub}   | f      | f         | p                | off                | dbname=regress_doesnotexist
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
 (1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
-ERROR:  two_phase = true and streaming = true are mutually exclusive options
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
+WARNING:  tables were not subscribed, you will have to run ALTER SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
 \dRs+
-                                            List of subscriptions
- Name | Owner | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit | Conninfo 
-------+-------+---------+-------------+--------+-----------+------------------+--------------------+----------
-(0 rows)
+                                                                     List of subscriptions
+      Name       |           Owner           | Enabled | Publication | Binary | Streaming | Two phase commit | Synchronous commit |          Conninfo           
+-----------------+---------------------------+---------+-------------+--------+-----------+------------------+--------------------+-----------------------------
+ regress_testsub | regress_subscription_user | f       | {testpub}   | f      | t         | p                | off                | dbname=regress_doesnotexist
+(1 row)
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
 DROP ROLE regress_subscription_user2;
diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql
index 88743ab33b..d42104c191 100644
--- a/src/test/regress/sql/subscription.sql
+++ b/src/test/regress/sql/subscription.sql
@@ -215,20 +215,21 @@ CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUB
 --fail - alter of two_phase option not supported.
 ALTER SUBSCRIPTION regress_testsub SET (two_phase = false);
 
---fail - cannot set streaming when two_phase enabled
+-- but can alter streaming when two_phase enabled
 ALTER SUBSCRIPTION regress_testsub SET (streaming = true);
 
-ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
-
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
 DROP SUBSCRIPTION regress_testsub;
 
--- fail - two_phase and streaming are mutually exclusive.
-CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (streaming = true, two_phase = true);
+-- two_phase and streaming are compatible.
+CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true);
 
 \dRs+
 
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
 
 RESET SESSION AUTHORIZATION;
 DROP ROLE regress_subscription_user;
diff --git a/src/test/subscription/t/022_twophase_cascade.pl b/src/test/subscription/t/022_twophase_cascade.pl
index d7cc99959f..a47c62d8fd 100644
--- a/src/test/subscription/t/022_twophase_cascade.pl
+++ b/src/test/subscription/t/022_twophase_cascade.pl
@@ -2,11 +2,14 @@
 # Copyright (c) 2021, PostgreSQL Global Development Group
 
 # Test cascading logical replication of 2PC.
+#
+# Includes tests for options 2PC (not-streaming) and also for 2PC (streaming).
+#
 use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 27;
+use Test::More tests => 41;
 
 ###############################
 # Setup a cascade of pub/sub nodes.
@@ -17,20 +20,26 @@ use Test::More tests => 27;
 # node_A
 my $node_A = PostgresNode->new('node_A');
 $node_A->init(allows_streaming => 'logical');
-$node_A->append_conf('postgresql.conf',
-	qq(max_prepared_transactions = 10));
+$node_A->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+logical_decoding_work_mem = 64kB
+));
 $node_A->start;
 # node_B
 my $node_B = PostgresNode->new('node_B');
 $node_B->init(allows_streaming => 'logical');
-$node_B->append_conf('postgresql.conf',
-	qq(max_prepared_transactions = 10));
+$node_B->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+logical_decoding_work_mem = 64kB
+));
 $node_B->start;
 # node_C
 my $node_C = PostgresNode->new('node_C');
 $node_C->init(allows_streaming => 'logical');
-$node_C->append_conf('postgresql.conf',
-	qq(max_prepared_transactions = 10));
+$node_C->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+logical_decoding_work_mem = 64kB
+));
 $node_C->start;
 
 # Create some pre-existing content on node_A
@@ -45,12 +54,29 @@ $node_B->safe_psql('postgres',
 $node_C->safe_psql('postgres',
 	"CREATE TABLE tab_full (a int PRIMARY KEY)");
 
+# Create some pre-existing content on node_A (for streaming tests)
+$node_A->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_A->safe_psql('postgres',
+	"INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Create the same tables on node_B and node_C
+# columns a and b are compatible with same table name on node_A
+$node_B->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+$node_C->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
 # Setup logical replication
 
+# -----------------------
+# 2PC NON-STREAMING TESTS
+# -----------------------
+
 # node_A (pub) -> node_B (sub)
 my $node_A_connstr = $node_A->connstr . ' dbname=postgres';
 $node_A->safe_psql('postgres',
-	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full");
+	"CREATE PUBLICATION tap_pub_A FOR TABLE tab_full, test_tab");
 my $appname_B = 'tap_sub_B';
 $node_B->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_B
@@ -61,7 +87,7 @@ $node_B->safe_psql('postgres',	"
 # node_B (pub) -> node_C (sub)
 my $node_B_connstr = $node_B->connstr . ' dbname=postgres';
 $node_B->safe_psql('postgres',
-	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full");
+	"CREATE PUBLICATION tap_pub_B FOR TABLE tab_full, test_tab");
 my $appname_C = 'tap_sub_C';
 $node_C->safe_psql('postgres',	"
 	CREATE SUBSCRIPTION tap_sub_C
@@ -203,6 +229,141 @@ is($result, qq(21), 'Rows committed are present on subscriber B');
 $result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);");
 is($result, qq(21), 'Rows committed are present on subscriber C');
 
+# ---------------------
+# 2PC + STREAMING TESTS
+# ---------------------
+
+my $oldpid_B = $node_A->safe_psql('postgres', "
+	SELECT pid FROM pg_stat_replication
+	WHERE application_name = '$appname_B';");
+my $oldpid_C = $node_B->safe_psql('postgres', "
+	SELECT pid FROM pg_stat_replication
+	WHERE application_name = '$appname_C';");
+
+# Setup logical replication (streaming = on)
+
+$node_B->safe_psql('postgres', "
+	ALTER SUBSCRIPTION tap_sub_B
+	SET (streaming = on);");
+$node_C->safe_psql('postgres', "
+	ALTER SUBSCRIPTION tap_sub_C
+	SET (streaming = on)");
+
+# Wait for subscribers to finish initialization
+
+$node_A->poll_query_until('postgres', "
+	SELECT pid != $oldpid_B FROM pg_stat_replication
+	WHERE application_name = '$appname_B';"
+) or die "Timed out while waiting for apply to restart";
+$node_B->poll_query_until('postgres', "
+	SELECT pid != $oldpid_C FROM pg_stat_replication
+	WHERE application_name = '$appname_C';"
+) or die "Timed out while waiting for apply to restart";
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED.
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber(s) after the commit.
+###############################
+
+# Insert, update and delete enough rows to exceed the 64kB limit.
+# Then 2PC PREPARE
+$node_A->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check that transaction was committed on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber B, and extra columns have local defaults');
+$result = $node_C->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber C, and extra columns have local defaults');
+
+# check the transaction state is ended on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber C');
+
+###############################
+# Test 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT.
+# 0. Cleanup from previous test leaving only 2 rows.
+# 1. Insert one more row.
+# 2. Record a SAVEPOINT.
+# 3. Data is streamed using 2PC.
+# 4. Do rollback to SAVEPOINT prior to the streamed inserts.
+# 5. Then COMMIT PREPARED.
+#
+# Expect data after the SAVEPOINT is aborted leaving only 3 rows (= 2 original + 1 from step 1).
+###############################
+
+# First, delete the data except for 2 rows (delete will be replicated)
+$node_A->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
+$node_A->safe_psql('postgres', "
+	BEGIN;
+	INSERT INTO test_tab VALUES (9999, 'foobar');
+	SAVEPOINT sp_inner;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	ROLLBACK TO SAVEPOINT sp_inner;
+	PREPARE TRANSACTION 'outer';
+	");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state prepared on subscriber(s)
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber C');
+
+# 2PC COMMIT
+$node_A->safe_psql('postgres', "COMMIT PREPARED 'outer';");
+
+$node_A->wait_for_catchup($appname_B);
+$node_B->wait_for_catchup($appname_C);
+
+# check the transaction state is ended on subscriber
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is ended on subscriber C');
+
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (9999, 'foobar') should be committed.
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows committed are present on subscriber B');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber C');
+$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+is($result, qq(3), 'Rows committed are present on subscriber C');
+
 ###############################
 # check all the cleanup
 ###############################
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
new file mode 100644
index 0000000000..c72c6b5ef4
--- /dev/null
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -0,0 +1,284 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test logical replication of 2PC with streaming.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 18;
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgresNode->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+logical_decoding_work_mem = 64kB
+));
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = PostgresNode->new('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+));
+$node_subscriber->start;
+
+# Create some pre-existing content on publisher
+$node_publisher->safe_psql('postgres', "CREATE TABLE test_tab (a int primary key, b varchar)");
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar')");
+
+# Setup structure on subscriber (columns a and b are compatible with same table name on publisher)
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE test_tab (a int primary key, b text, c timestamptz DEFAULT now(), d bigint DEFAULT 999)");
+
+# Setup logical replication (streaming = on)
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub FOR TABLE test_tab");
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres', "
+	CREATE SUBSCRIPTION tap_sub
+	CONNECTION '$publisher_connstr application_name=$appname'
+	PUBLICATION tap_pub
+	WITH (streaming = on, two_phase = on)");
+
+# Wait for subscriber to finish initialization
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+	"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+# Also wait for two-phase to be enabled
+my $twophase_query =
+	"SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');";
+$node_subscriber->poll_query_until('postgres', $twophase_query)
+  or die "Timed out while waiting for subscriber to enable twophase";
+
+###############################
+# Check initial data was copied to subscriber
+###############################
+my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'check initial data was copied to subscriber');
+
+###############################
+# Test 2PC PREPARE / COMMIT PREPARED.
+# 1. Data is streamed as a 2PC transaction.
+# 2. Then do commit prepared.
+#
+# Expect all data is replicated on subscriber side after the commit.
+###############################
+
+# check that 2PC gets replicated to subscriber
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# Test 2PC PREPARE / ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. Do rollback prepared.
+#
+# Expect data rolls back leaving only the original 2 rows.
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres',  "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(2|2|2), 'Rows inserted by 2PC are rolled back, leaving only the original 2 rows');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.
+# 1. insert, update and delete enough rows to exceed the 64kB limit.
+# 2. Then server crashes before the 2PC transaction is committed.
+# 3. After servers are restarted the pending transaction is committed.
+#
+# Expect all data is replicated on subscriber side after the commit.
+# Note: both publisher and subscriber do crash/restart.
+###############################
+
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_subscriber->stop('immediate');
+$node_publisher->stop('immediate');
+
+$node_publisher->start;
+$node_subscriber->start;
+
+# commit post the restart
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+$node_publisher->wait_for_catchup($appname);
+
+# check inserts are visible
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults');
+
+###############################
+# Do INSERT after the PREPARE but before ROLLBACK PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a ROLLBACK PREPARED.
+#
+# Expect the 2PC data rolls back leaving only 3 rows on the subscriber
+# (the original 2 + inserted 1).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separate primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets aborted
+$node_publisher->safe_psql('postgres', "ROLLBACK PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is aborted on subscriber,
+# but the extra INSERT outside of the 2PC still was replicated
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3|3|3), 'check the outside insert was copied to subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is aborted on subscriber');
+
+###############################
+# Do INSERT after the PREPARE but before COMMIT PREPARED.
+# 1. Table is deleted back to 2 rows which are replicated on subscriber.
+# 2. Data is streamed using 2PC.
+# 3. A single row INSERT is done which is after the PREPARE.
+# 4. Then do a COMMIT PREPARED.
+#
+# Expect 2PC data + the extra row are on the subscriber
+# (the 3334 + inserted 1 = 3335).
+###############################
+
+# First, delete the data except for 2 rows (will be replicated)
+$node_publisher->safe_psql('postgres', "DELETE FROM test_tab WHERE a > 2;");
+
+# Then insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+	BEGIN;
+	INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+	UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+	DELETE FROM test_tab WHERE mod(a,3) = 0;
+	PREPARE TRANSACTION 'test_prepared_tab';});
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is in prepared state on subscriber
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(1), 'transaction is prepared on subscriber');
+
+# Insert a different record (now we are outside of the 2PC transaction)
+# Note: the 2PC transaction still holds row locks so make sure this insert is for a separare primary key
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab VALUES (99999, 'foobar')");
+
+# 2PC transaction gets committed
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab';");
+
+$node_publisher->wait_for_catchup($appname);
+
+# check that transaction is committed on subscriber
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), count(c), count(d = 999) FROM test_tab");
+is($result, qq(3335|3335|3335), 'Rows inserted by 2PC (as well as outside insert) have committed on subscriber, and extra columns contain local defaults');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is committed on subscriber');
+
+###############################
+# check all the cleanup
+###############################
+
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0), 'check subscription relation status was dropped on subscriber');
+
+$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
-- 
2.28.0.windows.1

#418Peter Smith
smithpb2250@gmail.com
In reply to: Amit Kapila (#417)

On Tue, Aug 3, 2021 at 5:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Aug 3, 2021 at 6:17 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v102*

I have made minor modifications in the comments and docs, please see
attached. Can you please check whether the names of contributors in
the commit message are correct or do we need to change it?

I checked the differences between v102 and v103 and have no review
comments about the latest changes.

The commit message looks ok.

I applied the v103 to the current HEAD; no errors.
The build is ok.
The make check is ok.
The TAP subscription tests are ok.

I also rebuilt the PG docs and verified rendering of the updated pages looks ok.

The patch v103 LGTM.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

#419vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#417)

On Tue, Aug 3, 2021 at 12:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Aug 3, 2021 at 6:17 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v102*

I have made minor modifications in the comments and docs, please see
attached. Can you please check whether the names of contributors in
the commit message are correct or do we need to change it?

The patch applies neatly, the tests pass and documentation built with
the updates provided. I could not find any comments. The patch looks
good to me.

Regards,
Vignesh

#420tanghy.fnst@fujitsu.com
tanghy.fnst@fujitsu.com
In reply to: vignesh C (#419)
RE: [HACKERS] logical decoding of two-phase transactions

On Tuesday, August 3, 2021 6:03 PM vignesh C <vignesh21@gmail.com>wrote:

On Tue, Aug 3, 2021 at 12:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Aug 3, 2021 at 6:17 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v102*

I have made minor modifications in the comments and docs, please see
attached. Can you please check whether the names of contributors in
the commit message are correct or do we need to change it?

The patch applies neatly, the tests pass and documentation built with
the updates provided. I could not find any comments. The patch looks
good to me.

I did some stress tests on the patch and found no issues.
It also works well when using synchronized replication.
So the patch LGTM.

Regards
Tang

#421Amit Kapila
amit.kapila16@gmail.com
In reply to: tanghy.fnst@fujitsu.com (#420)

On Wed, Aug 4, 2021 at 6:51 AM tanghy.fnst@fujitsu.com
<tanghy.fnst@fujitsu.com> wrote:

On Tuesday, August 3, 2021 6:03 PM vignesh C <vignesh21@gmail.com>wrote:

On Tue, Aug 3, 2021 at 12:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Aug 3, 2021 at 6:17 AM Peter Smith <smithpb2250@gmail.com> wrote:

Please find attached the latest patch set v102*

I have made minor modifications in the comments and docs, please see
attached. Can you please check whether the names of contributors in
the commit message are correct or do we need to change it?

The patch applies neatly, the tests pass and documentation built with
the updates provided. I could not find any comments. The patch looks
good to me.

I did some stress tests on the patch and found no issues.
It also works well when using synchronized replication.
So the patch LGTM.

I have pushed this last patch in the series.

--
With Regards,
Amit Kapila.

#422Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#421)

On Wed, Aug 4, 2021 at 4:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have pushed this last patch in the series.

I have closed this CF entry. Thanks to everyone involved in this work!

--
With Regards,
Amit Kapila.

#423Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#422)

Hi,

On Mon, Aug 9, 2021 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 4, 2021 at 4:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have pushed this last patch in the series.

I have closed this CF entry. Thanks to everyone involved in this work!

I have a questoin about two_phase column of pg_replication_slots view:
with this feature, pg_replication_slots has a new column two_phase:

View "pg_catalog.pg_replication_slots"
Column | Type | Collation | Nullable | Default
---------------------+---------+-----------+----------+---------
slot_name | name | | |
plugin | name | | |
slot_type | text | | |
datoid | oid | | |
database | name | | |
temporary | boolean | | |
active | boolean | | |
active_pid | integer | | |
xmin | xid | | |
catalog_xmin | xid | | |
restart_lsn | pg_lsn | | |
confirmed_flush_lsn | pg_lsn | | |
wal_status | text | | |
safe_wal_size | bigint | | |
two_phase | boolean | | |

According to the doc, the two_phase field has:

True if the slot is enabled for decoding prepared transactions. Always
false for physical slots.

It's unnatural a bit to me that replication slots have such a property
since the replication slots have been used to protect WAL and tuples
that are required for logical decoding, physical replication, and
backup, etc from removal. Also, it seems that even if a replication
slot is created with two_phase = off, it's overwritten to on if the
plugin enables two-phase option. Is there any reason why we can turn
on and off this value on the replication slot side and is there any
use case where the replication slot’s two_phase is false and the
plugin’s two-phase option is on and vice versa? I think that we can
have replication slots always have two_phase_at value and remove the
two_phase field from the view.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#424Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#423)

On Tue, Jan 4, 2022 at 9:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

According to the doc, the two_phase field has:

True if the slot is enabled for decoding prepared transactions. Always
false for physical slots.

It's unnatural a bit to me that replication slots have such a property
since the replication slots have been used to protect WAL and tuples
that are required for logical decoding, physical replication, and
backup, etc from removal. Also, it seems that even if a replication
slot is created with two_phase = off, it's overwritten to on if the
plugin enables two-phase option. Is there any reason why we can turn
on and off this value on the replication slot side and is there any
use case where the replication slot’s two_phase is false and the
plugin’s two-phase option is on and vice versa?

We enable two_phase only when we start streaming from the
subscriber-side. This is required because we can't enable it till the
initial sync is complete, otherwise, it could lead to loss of data.
See comments atop worker.c (description under the title: TWO_PHASE
TRANSACTIONS).

I think that we can
have replication slots always have two_phase_at value and remove the
two_phase field from the view.

I am not sure how that will work because we can allow streaming of
prepared transactions when the same is enabled at the CREATE
SUBSCRIPTION time, the default for which is false.

--
With Regards,
Amit Kapila.